Summary

This article discusses a deep learning approach for handling a combination of time series and tabular data, using multimodal learning and wide and deep learning techniques.

Abstract

The article presents a scenario where a mixture of time series inputs (sequence of heart rate and respiratory rate) and tabular features (time since sleep onset and personalized embedding representing users' sleep pattern) are used to predict a sleep stage at any given minute. The author suggests treating this as a multimodal deep learning problem and incorporating "wide and deep learning" as introduced by Google Research. The proposed approach involves passing time series sequences through an RNN or LSTM or 1D CNN to capture the hidden state or CNN embedding as a representation of the sequence, and then concatenating embeddings for each sequence with other tabular features. The article also discusses some interesting decisions to consider, such as treating the sequences independently or as multi-channel input, and the use of RNN, LSTM, or 1D CNN for different sequences. A sample model implementation in PyTorch is provided.

Opinions

Late fusion seems to work better than early fusion for multiple time series input sequences.
RNN seems to work better for shorter sequences and bidirectional LSTM for longer sequences.
The fusion layer is not necessary and is not required in the absence of baseline values.
The author invites readers who have worked on similar problems to share their experiences and approaches.

Deep learning on a combination of time series and tabular data.

We often times find ourself in a situation where we have combination of different features that we want to leverage in our model. The input data to your model is a mix of time series and tabular data. Architecting a deep learning model to work well in this scenario is an interesting problem.

One example scenario is: You have data from a device like fitbit and you want to predict a sleep stage at any given minute:

You have a mixture of:

Time series inputs:

sequence of heart rate
sequence of respiratory rate

Tabular features:

time since sleep onset
personalized embedding representing this users sleep pattern

and bunch of other features.

One way to approach this is treat it as multimodal deep learning.

Multimodal learning — Photo on ResearchGate

And mix in “wide and deep learning” as introduced by google research here

Wide and deep learning — Photo on google blog

So how do we go about this?

Pass in time series sequence though an RNN or LSTM or 1D CNN and capture the hidden state or CNN embedding as a representation of the sequence.
Concatenate embeddings for each of the sequence with other tabular features.

Some interesting decisions to consider:

For the multiple time series input sequences:

do you treat the sequences independent and fuse/concatenate the representations late (late fusion)?
Or do you treat them as multi-channel input with each time series as a channel (early fusion)?

In general late fusion seems to work better than early fusion and also does not require padding the input when different input sequences are not of same length. However it is really dependent on the problem space and correlation of input sequences.

In general RNN seems to work better for shorter sequences and bidirectional LSTM for longer sequences. In late fusion you can mix and match RNN/LSTM/1d CNN for different sequences.

Here is a sample model implementation (in pytorch):

This example uses 2 layer bidi LSTM with late fusion.

For each of the time series features (feature_1 and feature_2) we also have the baseline values, hence we have fusion layers where we fuse representation of sequence (hidden state of LSTM) with the baseline values. However this fusion layer is not necessary and is not required in the absence of baseline values.

class TestModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.input_size = 70
        self.hidden_size = 8
        self.num_layers = 2
        self.output_dim = 2
        self.lstm_feature_1 = nn.LSTM(self.input_size, self.hidden_size, self.num_layers, batch_first=True,
                                      bidirectional=True)
        self.lstm_feature_2 = nn.LSTM(self.input_size, self.hidden_size, self.num_layers, batch_first=True,
                                      bidirectional=True)
        self.fc_feature_1 = nn.Linear((self.hidden_size * 2) + 1, 1)
        self.fc_feature_2 = nn.Linear((self.hidden_size * 2) + 1, 1)

        self.fc = nn.Linear(4, self.output_dim)

    def forward(self, f, device=None):
        if not device:
            device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

        x_f1, f1, x_f2, f2, f3, f4 = f

        # x_f1 is feature_1_seq
        # f1 is feature_1_baseline
        # x_f2 is feature_2_seq
        # f2 is feature_2_baseline

        # f3 and f4 are tabular features

        x_f1 = x_f1.view(x_f1.shape[0], 1, -1)
        h0_f1, c0_f1 = self.init_hidden(x_f1, device)
        h_t_f1, c_t_f1 = self.lstm_feature_1(x_f1, (h0_f1, c0_f1))
        x_f1 = h_t_f1
        x_f1 = x_f1.view(x_f1.shape[0], -1)

        x_f2 = x_f2.view(x_f2.shape[0], 1, -1)
        h0_f2, c0_f2 = self.init_hidden(x_f2, device)
        h_t_f2, c_t_f2 = self.lstm_feature_2(x_f2, (h0_f2, c0_f2))
        x_f2 = h_t_f2
        x_f2 = x_f2.view(x_f2.shape[0], -1)

        x_f1 = torch.cat((x_f1, f1), 1)
        x_f1 = self.fc_feature_1(x_f1)

        x_f2 = torch.cat((x_f2, f2), 1)
        x_f2 = self.fc_feature_2(x_f2)

        x = torch.cat((x_f1, x_f2, f3, f4), 1)
        x = self.fc(x)
        x = F.log_softmax(x, dim=1)
        return x

    def init_hidden(self, x, device):
        batch_size = x.size(0)
        h0 = torch.zeros(self.num_layers * 2, batch_size, self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers * 2, batch_size, self.hidden_size).to(device)
        return h0, c0

If you have worked on a problem that has involved similar style of input: I would love to hear from you on what has worked for you or any other ways you have tried to approach it.