Model Understanding >

LSTM

LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) that can learn long-term dependencies in time series data. Unlike MLP, LSTM processes the data with dependencies between time steps. Therefore, it's likely to be more suitable for time series prediction than MLP.

LSTM architecture

We use stacked (unidirectional) LSTM layers to capture long-term dependencies effectively and to discover features automatically. The last hidden state from the last LSTM layer is flattened and combined with the input data to implement residual connection. Finally, a fully connected (FC) layer is applied to produce the prediction.

The concatenation of the input before the FC layer can be viewed as a form of Residual Network (ResNet). It allows the model to account for both the general patterns learned by the LSTM and the unique characteristics of each individual case.

Training

For training, sliding windows of fixed size are generated along with prediction targets of n days later bitcoin prices. During prediction, the input also became sliding windows and each window generates predictions. Finally, predictions are stacked to produce the prediction for the entire input. One drawback is that a prediction can't be made for the initial sequence length - 1 days, but having fixed sequence length allows us learning the general patterns of the data easily.

LSTM for Air Passengers

To demonstrate model performance, we show the model's prediction results for the air passengers dataset. The cross validation process identified the best transformation to make the time series stationary and the optimal hyperparameters. The Root Mean Squared Error on the next day's closing price was used to determine the best model.

In the chart, we display the model's predictions for last split of cross validation and test data.

train: Training data of the last split.
validation: Validation data of the last split.
prediction (train, validation): Prdiction for train and validation data period. For each row (or a sliding window) of data, predictions are made for n days into the future (where n is set to 1, 2, 7). The predictions are then combined into a single series of dots. Since the accuracy of predictions decreases for large n, we see some hiccups in the predictions. The predictions from the tail of the train spills into the validation period as that's future from the 'train' data period viewpoint. These are somewhat peculiar settings, but it works well in testing if the model's predictions are good enough.
test(input): Test input data.
test(actual): Test actual data.
prediction(test): The model's prediction given the test input. There's only one prediction from the last row (or the last sliding window) of the test input which corresponds to 1, 2, 7 days later after 'test(input)'.

LSTM model predicts the increases percentage of air passengers as described in capturing trends. Sequence length, number of LSTM layers, hidden dimensions, dropout rates, etc. were determined by the grid search.