Model Understanding >

TS Mixer

TS Mixer (Time-Series Mixer), developed by Google Cloud AI, is an architecture that extensively utilizes MLPs on time series data across both features and time steps.

TS Mixer Architecture

Figure 1 in the TS Mixer paper effectively illustrates the architecture of TS Mixer. In summary, the model comprises the following components:

Mixer Layer: A residual-connected MLP that operates across features and time steps.
Temporal Projection: An MLP that functions across time steps. This produces the final output.

The Mixer Layer consists of three components:

Time Mixing: 2D batch normalization, MLP, and a residual connection.
Feature Mixing: 2D batch normalization, MLP, and a residual connection.

The 2D batch normalization is noteworthy as it normalizes both time and feature dimensions for the time-mixing and feature-mixing.

The paper notes that feature mixing can alter the data's dimensions, and in such cases, a fully connected layer over the residual is used to match the dimensions. However, we omit these cases during hyperparameter optimization to reduce the search space.

Instead of separately mixing features as depicted in Figure 4 of the paper, we input all the generated features since our features exist for all time steps. There is no needs for alignment between historical prices and features.

Also, while applying data scaling as preprocessing, no local normalization was applied. Our goal is not for benchmarking but for practical use and we speculate that batch norm 2d would be sufficient.

Strength of TS Mixer

TS Mixer is efficient in that it incorporates feature and time mixing operations along with residual connections. It has outperformed transformer models such as FEDformer, Autoformer, and Informer in the experiments presented in the paper. Moreover, it is faster to train and predict compared to slower deep learning models like LSTM.

MLP for Air Passengers

To demonstrate model performance, we show the model's prediction results for the air passengers dataset. The cross validation process identified the best transformation to make the time series stationary and the optimal hyperparameters. The Root Mean Squared Error on the next day's closing price was used to determine the best model.

In the chart, we display the model's predictions for last split of cross validation and test data.

train: Training data of the last split.
validation: Validation data of the last split.
prediction (train, validation): Prdiction for train and validation data period. For each row (or a sliding window) of data, predictions are made for n days into the future (where n is set to 1, 2, 7). The predictions are then combined into a single series of dots. Since the accuracy of predictions decreases for large n, we see some hiccups in the predictions. The predictions from the tail of the train spills into the validation period as that's future from the 'train' data period viewpoint. These are somewhat peculiar settings, but it works well in testing if the model's predictions are good enough.
test(input): Test input data.
test(actual): Test actual data.
prediction(test): The model's prediction given the test input. There's only one prediction from the last row (or the last sliding window) of the test input which corresponds to 1, 2, 7 days later after 'test(input)'.