Background >

Data Scaling

Data scaling refers to the process of transforming the data to a common scale so that the model can learn the patterns more effectively. Models that rely on distance (or, similarity) such as k-NN (k Nearest Neightbor), SVM (Support Vector Machine) as well as Neutral network models such as MLP, LSTM, etc. require data scaling.

Models that computes distance

Let's consider k-NN which finds nearest k points given a query point. If the data is not scaled, the distance is dominated by the feature with larger scale. For example, if one feature is in the range of [0, 1] and the other is in the range of [0, 1000], the distance is heavily influenced by the second feature. This is why we need to scale the data to bring all features onto the same scale.

Neutral network models

There are a couple of reasons why neural networks require scaling:

To aid scaling, deep learning libraries also has normalization layers such as Batch Normalization and Layer Normalization.

Scaling methods

There are a couple of scaling methods:

Those are well supported by ML libraries, e.g., scikit learn's preprocessing.

When to scale

As explained, scaling requires reading the entire data. Therefore, to avoid data leakage, we perform scaling AFTER the data split so that we do not accidentally include future data during training. See also feature generation and stationary time series.