Background >

Data Scaling

Data scaling refers to the process of transforming the data to a common scale so that the model can learn the patterns more effectively. Models that rely on distance (or, similarity) such as k-NN (k Nearest Neightbor), SVM (Support Vector Machine) as well as Neutral network models such as MLP require data scaling.

Models that computes distance

Let's consider k-NN which finds nearest k points given a query point. If the data is not scaled, the distance is dominated by the feature with larger scale. For example, if one feature is in the range of [0, 1] and the other is in the range of [0, 1000], the distance is heavily influenced by the second feature. This is why we need to scale the data to bring all features onto the same scale.

Neutral network models

There are a couple of reasons why neural networks require scaling:

Faster convergence: Neural networks use gradient-based optimization (e.g., stochastic gradient descent, SGD). If input features have different scales, gradients calculated for parameters corresponding to larger-valued features will dominate those of smaller-valued features.
Preventing diminishing and exploding gradients: Extremely large gradient destabilizes training while extremely small gradient slows down convergence due to vanishing gradients.
Numerical stability: Large or disparate feature values can lead to numerical instability in matrix operations, resulting in issues like overflow or underflow.

To aid scaling, deep learning libraries also has normalization layers such as Batch Normalization and Layer Normalization.

Scaling methods

There are a couple of scaling methods:

Standardization: $x' = \frac{x-\mu}{\sigma}$
Normalization: $x' = \frac{x}{max(x)}$
Min-max scaling: $x' = \frac{x-min(x)}{max(x)-min(x)}$

Those are well supported by ML libraries, e.g., scikit learn's preprocessing.

When to scale

As explained, scaling requires reading the entire data. Therefore, to avoid data leakage, we perform scaling AFTER the data split so that we do not accidentally include future data during training. See also feature generation and stationary time series.