Background >

Metrics

When training a model, we need to evaluate the performance of a trained model, and metrics are used for this purpose. Since optimization algorithms are usually built for minimization, metrics should be smaller for better performing models.

Mean Squared Error

Mean Squared Error (MSE), also called as L2 loss, is a common metric that evaluates the performance as $(y_i - \hat{y}_i)^2/n$ where $y_i$ is the actual value, $\hat{y}_i$ is the predicted value, and n is the number of data points. As an example, if actual data is [1, 2, 3] and prediction is [1, 2, 5], then MSE is $((1-1)^2 + (2-2)^2 + (3-5)^2)/3 = (0 + 0 + 4) / 3 = 1.333...$.

MSE is particulary useful as it's a convex function for which optimization algorithms can easily find the minimum. Chart in the below shows you $y=t^2$, the function we use in getting MSE.

MSE has a property that large errors are penalized more than small ones. Taking square of the error, if actual is 1 while prediction is 10, the error is 81. On the other hand, if actual is 10 while prediction is 11, the error is 1. The difference of 9 and 1 is now exaggerated by 81 vs 1.

On one hand, this is a problem as it's too sensitive to errors. However, if we can assume that we really hate the wrong bitcoin price prediction, then it's a desired property.

Root Mean Squared Error

Root Mean Squared Error (RMSE), defined as $\sqrt{MSE}$, addresses the issue of MSE that its unit is the square of the original data, making it harder to interpret. By taking square root, data is now in the scale of the input data.

Mean Absolute Error

Mean Absolute Error (MAE), also called as L1 loss, is another common metric that evaluates the performance as $|y_i - \hat{y}_i|/n$ where $y_i$ is the actual value, $\hat{y}_i$ is the predicted value, and n is the number of data points. As an example, if actual data is [1, 2, 3] and prediction is [1, 2, 5], then MAE is $(|1-1| + |2-2| + |3-5|)/3 = (0 + 0 + 2) / 3 = 0.666...$. Below is the chart of $y=|t|$.

While fairly intuitive, MAE has weakness that it's not differentiable at 0. This makes it harder for optimization algorithms to find the minimum.

Huber Loss

There are variations to address the non differentiable issue of MAE while keeping the easy-to-optimize property of MSE. One of them is Huber loss which is defined as $\begin{cases} \frac{1}{2} t^2 & \text{if } |t| \leq \delta, \\ \delta (|t| - \frac{1}{2} \delta) & \text{if } |t| > \delta. \end{cases}$.

Following chart has set $\delta=1$ and plot $y=\text{Huber}(t)$.