Cross Validation

Cross validation is a technique to evaluate the performance of a model.

Data Split

When dealing with the time series data like bitcoin, we evaluate the performance of a model by training on the past data and testing on the future.

For example, a model is trained on the 80% of the past and evaluated on the 20%. It's important to note that the test data is in the future of the training data, so that we can evaluate the performance of the model on the unseen future.

Training Data Test Data

Cross Validation

Cross validation extends the idea of data split, by repeatedly performing data split.

The down side of the data split is that we're performing an evaluation on a single data split. There is a chance that the model accidentally performs well (or worse) due to the pure luck.

Cross validation addresses this problem by performing training and evaluation over multiple data splits. In the below is a diagram showing the idea of cross validation. Since it's splitting the data 5 times, it's called 5-fold cross validation, or 5-CV for short.

Train Test Train Test Train Test Train Test Train Test

In the first run, we use 16.6% (= 100% / (5 + 1)) of the data for training, and the immediately following 16.6% for testing. The rest of the data is unused. Likewise, in the second run, we use 33.2% (= 16.6% * 2) of the data for training, and the immediately following 16.6% for testing. We repeat this to run total of 5 times of training and evaluation. Final result is the average of the performance of those runs.