Let’s Talk about Cross-Validation

Tay Fadero • May 28, 2025

Ronin Revelations

Machine Learning is always an exciting topic to discuss. What’s not exciting about talking about how a system mines data to solve complex problems? Many people consider this section, Cross-Validation, to be one of the sexier subsections of it. (OK, maybe just me and a handful of others). However, this section is key to obtaining a forecast prediction model that performs optimally in order to generate the most accurate forecast for your future Amazon sales orders and inventory levels.

A Short Glossary of Terms

For the uninitiated to ML Forecasting, the terms below will be helpful to understanding the process of Time Series Cross-Validation which will be covered further below:

Time Series: A sequence of data points recorded at set time intervals and ordered chronologically. The obvious example here is the daily product sales of all listings by an Amazon seller.
Training Data: The portion of a dataset (known as the in-sample portion) used to train statistical or machine learning models. The model “sees” and learns from the dataset during the training phase of a forecasting process. This data is strictly used to build the model, and part of this building process includes tuning the model to select the best hyperparameters.
Test Data: The portion of a dataset (known as the out-of-sample portion) used to test statistical or machine learning models. Unlike the in-sample portion, the model never sees or learns from this dataset during the training phase of generating a forecast. The reason it’s important the model never sees this test data is to ensure it produces unbiased predictions, and hence trustworthy test results.
Forecast Horizon: The number of observations, or stated simply, the length of time into the future that a forecast model is intended to predict. This could be in days, weeks, months, etc.
Cross-Validation (CV): This is a process in machine learning which involves splitting up sections of the data into multiple Training and Test datasets in order to perform validation several times when building (and ultimately testing and selecting) the forecast prediction model.
Model Overfitting: A process which sometimes occurs in machine learning where a model is trained too well on a specific dataset, but does not perform well when it encounters new data. To use a simple analogy, this would be like you preparing for an exam by cramming only the answers to a single past multiple choice exam instead of studying the course material on a subject, and walking into a new exam. If the same questions came up in the new exam, ideal for you, but if a whole new set of questions were presented, you would likely fail. In the same vein, overfitting causes models to generate good predictions when presented with the data it knows, but not so much generally in the real world.

The Traditional Approach to Forecast Validation

A common approach a lot of forecasters take to training and testing their models is to simply split the dataset once into an 80-20 (or 70-30) split. An 80-20 split means the first 80% of the data is used for training and validation of the model, and the remaining 20% is used for testing and evaluating the model. This is an unsophisticated method when compared to the cross-validation method explained further above in the glossary.

Let’s define a really simple example where we have 16 observations of data, which to put in a context we understand, could represent 16 weeks of sales data from the Amazon Seller Central database. Let’s then imagine we wanted to forecast the subsequent 4 weeks of sales orders based on this data. In reality, you would more than likely have many more weeks, months or years worth of data for your ASIN.

An 80-20 split would assign Weeks 1 - 12 (the first 12 plot points) as Training Data and Weeks 13 - 16 (the last 4 points) as Test Data as illustrated below:

Alt text

Let’s take a closer look at the diagram above. Looking at the 80-20 Split section, the first 12 weeks of sales data (the Training Data points) are used to train, tune and build a number of different forecasting models, which are in turn used to predict the last 4 weeks (the Test Data points). Since we have the data for all 16 weeks and know how many products/ASINs were actually sold in the last 4 weeks, we can calculate a Forecast Accuracy value using a chosen accuracy metric. There are several Forecast Accuracy metrics to choose from, but we will choose the Root Means Squared Error (RMSE) in our example. (Note to self to write a post specifically about forecast accuracy metrics in the future).

A model is then selected by comparing all the models built and selecting the model with the lowest RMSE; the model with the least error should, in theory, be the best predictive and most accurate model. You know, the one that’s closest to the mark.

What about Overfitting?

Splitting the data in the manner described above is meant to allocate enough data to training and validating the model—so it can pick up any patterns, trends or anomalies in it, while leaving enough data to test the model with, to assess its predictive capability.

The problem with splitting the data just once is you’re most likely not building a model that performs optimally when pitted against unseen data (i.e. your model could be overfitting). It’s perhaps learned the 80% it’s been trained on way too specifically to adapt to the range of possible outcomes in the unknown future. (Please refer to the definition of overfitting in the glossary if you need more detail or a refresher). In short, you’re not using the data you’ve got to its fullest predictive potential.

Where Time Series Cross-Validation Comes In

We already know cross-validation iterates through the process of training, tuning and testing several times. For Time Series data, which is ordered by time, a very specific method of cross-validation is used; Time Series Cross-Validation always makes sure to respect the temporal order of the data. This means every split of the dataset never trains a model using future data points, while testing anything in the past.

Forward-chaining is the approach which is used to include a little more “past” data for training the model, while testing using the future data. To illustrate this, let’s consider another example below, where we have the same 16 weeks of actual sales data, but use Time-Series Cross-Validation to build predictive models which predict a Forecast Horizon of the next 4 weeks of sales.

Alt text

Taking a closer look at this diagram, observe how the appropriate ML or statistical models are trained, tuned and tested over several iterations. Each model predicted 4 weeks into the future 5 times (each CV Iteration) and each time, it was evaluated against what actually happened by comparing what the model predicted against the actual sales. We then obtain an RMSE for each iteration, and calculate a single RMSE for each model by taking the average of all of them. i.e. (RMSE 1 + RMSE 2 + RMSE 3 + RMSE 4 + RMSE 5) / 5.

Just as in the 80-20 section above, we select the model with the lowest error. But this cross-validation approach used the data to its fullest potential and ensured a well-generalized predictive model was used to model future sales. No “cramming the answers for a single exam” this time.

As an aside, one more thing I’d like to mention is how even the first CV Iteration uses more than one or two data points for training. As you may have already guessed, this is because you need to observe anything an adequate number of times to feel comfortable making predictions about it. Think if someone asked you to describe a good friend’s habits as opposed to someone you just met, for example.

Gain the Edge with Cross-Validation (TL;DR)

FarSight, which is Ronin’s proprietary forecasting engine, generates all forecasts for the demand for all your ASINs/products as well as your FBA/Amazon Inventory Levels using cross-validation (along with other Machine Learning techniques).

Is it a more computationally-involved process? Yes. Does it require more time and computing power to crunch all the numbers? Yes. (Not to worry, all these calculations occur behind the scenes). Nevertheless, it’s all worth it because Ronin’s process:

Maximizes Data Utilization: The process allows the platform to utilize as much data available for both training and testing, maximizing the learning process and providing a more reliable estimation of the model’s accuracy.
Prevents Overfitting: By using multiple subsets of the dataset, the process helps to mitigate overfitting by building and evaluating the model over several different iterations. This ensures that the model is not overly specialized to the training data and can perform well on diverse samples, ultimately enhancing model robustness.
Facilitates Better Model Selection: As a result of taking several passes at training and testing, and getting a much more comprehensive evaluation of model accuracy, our platform selects the most optimal model to perform in real-world settings.
Improves Forecast Predictions: The ultimate goal here; cross-validation greatly reduces bias when compared to a single-split process like the 80-20 or 70-30 split. The resulting model, which is more generalized to unseen data, generates the best Demand and Inventory Forecasts for your Amazon business.

What are you waiting for?
Start running with Ronin today.

Get Started CTA Foreground