June 28, 2021

How good will my forecast be?

One of the challenging things of using a machine learning system like Forecast Forge is learning how much you can trust the results. Obviously the system isn’t 100% infallible so it is important to have an idea of where it can go wrong and how badly it can go wrong when you are talking through a forecast with your boss or clients.

One ways that people familiarise themselves with Forecast Forge is to run a few backtests. A backtest is where you make a forecast for something that has already happened and then you can compare the forecast against the actual values.

For example you might make a forecast for the first six months of 2021 using data from 2020 and earlier. Then you can compare what the forecast said would happen against the real data.

         Use data from this period      To predict here
  ________________________________________~~~~~~~~~~~~~
  |------------|------------|------------|------------|--------????|?????????
2016         2017         2018         2019         2020       Then use the same
                                                               methodology here

Sometimes when you do this you will see the forecast do something dumb and you will think “stupid forecast. That obviously can’t be right because we launched the new range then” or something like that - the reason will probably be very specific to your industry or website. If you can convert your reasoning here into a regressor column then you can use this to make the forecast better.

Backtesting several different forecasting methods with different transforms and regressors

Once you start doing this then there is a temptation to start adding and adjusting regressor columns to make your backtesting fit as good as possible.

Figuring these things out and seeing your error metrics improve is one of the most fun things about data science (and have no doubt that, even if you’re doing it in a spreadsheet, this is data science) but it also opens the door to the bad practice of overfitting. If you overfit then the forecast will perform a lot better on a backtest than it will in the future.

An extreme example of this is if you were trying to make a revenue forecast and, for your backtest, used the actual revenue as a regressor column. In this case Forecast Forge would learn to use that column to make an extremely good prediction! But when it came to making a real forecast you’ll have to use a different method to fill in the revenue values for the regressor columns and then either your real forecast will be rubbish or you’ll have figured out a way to forecast revenue without using Forecast Forge - good for you!

There are a few different techniques you can use to help avoid this trap and to estimate how good your forecast will be in the real world outside of a backtest.

1. Separate test and validation

The idea here is that you test out different regressors and transforms to see which performs well on your test data. Then, once you have picked the best, you run a final test on validation data to decide whether or not to proceed with using the forecast.

For example, if you want to make a 6 month forecast you could use data up until July 2020 to make a forecast through to December 2020. July to December 2020 is your test data to where you can play around with different regressors and transforms to see what gives you the best result. And then use the same method with data through to December 2020 to make a forecast for January 2021 to June 2021. This is your validation data where you check that your forecasting method gives good enough results on out of sample data and gives a “best guess” estimate of how well the forecast will perform on unseen data. For this reason you must not peek at the validation data before the final test. If you do, then it isn’t really validation but just a poorly used form of test data.

This is a good method but it does have two main flaws:

For longer forecasts you end up needing a lot of training data. For example, if you want to make a 12 month forecast then you’d use June 2020 to June 2021 as your validation data, June 2019 to June 2020 as your test data and then you’d still need at least a couple of years training data before that for you to use when you’re trying to figure out the best model. So, in this example, you’d need good quality data going back to at least June 2017 and June 2016 would be even better. This is a long time to have a consistent, non-broken analytics implementation; according to surveys done by Dipesh Shah, less than 20% of businesses have this data (see a case study on how he uses Forecast Forge with one of his ecommerce clients).
At the moment people care a lot about how well their forecasts model things like the effect of the covid-19 pandemic. But it is likely that all or most of the pandemic will occur only within your testing and validation periods so the model will perform very poorly during testing because there is no pandemic data during the training period. There is no way around this because the validation period has to be after the test period and the test period has to come after the training data.

2. Make models for similar things at the same time

If you have to make forecasts for a variety of similar things (e.g. similar brands in one market or the same brand in different countries) then you can use some of them for testing and improving your methodology and some of them for validating how well you expect it to perform on unseen data.

For example, if you are forecasting for five similar brands (e.g. they are different faces for the same conglomerate) you can find the forecasting method that works best with three of them and then save the final two for validation - checking how well your method is likely to perform on unseen data.

This method only works well if the things you are forecasting are similar enough that the same method will work well for all of them. For example if you try this with brands that are completely different then the performance of a forecast for brand A will not tell you very much about how it will perform for brand C.

Whether things are “similar enough” for this technique to work is quite a tough question to answer. In may ways the typical use case for Forecast Forge makes this even harder; if you were trying to forecast 1000 metrics then you’d code the model in Python or R and it would be obvious that you wouldn’t have time to customise the model so you’d just use whichever method performed best on average. But part of the point of Forecast Forge is to make it easier for people to customise their forecasts and Forecast Forge users tend to be making fewer than ten forecasts at once so the temptation to customise and overfit forecasts is a lot harder to avoid.

Both of these methods and techniques rely on the patterns and features learned by the Forecast Forge algorithm in the past also continuing into the future. As we’ve all seen in 2020 sometimes crazy things happen and then your forecast will be rubbish unless you knew about the craziness in advance.

How good will my forecast be?

1. Separate test and validation

2. Make models for similar things at the same time

Never miss a post