At the end of the last tutorial we had made a forecast for the daily number of pageviews for the Easter page on wikipedia.
There are two main problems with this forecast:
The two methods I will teach you to help solve these problems are:
Without Data Transform ┌──────────────────┐ │ Data │ └──────────────────┘ │ │ ▼ ┌──────────────────┐ │ Machine Learning │ └──────────────────┘ │ │ ▼ ┌──────────────────┐ │ Forecast │ └──────────────────┘
With Data Transform ┌───────────────────┐ │ Data │ └───────────────────┘ │ │ ▼ ┌───────────────────┐ │ Transform │ └───────────────────┘ │ │ ▼ ┌───────────────────┐ │ Machine Learning │ └───────────────────┘ │ │ ▼ ┌───────────────────┐ │ Inverse transform │ └───────────────────┘ │ │ ▼ ┌───────────────────┐ │ Forecast │ └───────────────────┘
The idea here is that you do some kind of transformation of the data before it reaches the forecasting algorithm. Then the forecast produced by the algorithm predicts what the transformed values will be in the future. You need to do the inverse transformation to get the correct predictions at the end.
A simple example might make this clearer. Suppose you perform the transformation
+5 on every data point before you send the data to the forecast algorithm.
Then the predictions from the algorithm will be higher than what they would be if
the algorithm was trained on the original data. You need to perform the inverse
transformation, in this case
-5, to turn the forecast output into a
Any invertable function you care to choose could be used as a transformation. But there are some commonly used transformations and one of these will be helpful in most cases.
|Logarithmic||Take the logarithm of the data. The inverse is exponentiating.
For example, in Google Sheets
|Square Root||Take the square root of the data. The inverse is squaring. For example,
|Box-Cox||This is a more general transform for which the above are special cases. You can use
it through the
|Logit||The above transforms “squash” down from the top. The logit squeezes from both
sides. Another difference is that the results are bounded above and below.
This is very useful if you have a fixed range which you know the forecasted values
must lie in; for example, a conversion rate must be between 0% and 100%. You can use
this transform with the
The idea is to pick a transform where the algorithm can make better predictions on the transformed data than it can on the raw data.
For example, machine learning algorithms generally are not good at dealing with data with outliers; this is what happens every Easter with our pageview dataset. A logarithmic transform will “squash” the outliers in more than the rest of the data so the size of the outliers will be smaller. This helps the algorithm make better predictions.
In this case, another advantage of a logarithmic transform is that the inverse transform (exponentiation) always outputs a positive value. This means we will never predict a negative pageview count again.
Start by making sure you have enough columns in the spreadsheet. You will need at least 11 columns (up to column K) to fit all the steps in.
LOG10 function to populate a new column with the log transformed data.
As always, it is worth plotting the transformed data to see what the transformation has done.
There are still big spikes around Easter Sunday but these are much less outlying than in the untransformed data that we started with.
This will make it easier for the machine learning algorithm to learn the historical trends which is uses to make predictions for the future.
FORGE_FORECAST function in a similar way to last time. Keep the
Date column the same, but change the other column to be the transformed data.
In this example the formula is:
because column G is where the transformed data is.
Remember that you need to run the inverse transform on the forecast output in order to make your pageview predictions.
Do this with the
Comparing our new forecast with the old and with what happened in Easter 2020 shows that there is still a long way to go before we have a forecast that looks right.
The log transform has help with the forecast for the rest of the year; the weekly spikes that you can see on the red line are muted on the yellow, and it never forecasts a negative number. But the forecasted traffic levels around Easter are much lower.
I will show you how to fix these problems in the next section on Regressors
A regressor is a column of extra data which helps the machine learning algorithm make its predictions.
A good regressor has the following properties:
A good example of a regressor is paid search budgets when you are forecasting paid search clicks, conversions or revenue. The amount you spend is quite well correlated with the ouput and you have a good idea what your budgets will be in the future.
You can also create regressors for special events like sales or the launch of new product lines; as long as you know when these things occured in the past and when they will occur in the future.
I will teach you this approach to help improve our Easter forecast.
┌──────────────────┐ │ Data │ └──────────────────┘ │ │ ▼ ┌────────────┐ ┌──────────────────┐ │ Regressors │ ──▶ │ Machine Learning │ └────────────┘ └──────────────────┘ │ │ ▼ ┌──────────────────┐ │ Forecast │ └──────────────────┘
Start with the simplest thing that might work.
Here I have created a new column with a zero on every day except for Easter Sunday where the value is one.
This tells the algorithm that there is something very special about Easter Sunday so it should predict differently for 4th April 2021.
To use a regressor in the
FORGE_FORECAST custom function just insert
the regressor range as the third argument.
In the example spreadsheet, the Easter Sunday regressor column is in column M
so the formula is:
NB: we are still using the log transformed values.
The chart above shows a big improvement in the forecast; it looks much more like the trend we have seen in past years.
The forecast algorithm automatically looks for annual and weekly trends but Easter is confusing because Easter Sundays are not 365 days apart. By adding the regressor column we have given the algorithm enough of a hint to figure out what is going on.
Comparing what our forecast looks like with the trend in previous years it looks like the algorithm is not really learning about the uptick in pageviews during Lent.
Next I will add extra regressor columns to help the algorithm with this.
There are a few different ways you could do this. I will do the following and let the algorithm figure out which one (or which combination) of them it uses:
Using multiple regressors is very similar to working with only one regressor column; just use a range when calling the function.
In this example, the regressors are in columns S, T, U and V so the
It is the range
S2:V2234 which specifies the regressors.
Running the forecast and doing the inverse transform on the result ends up looking like this:
All this must look like quite a lot of work.
It is probably worth it for a business where Easter is an extremely important time of year, but to have to go through this for every movable feast (Thanksgiving and Black Friday are also not on a 365 day cycle!) doesn’t look worth it.
Luckily there is an easier way!
FORGE_FORECAST function you can specify a country code
which will automatically add regressors for holidays that are celebrated in
In this example the formula is
Notice that the regressors have been left blank (the double comma
You can still use regressors here if you want to.
The results are OK:
The auto holiday feature works well in general but, as you can see in the plot above, it doesn’t quite capture the crazy effect of Easter Sunday on the pageviews of the Easter page.
If are making a forecast for a similar metric, where one part of the year is way more important than the others, then I advise you to use custom regressors as in the examples above.
Do you have any more questions about how to get the most out of your data?Ask Me!