Anomaly Detection is identifying “anomalous” data.
Usually it means using machine learning to identify outliers but you can do anomaly detection with any v1.0 human eyeball too.
Aside from all the complexities of machine learning an important difficulty is knowing what an anomaly is. Sometimes it is entirely expected to have some data that is very different to the rest.
Forecast Forge is focussed on timeseries data; data where each observation is associated with a specific time. An anomaly is then where an observation is very different from what we would expect given the trend and seasonality in the data.
Because Forecast Forge works as part of Google Sheets you can include Forecast Forge forecasts in your Data Studio reports. And with addons like the official Google Analytics spreadsheet addon you can schedule these reports to update and run automatically.
First you will need to get setup with the Google Analytics spreadsheet addon. You can do this easily be following the instructions.
Once you have the addon installed you can create a report in the sidebar:
For Forecast Forge you must have
date as a dimension. If you add other dimensions you’ll have to do some funky spreadsheet wizardry to get everything formatted correctly but I’m sure that is possible.
Once you’ve created the report the addon will create a new sheet called “Report Configuration” in your spreadsheet that looks a bit like this:
There are a few edits you will probably want to make to this sheet:
The start date can be a specific date (e.g.
N is a positive integer. You can also use Sheets formula to configure the date if you want (e.g.
Using a fixed start date vs. a variable start date changes how the forecast can be used. I’ll start by showing you a variable start date example in this tutorial.
When you have finished the configuration, click “Run Report” in the menu to pull the data into Google Sheets.
The addon will create another new sheet and you should see it populated with some lovely Google Analytics data.
If you scroll to the bottom of the sheet you will notice that the date fills the sheet right to the end.
To keep things clean and separate I prefer to make the forecast in a separate sheet.
Because the Google Analytics data will always contain the same number of days (this will get a lot more complicated if you add other dimensions!) it is easy to pull the dates and metrics into another sheet:
Then extend the dates column into the future for as many days as you want to forecasts:
Then make your forecast. For the scheduled updates to work you must use the custom functions for this rather than the sidebar. I suggest you follow my forecasting workflow to figure out what makes the best forecast for your data and then implement it using the custom functions.
When the forecast has run it should look something like this:
Label the forecast columns in row 1 if you are later going to import this into Data Studio.
Next setup the scheduled update:
Running hourly won’t help you very much but you can select daily, weekly or monthly for automatic updates.
In Data Studio you need to setup your Google Sheet as a new data source:
And then add the columns from the sheets as metrics for display:
Make sure you add the metric (in this case
Sessions) and the forecast (
Forecast). You can add the upper and lower bounds too if you want them included in your report.
I like to restrict the date range of the Data Studio report so that the viewer doesn’t have to see all the training data; they can just see the most relevant parts. Remember to extend this date range into the future to include the forecast.
Using this you will be able to produce tables and charts that contain both historical data and the forecast from your Google Sheet. These will update when the Google Sheet updates on the schedule you specified earlier.Read more
The early tutorials have shown you how to make forecasts using the
FORGE_FORECAST function. You can also make forecasts by using the sidebar; this tutorial will show you how to do this and some of the extra features you can use when making a forecast this way.
You can watch this video for a quick demo of how things work and bit of explanation. Or read on below…
The first thing you will have to do is open the sidebar if it isn’t open already.
After a short loading period you should see it appear on the right of your screen:Read more
If you have enough data then the Forecast Forge addon will estimate how accurate your forecast is likely to be.
We don’t know what will happen in the future so it is impossible to be certain how good or bad your forecast will be. But we can use the same forecasting algorithm to make a forecast for the recent past and then compare how accurate that forecast is against what actually happened.
For example, you might pretend you don’t know what happened between April 2019 and April 2020 (and I think we’d all like to imagine this didn’t happen at all!) and use the data from April 2017 to March 2019 to feed into the forecasting algorithm.
Then you can compare the results of this forecast with the actual data for 2019/20 to see how good the forecasting algorithm is at predicting with your data.
You have this data 2018 2019 2020 /----------/-----------/----- |~~~~~~ And you want to forecast this Use this data 2018 2019 2020 /----------/-----------/----- |~~~~~~ To forecast this
The Forecast Forge addon shows you four different ways of measuring the error. They are each useful in different circumstances.
Every error metric is based on the daily errors; the difference between the actual value and the forecast value for each day in the forecast.
Take all the error values and find the mean.
This is the simplest error metric but it doesn’t always tell you the full story because positive and negative errors (where the forecast over- and under-estimates) can cancel each other out.
The main thing the Mean Error tells you is whether the forecast tends to overestimate (positive error) or underestimate (negative error).
Take the absolute value of the errors (i.e. make them all positive) and then find the mean.
This fixes the problem with Mean Error described above.
Square all the error values, find the mean of this and then take the square root.
This is a very commonly used error metric in machine learning. I strongly suggest you try to minimise this error when working to improve your forecasts unless you have a very good reason not to.
However, this can be a bit harder to understand than the other error metrics so once you have your model figured out you can report MAE or MAPE to your clients who aren’t elbows deep in forecasting.
Find the error values as a percentage, take the absolute value and then calculate the mean of this.
This is a very useful error metric because it is a percentage; it doesn’t matter what scale the values being forecast are.
For example, imagine I tell you that I’ve made a forecast for average order value (AOV) and that my MAE is
15. Is this good or bad?
It is impossible to say without knowing more about the average order value. If it is very high (e.g. over
15 is quite good. If it is very low (e.g.
15 is very bad!
But if you have a MAPE of
10% then you don’t need to know how big or small the AOV is to assess how much of a problem the error might be.
For more detail on running backtests manually or using other error metrics read the Backtesting Forecasts to Estimate Future Accuracy post.
As with just about everything, it’s a bit more complicated than that!
The Logit Transform is most useful when the metric you are forecasting has both a ceiling and a floor. For example a forecast for a conversion rate must be between 0% and 100%. Or, the number of users for a site must be between 0 and the total population of the world.
That last one is probably only a concern for Google and Facebook!
In Search you might use this if you have an idea of how many searches are going to be done through the year; the number of impressions you get can’t be higher than this number and it can’t be lower than 0. NB in this example the ceiling cap is variable; the number of searches isn’t the same every day.
For this example I will, again, show you something with Wikipedia pageview data.
This Google Sheet has three columns of data:
The proportion is what we are interested in here; we know that this can never be less than 0% or more than 100%.Read more