November 23, 2021
# All I Overfitting for Overfitting is Overfitting

### Never miss a post

A couple of weeks ago I used Forecast Forge to build a forecast for the number of pageviews of the “All I Want for Christmas is You” wikipedia page in the run-up to Christmas.

pageviews for the “All I Want For Christmas is You” wikipedia page will peak on 24th December close to 30k

This forecast “looks” good - it has the peak roughly when we’d expect it and the magnitude is comparable with previous years. It passes the smell test, but I am still concerned about whether it predicts the size of the peak correctly. As you can see from the chart, the peak in 2019 was way higher than any of the preceeding years and getting this right is the most important part of my prediction. How can we tell in advance what kind of peak there is going to be?

My hypothesis is that the peak in 2019 was driven by the re-release of the *Merry Christmas* album; something that would have been known about on the 8th November 2019 when the corresponding 2019 forecast would have been made. If I was a Mariah Carey expert I would be able to include broader information like this in a regression column; probably by giving each December a mark out of 10 for how much Mariahhad going on that year.

But I am not a Mariah Carey expert so I don’t trust my intuition on this; I need some hard data to work with here that might give me an inkling of whether or not she is having a big year.

The first idea that sprang to mind was that it might be possible to use traffic data from other wikipedia pages for this. The challenge is then to find which pages are a leading indicator of Christmas traffic for “All I Want for Christmas is You”.

Doing this analysis for *all* pages on Wikipedia would make the challenge more about dealing with big data so instead I first need to select a subset of pages for first analysis. All the pages that link to “All I Want for Christmas is You” seems like a good starting point. You can see a list of these using the linksto: operator and download the pageview data from the Massviews tool.

The rest of this post will be me analysing this data and trying (failing!) to use it to make a better forecast.

You can download the Jupyter Notebook from here. All the code an explanations are also included below.

First let’s load the data into python

```
import pandas as pd
= pd.read_csv("https://www.forecastforge.com/files/massviews-20150701-20211116.csv",
inlinks # Need to set this otherwise cloudflare (I think?) 403's the pandas request
= {'User-Agent': 'Pandas!'})
storage_options inlinks
```

Title | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | … | 2021-11-07 | 2021-11-08 | 2021-11-09 | 2021-11-10 | 2021-11-11 | 2021-11-12 | 2021-11-13 | 2021-11-14 | 2021-11-15 | 2021-11-16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | Justin Bieber | 11212 | 11419 | 10973 | 10816 | 10737 | 11675 | 14246 | 13780 | 12273 | … | 13804 | 13002 | 12540 | 13343 | 14681 | 13842 | 12980 | 13267 | 14624.0 | 14028.0 |

1 | Millie Bobby Brown | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 14229 | 11248 | 9497 | 8314 | 8197 | 8453 | 12435 | 11840 | 9445.0 | 9979.0 |

2 | Mariah Carey | 9832 | 14382 | 10988 | 9595 | 9428 | 9329 | 8735 | 13985 | 10900 | … | 16638 | 14650 | 13315 | 13851 | 15706 | 14629 | 12662 | 13895 | 13326.0 | 10419.0 |

3 | Fifth Harmony | 11649 | 11674 | 10864 | 10733 | 11292 | 12147 | 12247 | 12316 | 11701 | … | 2385 | 2071 | 2169 | 2334 | 2098 | 2151 | 2244 | 2091 | 2104.0 | 1918.0 |

4 | Jenna Dewan | 13182 | 26490 | 21592 | 16881 | 14074 | 16984 | 12811 | 11701 | 10932 | … | 3598 | 4857 | 3929 | 3307 | 2906 | 2582 | 2590 | 3292 | 5002.0 | 4402.0 |

… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |

628 | Ngayong Pasko | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 4 | 1 | 0 | 4 | 2 | 0 | 2 | 1.0 | 1.0 |

629 | Česko Slovenská SuperStar (season 4) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 4 | 8 | 3 | 4 | 2 | 2 | 0 | 2 | 1.0 | 7.0 |

630 | Mr. Vocalist X’Mas | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 3 | 2 | 2 | 0 | 0 | 1 | 3 | 0 | 3.0 | 1.0 |

631 | 24 Hours (Agnes song) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 19 | 14 | 15 | 10 | 9 | 9 | 8 | 6 | 14.0 | 4.0 |

632 | We Were Born for This | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 3 | 1 | 4 | 4 | 3 | 4 | 6 | 1 | 3.0 | 12.0 |

633 rows × 2332 columns

This looks OK, but I would prefer it if each page was a separate column with a row for each day

```
= inlinks.transpose()
inlinks ## Column names are in the first row
= inlinks.iloc[0]
inlinks.columns ## Delete the first row
= inlinks.drop(["Title"])
inlinks inlinks
```

Title | Justin Bieber | Millie Bobby Brown | Mariah Carey | Fifth Harmony | Jenna Dewan | Love Actually | Michael Bublé | Despacito | Tori Kelly | Hallelujah (Leonard Cohen song) | … | Colleen Madden | List of Tawag ng Tanghalan finalists (season 3) | List of top 10 singles for 2018 in Australia | The Masked Singer (Bulgarian season 1) | List of Billboard Global 200 number ones of 2021 | Ngayong Pasko | Česko Slovenská SuperStar (season 4) | Mr. Vocalist X’Mas | 24 Hours (Agnes song) | We Were Born for This |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2015-07-01 | 11212 | 0 | 9832 | 11649 | 13182 | 2014 | 2129 | 0 | 12672 | 2091 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-02 | 11419 | 0 | 14382 | 11674 | 26490 | 2409 | 2628 | 0 | 10764 | 2239 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-03 | 10973 | 0 | 10988 | 10864 | 21592 | 2532 | 2350 | 0 | 8501 | 2290 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-04 | 10816 | 0 | 9595 | 10733 | 16881 | 2514 | 2195 | 0 | 7252 | 2643 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-05 | 10737 | 0 | 9428 | 11292 | 14074 | 2445 | 2344 | 0 | 6113 | 2675 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |

2021-11-12 | 13842 | 8453 | 14629 | 2151 | 2582 | 4978 | 7419 | 1089 | 1088 | 1801 | … | 0 | 0 | 1 | 21 | 73 | 2 | 2 | 1 | 9 | 4 |

2021-11-13 | 12980 | 12435 | 12662 | 2244 | 2590 | 6175 | 4041 | 1181 | 1272 | 1947 | … | 1 | 1 | 1 | 27 | 74 | 0 | 0 | 3 | 8 | 6 |

2021-11-14 | 13267 | 11840 | 13895 | 2091 | 3292 | 7413 | 3608 | 971 | 2016 | 2002 | … | 0 | 3 | 4 | 15 | 61 | 2 | 2 | 0 | 6 | 1 |

2021-11-15 | 14624.0 | 9445.0 | 13326.0 | 2104.0 | 5002.0 | 6195.0 | 3051.0 | 1066.0 | 1057.0 | 1709.0 | … | 0.0 | 3.0 | 2.0 | 4.0 | 76.0 | 1.0 | 1.0 | 3.0 | 14.0 | 3.0 |

2021-11-16 | 14028.0 | 9979.0 | 10419.0 | 1918.0 | 4402.0 | 6276.0 | 3255.0 | 960.0 | 1225.0 | 1524.0 | … | 1.0 | NaN | 1.0 | 13.0 | 67.0 | 1.0 | 7.0 | 1.0 | 4.0 | 12.0 |

2331 rows × 633 columns

Much better!

Now load the “All I Want for Christmas is You” pageview data

```
= pd.read_csv("https://www.forecastforge.com/files/pageviews-20150701-20211108.csv",
alliwant = {'User-Agent': 'Pandas!'})
storage_options alliwant
```

Date | All I Want for Christmas Is You | |
---|---|---|

0 | 2015-07-01 | 189 |

1 | 2015-07-02 | 248 |

2 | 2015-07-03 | 179 |

3 | 2015-07-04 | 208 |

4 | 2015-07-05 | 245 |

… | … | … |

2318 | 2021-11-04 | 4103 |

2319 | 2021-11-05 | 3556 |

2320 | 2021-11-06 | 3192 |

2321 | 2021-11-07 | 3046 |

2322 | 2021-11-08 | 2760 |

2323 rows × 2 columns

Match the timeseries up by joining the dataframes together

```
## Default is inner join so rows match up
= alliwant.join(inlinks, on="Date")
all_series = all_series.set_index("Date")
all_series all_series
```

All I Want for Christmas Is You | Justin Bieber | Millie Bobby Brown | Mariah Carey | Fifth Harmony | Jenna Dewan | Love Actually | Michael Bublé | Despacito | Tori Kelly | … | Colleen Madden | List of Tawag ng Tanghalan finalists (season 3) | List of top 10 singles for 2018 in Australia | The Masked Singer (Bulgarian season 1) | List of Billboard Global 200 number ones of 2021 | Ngayong Pasko | Česko Slovenská SuperStar (season 4) | Mr. Vocalist X’Mas | 24 Hours (Agnes song) | We Were Born for This | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Date | |||||||||||||||||||||

2015-07-01 | 189 | 11212 | 0 | 9832 | 11649 | 13182 | 2014 | 2129 | 0 | 12672 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-02 | 248 | 11419 | 0 | 14382 | 11674 | 26490 | 2409 | 2628 | 0 | 10764 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-03 | 179 | 10973 | 0 | 10988 | 10864 | 21592 | 2532 | 2350 | 0 | 8501 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-04 | 208 | 10816 | 0 | 9595 | 10733 | 16881 | 2514 | 2195 | 0 | 7252 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2015-07-05 | 245 | 10737 | 0 | 9428 | 11292 | 14074 | 2445 | 2344 | 0 | 6113 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |

2021-11-04 | 4103 | 12832 | 8495 | 18005 | 2393 | 3147 | 2783 | 2875 | 973 | 1525 | … | 0 | 1 | 1 | 5 | 0 | 0 | 2 | 1 | 6 | 4 |

2021-11-05 | 3556 | 12473 | 7724 | 18080 | 2206 | 2789 | 4056 | 2925 | 992 | 1341 | … | 3 | 0 | 1 | 24 | 0 | 2 | 4 | 4 | 13 | 2 |

2021-11-06 | 3192 | 13273 | 11944 | 16802 | 2231 | 3408 | 5880 | 2740 | 1067 | 1322 | … | 0 | 2 | 0 | 27 | 42 | 3 | 7 | 5 | 15 | 5 |

2021-11-07 | 3046 | 13804 | 14229 | 16638 | 2385 | 3598 | 6717 | 2705 | 1194 | 1269 | … | 1 | 4 | 2 | 20 | 65 | 1 | 4 | 3 | 19 | 3 |

2021-11-08 | 2760 | 13002 | 11248 | 14650 | 2071 | 4857 | 5436 | 2427 | 956 | 1073 | … | 4 | 1 | 2 | 24 | 111 | 4 | 8 | 2 | 14 | 1 |

2323 rows × 634 columns

Eventually I want to run some kind of regression on the lagged timeseries; and I want to easily be able to tell which regression coefficients are important and which are not. So I am going to normalise all the data so it has mean of `0`

and standard deviation of `1`

```
=(all_series-all_series.mean())/all_series.std()
normalised normalised
```

All I Want for Christmas Is You | Justin Bieber | Millie Bobby Brown | Mariah Carey | Fifth Harmony | Jenna Dewan | Love Actually | Michael Bublé | Despacito | Tori Kelly | … | Colleen Madden | List of Tawag ng Tanghalan finalists (season 3) | List of top 10 singles for 2018 in Australia | The Masked Singer (Bulgarian season 1) | List of Billboard Global 200 number ones of 2021 | Ngayong Pasko | Česko Slovenská SuperStar (season 4) | Mr. Vocalist X’Mas | 24 Hours (Agnes song) | We Were Born for This | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Date | |||||||||||||||||||||

2015-07-01 | -0.360837 | -0.666008 | -0.60886 | -0.136275 | 0.833743 | 0.602052 | -0.38695 | -0.350378 | -0.572981 | 1.37151 | … | -0.382595 | -0.361893 | -0.156449 | -0.105454 | -0.033438 | -0.257078 | -0.249068 | -0.188576 | -0.063978 | -0.075707 |

2015-07-02 | -0.343616 | -0.646519 | -0.60886 | 0.180788 | 0.837845 | 1.721202 | -0.329304 | -0.261714 | -0.572981 | 1.083843 | … | -0.382595 | -0.361893 | -0.156449 | -0.105454 | -0.033438 | -0.257078 | -0.249068 | -0.188576 | -0.063978 | -0.075707 |

2015-07-03 | -0.363756 | -0.68851 | -0.60886 | -0.05572 | 0.704938 | 1.309299 | -0.311353 | -0.31111 | -0.572981 | 0.742653 | … | -0.382595 | -0.361893 | -0.156449 | -0.105454 | -0.033438 | -0.257078 | -0.249068 | -0.188576 | -0.063978 | -0.075707 |

2015-07-04 | -0.355291 | -0.703292 | -0.60886 | -0.15279 | 0.683444 | 0.913123 | -0.31398 | -0.338651 | -0.572981 | 0.554343 | … | -0.382595 | -0.361893 | -0.156449 | -0.105454 | -0.033438 | -0.257078 | -0.249068 | -0.188576 | -0.063978 | -0.075707 |

2015-07-05 | -0.344492 | -0.710729 | -0.60886 | -0.164428 | 0.775165 | 0.677065 | -0.32405 | -0.312176 | -0.572981 | 0.382617 | … | -0.382595 | -0.361893 | -0.156449 | -0.105454 | -0.033438 | -0.257078 | -0.249068 | -0.188576 | -0.063978 | -0.075707 |

… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |

2021-11-04 | 0.781595 | -0.513486 | -0.1901 | 0.433254 | -0.684997 | -0.241852 | -0.274723 | -0.217826 | -0.424959 | -0.30911 | … | -0.382595 | 0.478784 | 0.210653 | 1.332167 | -0.033438 | -0.257078 | 2.131942 | 1.238336 | 6.043733 | 12.053036 |

2021-11-05 | 0.621934 | -0.547286 | -0.228106 | 0.438481 | -0.715681 | -0.271958 | -0.088942 | -0.208942 | -0.422069 | -0.336852 | … | 2.077097 | -0.361893 | 0.210653 | 6.795127 | -0.033438 | 2.062114 | 4.512953 | 5.519072 | 13.169396 | 5.988665 |

2021-11-06 | 0.515689 | -0.471966 | -0.020082 | 0.349424 | -0.711579 | -0.219903 | 0.177251 | -0.241814 | -0.410659 | -0.339716 | … | -0.382595 | 1.31946 | -0.156449 | 7.6577 | 14.93164 | 3.22171 | 8.084469 | 6.945984 | 15.2053 | 15.085222 |

2021-11-07 | 0.473074 | -0.421973 | 0.092557 | 0.337996 | -0.68631 | -0.203925 | 0.299402 | -0.248033 | -0.391339 | -0.347707 | … | 0.437302 | 3.000813 | 0.577756 | 5.64503 | 23.126802 | 0.902518 | 4.512953 | 4.09216 | 19.277107 | 9.020851 |

2021-11-08 | 0.389595 | -0.497481 | -0.054391 | 0.199464 | -0.737832 | -0.098048 | 0.112454 | -0.297429 | -0.427545 | -0.377258 | … | 2.896995 | 0.478784 | 0.577756 | 6.795127 | 39.517125 | 4.381307 | 9.274975 | 2.665248 | 14.187348 | 2.956479 |

2323 rows × 634 columns

```
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
set(rc = {'figure.figsize':(12,6)})
sns.="Date",y="All I Want for Christmas Is You", kind="line", data=normalised) sns.relplot(x
```

`<seaborn.axisgrid.FacetGrid at 0x7fd4a4a6b6d0>`

The “All I Want for Christmas is You” data looks how it should.

Eventually I want to start to use the pageview data from the in-linking pages to make a better forecast for this. There are 633 of these pages but I will also need to look at whether a page is a leading indicator by 30 days, 50 days etc. This quickly multiplies out to be **a lot** of potential regressor columns; I am nervous about chucking this amount of data at a regression algorithm and hoping it can figure out what is and isn’t important.

There are certainly situations where this would be a good approach but for this specific problem I don’t like it. There might be 2323 rows of data but there are only six Christmas peaks - and if I reserve one of them for testing then that only leaves five for training. This means there is a big risk of overfitting and the more regression variables I throw into the mix, the bigger this risk.

My plan here is to follow a two step process:

- Filter out a lot of the regression variables based on correlation value
- Setup the forecasting model so that it will only use a small number of them

There are a few different ways of doing step 2. Tensorflow supports “Structural Time Series” (more on this shortly) with a Sparse Linear Regression component.

SparseLinearRegression uses a parameterization of a Horseshoe prior to encode the assumption that many of the weights are zero, i.e., many of the covariate time series are irrelevant

Which sounds pretty close to what I want so let’s go with that for now.

Rather than diving straight into the deep end let’s start by making a simpler model so we can verify it works

```
import tensorflow_probability as tfp
import tensorflow.compat.v2 as tf
## Forgetting this line is my number one cause of mystery tf errors that I can't figure out
tf.enable_v2_behavior()
```

The “structure” in a structural timeseries describes how the timeseries evolves given certain parameters. For example, we might model the value at day `t+1`

as the value at day `t`

plus a small amount of random noise. This is a “random walk” model and it can actually work pretty well for predicting things only 1 day ahead.

The parameters are estimated from historical data; in the above example the parameter to estimate is the variance of the random noise.

The first model I want to try is a bit more complicated than that. It is similar to how Forecast Forge works behind the scenes. There are three components:

- Two seasonalities; weekly and annual
- A trend component

The trend component looks a bit similar to the first random walk example:

```
level[t] = level[t-1] + slope[t-1] + Normal(0., level_scale)
slope[t] = slope[t-1] + Normal(0., slope_scale)
```

The main difference is that there is both a `level`

and a `slope`

both random walking around. This means that the model is slower to “change direction” than a random walk; if the slope is large and positive then it will take a few days for it to become negative and start trending the other way.

The weekly seasonality component will estimate an effect for each day. This is the same approach to seasonality that Tom Capper uses in his Google Sheets template on Moz.

The annual seasonality component is designed to be *smoother* than that. This is actually a weakness when forecasting “All I Want for Christmas is You” pageviews because the data is so “spikey” but I have picked this for three reasons:

- I don’t want to estimate 12 different monthly effects because the peak in pageviews is concentrated at one part of December. Adding an average effect for the whole month to whatever weekly seasonality there is won’t capture the 7 days before Christmas very well
- Estimating an effect for each of 365 days slows down the model fit quite a lot and eats RAM like it is going out of fashion. This way of modelling things also ignores any smoothness there might be in the data (e.g. the value for November 23rd is likely to be close to the values for November 22nd and 24th)
- This is most similar to what the Forecast Forge algorithm does which makes the forecasts more comparable

```
def build_model(observed_time_series):
= tfp.sts.LocalLinearTrend(observed_time_series=observed_time_series)
trend = tfp.sts.Seasonal(
weekly =7, observed_time_series=observed_time_series, name="week")
num_seasons= tfp.sts.SmoothSeasonal(
annual =365, frequency_multipliers=[1, 2, 3, 4, 5, 6])
period# The actual prediction is the sum of the above three components
= tfp.sts.Sum([trend, weekly, annual], observed_time_series=observed_time_series)
model return model
```

Let’s forecast for Christmas 2020 and see what it looks like against the actuals.

```
= normalised[normalised.index<"2020-11-09"]["All I Want for Christmas Is You"]
observed = list(observed)
observed
= build_model(observed)
model
# The below is mostly tensorflow voodoo
# The output will be a loss curve plot; we want to see that this converges before proceeding
= tfp.sts.build_factored_surrogate_posterior(
variational_posteriors =model)
model
= 200
num_variational_steps
# Build and optimize the variational loss function.
= tfp.vi.fit_surrogate_posterior(
elbo_loss_curve =model.joint_log_prob(
target_log_prob_fn=observed),
observed_time_series=variational_posteriors,
surrogate_posterior=tf.optimizers.Adam(learning_rate=0.1),
optimizer=num_variational_steps,
num_steps=True)
jit_compile
plt.plot(elbo_loss_curve)
plt.show()
# Draw samples from the variational posterior.
= variational_posteriors.sample(100) q_samples
```

```
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow/python/ops/linalg/linear_operator_composition.py:182: LinearOperator.graph_parents (from tensorflow.python.ops.linalg.linear_operator) is deprecated and will be removed in a future version.
Instructions for updating:
Do not call `graph_parents`.
WARNING:tensorflow:From /opt/conda/lib/python3.8/site-packages/tensorflow_probability/python/distributions/distribution.py:342: MultivariateNormalFullCovariance.__init__ (from tensorflow_probability.python.distributions.mvn_full_covariance) is deprecated and will be removed after 2019-12-01.
Instructions for updating:
`MultivariateNormalFullCovariance` is deprecated, use `MultivariateNormalTriL(loc=loc, scale_tril=tf.linalg.cholesky(covariance_matrix))` instead.
WARNING:tensorflow:From <ipython-input-10-608bf49b6c8f>:15: StructuralTimeSeries.joint_log_prob (from tensorflow_probability.python.sts.structural_time_series) is deprecated and will be removed after 2022-03-01.
Instructions for updating:
Please use `StructuralTimeSeries.joint_distribution(observed_time_series).log_prob`
```

The model converges (yay!) so we can use the parameter samples from the posterior to get the forecast values

```
= tfp.sts.forecast(
forecast_dist
model,=observed,
observed_time_series=q_samples,
parameter_samples=53)
num_steps_forecast
= (
forecast_mean, forecast_scale 0],
forecast_dist.mean().numpy()[..., 0]) forecast_dist.stddev().numpy()[...,
```

Quick inspection of the forecast_mean data. We want to see values somewhere between 4 and 8 later in the series

` forecast_mean`

```
array([0.35945305, 0.48060617, 0.50919384, 0.5111362 , 0.6007148 ,
0.688883 , 0.7947957 , 0.9094128 , 1.0705427 , 1.1377562 ,
1.1766099 , 1.301028 , 1.4216119 , 1.5571762 , 1.698354 ,
1.8826438 , 1.9693264 , 2.023694 , 2.1594274 , 2.2869177 ,
2.424798 , 2.5635533 , 2.740567 , 2.8150437 , 2.8522348 ,
2.96582 , 3.0662289 , 3.1721685 , 3.2742407 , 3.4099848 ,
3.4387949 , 3.4261525 , 3.486002 , 3.5290675 , 3.5743883 ,
3.612922 , 3.6825888 , 3.6431928 , 3.5606399 , 3.5493193 ,
3.5204093 , 3.4934168 , 3.4597697 , 3.4578633 , 3.3479712 ,
3.1964653 , 3.1181912 , 3.02477 , 2.9361305 , 2.8441105 ,
2.787481 , 2.626873 , 2.428981 ], dtype=float32)
```

Hmmm, possible a little low.

Pad these forecasted values out so we can compare them with the actuals in a plot

```
= np.pad(forecast_mean,(len(observed),0),constant_values=(np.NaN,))
padded_forecast
= pd.DataFrame({"forecast":padded_forecast
forecast_df
})"actuals"] = list(normalised["All I Want for Christmas Is You"] [:forecast_df.shape[0]])
forecast_df[= normalised.index[:forecast_df.shape[0]]
forecast_df.index = "Day"
forecast_df.index.name = forecast_df[["actuals", "forecast"]]
forecast_df
=forecast_df) sns.lineplot(data
```

`<AxesSubplot:xlabel='Day'>`

Not bad! It misses the big peaks but that is ok for now because the model doesn’t have anything in it to account for them.

Now back to doing some data-munging in preparation for the lagged regression.

First extend the index on the dataframe into the future - this is so there is “space” for the lagged regressors to go into. There is probably a better way of doing this, LMK if you know what it is.

```
from datetime import date
= pd.date_range(start=date(2015, 7, 1), end=date(2021, 11, 8), freq='D')
old_ix = pd.date_range(start=date(2015, 7, 1), end=date(2021, 12, 31), freq='D')
extended_ix = old_ix
normalised.index = normalised.reindex(extended_ix) normalised
```

The absolute minimum amount of lag we can consider is 53 days; this means the value on November 8th will tell us a bit about what happens on New Year’s Eve. I’ve decided the maximum to look at is 83 days; looking back earlier than October probably won’t give that much extra information.

So now make a dataframe with every page lagged between 53 and 83 days:

```
= None
shifted for i in range(53,83):
= normalised.drop(columns="All I Want for Christmas Is You").shift(periods=i).add_suffix("_"+str(i))
x if shifted is None:
= x
shifted else:
= pd.concat([shifted,x], axis=1)
shifted
shifted.columns
```

```
Index(['Justin Bieber_53', 'Millie Bobby Brown_53', 'Mariah Carey_53',
'Fifth Harmony_53', 'Jenna Dewan_53', 'Love Actually_53',
'Michael Bublé_53', 'Despacito_53', 'Tori Kelly_53',
'Hallelujah (Leonard Cohen song)_53',
...
'Colleen Madden_82',
'List of Tawag ng Tanghalan finalists (season 3)_82',
'List of top 10 singles for 2018 in Australia_82',
'The Masked Singer (Bulgarian season 1)_82',
'List of Billboard Global 200 number ones of 2021_82',
'Ngayong Pasko_82', 'Česko Slovenská SuperStar (season 4)_82',
'Mr. Vocalist X'Mas_82', '24 Hours (Agnes song)_82',
'We Were Born for This_82'],
dtype='object', length=18990)
```

`18990`

is a lot of columns. Let’s reduce that number down by seeing which have a statistically significant correlation with “All I Want for Christmas is You”

```
from scipy.stats import pearsonr
= []
sig_cols for i in shifted.columns:
= np.logical_or(np.isnan(normalised["All I Want for Christmas Is You"].values),
nas "float")))
np.isnan(shifted[i].values.astype(
= pearsonr(normalised["All I Want for Christmas Is You"].values[~nas],
corr, pval ~nas]
shifted[i].values[
)if(pval < 0.05):
"col":i, "corr":corr, "p":pval})
sig_cols.append({
# View the top ten
=lambda x: -abs(x['corr']))
sig_cols.sort(key10] sig_cols[:
```

```
[{'col': 'Tawag ng Tanghalan (season 3, quarter I)_80',
'corr': 0.42718188064663565,
'p': 3.524046455470085e-100},
{'col': 'Tawag ng Tanghalan (season 3)_80',
'corr': 0.41963677515784387,
'p': 2.2040219582815046e-96},
{'col': 'Tawag ng Tanghalan (season 3)_81',
'corr': 0.4151538500141587,
'p': 3.9297134847417186e-94},
{'col': 'Tawag ng Tanghalan (season 3)_82',
'corr': 0.41496413323530373,
'p': 5.348570466557383e-94},
{'col': 'Tawag ng Tanghalan (season 3, quarter I)_81',
'corr': 0.38587368756325663,
'p': 1.6532525769584053e-80},
{'col': 'Tawag ng Tanghalan (season 3, quarter I)_79',
'corr': 0.37752700745584056,
'p': 6.1489592197905055e-77},
{'col': 'Tawag ng Tanghalan (season 3, quarter I)_82',
'corr': 0.36852002750662144,
'p': 4.9533722974120236e-73},
{'col': 'The X Factor: Celebrity_59',
'corr': 0.34727695650499074,
'p': 3.603730969714107e-65},
{'col': 'Tawag ng Tanghalan (season 3)_79',
'corr': 0.344126527727554,
'p': 2.114127129858737e-63},
{'col': 'The X Factor: Celebrity_58',
'corr': 0.31705212117051595,
'p': 4.591941729922087e-54}]
```

“Tawag ng Tanghalan” is a Philippine singing competition which I’m guessing is similar to the X Factor (also listed in the top 10). It seems like these competitions start their TV runs in late Autumn and interest builds through until the final before Christmas - maybe so the winner can have a shot at a Christmas number one.

This means that interest and pageviews to their wikipedia pages start increasing through November so the lagged correlation with “All I Want for Christmas is You” appears strong. I’m guessing that “All I Want for Christmas is You” is also sung in the later stages of these competitions which is why the wikipedia pages are linked.

Worryingly, none of this is saying anything about what Mariah Carey is doing in November and wikipedia pages for new seasons of Tawag ng Tanghalan won’t be included in our data set but let’s plow on regardless.

```
def build_regressor_model(observed_time_series, regressors):
= tfp.sts.LocalLinearTrend(observed_time_series=observed_time_series)
trend = tfp.sts.Seasonal(
weekly =7, observed_time_series=observed_time_series, name="week")
num_seasons= tfp.sts.SmoothSeasonal(
annual =365, frequency_multipliers=[1, 2, 3, 4, 5, 6])
period= tfp.sts.SparseLinearRegression(
regressors =regressors, weights_prior_scale=0.01
design_matrix
)= tfp.sts.Sum([trend, weekly, annual, regressors], observed_time_series=observed_time_series)
model return model
```

```
= shifted[[x['col'] for x in sig_cols]].to_numpy(na_value=0).astype("float32")
training_regressors
= build_regressor_model(observed, training_regressors)
regressor_model
= tfp.sts.build_factored_surrogate_posterior(
variational_posteriors =regressor_model)
model
# Build and optimize the variational loss function.
= tfp.vi.fit_surrogate_posterior(
elbo_loss_curve =regressor_model.joint_log_prob(
target_log_prob_fn=observed),
observed_time_series=variational_posteriors,
surrogate_posterior=tf.optimizers.Adam(learning_rate=0.1),
optimizer=num_variational_steps,
num_steps=True)
jit_compile
plt.plot(elbo_loss_curve)
plt.show()
# Draw samples from the variational posterior.
= variational_posteriors.sample(100) q_samples
```

```
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
```

Now make the forecast

```
= tfp.sts.forecast(
forecast_dist
regressor_model,=observed,
observed_time_series=q_samples,
parameter_samples=53) num_steps_forecast
```

```
= (
forecast_mean, forecast_scale 0],
forecast_dist.mean().numpy()[..., 0]) forecast_dist.stddev().numpy()[...,
```

` forecast_mean`

```
array([0.4021134 , 0.26802546, 0.29915774, 0.35381198, 0.39239565,
0.4913519 , 0.62269664, 0.62460697, 0.6748683 , 0.8460419 ,
0.8314816 , 0.89810306, 0.9707965 , 1.1343668 , 1.271471 ,
1.3178807 , 1.3192046 , 1.4591802 , 1.5402098 , 1.6311107 ,
1.7568297 , 1.9190131 , 1.8498538 , 1.8599926 , 2.0212965 ,
2.163909 , 2.2679136 , 2.2723856 , 2.2320397 , 2.3305006 ,
2.2959707 , 2.3503718 , 2.5391078 , 2.5836945 , 2.656044 ,
2.612377 , 2.48579 , 2.6914563 , 2.6265645 , 2.6099665 ,
2.5055795 , 2.498663 , 2.4052846 , 2.3451033 , 2.2123523 ,
2.3186212 , 2.1818147 , 2.0509038 , 1.9053872 , 1.7598625 ,
1.5765525 , 1.5135591 , 1.4336597 ], dtype=float32)
```

This seems way worse; peak value for the first forecast was `3.68`

compared with `2.69`

here

```
= np.pad(forecast_mean,(len(observed),0),constant_values=(np.NaN,))
padded_forecast 'Lagged regressors'] = padded_forecast
forecast_df[= sns.lineplot(data=forecast_df.tail(730), dashes=False, palette=["blue","green","red"], sizes=[1,10,10])
ax import matplotlib.dates as mdates
=3))
ax.xaxis.set_major_locator(mdates.MonthLocator(interval plt.show()
```

This is textbook overfitting :-(

Regressors that were predictive in December 2019 do not work so well in 2020 and the model has learned to rely more on the regressors than the seasonal components.

Let’s analyse this in a little more detail by looking at the regression coefficients

```
= q_samples['SparseLinearRegression/_weights_noncentered'].numpy().mean(axis=0)
regs from scipy import stats
stats.describe(regs)
```

`DescribeResult(nobs=9766, minmax=(-3.7824707, 5.50638), mean=0.0054672873, variance=0.11094342, skewness=2.445416212081909, kurtosis=42.830982132014604)`

` sns.displot(regs)`

`<seaborn.axisgrid.FacetGrid at 0x7fd35520a340>`

These are not what I expected! It is true that the regression coefficiants are tightly distributed around zero, but I was expecting it to be more obvious which of them were relevant and which weren’t. I think the `SparseLinearRegression`

hasn’t worked the way that I wanted it to.

So to sum up there are two main problems here, they are kind of linked but each could happen alone too:

- I chucked a load of data at an algorithm expecting it to be able to figure out what was and wasn’t important
- The result was overfitted on the training set

Can I do better by manually choosing regressors?

Let’s look at only various lags for the Mariah Carey page.

```
= [col for col in shifted if col.startswith("Mariah Carey_")]
cols = shifted[cols]
shifted_mariah
= shifted_mariah.to_numpy(na_value=0).astype("float32")
training_regressors
= build_regressor_model(observed, training_regressors)
regressor_model
= tfp.sts.build_factored_surrogate_posterior(
variational_posteriors =regressor_model)
model
# Build and optimize the variational loss function.
= tfp.vi.fit_surrogate_posterior(
elbo_loss_curve =regressor_model.joint_log_prob(
target_log_prob_fn=observed),
observed_time_series=variational_posteriors,
surrogate_posterior=tf.optimizers.Adam(learning_rate=0.1),
optimizer=num_variational_steps,
num_steps=True)
jit_compile
plt.plot(elbo_loss_curve)
plt.show()
# Draw samples from the variational posterior.
= variational_posteriors.sample(100) q_samples
```

```
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
```

```
= tfp.sts.forecast(
forecast_dist
regressor_model,=observed,
observed_time_series=q_samples,
parameter_samples=53)
num_steps_forecast
= (
forecast_mean, forecast_scale 0],
forecast_dist.mean().numpy()[..., 0])
forecast_dist.stddev().numpy()[...,
= np.pad(forecast_mean,(len(observed),0),constant_values=(np.NaN,))
padded_forecast 'Mariah regressors'] = padded_forecast
forecast_df[= sns.lineplot(data=forecast_df.tail(730), dashes=False
ax =["blue","green","red","purple"]
, palette=[1,10,10])
, sizesimport matplotlib.dates as mdates
=3))
ax.xaxis.set_major_locator(mdates.MonthLocator(interval plt.show()
```

The result isn’t very different to the first attempt forecast; it looks like pageviews for “Mariah Carey” in November don’t tell you very much about what is going to happen in December.

I’m going to leave it here for the day, I hope this post has demonstrated two things:

- This stuff is hard. Being able to use fancy tensorflow models doesn’t necessarily make things any easier
- I don’t know enough about All I Want for Christmas is You to really drive this forward (or else it is just hard - see point 1)

If you've read this far and understood everything then you might not be in the target audience for the Forecast Forge Google Sheets Addon.

However, you probably have collegues who are! I would like to talk to them so please consider hooking me up with an introduction (fergie@forecastforge.com).

And feel free to email me with any other questions/comments you may have; I'd love to hear from you.