October 27, 2020

# Extending Prophet for fun and profit - Part 2

This post is part of a series on extending the Prophet forecasting model so that it can do other things. You should read part one on the default Prophet forecasting model first.

## What Prophet Can’t Do

For me, one of the most useful features of Prophet is how easily you can add regression features to the model. This enables you to provide information to the machine learning model about things like the dates when email campaigns were sent or when your PPC budget increased; and you can use your marketing calendar to add information about when these things will happen in the future too.

Being able to do this in a spreadsheet is one of the main things that makes Forecast Forge great.

This works great when you can know the future values of your regressors. But what if you don’t? Can you make a forecast for a metric and then use the output of this forecast as a regressor for another forecast? Kind of…

One can also use as a regressor another time series that has been forecasted with a time series model, such as Prophet. For instance, if r(t) is included as a regressor for y(t), Prophet can be used to forecast r(t) and then that forecast can be plugged in as the future values when forecasting y(t). A note of caution around this approach: This will probably not be useful unless r(t) is somehow easier to forecast then y(t). This is because error in the forecast of r(t) will produce error in the forecast of y(t) – The prophet documentation

You can use the point estimate from one forecast as the input to another but there is no way to propagate the uncertainty. This is what I will show you how to do in this post.

Recall the default Stan model from last time:

``````y ~ normal(
trend
.* (1 + X * (beta .* s_m))
+ X * (beta .* s_a),
sigma_obs
);``````

`X` is a matrix of regressor columns; we need to somehow extend this to include the output and uncertainties from another forecast.

To make things clearer, let’s have `y_1` as the metric for the first forecast that doesn’t depend on the results of any other forecast and then `y_2` as the metric which depends on `y_1` as a regressor.

The model for `y_1` is just the same as before:

``````y_1 ~ normal(
trend_1
.* (1 + X_1 * (beta_1 .* s_m_1))
+ X_1 * (beta_1 .* s_a_1),
sigma_obs_1
);``````

Notice how we also have specific `X`, `beta`, `sigma_obs` etc. for this timeseries.

Thanks to the magic of Stan we don’t need to worry about uncertainty or anything like that (Stan will figure that out for us); all that needs to happen is to add `y_1` as an extra column in `X_2`.

``````y_2 ~ normal(
trend_2
.* (1 + append_col(X_2,y_1) * (beta_2 .* s_m_2))
+ append_col(X_2,y_1) * (beta_2 .* s_a_2),
sigma_obs_2
);``````

This means that the variables `beta_2`, `s_m_2` and `s_a_2` all need to be one row larger than they are by default. The extra row in `beta_2` is the regression coefficient for `y_1` on `y_2` and the extra row in `s_m_2` and `s_a_2` tells the model whether this is an additive or multiplicative regression.

The model is the key part; the rest (apart from one part that we’ll get to later) is book keeping. The only tricky part is knowing which parameters need extra rows because there is an extra `y_1` in a matrix multiplication and which do not.

Here is the model:

``````functions {
matrix get_changepoint_matrix(vector t, vector t_change, int T, int S) {
// Assumes t and t_change are sorted.
matrix[T, S] A;
row_vector[S] a_row;
int cp_idx;

A = rep_matrix(0, T, S);
a_row = rep_row_vector(0, S);
cp_idx = 1;

// Fill in each row of A.
for (i in 1:T) {
while ((cp_idx <= S) && (t[i] >= t_change[cp_idx])) {
a_row[cp_idx] = 1;
cp_idx = cp_idx + 1;
}
A[i] = a_row;
}
return A;
}

// Logistic trend functions

vector logistic_gamma(real k, real m, vector delta, vector t_change, int S) {
vector[S] gamma;  // adjusted offsets, for piecewise continuity
vector[S + 1] k_s;  // actual rate in each segment
real m_pr;

// Compute the rate in each segment
k_s = append_row(k, k + cumulative_sum(delta));

// Piecewise offsets
m_pr = m; // The offset in the previous segment
for (i in 1:S) {
gamma[i] = (t_change[i] - m_pr) * (1 - k_s[i] / k_s[i + 1]);
m_pr = m_pr + gamma[i];  // update for the next segment
}
return gamma;
}

vector logistic_trend(
real k,
real m,
vector delta,
vector t,
vector cap,
matrix A,
vector t_change,
int S
) {
vector[S] gamma;

gamma = logistic_gamma(k, m, delta, t_change, S);
return cap .* inv_logit((k + A * delta) .* (t - (m + A * gamma)));
}

// Linear trend function

vector linear_trend(
real k,
real m,
vector delta,
vector t,
matrix A,
vector t_change
) {
return (k + A * delta) .* t + (m + A * (-t_change .* delta));
}

// Flat trend function

vector flat_trend(
real m,
int T
) {
return rep_vector(m, T);
}
}

data {
int T;                  // Number of time periods
int<lower=1> K_1;       // Number of regressors
int<lower=1> K_2;
vector[T] t;            // Time
vector[T] cap_1;        // Capacities for logistic trend
vector[T] cap_2;
vector[T] y_1;          // Time series
vector[T] y_2;
int S_1;                // Number of changepoints
int S_2;
vector[S_1] t_change_1; // Times of trend changepoints
vector[S_2] t_change_2;
matrix[T,K_1] X_1;      // Regressors
matrix[T,K_2] X_2;
vector[K_1] sigmas_1;   // Scale on seasonality prior
vector[K_2+1] sigmas_2;
real<lower=0> tau_1;    // Scale on changepoints prior
real<lower=0> tau_2;
int trend_indicator_1;  // 0 for linear, 1 for logistic, 2 for flat
int trend_indicator_2;
vector[K_1] s_a_1;      // Indicator of additive features
vector[K_2+1] s_a_2;
vector[K_1] s_m_1;      // Indicator of multiplicative features
vector[K_2+1] s_m_2;
}

transformed data {
matrix[T, S_1] A_1;
matrix[T, S_2] A_2;
A_1 = get_changepoint_matrix(t, t_change_1, T, S_1);
A_2 = get_changepoint_matrix(t, t_change_2, T, S_2);
}

parameters {
real k_1;                 // Base trend growth rate
real k_2;
real m_1;                 // Trend offset
real m_2;
vector[S_1] delta_1;      // Trend rate adjustments
vector[S_2] delta_2;
real<lower=0> sigma_obs_1;// Observation noise
real<lower=0> sigma_obs_2;
vector[K_1] beta_1;       // Regressor coefficients
vector[K_2+1] beta_2;
}

transformed parameters {
vector[T] trend_1;
vector[T] trend_2;
if (trend_indicator_1 == 0) {
trend_1 = linear_trend(k_1, m_1, delta_1, t, A_1, t_change_1);
} else if (trend_indicator_1 == 1) {
trend_1 = logistic_trend(k_1, m_1, delta_1, t, cap_1, A_1, t_change_1, S_1);
} else if (trend_indicator_1 == 2) {
trend_1 = flat_trend(m_1, T);
}
// same for trend_2
if (trend_indicator_2 == 0) {
trend_2 = linear_trend(k_2, m_2, delta_2, t, A_2, t_change_2);
} else if (trend_indicator_2 == 1) {
trend_2 = logistic_trend(k_2, m_2, delta_2, t, cap_2, A_2, t_change_2, S_2);
} else if (trend_indicator_1 == 2) {
trend_2 = flat_trend(m_2, T);
}
}

model {
//priors
k_1 ~ normal(0, 5);
m_1 ~ normal(0, 5);
delta_1 ~ double_exponential(0, tau_1);
sigma_obs_1 ~ normal(0, 0.5);
beta_1 ~ normal(0, sigmas_1);

k_2 ~ normal(0, 5);
m_2 ~ normal(0, 5);
delta_2 ~ double_exponential(0, tau_2);
sigma_obs_2 ~ normal(0, 0.5);
beta_2 ~ normal(0, sigmas_2);

// Likelihood
y_1 ~ normal(
trend_1
.* (1 + X_1 * (beta_1 .* s_m_1))
+ X_1 * (beta_1 .* s_a_1),
sigma_obs_1
);

y_2 ~ normal(
trend_2
.* (1 + append_col(X_2,y_1) * (beta_2 .* s_m_2))
+ append_col(X_2,y_1) * (beta_2 .* s_a_2),
sigma_obs_2
);
}``````

This model will compile and fit returning the parameter estimates for both the `y_1` and the `y_2` timeseries. But it won’t actually make the forecast for either of them. Instead Prophet does this bit in Python or R rather than in Stan.

We’ve gone completely off piste with the Stan model here so the python/R code that Prophet wraps around Stan won’t work.

You could take the approach Prophet uses and do the sampling in Python (or R) but I think it is simpler to keep everything in Stan by using a `generated quantities` block. Prophet don’t take this approach by default because of a bug in Rstan which causes it to take a very long time to load predictions for large models back into R. I don’t fit particularly large models and I know Stan a lot better than I know Python so the `generated quantities` approach is better for me.

This means extra variables must be provided to Stan:

• `X_pred` containing the future values of regressors
• `cap_pred` for the future capacities (you can miss this out if you aren’t using a logistic trend in Stan)
• `S_pred` is an upper bound on the number of future changepoints. If you have spaced changepoints evenly though the training data then you can do the same here.
• `T_pred` to tell the model how many predictions you want to make.
• `n_samp` is the number of samples to generate.

Generating the output breaks down into two parts:

1. Find the best guess estimate `y_hat`
2. Generate samples to estimate the uncertainty.

I’m not sure why the original Prophet code doesn’t just do the sampling and then calculate `y_hat` from there. But I’ll follow the same approach here.

As with most things in Stan there is a lot of declaration before you get to the meat of the `generated quantities` block:

``````generated quantities {

vector[T_pred] y_hat_2;
vector[T_pred] trend_hat_2;
matrix[T_pred, S_2] A_pred_2;
matrix[T_pred, n_samp] trend_samples_2;
matrix[T_pred, n_samp] y_pred_2;
vector[S_1 + S_pred_2] t_change_sim_2;
vector[S_1 + S_pred_2] delta_sim_2;
real lambda_2;
matrix[T_pred, S_2 + S_pred_2] A_sim_2;

vector[T_pred] y_hat_1;
vector[T_pred] trend_hat_1;
matrix[T_pred, S_1] A_pred_1;
matrix[T_pred, n_samp] trend_samples_1;
matrix[T_pred, n_samp] y_pred_1;
vector[S_1 + S_pred_1] t_change_sim_1;
vector[S_1 + S_pred_1] delta_sim_1;
real lambda_1;
matrix[T_pred, S_1 + S_pred_1] A_sim_1;

if(T_pred > 0) { // need to make predictions

A_pred_1 = get_changepoint_matrix(t_pred, t_change_1, T_pred, S_1);
trend_hat_1 = get_trend(
k_1, offset_1, delta_1, t_pred, cap_pred_1, A_pred_1,
t_change_1, S_1, trend_indicator_1, T_pred
);

// calculate the y_hat
y_hat_1 = trend_hat_1 .* (1 + X_pred_1 * (beta_1 .* s_m_1))
+ X_pred_1 * (beta_1 .* s_a_1);

// bookeeping for the simulation
for (i in 1:S_1) {
t_change_sim_1[i] = t_change_1[i];
delta_sim_1[i] = delta_1[i];
}

lambda_1 = mean(fabs(delta_1)) + 1e-8;

A_pred_2 = get_changepoint_matrix(t_pred, t_change_2, T_pred, S_2);
trend_hat_2 = get_trend(
k_2, offset_2, delta_2, t_pred, cap_pred_2, A_pred_2,
t_change_2, S_2, trend_indicator_2, T_pred
);

y_hat_2 = trend_hat_2 .* (1 + append_col(X_pred_2,y_hat_1) * (beta_2 .* s_m_2))
+ append_col(X_pred_2,y_hat_1) * (beta_2 .* s_a_2);

for (i in 1:S_2) {
t_change_sim_2[i] = t_change_2[i];
delta_sim_2[i] = delta_2[i];
}

lambda_2 = mean(fabs(delta_2)) + 1e-8;

// Now do the sampling
for (i in 1:n_samp) {

if (S_pred_1 > 0) {
//Sample new changepoints from a Poisson process with rate S
//Sample changepoint deltas from Laplace(lambda)
t_change_sim_1[S_1 + 1] = 1 + exponential_rng(S_1);
for (j in (S_1 + 2):(S_1 + S_pred_1)) {
t_change_sim_1[j] = t_change_sim_1[j - 1] + exponential_rng(S_1);
}
for (j in (S_1 + 1): (S_1 + S_pred_1)) {
delta_sim_1[j] = double_exponential_rng(0, lambda_1);
}
}

// Compute trend with these changepoints
A_sim_1 = get_changepoint_matrix(t_pred, t_change_sim_1, T_pred, S_1 + S_pred_1);
trend_samples_1[:, i] = get_trend(
k_1, offset_1, delta_sim_1, t_pred, cap_pred_1, A_sim_1,
t_change_sim_1, S_1 + S_pred_1,
trend_indicator_1, T_pred
);
y_pred_1[:,i] = trend_samples_1[:, i] .* (1 + X_pred_1 * (beta_1 .* s_m_1))
+ X_pred_1 * (beta_1 .* s_a_1);

if (S_pred_2 > 0) {
//Sample new changepoints from a Poisson process with rate S
//Sample changepoint deltas from Laplace(lambda)
t_change_sim_2[S_2 + 1] = 1 + exponential_rng(S_2);
for (j in (S_2 + 2):(S_2 + S_pred_2)) {
t_change_sim_2[j] = t_change_sim_2[j - 1] + exponential_rng(S_2);
}
for (j in (S_2 + 1): (S_2 + S_pred_2)) {
delta_sim_2[j] = double_exponential_rng(0, lambda_2);
}
}
// Compute trend with these changepoints
A_sim_2 = get_changepoint_matrix(t_pred, t_change_sim_2, T_pred, S_2 + S_pred_2);
trend_samples_2[:, i] = get_trend(
k_2, offset_2, delta_sim_2, t_pred, cap_pred_2, A_sim_2,
t_change_sim_2, S_2 + S_pred_2,
trend_indicator_2, T_pred
);
y_pred_2[:,i] = trend_samples_2[:, i] .* (1 + append_col(X_pred_2,y_pred_1[:,i]) * (beta_2 .* s_m_2))
+ append_col(X_pred_2,y_pred_1[:,i]) * (beta_2 .* s_a_2);

}
}
}``````

This block will generate estimates for `y_hat_1` and `y_hat_2` and return samples as `y_pred_1` and `y_pred_2`.

This method allows you to use the forecast for one timeseries and a regressor for the forecast of another. Next time I will show you how to model functional constraints; for example `y_3 = y_1 + y_2`.