October 27, 2020

Extending Prophet for fun and profit - Part 2

This post is part of a series on extending the Prophet forecasting model so that it can do other things. You should read part one on the default Prophet forecasting model first.

What Prophet Can’t Do

For me, one of the most useful features of Prophet is how easily you can add regression features to the model. This enables you to provide information to the machine learning model about things like the dates when email campaigns were sent or when your PPC budget increased; and you can use your marketing calendar to add information about when these things will happen in the future too.

Being able to do this in a spreadsheet is one of the main things that makes Forecast Forge great.

This works great when you can know the future values of your regressors. But what if you don’t? Can you make a forecast for a metric and then use the output of this forecast as a regressor for another forecast? Kind of…

One can also use as a regressor another time series that has been forecasted with a time series model, such as Prophet. For instance, if r(t) is included as a regressor for y(t), Prophet can be used to forecast r(t) and then that forecast can be plugged in as the future values when forecasting y(t). A note of caution around this approach: This will probably not be useful unless r(t) is somehow easier to forecast then y(t). This is because error in the forecast of r(t) will produce error in the forecast of y(t) – The prophet documentation

You can use the point estimate from one forecast as the input to another but there is no way to propagate the uncertainty. This is what I will show you how to do in this post.

Recall the default Stan model from last time:

y ~ normal(
  trend
  .* (1 + X * (beta .* s_m))
  + X * (beta .* s_a),
  sigma_obs
);

X is a matrix of regressor columns; we need to somehow extend this to include the output and uncertainties from another forecast.

To make things clearer, let’s have y_1 as the metric for the first forecast that doesn’t depend on the results of any other forecast and then y_2 as the metric which depends on y_1 as a regressor.

The model for y_1 is just the same as before:

y_1 ~ normal(
  trend_1
  .* (1 + X_1 * (beta_1 .* s_m_1))
  + X_1 * (beta_1 .* s_a_1),
  sigma_obs_1
);

Notice how we also have specific X, beta, sigma_obs etc. for this timeseries.

Thanks to the magic of Stan we don’t need to worry about uncertainty or anything like that (Stan will figure that out for us); all that needs to happen is to add y_1 as an extra column in X_2.

y_2 ~ normal(
  trend_2
  .* (1 + append_col(X_2,y_1) * (beta_2 .* s_m_2))
  + append_col(X_2,y_1) * (beta_2 .* s_a_2),
  sigma_obs_2
);

This means that the variables beta_2, s_m_2 and s_a_2 all need to be one row larger than they are by default. The extra row in beta_2 is the regression coefficient for y_1 on y_2 and the extra row in s_m_2 and s_a_2 tells the model whether this is an additive or multiplicative regression.

The model is the key part; the rest (apart from one part that we’ll get to later) is book keeping. The only tricky part is knowing which parameters need extra rows because there is an extra y_1 in a matrix multiplication and which do not.

Here is the model:

functions {
  matrix get_changepoint_matrix(vector t, vector t_change, int T, int S) {
    // Assumes t and t_change are sorted.
    matrix[T, S] A;
    row_vector[S] a_row;
    int cp_idx;

    // Start with an empty matrix.
    A = rep_matrix(0, T, S);
    a_row = rep_row_vector(0, S);
    cp_idx = 1;

    // Fill in each row of A.
    for (i in 1:T) {
      while ((cp_idx <= S) && (t[i] >= t_change[cp_idx])) {
        a_row[cp_idx] = 1;
        cp_idx = cp_idx + 1;
      }
      A[i] = a_row;
    }
    return A;
  }

  // Logistic trend functions

  vector logistic_gamma(real k, real m, vector delta, vector t_change, int S) {
    vector[S] gamma;  // adjusted offsets, for piecewise continuity
    vector[S + 1] k_s;  // actual rate in each segment
    real m_pr;

    // Compute the rate in each segment
    k_s = append_row(k, k + cumulative_sum(delta));

    // Piecewise offsets
    m_pr = m; // The offset in the previous segment
    for (i in 1:S) {
      gamma[i] = (t_change[i] - m_pr) * (1 - k_s[i] / k_s[i + 1]);
      m_pr = m_pr + gamma[i];  // update for the next segment
    }
    return gamma;
  }

  vector logistic_trend(
    real k,
    real m,
    vector delta,
    vector t,
    vector cap,
    matrix A,
    vector t_change,
    int S
  ) {
    vector[S] gamma;

    gamma = logistic_gamma(k, m, delta, t_change, S);
    return cap .* inv_logit((k + A * delta) .* (t - (m + A * gamma)));
  }

  // Linear trend function

  vector linear_trend(
    real k,
    real m,
    vector delta,
    vector t,
    matrix A,
    vector t_change
  ) {
    return (k + A * delta) .* t + (m + A * (-t_change .* delta));
  }

  // Flat trend function

  vector flat_trend(
    real m,
    int T
  ) {
    return rep_vector(m, T);
  }
}

data {
  int T;                  // Number of time periods
  int<lower=1> K_1;       // Number of regressors
  int<lower=1> K_2;
  vector[T] t;            // Time
  vector[T] cap_1;        // Capacities for logistic trend
  vector[T] cap_2;
  vector[T] y_1;          // Time series
  vector[T] y_2;
  int S_1;                // Number of changepoints
  int S_2;
  vector[S_1] t_change_1; // Times of trend changepoints
  vector[S_2] t_change_2;
  matrix[T,K_1] X_1;      // Regressors
  matrix[T,K_2] X_2;
  vector[K_1] sigmas_1;   // Scale on seasonality prior
  vector[K_2+1] sigmas_2;
  real<lower=0> tau_1;    // Scale on changepoints prior
  real<lower=0> tau_2;
  int trend_indicator_1;  // 0 for linear, 1 for logistic, 2 for flat
  int trend_indicator_2;
  vector[K_1] s_a_1;      // Indicator of additive features
  vector[K_2+1] s_a_2;
  vector[K_1] s_m_1;      // Indicator of multiplicative features
  vector[K_2+1] s_m_2;
}

transformed data {
  matrix[T, S_1] A_1;
  matrix[T, S_2] A_2;
  A_1 = get_changepoint_matrix(t, t_change_1, T, S_1);
  A_2 = get_changepoint_matrix(t, t_change_2, T, S_2);
}

parameters {
  real k_1;                 // Base trend growth rate
  real k_2;
  real m_1;                 // Trend offset
  real m_2;
  vector[S_1] delta_1;      // Trend rate adjustments
  vector[S_2] delta_2;
  real<lower=0> sigma_obs_1;// Observation noise
  real<lower=0> sigma_obs_2;
  vector[K_1] beta_1;       // Regressor coefficients
  vector[K_2+1] beta_2;
}

transformed parameters {
  vector[T] trend_1;
  vector[T] trend_2;
  if (trend_indicator_1 == 0) {
    trend_1 = linear_trend(k_1, m_1, delta_1, t, A_1, t_change_1);
  } else if (trend_indicator_1 == 1) {
    trend_1 = logistic_trend(k_1, m_1, delta_1, t, cap_1, A_1, t_change_1, S_1);
  } else if (trend_indicator_1 == 2) {
    trend_1 = flat_trend(m_1, T);
  }
  // same for trend_2
  if (trend_indicator_2 == 0) {
    trend_2 = linear_trend(k_2, m_2, delta_2, t, A_2, t_change_2);
  } else if (trend_indicator_2 == 1) {
    trend_2 = logistic_trend(k_2, m_2, delta_2, t, cap_2, A_2, t_change_2, S_2);
  } else if (trend_indicator_1 == 2) {
    trend_2 = flat_trend(m_2, T);
  }
}

model {
  //priors
  k_1 ~ normal(0, 5);
  m_1 ~ normal(0, 5);
  delta_1 ~ double_exponential(0, tau_1);
  sigma_obs_1 ~ normal(0, 0.5);
  beta_1 ~ normal(0, sigmas_1);

  k_2 ~ normal(0, 5);
  m_2 ~ normal(0, 5);
  delta_2 ~ double_exponential(0, tau_2);
  sigma_obs_2 ~ normal(0, 0.5);
  beta_2 ~ normal(0, sigmas_2);

  // Likelihood
  y_1 ~ normal(
  trend_1
  .* (1 + X_1 * (beta_1 .* s_m_1))
  + X_1 * (beta_1 .* s_a_1),
  sigma_obs_1
  );

  y_2 ~ normal(
  trend_2
  .* (1 + append_col(X_2,y_1) * (beta_2 .* s_m_2))
  + append_col(X_2,y_1) * (beta_2 .* s_a_2),
  sigma_obs_2
  );
}

This model will compile and fit returning the parameter estimates for both the y_1 and the y_2 timeseries. But it won’t actually make the forecast for either of them. Instead Prophet does this bit in Python or R rather than in Stan.

We’ve gone completely off piste with the Stan model here so the python/R code that Prophet wraps around Stan won’t work.

Desperately trying to get back on the piste

You could take the approach Prophet uses and do the sampling in Python (or R) but I think it is simpler to keep everything in Stan by using a generated quantities block. Prophet don’t take this approach by default because of a bug in Rstan which causes it to take a very long time to load predictions for large models back into R. I don’t fit particularly large models and I know Stan a lot better than I know Python so the generated quantities approach is better for me.

This means extra variables must be provided to Stan:

X_pred containing the future values of regressors
cap_pred for the future capacities (you can miss this out if you aren’t using a logistic trend in Stan)
S_pred is an upper bound on the number of future changepoints. If you have spaced changepoints evenly though the training data then you can do the same here.
T_pred to tell the model how many predictions you want to make.
n_samp is the number of samples to generate.

Generating the output breaks down into two parts:

Find the best guess estimate y_hat
Generate samples to estimate the uncertainty.

I’m not sure why the original Prophet code doesn’t just do the sampling and then calculate y_hat from there. But I’ll follow the same approach here.

As with most things in Stan there is a lot of declaration before you get to the meat of the generated quantities block:

generated quantities {

vector[T_pred] y_hat_2;
vector[T_pred] trend_hat_2;
matrix[T_pred, S_2] A_pred_2;
matrix[T_pred, n_samp] trend_samples_2;
matrix[T_pred, n_samp] y_pred_2;
vector[S_1 + S_pred_2] t_change_sim_2;
vector[S_1 + S_pred_2] delta_sim_2;
real lambda_2;
matrix[T_pred, S_2 + S_pred_2] A_sim_2;


vector[T_pred] y_hat_1;
vector[T_pred] trend_hat_1;
matrix[T_pred, S_1] A_pred_1;
matrix[T_pred, n_samp] trend_samples_1;
matrix[T_pred, n_samp] y_pred_1;
vector[S_1 + S_pred_1] t_change_sim_1;
vector[S_1 + S_pred_1] delta_sim_1;
real lambda_1;
matrix[T_pred, S_1 + S_pred_1] A_sim_1;


if(T_pred > 0) { // need to make predictions

  A_pred_1 = get_changepoint_matrix(t_pred, t_change_1, T_pred, S_1);
  trend_hat_1 = get_trend(
      k_1, offset_1, delta_1, t_pred, cap_pred_1, A_pred_1,
      t_change_1, S_1, trend_indicator_1, T_pred
      );

  // calculate the y_hat
  y_hat_1 = trend_hat_1 .* (1 + X_pred_1 * (beta_1 .* s_m_1))
      + X_pred_1 * (beta_1 .* s_a_1);

  // bookeeping for the simulation
  for (i in 1:S_1) {
      t_change_sim_1[i] = t_change_1[i];
      delta_sim_1[i] = delta_1[i];
  }

  lambda_1 = mean(fabs(delta_1)) + 1e-8;


  A_pred_2 = get_changepoint_matrix(t_pred, t_change_2, T_pred, S_2);
  trend_hat_2 = get_trend(
      k_2, offset_2, delta_2, t_pred, cap_pred_2, A_pred_2,
      t_change_2, S_2, trend_indicator_2, T_pred
      );

  y_hat_2 = trend_hat_2 .* (1 + append_col(X_pred_2,y_hat_1) * (beta_2 .* s_m_2))
      + append_col(X_pred_2,y_hat_1) * (beta_2 .* s_a_2);

  for (i in 1:S_2) {
      t_change_sim_2[i] = t_change_2[i];
      delta_sim_2[i] = delta_2[i];
  }

  lambda_2 = mean(fabs(delta_2)) + 1e-8;

  // Now do the sampling
  for (i in 1:n_samp) {

    if (S_pred_1 > 0) {
       //Sample new changepoints from a Poisson process with rate S
       //Sample changepoint deltas from Laplace(lambda)
       t_change_sim_1[S_1 + 1] = 1 + exponential_rng(S_1);
       for (j in (S_1 + 2):(S_1 + S_pred_1)) {
            t_change_sim_1[j] = t_change_sim_1[j - 1] + exponential_rng(S_1);
       }
       for (j in (S_1 + 1): (S_1 + S_pred_1)) {
            delta_sim_1[j] = double_exponential_rng(0, lambda_1);
       }
    }

    // Compute trend with these changepoints
    A_sim_1 = get_changepoint_matrix(t_pred, t_change_sim_1, T_pred, S_1 + S_pred_1);
    trend_samples_1[:, i] = get_trend(
          k_1, offset_1, delta_sim_1, t_pred, cap_pred_1, A_sim_1,
          t_change_sim_1, S_1 + S_pred_1,
          trend_indicator_1, T_pred
        );
    y_pred_1[:,i] = trend_samples_1[:, i] .* (1 + X_pred_1 * (beta_1 .* s_m_1))
          + X_pred_1 * (beta_1 .* s_a_1);


    if (S_pred_2 > 0) {
     //Sample new changepoints from a Poisson process with rate S
     //Sample changepoint deltas from Laplace(lambda)
     t_change_sim_2[S_2 + 1] = 1 + exponential_rng(S_2);
     for (j in (S_2 + 2):(S_2 + S_pred_2)) {
            t_change_sim_2[j] = t_change_sim_2[j - 1] + exponential_rng(S_2);
     }
     for (j in (S_2 + 1): (S_2 + S_pred_2)) {
            delta_sim_2[j] = double_exponential_rng(0, lambda_2);
     }
    }
    // Compute trend with these changepoints
    A_sim_2 = get_changepoint_matrix(t_pred, t_change_sim_2, T_pred, S_2 + S_pred_2);
    trend_samples_2[:, i] = get_trend(
          k_2, offset_2, delta_sim_2, t_pred, cap_pred_2, A_sim_2,
          t_change_sim_2, S_2 + S_pred_2,
          trend_indicator_2, T_pred
        );
    y_pred_2[:,i] = trend_samples_2[:, i] .* (1 + append_col(X_pred_2,y_pred_1[:,i]) * (beta_2 .* s_m_2))
          + append_col(X_pred_2,y_pred_1[:,i]) * (beta_2 .* s_a_2);

  }
}
}

This block will generate estimates for y_hat_1 and y_hat_2 and return samples as y_pred_1 and y_pred_2.

You can download the full Stan file here.

This method allows you to use the forecast for one timeseries and a regressor for the forecast of another. Next time I will show you how to model functional constraints; for example y_3 = y_1 + y_2.

If you've read this far and understood everything then you might not be in the target audience for the Forecast Forge Google Sheets Addon.

However, you probably have collegues who are! I would like to talk to them so please consider hooking me up with an introduction (fergie@forecastforge.com).

And feel free to email me with any other questions/comments you may have; I'd love to hear from you.

Extending Prophet for fun and profit - Part 2

What Prophet Can’t Do

Never miss a post