Modeling temporal data with an unknown number of changepoints

Identifying structural breaks in data is an important problem to folks that frequently work with time series data. Some examples of how people have dealt with the problem of a single changepoint can be found here and here.

Generally, the setup looks like this: we have some data \(X_t\) indexed by a discrete time coordinate \(t \in \{1,...,T\}\) and a parametric submodel linking the distribution of \(X\) to another quantity \(\mu_t\) which depends on the temporal coordinate. For the simple case of a linear Gaussian model with a single change point, we have

\[a_1, a_2 \sim N(0, \sigma^2_\mu)\] \[\tau \sim \text{DiscreteUniform}(\{1,...,T\})\] \[\mu_t = \left\{ \begin{array}{l} a_1 \text{ if } t > \tau \\ a_2 \text{ if } t \le \tau \end{array} \right.\] \[X_t \sim N(\mu_t, \sigma_\epsilon)\]

with your scale priors of choice on the variance parameters \(\sigma_\epsilon\) and \(\sigma_\mu\). Now, one of the main conceptual problems with this model is that you need to assume it has a single changepoint. You can relax that assumption by extending this model to include more \(\tau\) and \(a\) parameters, but you’ll still need to specify the number of them ahead of time.

Relaxing the assumption on the number of parameters is, for the most part, a solved problem in the research community (see here and here for a few representative examples). Unfortunately, these require the analyst to implement the inference techniques presented by hand; these are often Gibbs samplers or similar. Wouldn’t it be nice to just be able to use a PPL and write down the forward process instead?

That’s the point of this notebook - we’ll walk through a construction of a changepoint model plus inference in PyMC which is considerably more straightforward than a handwritten sampler.

We’ll start by simulating some data over 50 timesteps; there are 4 changepoints and the model’s likelihood is Gaussian. We will use a standard set of imports for working with PyMC and set the seed for repeatability.

import pymc as pm
import matplotlib.pyplot as plt
import numpy as np
import aesara.tensor as at
from collections import Counter
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf')

Simulating a dataset

Since the generative process for this data is simple, the code required to simulate data is relatively short. We begin by sampling the changepoints and then adding offsets for each changepoint to the mean value of the data. We then perturb this mean with normal noise variates to create simulated observations.

T = 50
noise_sd = 0.15
n_changepoints = 4
true_cp = np.sort(np.random.choice(T, size=n_changepoints))
offsets_per_period = np.random.randn(n_changepoints)

noiseless = np.zeros(T)
start_time = 0

for changepoint, offset in zip(true_cp, offsets_per_period):
  noiseless[start_time:changepoint] += offset
  start_time = changepoint

xs = noiseless + np.random.randn(T) * noise_sd

As we can see below, the green changepoints do clearly correspond to changes in the level of the time series. However, not all of them are obvious - the last one, in particular, is a relatively small jump.

plt.plot(xs, marker='o', color='k', label='Observed data')
plt.plot(noiseless, color='g', label='Noise-free mean value')

for i, cp in enumerate(true_cp):
  if i == 0:
    label = 'Change point'
  plt.axvline(cp, color='g', linestyle='--',label=label)




Creating the model

For inference, we’ll assume that we don’t know the number of changepoints. The main trick that we’ll use is to instantiate way more changepoints than we need, and use latent variables to zero out most of them.

The model that we declare looks like the following: \(p_{changepoint} \sim \text{Beta}(2,8)\)

\[\tau_1,...,\tau_{M} \sim \text{DiscreteUniform}(\{1,...,T\})\] \[u_1,...,u_M \sim \text{Bernoulli}(p_{changepoint})\] \[\mu \sim \text{Normal}(0, 1)\] \[\sigma^2_{\delta} \sim \text{HalfNormal}(2)\] \[\sigma^2_{\epsilon} \sim \text{HalfNormal}(1)\] \[\delta_1,...,\delta_M \sim \text{Normal}(0, \sigma^2_{\delta})\]

For convenience in our notation, we assume that \(\tau_1,...,\tau_M\) are ordered. We perform an elementwise multiplication of the \(\delta\) offsets with the latent binary variables \(u_m\) as well as indicator variables \(I\) \(\mu_t\):

\[\mu_t = \left[\left( \begin{array}{ccc} I(t\ge \tau_1) \\ \vdots \\ I(t\ge \tau_M) \end{array} \right) \odot \left( \begin{array}{ccc} \delta_1 \\ \vdots \\ \delta_M \end{array} \right)\right] \left(\begin{array}{ccc} u_1 \cdots u_M \end{array}\right)\] \[X_t \sim N(\mu_t, \sigma^2_\epsilon)\]

Since \(p_{changepoint}\) has a prior encouraging it to be lower, the indicator variables above will be pushed towards zero, thereby deactivating some of the \(\delta\) terms’ contributions towards \(X\).

The code block below implements this model logic, though it uses uniform_except_ends to prevent any \(\tau\) values from occurring in the first two or last two timesteps.

max_cp_inference = 10

tiled_times = np.arange(T)[:, None].repeat(max_cp_inference, axis=1)

# We do this so that we can allow the Categorical prior over the changepoint
# locations to exclude the timesteps at the very beginning and very end. 
# The reason for this is that these data points always benefit from using an 
# extra changepoint just for the first or last data points. 
uniform_except_ends = np.ones(T)
uniform_except_ends[0:2] = 0
uniform_except_ends[-2:] = 0
uniform_except_ends = uniform_except_ends / uniform_except_ends.sum()

with pm.Model() as model:
  # Probability that any of the <max_cp_inference> change points are active
  p_changepoint  = pm.Beta('p_changepoint', alpha=2, beta=8)

  # Sort the changepoints for faster mixing / convergence
  changepoints = pm.Categorical('changepoints', uniform_except_ends, shape=max_cp_inference)
  is_cp_active = pm.Bernoulli('is_cp_active', p_changepoint, shape=max_cp_inference)

  changepoints_sorted = at.sort(changepoints)

  # This will give us a nice posterior estimate of the number of changepoints
  num_active_cp = pm.Deterministic('num_active_cp', pm.math.sum(is_cp_active))

  global_mean = pm.Normal('global_mean', sigma=1)
  cp_sd = pm.HalfNormal('cp_sd', sigma=2)
  noise_sd = pm.HalfNormal('noise_sd', sigma=1)
  changepoint_deltas = pm.Normal('changepoint_deltas', cp_sd, shape=max_cp_inference)

  # Operation involves operations on arrays with shape (T, max_cp_inference)
  # Elementwise operation zeros-out contributions from changepoints which are
  # not active
  is_timestep_past_cp = (tiled_times > changepoints[None, :].repeat(T, axis=0))
  active_deltas = (changepoint_deltas*is_cp_active)
  cp_contrib = pm.Deterministic('cp_contrib',
                                global_mean + pm.math.sum(is_timestep_past_cp * active_deltas, axis=1)

  _ = pm.Normal('likelihood', mu=cp_contrib, sigma=noise_sd, observed=xs)

  trace = pm.sample(draws=8000, tune=8000, chains=2)

100.00% [16000/16000 02:30<00:00 Sampling chain 0, 6,594 divergences]
100.00% [16000/16000 03:03<00:00 Sampling chain 1, 3,278 divergences]
ERROR:pymc:There were 6594 divergences after tuning. Increase `target_accept` or reparameterize.
WARNING:pymc:The acceptance probability does not match the target. It is 0.08247, but should be close to 0.8. Try to increase the number of tuning steps.
ERROR:pymc:There were 9872 divergences after tuning. Increase `target_accept` or reparameterize.
WARNING:pymc:The acceptance probability does not match the target. It is 0.5083, but should be close to 0.8. Try to increase the number of tuning steps.

From a sampling perspective, this is a pretty ugly problem. NUTS isn’t designed to work well in an alternating NUTS / Gibbs sampling scheme, and we get tons of divergences because NUTS is facing a log-posterior landscape that is shifting dramatically on every iteration because of the discrete latent variables.

That said, the \(\hat{R}\) values look good - no warnings are fired off!

Assessing the results from inference

As a basic statistic for the number of changepoints, we can just take the posterior mean of the indicator variables’ sum to see how many parameters were active, on average.

<xarray.DataArray 'num_active_cp' ()>

We can also make a plot of the posterior inferences about the location and parameters of each changepoint as compared against the true values:

top_10_cp = Counter(

plt.plot(noiseless, label='True noiseless values', color='green')

plt.plot(trace.posterior['cp_contrib'].mean(axis=(0,1)), label='Inferred noiseless mean', color='orange')

q10, q90 = np.percentile(trace.posterior['cp_contrib'], [10,90], axis=(0,1))
plt.fill_between(np.arange(T), q10, q90, color='orange', alpha=0.2)
plt.plot(xs, linestyle='', color='k', marker='o', label='Observed data')
for i, cp in enumerate(true_cp):
  if i == 0:
    label = 'True change point'
  plt.axvline(cp, color='g', linestyle='--',label=label)

for i, (t, _) in enumerate(top_10_cp):
  if i == 0:
    label = 'Inferred change point'
  plt.axvline(t, color='orange', linestyle='--', label=label)
plt.legend(loc='upper right');


Here, the green vertical lines are the true changepoints while the orange vertical lines are one of the top 10 most likely changepoints as gleaned from the posterior samples. We can see that the major jumps around timesteps 10 and 20 are clearly captured, while there is more uncertainty from timesteps 20-40. The smaller jump at timestep 45 is also missed completely; this is not very surprising given how small it was.

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Modeling spatial structure in binary data with an H3 hexagonal coordinate system
  • Surrogate modeling for SEIR dynamics
  • Density estimation for geospatial imagery using autoregressive neural models
  • Multivariate sample size for Markov chains
  • Fast Kronecker matrix-vector product with einsum