ckrapu.github.io

Nonparametric changepoint model in PyMC

2022-11-20T00:00:00-08:00

Identifying structural breaks in data is an important problem to folks that frequently work with time series data. Some examples of how people have dealt with the problem of a single changepoint can be found here and here.

Generally, the setup looks like this: we have some data $X_t$ indexed by a discrete time coordinate $t \in {1,…,T}$ and a parametric submodel linking the distribution of $X$ to another quantity $\mu_t$ which depends on the temporal coordinate. For the simple case of a linear Gaussian model with a single change point, we have

\[a_1, a_2 \sim N(0, \sigma^2_\mu)\] \[\tau \sim \text{DiscreteUniform}(\{1,...,T\})\] \[\mu_t = \left\{ \begin{array}{l} a_1 \text{ if } t > \tau \\ a_2 \text{ if } t \le \tau \end{array} \right.\] \[X_t \sim N(\mu_t, \sigma_\epsilon)\]

with your scale priors of choice on the variance parameters $\sigma_\epsilon$ and $\sigma_\mu$. Now, one of the main conceptual problems with this model is that you need to assume it has a single changepoint. You can relax that assumption by extending this model to include more $\tau$ and $a$ parameters, but you’ll still need to specify the number of them ahead of time.

Relaxing the assumption on the number of parameters is, for the most part, a solved problem in the research community (see here and here for a few representative examples). Unfortunately, these require the analyst to implement the inference techniques presented by hand; these are often Gibbs samplers or similar. Wouldn’t it be nice to just be able to use a PPL and write down the forward process instead?

That’s the point of this notebook - we’ll walk through a construction of a changepoint model plus inference in PyMC which is considerably more straightforward than a handwritten sampler.

We’ll start by simulating some data over 50 timesteps; there are 4 changepoints and the model’s likelihood is Gaussian. We will use a standard set of imports for working with PyMC and set the seed for repeatability.

import pymc as pm
import matplotlib.pyplot as plt
import numpy as np
import aesara.tensor as at
from collections import Counter
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg', 'pdf')
np.random.seed(827)

Simulating a dataset

Since the generative process for this data is simple, the code required to simulate data is relatively short. We begin by sampling the changepoints and then adding offsets for each changepoint to the mean value of the data. We then perturb this mean with normal noise variates to create simulated observations.

T = 50
noise_sd = 0.15
n_changepoints = 4
true_cp = np.sort(np.random.choice(T, size=n_changepoints))
offsets_per_period = np.random.randn(n_changepoints)

noiseless = np.zeros(T)
start_time = 0

for changepoint, offset in zip(true_cp, offsets_per_period):
  noiseless[start_time:changepoint] += offset
  start_time = changepoint

xs = noiseless + np.random.randn(T) * noise_sd

As we can see below, the green changepoints do clearly correspond to changes in the level of the time series. However, not all of them are obvious - the last one, in particular, is a relatively small jump.

plt.figure(figsize=(9,3))
plt.plot(xs, marker='o', color='k', label='Observed data')
plt.plot(noiseless, color='g', label='Noise-free mean value')

for i, cp in enumerate(true_cp):
  if i == 0:
    label = 'Change point'
  else:
    label=None
  plt.axvline(cp, color='g', linestyle='--',label=label)

plt.legend()

plt.xlabel('Timestep',fontsize=12)
plt.ylabel('$X(t)$',fontsize=18);

Creating the model

For inference, we’ll assume that we don’t know the number of changepoints. The main trick that we’ll use is to instantiate way more changepoints than we need, and use latent variables to zero out most of them.

The model that we declare looks like the following: $p_{changepoint} \sim \text{Beta}(2,8)$

\[\tau_1,...,\tau_{M} \sim \text{DiscreteUniform}(\{1,...,T\})\] \[u_1,...,u_M \sim \text{Bernoulli}(p_{changepoint})\] \[\mu \sim \text{Normal}(0, 1)\] \[\sigma^2_{\delta} \sim \text{HalfNormal}(2)\] \[\sigma^2_{\epsilon} \sim \text{HalfNormal}(1)\] \[\delta_1,...,\delta_M \sim \text{Normal}(0, \sigma^2_{\delta})\]

For convenience in our notation, we assume that $\tau_1,…,\tau_M$ are ordered. We perform an elementwise multiplication of the $\delta$ offsets with the latent binary variables $u_m$ as well as indicator variables $I$ $\mu_t$:

\[\mu_t = \left[\left( \begin{array}{ccc} I(t\ge \tau_1) \\ \vdots \\ I(t\ge \tau_M) \end{array} \right) \odot \left( \begin{array}{ccc} \delta_1 \\ \vdots \\ \delta_M \end{array} \right)\right] \left(\begin{array}{ccc} u_1 \cdots u_M \end{array}\right)\] \[X_t \sim N(\mu_t, \sigma^2_\epsilon)\]

Since $p_{changepoint}$ has a prior encouraging it to be lower, the indicator variables above will be pushed towards zero, thereby deactivating some of the $\delta$ terms’ contributions towards $X$.

The code block below implements this model logic, though it uses uniform_except_ends to prevent any $\tau$ values from occurring in the first two or last two timesteps.

max_cp_inference = 10

tiled_times = np.arange(T)[:, None].repeat(max_cp_inference, axis=1)

# We do this so that we can allow the Categorical prior over the changepoint
# locations to exclude the timesteps at the very beginning and very end. 
# The reason for this is that these data points always benefit from using an 
# extra changepoint just for the first or last data points. 
uniform_except_ends = np.ones(T)
uniform_except_ends[0:2] = 0
uniform_except_ends[-2:] = 0
uniform_except_ends = uniform_except_ends / uniform_except_ends.sum()

with pm.Model() as model:
  # Probability that any of the  change points are active
  p_changepoint  = pm.Beta('p_changepoint', alpha=2, beta=8)

  # Sort the changepoints for faster mixing / convergence
  changepoints = pm.Categorical('changepoints', uniform_except_ends, shape=max_cp_inference)
  is_cp_active = pm.Bernoulli('is_cp_active', p_changepoint, shape=max_cp_inference)

  changepoints_sorted = at.sort(changepoints)

  # This will give us a nice posterior estimate of the number of changepoints
  num_active_cp = pm.Deterministic('num_active_cp', pm.math.sum(is_cp_active))

  global_mean = pm.Normal('global_mean', sigma=1)
  cp_sd = pm.HalfNormal('cp_sd', sigma=2)
  noise_sd = pm.HalfNormal('noise_sd', sigma=1)
  changepoint_deltas = pm.Normal('changepoint_deltas', cp_sd, shape=max_cp_inference)

  # Operation involves operations on arrays with shape (T, max_cp_inference)
  # Elementwise operation zeros-out contributions from changepoints which are
  # not active
  is_timestep_past_cp = (tiled_times > changepoints[None, :].repeat(T, axis=0))
  active_deltas = (changepoint_deltas*is_cp_active)
  cp_contrib = pm.Deterministic('cp_contrib',
                                global_mean + pm.math.sum(is_timestep_past_cp * active_deltas, axis=1)
  )

  _ = pm.Normal('likelihood', mu=cp_contrib, sigma=noise_sd, observed=xs)

  trace = pm.sample(draws=8000, tune=8000, chains=2)

100.00% [16000/16000 02:30<00:00 Sampling chain 0, 6,594 divergences]

100.00% [16000/16000 03:03<00:00 Sampling chain 1, 3,278 divergences]

ERROR:pymc:There were 6594 divergences after tuning. Increase `target_accept` or reparameterize.
WARNING:pymc:The acceptance probability does not match the target. It is 0.08247, but should be close to 0.8. Try to increase the number of tuning steps.
ERROR:pymc:There were 9872 divergences after tuning. Increase `target_accept` or reparameterize.
WARNING:pymc:The acceptance probability does not match the target. It is 0.5083, but should be close to 0.8. Try to increase the number of tuning steps.

From a sampling perspective, this is a pretty ugly problem. NUTS isn’t designed to work well in an alternating NUTS / Gibbs sampling scheme, and we get tons of divergences because NUTS is facing a log-posterior landscape that is shifting dramatically on every iteration because of the discrete latent variables.

That said, the $\hat{R}$ values look good - no warnings are fired off!

Assessing the results from inference

As a basic statistic for the number of changepoints, we can just take the posterior mean of the indicator variables’ sum to see how many parameters were active, on average.

trace.posterior['num_active_cp'].mean()


array(3.2660625)

We can also make a plot of the posterior inferences about the location and parameters of each changepoint as compared against the true values:

top_10_cp = Counter(
    trace.posterior['changepoints'].to_numpy().ravel().tolist()
    ).most_common(10)

plt.figure(figsize=(9,4))
plt.plot(noiseless, label='True noiseless values', color='green')

plt.plot(trace.posterior['cp_contrib'].mean(axis=(0,1)), label='Inferred noiseless mean', color='orange')

q10, q90 = np.percentile(trace.posterior['cp_contrib'], [10,90], axis=(0,1))
plt.fill_between(np.arange(T), q10, q90, color='orange', alpha=0.2)
plt.plot(xs, linestyle='', color='k', marker='o', label='Observed data')
for i, cp in enumerate(true_cp):
  if i == 0:
    label = 'True change point'
  else:
    label=None
  plt.axvline(cp, color='g', linestyle='--',label=label)

for i, (t, _) in enumerate(top_10_cp):
  if i == 0:
    label = 'Inferred change point'
  else:
    label=None
  plt.axvline(t, color='orange', linestyle='--', label=label)
plt.xlabel('Timestep',fontsize=12)
plt.ylabel('$X(t)$',fontsize=18)
plt.legend(loc='upper right');

Here, the green vertical lines are the true changepoints while the orange vertical lines are one of the top 10 most likely changepoints as gleaned from the posterior samples. We can see that the major jumps around timesteps 10 and 20 are clearly captured, while there is more uncertainty from timesteps 20-40. The smaller jump at timestep 45 is also missed completely; this is not very surprising given how small it was.

Easy local average with Google Earth Engine’s Python API

2022-03-07T00:00:00-08:00

Across many projects, I’ve needed to analyze remote sensing data and compute, for some points or polygons, a statistical summary of the remotely sensed layer in a neighborhood of those objects. Sometimes, that summary needs to be exactly calculated within the extent of the spatial object (like the boundaries of a field or a county) but other times, simply knowing the average in a circular or square buffer around the feature is good enough.

I wrote this notebook to show what is the least painful way to do it for large numbers of geometries without having to manually retrieve and download data. We do this by making use of the functionality in Google’s Earth Engine via its Python API. In this example, I calculate the average amount of surface water within a 1 km. circular buffer around 100,000 points sampled within the vicinity of Washington, DC.

The imports we require are fairly standard. I’ve used contextily here only to show a basemap comparison later on - you can omit this with no ill effect. Otherwise, we use the ee library for Earth Engine as well as Geopandas and the built-in json library.

import contextily as ctx
import ee
import geopandas  as gpd
import json
import numpy      as np

from shapely   import geometry
from time      import time

The first thing we’ll do is create some fake data. Here, I randomly sample a large number of points as referenced by their latitude / longitude coordinates within a bounding box.

left, lower, right, upper=-77.14,38.81,-76.90,38.99

n     = 100_000
longs = np.random.uniform(left, right, size=n)
lats  = np.random.uniform(lower, upper, size=n)

gdf        = gpd.GeoDataFrame(geometry=[geometry.Point(x,y) for x,y in zip(longs,lats)], 
                crs='epsg:4326')
n_geoms    = len(gdf)

Next, we’ll indicate that the remote sensing image we want to average over is the Global Surface Water binary yes/no water layer which indicates, for each pixel, whether or not water was ever sensed in that pixel over the entire Landsat archive. We also define how big we want our local summary buffer to be, in terms of meters.

image         = ee.Image("JRC/GSW1_3/GlobalSurfaceWater").select("max_extent")
radius_meters = 1000

The most important part of this is the next code block. After instantiating an EarthEngine FeatureCollection corresponding to our points, we convolve the target image with a circular kernel around the points that we’ve supplied. We then ask Earth Engine to directly evaluate these summaries and then put them into our local memory.

def radial_average(ee_geoms, image, radius_meters):
    '''
    Creates EE object for zonal average about each
    point in  and directly evaluates it
    into the local kernel.
    '''

    kernel = ee.Kernel.circle(radius=radius_meters,
                              units='meters',
                              normalize=True);

    smooth = image.convolve(kernel);

    val_list = smooth.reduceRegions(**{
        'collection':ee_geoms,
        'reducer':ee.Reducer.mean()
    }).aggregate_array('mean')

    val_array = ee.Array(val_list).toFloat().getInfo()
    return val_array

This loop takes our GeoDataFrame of points from earlier and splits it into smaller blocks so that we don’t hit the Earth Engine data transfer limit. We use the JSON representation of our GeoDataFrame to make it palatable to Earth Engine.

block_size    = 10000
n_requests    = int(n_geoms/block_size)

blocks = np.array_split(gdf, n_requests)

vals  = []
start = time()

ee.Initialize()

for gdf_partial in blocks:
    js           = json.loads(gdf_partial.to_json())
    ee_geoms     = ee.FeatureCollection(js)
    partial_vals = radial_average(ee_geoms, image, radius_meters)
    
    vals += [partial_vals]
    print(time() - start)
    start = time()
    
vals = np.concatenate(vals)

036774396896362
84600043296814
808760643005371
694471120834351
156326532363892
155316114425659
938997268676758
8641228675842285
573741436004639
0107102394104

Overall, it takes about 80 seconds for this to run. It could certainly be done faster locally with GDAL and/or zonalstats, but that requires extra setup and storing the data locally.

To verify that our results are sensible, we overlay our point summaries with an OpenStreetMap-derived basemap for Washington DC.

fig, axes = plt.subplots(1,2,figsize=(11,5), sharey=True)
axes[0].scatter(longs,lats, c=vals, vmax=0.05, s=0.05)
axes[1].scatter(longs,lats, alpha=0.001)
[ctx.add_basemap(ax, crs='epsg:4326') for ax in axes]

axes[0].set_title('Proximity to water, per point')
axes[1].set_title('OSM basemap')

plt.tight_layout();

Fast Kronecker matrix-vector product with einsum

2021-11-28T00:00:00-08:00

In numerical linear algebra, a common problem that arises in the analysis of large datasets is the product of a dense but structured $N \times N$ matrix $\mathbf{A} = \bigotimes_{j=1}^J \mathbf{A}_j$ with a similarly dense vector $\mathbf{y}$. We’re assuming that $\mathbf{A}$ can be written as the tensor or Kronecker product of $J$ smaller matrices denoted by $\mathbf{A}_j$, each of which has dimension $N_j$.

Our strategy in computing this is to rearrange $\mathbf{y}$ into a multdimensional array and, by contracting indices in an efficient way, avoid an $\mathcal{O}(N^2)$ matrix-vector operation. Some of the commonly used identities of product matrices are available on Wikipedia, and we’ll make use of several of them. We let $\mathbf{Y}$ denote an array formed from the vector $\mathbf{y}$ by reshaping into a form with axis dimensions of $N_1,…,N_J$. By the associative property of the tensor product, $\begin{align} \left(\bigotimes_{j=1}^J \mathbf{A}_j \right) \mathbf{y}&=\mathbf{A}_1\otimes\left(\cdot\cdot\cdot\otimes(\mathbf{A}_J\mathbf{Y})\right)\\ &=u^{(1)}_{k_1,k_2}u^{(2)}_{k_3,k_4}\cdot\cdot\cdot u^{(J)}_{k_{2J-1},k_{2J}}Y_{k_2,k_4,...,k_{2J}} \end{align}$ where we let $u^{(1)}_{k_1k_2}$ refer to the entry of the $k_1$-th row and $k_2$-th column of $\mathbf{A}_1$. The second equation above uses Einstein notation in representing the tensor product and multidimensional array $\mathbf{Y}$. The $k$ indices look a little funky compared to usual tensor notation; in physics we are used to having actual letters such as $i, j, k$ rather than letters with numbers. However, if we have an arbitrary number of Kronecker factors, there may be many, many indices used, so we avoid using any particular letter and instead replace $i, j, k, l, m,…$ with $k_1, k_2, k_3, k_4, k_5,…$.

The rule for Einstein notation is that when an index appears twice, we sum over it, also described as “contraction” over that index. Contracting the repeated indices, the result of the above procedure is an array $\mathbf{Z}$ of the same dimensions as $\mathbf{Y}$ running over indices $k_1, k_3,…,k_{2J-1}$ which has been transformed by the repeated application of the matrix and tensor product operations and which satisfies the equality $Vec(\mathbf{Z})=\left(\bigotimes_{j=1}^J \mathbf{A}_j \right) \mathbf{y}.$ Since each of the $J$ tensor contractions involves a sum involving $N_j$ terms, each of which makes use of all $N$ elements in $\mathbf{Y}$, the overall complexity of this algorithm is $\mathcal{O}(N \cdot \sum_{j=1}^J N_j)$, which compares favorably with the naive $\mathcal{O}(N^2)$.

This entire procedure can be run in a single call to the einsum function available in Numpy.

Here, I should note that this is essentially the same result characterized by Saatchi’s PhD thesis, but this presentation omits the dependence upon permutation indices and transpositions that obscures the essential index operations involved. Alex Williams has an implementation of this calculation in PyTorch, but it does the index juggling by hand and is a bit more complex than automatically contracting the right indices. In the rest of this notebook, I show how to implement this operation in a few lines of Python.

Implementing an efficient matrix-vector product

To begin, we’ll cook up a set of 5 square, symmetric matrices of increasing size. We’ll guarantee they are symmetic and positive semidefinite by squaring them.

import numpy as np
import time

sizes = 3,4,5,6,7
prod_size = np.prod(sizes)

matrices = [np.random.randn(n,n) for n in sizes]
matrices = [X@X.T for X in matrices]

In the end, we want to take the Kronecker / tensor product of these matrices. Since they have increasing dimension, the dimension of their Kronecker product will be 3*4*5*6*7=2520

To see what the Kronecker product looks like, let’s see what the product of two of these matrices looks like:

[A.shape for A in matrices]

[(3, 3), (4, 4), (5, 5), (6, 6), (7, 7)]

As promised, these matrices are invertible as shown by their determinants. Since each of these determinants is nonzero, an inverse exists.

[np.linalg.det(A) for A in matrices]

[19.055143537578502,
041852313010475074,
008158197604522445,
252474950990084,
5649833818011]

We will also instantiate the vector $\mathbf{y}$, though here we create it in the array form and then vectorize it later.

y = np.random.randn(*sizes)

Here, we perform a brute-force calculation of the matrix-vector product by instantiating the full Kronecker product. We do this by iteratively applying the Kronecker product to each of the matrices.

from functools import reduce
big_matrix = reduce(np.kron, matrices)
matrix_product = big_matrix @ y.ravel()

We’ll also do the same using the einsum function. The first argument is a string specification for the tensor contraction. Essentially, it is saying that we have 5 two-dimensional arrays (with indices ij, kl, and so on), and that they are multiplied with a 5-dimensional array to output another 5-dimensional array.

tensors = matrices+[y]
einstein_product = np.einsum('ij,kl,mn,op,qr,ikmoq->jlnpr', *tensors)

Both procedures result in the same values! Note that if you use the elementwise == operator overloaded by Numpy, you will get False due to minor differences due to the floating point representation.

np.allclose(matrix_product, einstein_product.ravel())

True

The next code cell packages up these functions so we can reuse them later to assess the relative runtimes of each.

from string import ascii_lowercase as letters

def mv_kron(matrices, y):
    '''
    Compute product of vector and Kronecker-structured matrix
    via brute-force enumeration of entire Kronecker matrix.
    '''
    A_kron = reduce(np.kron, matrices)
    return A_kron @ y.ravel()

def mv_einstein(matrices, y):
    '''
    Use Einstein summation convention to iteratively
    contract along secondary axes and implement Kronecker 
    matrix-vector product
    '''
    p = len(matrices)
    
    if p > 13:
        raise ValueError('There aren\'t enough letters in the alphabet for this operation :(')
    
    letter_pairs = [letters[2*i]+letters[2*i+1] for i in range(p)]
    matrix_string = ','.join(letter_pairs)
    vec_in_string, vec_out_string = [''.join(s) for s in zip(*letter_pairs)]
    string_spec = f'{matrix_string},{vec_in_string}->{vec_out_string}'
    
    return np.einsum(string_spec, *matrices, y, optimize='greedy').ravel(), string_spec

Comparing runtimes

ein_times  = []
kron_times = []
dimensions = []

for scale in [1, 2, 3, 4, 5, 6, 7, 8]:
    sizes = [2*scale, 4*scale, 8*scale]
    matrices = [np.random.randn(n,n) for n in sizes]
    matrices = [X@X.T for X in matrices]
    y = np.random.randn(*sizes)
    
    start_kron = time.perf_counter()
    mv_kron(matrices, y)
    end_kron = time.perf_counter() - start_kron
    
    start_ein = time.perf_counter()
    _, string_spec = mv_einstein(matrices, y)
    end_ein = time.perf_counter() - start_ein
    
    ein_times  += [end_ein]
    kron_times += [end_kron]
    dimensions += [y.size]
    print('Dimension:', str(y.size).ljust(5), f'Einstein time: {end_ein:.3f} s.', f'Naive time {end_kron:.3f}s.')

Dimension: 64    Einstein time: 0.001 s. Naive time 0.001s.
Dimension: 512   Einstein time: 0.000 s. Naive time 0.016s.
Dimension: 1728  Einstein time: 0.000 s. Naive time 0.039s.
Dimension: 4096  Einstein time: 0.000 s. Naive time 0.190s.
Dimension: 8000  Einstein time: 0.001 s. Naive time 0.921s.
Dimension: 13824 Einstein time: 0.001 s. Naive time 2.632s.
Dimension: 21952 Einstein time: 0.018 s. Naive time 7.073s.
Dimension: 32768 Einstein time: 0.063 s. Naive time 95.083s.

As we can see below, there’s a big disparity in runtime, although I can’t really probe any larger dimension sizes since my laptop has a small amount of memory.

import matplotlib.pyplot as plt

plt.plot(dimensions, ein_times, marker='o', color='b', label='Einstein')
plt.plot(dimensions, kron_times, marker='d', color='m', label='Naive')
plt.yscale('log')
plt.ylabel('$\log_{10}$ runtime'), plt.xlabel('$N$')
plt.savefig('../figures/kmvp_runtime.png', dpi=400)
plt.legend();

Finally, it’s interesting to note that behind-the-scenes, the Numpy implementation of einsum is performing a path optimization to determine which indices should be contracted first. We can check it out by calling einsum_path and examining the results.

[print(string) for string in np.einsum_path(string_spec, *matrices, y)];

['einsum_path', (2, 3), (1, 2), (0, 1)]
  Complete contraction:  ab,cd,ef,ace->bdf
         Naive scaling:  6
     Optimized scaling:  4
      Naive FLOP count:  4.295e+09
  Optimized FLOP count:  7.340e+06
   Theoretical speedup:  585.143
  Largest intermediate:  3.277e+04 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   4                 ace,ef->acf                           ab,cd,acf->bdf
   4                 acf,cd->adf                              ab,adf->bdf
   4                 adf,ab->bdf                                 bdf->bdf

This printout tells us two things: first, the number of floating point operations is nearly 600X smaller for the optimized summation path. Second, the largest array held in memory is only on the order of $10^4$ elements, so it’s much more memory efficient.

Balanced spatial partitioning for point data in 20 lines

2021-11-10T00:00:00-08:00

When working with geospatial data, sometimes a dataset is simply too large in its original form or file to be worked with effectively. I often work with hierarchical statistical models that function effectively when a larger dataset is partitioned into smaller subsets. To preprocess the data, I frequently need to find a way to split up a larger spatial domain into smaller pieces such that the number of objects or data points in each piece is roughly equal.

Unfortunately, I couldn’t find a quick reference online to do this with Python, so this post covers how to do it.

import geopandas         as gpd
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
load_path ='../sample_data.gpkg'
gdf       = gpd.read_file(load_path).to_crs('epsg:2283')

The data we’ll be working with is a set of building footprints for structures in Washington, DC. These have been collected from OpenStreetMap, and there’s roughly 134,000 of them:

print(len(gdf))

As we can see from the plot below, these buildings aren’t already nicely spaced into even subsets. There’s many more in the urban core of Washington, DC. Also, the geometry of the region itself is somewhat irregular.

gdf.plot(figsize=(8,8)), plt.axis('off');

To split up the data, we’re going to use a recursive approach. The main data structure we’re working is going to be a list of tuples containing an x-coordinate, y-coordinate, and feature index, respectively. We will recursively split subsets of features into balanced north-south or east-west halves by bisecting a sorted array. If we denote the number of splits as $M$ and the number of features as $N$, this naive approach has a complexity of $\mathcal{O}(MN \log N)$ and thus scales well to large-ish datasets. We shouldn’t need to sort at each splitting, however, so really, this algorithm should be running in $\mathcal{O}(M+N \log N)$ time. I didn’t take the time to make that modification here since the original version was fast enough.

The top-level function (shown below) initializes the required variables and also post-processes the subsets by repeatedly flattening a nested list-of-lists until only the bottom-level results of the recursion are contained in a single top-level list.

def split(gdf, max_level):
    
    xs = (gdf.centroid.x.values, gdf.centroid.y.values, gdf.index.values)
    xs = list(zip(*xs))
    
    splits = split_recurse(xs, 0, max_level)
    
    for i in range(max_level-1):
        splits = sum(splits, start=[])
        
    indices_only = [[x[2] for x in subset] for subset in splits]
        
    return [gdf.loc[s] for s in indices_only]

The recursive function is defined below. It’s pretty simple - we just sort, split, and continue on with each subset. The parameter max_level controls how many partition cells there are; we do a binary split at each level, resulting in 2**max_level cells by the time we’re finished.

def split_recurse(xs, split_pos, max_level, level=1):
    
    xs.sort(key=lambda x:x[split_pos])
    mid = int(len(xs)/2)

    above = [pair for i, pair in enumerate(xs) if i > mid]
    below = [pair for i, pair in enumerate(xs) if i <= mid]
    
    subsets = [above, below]
        
    if level == max_level:
        return subsets
    else:
        # We flip between using the 0-th position and the 1st position in 
        # our triplets to alternate between x- and y-coordinates for splitting.
        return [split_recurse(subset, 1-split_pos, max_level, level=level+1) for subset in subsets]

Let’s see how the results are distributed in space. The plot below shows the assignment of each building to a partition cell; each color is a different cell.

color_string = color='bgrcmykbgrcmykbgrcmykbgrcmyk'
plt.figure(figsize=(7,7))
gdf_splits = split(gdf, 5)

[subset.plot(ax=plt.gca(), color=color) for color, subset in zip(color_string, gdf_splits)];
plt.axis('off');

We can also see whether or not the subsets are balanced in size:

[len(subset) for subset in gdf_splits]

All of the cells have nearly the same number of points between them!

Modeling spatial structure in binary data with an H3 hexagonal coordinate system

2021-04-29T00:00:00-07:00

We often model geostatistical (i.e. point-referenced data) in order to determine whether or not there are spatial patterns of autocorrelation. The object of interest is frequently an underlying spatial function giving rise to patterns of spatially correlated data. When we work with discrete observational data, a problem arises - we want to study smoothly-varying response surfaces over space, but the data themselves are not continuous and therefore we cannot specify a likelihood which is continuous in both space and response. Consequently, we often choose to reparameterize our model in terms of a latent smooth spatial surface and a link function mapping this spatial surface to the parameters of a likelihood function appropriate for discrete data.

This notebook shows how to analyze binary geospatial point data using a spatially-smoothing conditional autoregression model to test for the existence of clusters of 0 or 1 values. The dataset used in this example is simulated data of preterm births in Washington, DC. While many autoregressive models use square grids, we’re going to use a hexagonal tiling from Uber’s H3 coordinate system library to demarcate our areal units.

Generating simulated data

We begin by importing the requisite libraries and simulating synthetic data of preterm births.

import geopandas as gpd
import h3
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import pymc3 as pm
import shapely

from sklearn.neighbors import BallTree

%matplotlib inline
%config InlineBackend.figure_format='retina'

To ensure reproducibility, it’s a good habit to include a version stamp as shown below.

%load_ext watermark
%watermark -iv

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
h3        : 3.7.2
pymc3     : 3.11.1
shapely   : 1.7.1
geopandas : 0.8.1
numpy     : 1.18.5
pandas    : 1.1.3
matplotlib: 3.3.2
networkx  : 2.5

At several points we will need to use multiple functions to handle geospatial operations such as creating point data, determining the adjacency of vector features, and ensuring that sets of spatial objects are topologically connected. We’ll define them now so we can use them later.

def xy_from_gdf(gdf):
    '''
    Returns Nx2 matrix of X,Y coordinates from a GeoDataFrame
    '''
    return np.stack([gdf.centroid.x, gdf.centroid.y], axis=1)

def lat_lng_to_h3(point, h3_level):
    '''
    Applies H3's geocoding to determine the hexagonal cell
    containing a given point. The h3 level determines the
    size of the hexagonal lattice used.
    '''
    return h3.geo_to_h3(
      point.geometry.centroid.y, point.geometry.centroid.x, h3_level)

def add_geometry(row):
    '''
    Creates a vector feature from the H3 hexagonal coordinates.
    '''
    points = h3.h3_to_geo_boundary(
      row['h3'], True)
    return shapely.geometry.Polygon(points)

def nearest_neighbor_centroid(gdf1, gdf2, k=4):
    '''
    Vectorized operation for identifying the nearest points in gdf2 relative to gdf1.
    '''
    X_proposed, X_base = xy_from_gdf(gdf1), xy_from_gdf(gdf2)        
    nearest = BallTree(X_proposed, leaf_size=2).query(X_base, k=k, return_distance=False)
    return nearest

def sigmoid(x):
    return np.exp(x) / (1 + np.exp(x))

def adjacency_via_buffer(gdf, very_small_distance=0.0003):
    '''
    Uses a spatial buffering and intersection operator to determine
    which features share a boundary in a GeoDataFrame.
    '''
    N = len(gdf)
    W = np.zeros([N,N], dtype=int)

    buffered = gdf.copy()
    buffered.geometry = buffered.buffer(very_small_distance)

    # Find neighbors by buffering and locating non-null overlap
    nearby  = buffered.geometry.apply(lambda x: np.where(buffered.intersection(x).area > 0)[0])
    nearest = nearest_neighbor_centroid(gdf, gdf, k=2)[:, 1:]

    for i, neighbors in enumerate(nearby):
        if len(neighbors) > 1:
            W[i, neighbors] += 1
            W[i,i]          -= 1 # self-neighboring is not allowed
        else:
            W[i, nearest[i]] = 1

    W = W+W.T > 0.
    W = W.astype(int)
    return W

def connect_components(W, geom_series):
    '''
    Iteratively add edges between nodes to the network
    until only a single edge-connected component covers the entire graph. This
    is critical for usage of the CAR model, which can fail if there
    are "islands" disconnected from each other in the network / adjacency matrix.
    '''
    connected = False
    while not connected:

        # Find the largest component and drop it from
        # the list of islands
        G = nx.convert_matrix.from_numpy_matrix(W)
        
        if nx.is_connected(G):
            break
            
        components = list(nx.connected_components(G))
        sizes      = [len(x) for x in components]
        largest    = np.argmax(sizes)
        components.pop(largest)

        # For each island, find the nearest node not on
        # the island and hook it up
        for island in components:
            element_on_island = list(island)[0]
            geom_on_island    = geom_series.iloc[[element_on_island]]
            repeated  = geom_on_island.iloc[np.zeros(geom_series.shape[0])]
            distances = geom_series.geometry.apply(lambda x: geom_on_island.distance(x))
            ordered_by_dist = np.argsort(distances.values[:,0])

            connected_for_island = False
            ctr = 0
            while not connected_for_island:
                proposed = ordered_by_dist[ctr]
                if proposed not in island:
                    W[element_on_island, proposed] = 1
                    W[proposed, element_on_island] = 1
                    print('Match for element {0} is {1}'.format(element_on_island, proposed))
                    connected_for_island = True
                ctr += 1

        G = nx.convert_matrix.from_numpy_matrix(W)
        connected = nx.is_connected(G)
    return W

Next, we use a shapefile of census tract data to determine how to sample birth events over space. We will use the population within each census tract, combined with a national average birth rate to determine how many births will be placed within each tract.

'''
Census tract shapefile taken from https://opendata.arcgis.com/datasets/f33d847161174e81ad59c9ea9c1f5a00_36.zip
'''
census_tract_path = "./data/Preliminary_2020_Census_Tract/Preliminary_2020_Census_Tract.shp"
tract_gdf = gpd.read_file(census_tract_path)

As we can see here, the POP10 field contains census counts.

tract_gdf.head()

	OBJECTID	STATEFP	COUNTYFP	TRACTCE	NAME	TRACTID	TRACTLABEL	POP10	HOUSING10	SHAPEAREA	SHAPELEN	geometry
0	21	11	001	001301	13.01	11001001301	13.01	3955	2156	2.882225e+06	8705.698378	POLYGON ((-77.06943 38.95434, -77.06932 38.954...
1	22	11	001	002001	20.01	11001002001	20.01	2340	1026	6.337953e+05	4198.601803	POLYGON ((-77.04338 38.96146, -77.04329 38.961...
2	23	11	001	003302	33.02	11001003302	33.02	2134	982	2.042153e+05	1915.794576	POLYGON ((-77.01428 38.91506, -77.01275 38.915...
3	24	11	001	008402	84.02	11001008402	84.02	2149	1270	2.741538e+05	2698.287213	POLYGON ((-76.99497 38.89741, -76.99496 38.898...
4	1	11	001	000101	1.01	11001000101	1.01	1384	999	1.993245e+05	2168.618432	POLYGON ((-77.05714 38.91055, -77.05702 38.910...

Our next step is to create a data table in which each row corresponds to a birth event and is associated with geospatial coordinates as well as a year.

base_pregnancy_rate = 11.4 / 1000 # births per thousand people
years = np.arange(2010, 2019)

birth_coords = []
birth_points = []

for year in years:
    
    for i, tract in tract_gdf.iterrows():
        tract_boundary = tract.geometry
        left, lower, right, upper = tract_boundary.bounds
        n_pregnancies = int(base_pregnancy_rate * tract['POP10'])
        
        for j in range(n_pregnancies):
            is_in_bounds = False
            
            while not is_in_bounds:
                coords = np.random.uniform(low=[left,lower],high=[right,upper])
                sample = shapely.geometry.Point(coords)
                is_in_bounds = tract.geometry.contains((sample))
                
            birth_points.append(sample)
            birth_coords.append(list(coords)+[year + np.random.uniform()])
                     
birth_gdf = gpd.GeoDataFrame(data=birth_coords, columns=['lat','lon','year'], geometry=birth_points)
birth_gdf["year_int"] = birth_gdf['year'].astype(int)

birth_gdf.head()

	lat	lon	year	geometry	year_int
0	-77.062823	38.951295	2010.816216	POINT (-77.06282 38.95130)	2010
1	-77.056945	38.949016	2010.555112	POINT (-77.05695 38.94902)	2010
2	-77.066707	38.951606	2010.082630	POINT (-77.06671 38.95161)	2010
3	-77.062031	38.951623	2010.639241	POINT (-77.06203 38.95162)	2010
4	-77.064879	38.955598	2010.950265	POINT (-77.06488 38.95560)	2010

To make this problem more interesting, we’ll simulate preterm births with spatial dependency. Our true generative process will allow for more preterm births in locations which are farther to the east and north.

birth_df   = pd.DataFrame(birth_gdf).drop(['geometry', 'year_int'], axis=1)
scales     = birth_df.std()
means      =  birth_df.mean()
zscore_gdf = (birth_df -means)/scales

# coefs are for lat, lon, and year respectively.
true_coefficients = [0.2, 0.2, 0.0]

# this value was chosen by hand to roughly line up with ~12% preterm births, on average
true_intercept    = -2
logits = zscore_gdf.dot(true_coefficients)+true_intercept

birth_gdf['preterm_prob'] = sigmoid(logits)

birth_gdf['preterm'] = np.random.binomial(1, birth_gdf['preterm_prob'])
birth_gdf['preterm'].mean()

0.12306756134464839

Let’s see the spatial point pattern for the births. The preterm births are marked in red while normal births are marked with blue points.

fig, axes = plt.subplots(2,2, figsize=(20,22), sharex=True, sharey=True)
axes = axes.ravel()

for i, year in enumerate(years[0:4]):
    birth_gdf.query(f"year_int=={year} & preterm==1").plot(markersize=5,
                                            color='red', ax=axes[i],zorder=1)
    
    birth_gdf.query(f"year_int=={year} & preterm==0").plot(markersize=2,
                                            color='blue', ax=axes[i],zorder=1, alpha=0.5)
    
    tract_gdf.plot(ax=axes[i], facecolor='none', edgecolor='k',zorder=2)
    preterm_frac = birth_gdf.query(f"year_int=={year}")['preterm'].mean()
    axes[i].set_title(f'Simulated births for {year}\n(Preterm fraction: {int(preterm_frac*100)}%)', fontsize=24)
    axes[i].axis('off')
    
plt.tight_layout()

A flaw of this simulation is that there are clearly jumps in point density at the interface between high- and low-population census tracts which are not reflective of reality.

Preprocessing spatial adjacency data

Since we don’t want to construct our model directly at the point level, we instead need to aggregate to a larger spatial unit. For this purpose, we’ll use the h3 library to aggregate into hexagonal bins.

h3_level = 9
gdf = birth_gdf
gdf['h3']      = gdf.apply(lat_lng_to_h3, h3_level=h3_level, axis=1)
gdf['hexagon'] = gdf.apply(add_geometry, axis=1)

hex_only = gdf[['hexagon','h3']].drop_duplicates(subset='h3')
hex_only = gpd.GeoDataFrame(geometry=hex_only['hexagon'], data=hex_only['h3'])
hex_only.sort_values(by='h3', inplace=True)

h3_to_int = {code: integer for integer, code in enumerate(np.sort(gdf['h3'].unique()))}
gdf['h3_int'] = gdf['h3'].apply(lambda x: h3_to_int[x])
print(hex_only.shape)

(1703, 2)

Under our model, each of the H3 cells is assumed to have its own free parameter for the probability of preterm birth. However, we will use the CAR prior to allow for pooling information across spatial cells and encouraging spatial smoothness in their estimates.

fig, axes = plt.subplots(1,2, figsize=(10,5), sharey=True)
hex_only.plot(ax=axes[0],edgecolor='k'), axes[0].set_title('H3 spatial cells')
tract_gdf.plot(ax=axes[1],edgecolor='k'), axes[1].set_title('DC census tracts')
plt.tight_layout()

As a final preprocessing step, we need to create the adjacency matrix $W$ and ensure that every node has a path through the adjacency matrix to every other path. Put more formally, we need to ensure there is only a single connected component in $W$ and that it is nontrivial.

# In geographic coordinate system
very_small_distance = 0.0003
W = adjacency_via_buffer(hex_only, very_small_distance=very_small_distance)
W = connect_components(W, hex_only)

To check the correctness of our procedures, we can make sure that every cell has at least neighbor and that no cell has more than six neighbors

neighbors_per_cell = W.sum(axis=0)

assert neighbors_per_cell.min() >= 1 & neighbors_per_cell.max() <= 6

Inference for model parameters

The probabilistic model we use has the following specification:

\[\alpha \sim Uniform(-0.95, 0.95)\\ c \sim Normal^{+}(0, 4)\\ \beta_0 \sim Normal(0, 9)\\ \mathbf{u}\sim CAR(W, \alpha)\\ y_j \sim Binomial(n_j, \sigma(u_j + \beta_0))\]

Here, $Normal^{+}$ refers to the half-normal distribution with a mode at zero and almost all probability mass placed on the positive real line. Then, the CAR prior assumes that $\mathbf{u}$ has a multivariate normal distribution with a spatially-smoothed covariance matrix. The spatial smoothing is informed by the cellwise adjacency matrix $W$ and the spatial correlation parameter $\alpha$. Finally, the number of preterm births within the $i$-th spatial cell is assumed to follow a binomial distribution with its logit specified as the spatial effect plus an intercept.

preterm_counts = gdf.groupby('h3_int')['preterm'].sum()
total_counts   = gdf.groupby('h3_int')['preterm'].count()

n = len(preterm_counts)

with pm.Model() as model:
    # Hyperparameters on spatial correlation, random effect size, and model intercept
    alpha       = pm.Uniform('alpha',lower=-0.95, upper=0.95)
    scale       = pm.HalfNormal('scale', sd=2)
    intercept   = pm.Normal('intercept', sd=3)
    
    # Spatially-smoothing prior on logit of preterm birth probability
    spatial_effect = pm.CAR('spatial_effect', mu=0., W=W, alpha=alpha, tau=1., sparse=True, shape=n)
    likelihood  = pm.Binomial('likelihood', p=pm.math.sigmoid(spatial_effect*scale + intercept),
                             observed=preterm_counts.values, n=total_counts.values)
    
    # Applies Markov chain Monte Carlo to draw from the posterior distribution
    trace = pm.sample(target_accept=0.95, tune=2000)

<>  Cannot convert Type TensorType(float64, matrix) (of Variable Usmm{no_inplace}.0) into Type TensorType(float64, row). You can try to manually convert Usmm{no_inplace}.0 into a TensorType(float64, row). Elemwise{sub,no_inplace}(z, Elemwise{mul,no_inplace}(alpha subject to  at 0x7f5e61173c10>, SparseDot(x, y))) -> Usmm{no_inplace}(Elemwise{neg,no_inplace}(alpha), x, y, z)
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [spatial_effect, intercept, scale, alpha]

100.00% [12000/12000 02:01<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 2_000 tune and 1_000 draw iterations (8_000 + 4_000 draws total) took 122 seconds.
The number of effective samples is smaller than 10% for some parameters.

Our posterior summary, as reported below, indicates strong evidence for spatial autocorrelation.

with model:
    print(pm.summary(trace, var_names=['scale', 'alpha', 'intercept']))

            mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  \
scale      0.483  0.036   0.411    0.547      0.001    0.001    1307.0   
alpha      0.947  0.003   0.943    0.950      0.000    0.000    3294.0   
intercept -1.983  0.026  -2.033   -1.936      0.001    0.001     363.0   

           ess_tail  r_hat  
scale        1876.0   1.00  
alpha        1618.0   1.00  
intercept     701.0   1.01  

Our trace plots look good - no multimodality and the samples look uncorrelated. While the posterior for $\alpha$ is piling up near the edge of the boundary, this is fine.

with model:
    pm.plot_trace(trace, var_names=['scale', 'alpha', 'intercept'])

We next generate two plots to visualize the resulting estimates.

sigma_cutoff = 2

estimated_intercept = trace['intercept'].mean()

hex_only['preterm_fraction'] = (preterm_counts / total_counts).values
hex_only['estimate']   = trace['spatial_effect'].mean(axis=0)
hex_only['stdevs']     = np.abs(trace['spatial_effect'].mean(axis=0) / trace['spatial_effect'].std(axis=0))
hex_only['is_sig']     = hex_only['stdevs'] > sigma_cutoff
hex_only['delta_prob'] = sigmoid(hex_only['estimate']+estimated_intercept) - sigmoid(estimated_intercept)

First, we create a plot of the data - the observed ratios of preterm births on a cell-by-cell basis.

fig, ax = plt.subplots(1, 1, figsize=(25,25))
hex_only.plot('preterm_fraction', ax=ax)
handle = tract_gdf.plot(facecolor='none', edgecolor='k',zorder=2,ax=ax, alpha=0.2, legend=True);

row_ctr = 0

for i, row in hex_only.iterrows():
    cent = row['geometry'].centroid 
    ax.text(cent.x, cent.y,f'{preterm_counts.iloc[row_ctr]} / {total_counts.iloc[row_ctr]}',
                 ha='center', va='center',fontsize=2, fontweight='bold', color='w')
    row_ctr += 1
    

Next, we compare against our inferred estimates. Cells for which our estimate of the spatial effect is significant at the $2\sigma$ level are highlighted with a star.

ax = hex_only.plot('estimate',figsize=(25,25))
tract_gdf.plot(facecolor='none', edgecolor='k',zorder=2,ax=ax, alpha=0.2, legend=True)

for i, row in hex_only.iterrows():
    cent = row['geometry'].centroid 
    if row['is_sig']:
        sig_str = '*'
    else:
        sig_str = ''
    se = row['delta_prob']
    ax.text(cent.x, cent.y, sig_str, ha='center', va='center',fontsize=16, fontweight='bold')

axes[0].set_title('Simulated preterm birth ratio', fontsize=24)
axes[1].set_title('Inferred change in probability of preterm birth', fontsize=24);

Creating an emulator for an agent-based model

2021-04-05T00:00:00-07:00

Computers are (still) getting faster every year and it is now commonplace to run simulations in seconds that would have required hours’ or days’ worth of compute time in previous generations. That said, we still often come across cases where our computer models are simply too intricate and/or have too many components to run as often and quickly as we would like. In this scenario, we are frequently forced to choose a limited subset of potential scenarios manifest as parameter settings for which we have the resources to run simulations. I’ve written this notebook to show how to use a statistical emulator to help understand how the outputs of a model’s simulations might vary with parameters.

This is going to be similar in many ways to the paper written by Kennedy and O’Hagan (2001) which is frequently cited on the subject, though our approach will be simpler in some regards.To start us off, I’ve modified an example of an agent-based model for disease spread on a grid which was written by Damien Farrell on his personal site. We’re going to write a statistical emulator in PyMC3 and use it to infer likely values for the date of peak infection without running the simulator exhaustively over the entire parameter space.

TL;DR: we run our simulation for a few combinations of parameter settings and then try to estimate a simulation summary statistic for the entire parameter space.

If you’re interested in reproducing this notebook, you can find the abm_lib.py file at this gist.

from abm_lib import SIR

import time
from tqdm import tqdm

%config InlineBackend.figure_format = 'retina'

Simulating with an ABM

We’ll first need to specify the parameters for the SIR model. This model is fairly rudimentary and is parameterized by:

The number of agents in the simulation
The height and width of the spatial grid
The proportion of infected agents at the beginng
Probability of infecting other agents in the same grid cell
Probability of dying from the infection
Mean + standard deviation of time required to overcome the infection and recover

These parameters, as well as the number of timesteps in the simulation, are all specified in the following cells. I am going to let most of the parameters be fixed as single values - only two parameters will be allowed to vary in our simulations.

fixed_params = {
    "N":20000,
    "width":80,
    "height":30,
    "recovery_sd":4,
    "recovery_days":21,
    "p_infected_initial":0.0002
}

For the probability of transmission and death rate, we’ll randomly sample some values from the domains indicated below.

sample_bounds = {
    "ptrans":[0.0001, 0.2],
    "death_rate":[0.001, 0.1],
}

Here, we iteratively sample new values of the parameters and run the simulation. Since each one takes ~40 seconds, it would take too long to run the simulation at every single parameter value in a dense grid of 1000 or more possible settings.

import numpy as np

n_samples_init = 10
input_dicts    = []

for i in range(n_samples_init):
    d = fixed_params.copy()
    for k,v in sample_bounds.items():
        d[k] = np.random.uniform(*v)
    input_dicts += [d]
    

n_steps=100

simulations = [SIR(n_steps, model_kwargs=d) for d in input_dicts]
all_states  = [x[1] for x in simulations]

100%|██████████| 100/100 [00:57<00:00,  1.75it/s]
100%|██████████| 100/100 [00:36<00:00,  2.78it/s]
100%|██████████| 100/100 [00:52<00:00,  1.91it/s]
100%|██████████| 100/100 [00:35<00:00,  2.80it/s]
100%|██████████| 100/100 [00:37<00:00,  2.64it/s]
100%|██████████| 100/100 [00:40<00:00,  2.48it/s]
100%|██████████| 100/100 [00:40<00:00,  2.47it/s]
100%|██████████| 100/100 [00:40<00:00,  2.44it/s]
100%|██████████| 100/100 [00:39<00:00,  2.50it/s]
100%|██████████| 100/100 [00:23<00:00,  4.32it/s]

Next, we combine all the sampled parameter values into a dataframe. We also add a column for our response variable which presents a summary of the results from the ABM simulation. We’ll use the timestep for which the level of infection was highest as the worst_day column in the dataframe.

import pandas as pd

params_df = pd.DataFrame(input_dicts)

# Add column showing the day with the peak infection rate
params_df['worst_day'] = [np.argmax(x[...,1].sum(axis=(1,2))) for x in all_states]

We can also spit out a few animations to visualize how the model dynamics behave. This can take quite awhile, however.

import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable

generate_animations = False
figure_directory = './figures/sir-states/'

if generate_animations:
    colors=['r','g','b']
    for k, pair in enumerate(simulations):
        model, state = pair
        for i in tqdm(range(n_steps)):
            fig = plt.figure(constrained_layout=True, figsize=(10,2))
            gs = gridspec.GridSpec(ncols=4, nrows=1, figure=fig)
            ax = fig.add_subplot(gs[2:4])
            im = ax.imshow(state[i,:,:,1].T, vmax=20, vmin=0, cmap='jet')
            ax.set_axis_off()
            divider = make_axes_locatable(ax)
            cax = divider.append_axes("right", size="3%", pad=0.05)
            plt.colorbar(im, cax=cax, label='Number infected')

            ax2 = fig.add_subplot(gs[0:2])
            for j in range(3):
                ax2.plot(state[:,:,:,j].sum(axis=(1,2)), color=colors[j])
                ax2.scatter(i, state[i,:,:,j].sum(), color=colors[j])
            ax2.set_ylabel('Number infected')
            ax2.set_xlabel('Timestep')
            plt.savefig(figure_directory+'frame_{1}_{0}.jpg'.format( str(i).zfill(5),j), bbox_inches='tight', dpi=250)
            plt.close()

        ! cd /Users/v7k/Dropbox\ \(ORNL\)/research/abm-inference/figures/sir-states/; convert *.jpg sir_states{k}.gif; rm *.jpg

Clearly, the model parameters make a major difference in the rate of spread of the virus. In the lower case, the spread requires over 100 timesteps to infect most of the agents.

If we make a plot depicting the date of peak infection as a function of ptrans and death_rate, we’ll get something that looks like the picture below. This is a fairly small set of points and the rest of this notebook will focus on interpolating between them in a way which provides quantified uncertainty.

plt.scatter(params_df.death_rate, params_df.ptrans, c=params_df.worst_day)
plt.xlabel('Death rate'), plt.ylabel('Transmission probability'), plt.colorbar(label="Day / timestep");

Building a simplified Gaussian process emulator

Our probabilistic model for interpolating between ABM parameter points is shown below in the next few code cells. We first rescale the parameter points and the response variable to have unit variance. This makes it a little easier to specify reasonable priors for the parameters of our Gaussian process model.

param_scales  = params_df.std()
params_df_std = params_df / param_scales

input_vars = list(sample_bounds.keys())
n_inputs   = len(input_vars)

We assume that the mean function of our Gaussian process is a constant, and we use fairly standard priors for the remaining GP parameters. In particular, we use a Matern52 covariance kernel which allows the correlation between values of our response variable to be a function of the Euclidean distance between them.

import pymc3 as pm

def sample_emulator_model_basic(X, y, sampler_kwargs={'target_accept':0.95}):
    _, n_inputs = X.shape
    
    with pm.Model() as emulator_model:
        intercept    = pm.Normal('intercept', sd=20)
        length_scale = pm.HalfNormal('length_scale', sd=3)
        variance     = pm.InverseGamma('variance', alpha=0.1, beta=0.1)
        
        cov_func     = variance*pm.gp.cov.Matern52(n_inputs, ls=length_scale)
        mean_func    = pm.gp.mean.Constant(intercept)
        
        gp       = pm.gp.Marginal(mean_func=mean_func, cov_func=cov_func)
        noise    = pm.HalfNormal('noise')
        response = gp.marginal_likelihood('response', X, y, noise)
        trace    = pm.sample(**sampler_kwargs)
        
    return trace, emulator_model, gp

Fitting the model runs fairly quickly since we have only a handful of observed data points. If we had 1000 or more instead of 10, we might need to use a different flavor of Gaussian process model to accommodate the larger set of data.

X = params_df_std[input_vars].values
y = params_df_std['worst_day'].values

trace, emulator_model, gp = sample_emulator_model_basic(X, y)

:17: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
  trace    = pm.sample(**sampler_kwargs)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [noise, variance, length_scale, intercept]

100.00% [8000/8000 00:29<00:00 Sampling 4 chains, 0 divergences]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 30 seconds.
The number of effective samples is smaller than 25% for some parameters.

Predicting at new locations is easy too, once we have our fitted model.

Xnew = np.asarray(np.meshgrid(*[np.linspace(*sample_bounds[k]) for k in input_vars]))
Xnew = np.asarray([Xnew[0].ravel(), Xnew[1].ravel()]).T / param_scales[input_vars].values

with emulator_model:
    pred_mean, pred_var = gp.predict(Xnew, given=trace, diag=True)

The final two cells create plots showing the posterior predictive distribution of the GP over all the values in parameter space for which we have no data. As we can see, it smoothly interpolates between data points.

plt.figure(figsize=(6,4))
plt.scatter(Xnew[:,0], Xnew[:,1], c=pred_mean, alpha=0.7)
plt.scatter(X[:,0], X[:,1], c=y, edgecolor='k')
plt.xlabel('Death rate'), plt.ylabel('Transmission probability'),
plt.colorbar(label='Posterior mean'), plt.title('Posterior mean surface');

(,
 Text(0.5, 1.0, 'Posterior mean surface'))

We also see that the variance in the predictions grows as we move farther and farther away from observed data points.

plt.figure(figsize=(6,4))
plt.scatter(Xnew[:,0], Xnew[:,1], c=pred_var, alpha=0.7)
plt.scatter(X[:,0], X[:,1], color='k', edgecolor='w')
plt.xlabel('Death rate'), plt.ylabel('Transmission probability'),
plt.colorbar(label='Posterior variance'), plt.title('Posterior variance surface');

Density estimation for geospatial imagery using autoregressive neural models

2020-03-30T00:00:00-07:00

Bayesian machine learning is all about learning a good representation of very complicated datasets, leveraging cleverly structured models and effective parameter estimation techniques to create a high-dimensional probability distribution approximating the observed data. A key advantage of posing computer vision research under the umbrella of Bayesian inference is that some tasks become really straightforward with the right choice of model.

In this notebook, I show how to use PixelCNN, a deep generative model of structured data, to perform density estimation on geospatial topographic imagery derived from LiDAR maps of the Earth’s surface. I also highlight how easy this is within TensorFlow Probability, a new open-source project extending the capabilities of Tensorflow into probabilistic programming, i.e. the representation of probability distributions with computer programs in a way that treats random variables as first-class citizens.

Note: To reproduce this notebook, you will need the digital elevation map dataset I used to train the model. It’s too large to be hosted on my Github repository. Email me at ckrapu at gmail.com to get everything you need to reproduce this!

Density estimation

Density estimation is a task which has a common sense interpretation: if our understanding of the world is encoded in a probabilistic model, data points with especially low density are rare according to the model while points with high density are common. Suppose that you are walking down the street and you see a bright, neon blue dog that is as large as a firetruck. This is an instance which would probably receive low density under your subjective model of the world because there is exceedingly low probability of it appearing. Conversely, a smaller brown dog would receive a higher density value because it is more likely under the set of beliefs and assumptions you hold about the world.

Most probability distributions are not as rich or flexible as the set of beliefs that we individually hold about the world. Coming up with extremely flexible and rich distributions is an active area of research. As of right now, a leading approach to generating these distributions is via neural autoregressive models which extend standard time series models such as the autoregressive or ARIMA models to have a neural transition operation rather than a linear, Markovian operation. The PixelCNN architecture is a popular neural autoregressive model currently in use.

Many machine learning models of imagery do not allow for easy density estimation. For example, the variational autoencoder provides a mapping from latent variable $\mathbf{z}$ to observed data point $\mathbf{x}$. Unfortunately, calculating $p(\mathbf{x})$ under the model typically requires approximating the integral $p(\mathbf{x}) = \int_z p(\mathbf{x}\vert \mathbf{z})p(\mathbf{z}) d\mathbf{z}$. Autoregressive models, in their most basic form, just don’t have this latent variable representative and instead parameterize the function $p(x_i \vert x_{i-1},…,x_1)$ where $x_i$ denotes the $i$-th pixel in the image. This admits a decomposition of the image’s probability as $p(\mathbf{x})=\prod_i p(x_i\vert x_{i-1},…,x_1)$. This assumes a total ordering of the pixels in an image; we usually assume the raster scan order (though there are creative solutions which can improve on this!).

The rest of this post shows how to use the PixelCNN distribution from Tensorflow Probability and apply density estimation. The PixelCNN distribution was included with the 0.9 update of tensorflow-probability, so you’ll need to upgrade your installation if you were on 0.8 or earlier.

import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import matplotlib.pyplot as plt


from utils import flatten_image_batch
print(f'This script uses Tensorflow {tf.__version__}')
print(f'Tensorflow Probability version: {tfp.__version__}')

This script uses Tensorflow 2.1.0
Tensorflow Probability version: 0.9.0

The dataset that I’m using consists of images with dimension $32\times32\times1$ representing topographical maps of the Earth’s surface in the state of North Dakota. Each pixel’s single channel of data represents the average elevation across several square meters. Features like roads, ditches, rivers and valleys can be seen in these images.

data_numpy = np.load('../data/datasets/training/dem_32_filtered.npy').astype('float32')
dem_as_int = (((data_numpy + 1)/2) * 255).astype(np.uint8)

Currently, the available architectures for PixelCNN work best when the output data is quantized. The image data originally had pixel values within the rage $[-1,1]$ which need to be mapped to ${0,1,…,255}$. Let’s take a look below and see what these images look like:

selected = np.random.choice(np.arange(data_numpy.shape[0]),size=36,replace=False)
images = dem_as_int[selected][0:32]
flat = flatten_image_batch(images.squeeze(),4,8)
plt.figure(figsize=(16,8))
plt.imshow(flat), plt.title('Training data'),plt.gca().axis('off');

Many of the images are of gently sloped or rolling surfaces with a few linear features such as ditches or roads. Many of the images have local regions of high variance corresponding to marshy vegetation which scatters the LiDAR pulses used for elevation estimation.

The PixelCNN model is actually a joint distribution over all the pixels of an image. Thus, it was possible for the developers of the tensorflow-probability package to actually include it as one of their distributions! This makes it really easy to work with and the code below shows how little setup is required to train a PixelCNN with TFP. Much of this code was copied from the TFP documentation.

# Specify inputs and training settings
input_shape = (32, 32, 1)
batch_size = 16
epochs = 3
filters = 96

# Create a Tensorflow Dataset object
train_dataset = tf.data.Dataset.from_tensor_slices(dem_as_int)
train_it = train_dataset.batch(batch_size).shuffle(data_numpy.shape[0])

# Create the PixelCNN using TFP
dist = tfp.distributions.PixelCNN(
    image_shape=input_shape,
    num_resnet=1,
    num_hierarchies=2,
    num_filters=filters,
    num_logistic_mix=5,
    dropout_p=.3,
)

# Define the model input and objective function
image_input = tf.keras.layers.Input(shape=input_shape)
log_prob = dist.log_prob(image_input)

# Specify model inputs and loss function
model = tf.keras.Model(inputs=image_input, outputs=log_prob)
model.add_loss(-tf.reduce_mean(log_prob))

Once the model is specified, we just need to compile it and start training. PixelCNN is an example of an autoregressive model and these are notorious for taking a long time to train. Unfortunately, I only have access to a single GPU currently. Normally, this code would display a progress bar and training metrics. I’ve toggled these off to keep the document short and prevent a large number of warnings from being shown.

# Compile and train the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(.001),
    metrics=[])

history = model.fit(train_it, epochs=epochs, verbose=True)

WARNING:tensorflow:Output tf_op_layer_Reshape_3 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to tf_op_layer_Reshape_3.
Train for 4602 steps
Epoch 1/3
3626/4602 [======================>.......] - ETA: 30:48 - loss: 2357.9549

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



4602/4602 [==============================] - 8735s 2s/step - loss: 1989.1206
Epoch 3/3
 612/4602 [==>...........................] - ETA: 2:06:25 - loss: 1972.0003

Since we’ve created an approximation of a probability distribution, we can sample from it to see examples of points that have high density under the PixelCNN model. As a warning, this sampling procedure can take quite awhile.

samples = dist.sample(36)

Let’s visually compare the sampled values with ground truth data points.

from utils import flatten_image_batch

samples_numpy = samples.numpy().squeeze()
flat_samples = flatten_image_batch(samples_numpy,6,6)
plt.figure(figsize=(6,6)),plt.imshow(flat_samples),plt.title('Simulated images')
plt.gca().axis('off')

selected = np.random.choice(np.arange(data_numpy.shape[0]),size=36,replace=False)
flat_ground_truth = flatten_image_batch(data_numpy[selected].squeeze(),6,6)
plt.figure(figsize=(6,6)),plt.imshow(flat_ground_truth),plt.title('True images')
plt.gca().axis('off');

Both the ground truth and sampled images appear to show winding streams and sloping hillsides, though there are more linear features such as roads and ditches in the true data than the synthetic samples.

In the next cell, I calculate the log density of 1000 ground truth images using the PixelCNN as my probability distribution. I also calculate the ranking of each image with regard to its probability.

subset    = dem_as_int[0:1000]
log_probs = dist.log_prob(subset).numpy()
ranking   = np.argsort(log_probs)

sorted_log_prob = log_probs[ranking]

With these rankings, I can show images which have low, medium, or high density under the PixelCNN model

fig,axes = plt.subplots(3,1,figsize=(10,16))
sorted_by_prob = subset[ranking]

subsets = [sorted_by_prob[0:32],sorted_by_prob[484:516],sorted_by_prob[-32:]]
labels = ['Images with low density', 'Images with medium density','Images with high density']
for i, subset in enumerate(subsets):
    flat = flatten_image_batch(subset.squeeze(),4,8)
    axes[i].imshow(flat), axes[i].set_title(labels[i]),axes[i].axis('off')

These images help us understand the representation that the model has learned. In the top panel, we see that the images with the lowest probability are those with a lot of “fuzziness”; these are images with lots of noisy LiDAR reflections due to water and vegetation. Since this is effectively random noise, it isn’t possible to predict perfectly what these values will be.

Images with high density, on the other hand, show smoothly varying topography and very strong spatial autocorrelations. Again, this isn’t terribly surprising because the model has favored data points for which it can easily yield very good pixel-level predictions. If each pixel differs from its neighbor by only a small amount, it is much easier to construct a predictive model with low error.

I hope that this provided a straightforward and minimal example of how to use Tensorflow Probability for a rather sophisticated machine learning task. I’ve been impressed with the functionality incorporated into the TFP codebase and look forward to using it more in the future!

Multivariate sample size for Markov chains

2020-02-18T00:00:00-08:00

Summary: I show how to calculate a multivariate effective sample size after Vats et al. (2019). In applied statistics, Markov chain Monte Carlo (MCMC) is now widely used to fit statistical models. Suppose we have a statistical model $p_\theta$ of some dataset $\mathcal{D}$ which has a parameter $\theta$. The basic idea behind MCMC is to estimate $\theta$ by generating $N$ random variates $\theta_i,\theta_2,…$ from the posterior distribution $p(\theta\vert\mathcal{D})$ which are (hopefully) distributed around $\theta$ in a predictable way. The Bayesian central limit theorem states that under the right conditions, $\theta_i$ is normally distributed about the true parameter value $\theta$ with some sample variance $\sigma^2$. We might then want to use the mean of these samples $\hat{\theta}=\sum_i^N \theta_i$ as an estimator of $\theta$ since this posterior mean is an optimal estimator in the context of Bayes risk and mean square loss.

This task has a few challenges lurking within. The accuracy of our estimate of $\theta$ is going to be low when we have only a few samples, i.e. $N$ is quite small. We can increase our accuracy by taking more samples. Ideally, our samples $\theta_i$ are all going to be independent so that we can make use of the theory of Monte Carlo estimators to assert that the error in our estimation of $\theta$ decreases at a rate of $1/N$. Thus, to get more accuracy, we draw more samples!

Autocorrelated MCMC draws

Unfortunately, MCMC won’t provide uncorrelated values of $\theta_i$ because of its inherently sequential nature. These samples are going to have some autocorrelation $\rho$ and it’s helpful to think of this autocorrelation as reduced the number of samples from a nominal $N$ to an effective number $N$. Here’s a helpful analogy - suppose that you want to determine the average income within a city. You could pursue two sampling strategies; the first leads you to travel to 10 spots randomly selected on the map and then query a single person. The second approach is that you travel to two neighborhoods and query five people each. The latter method has the downside that you may get grossly misrepresentative numbers if you happen to land in a neighborhood where everyone has similar incomes which are not close to the city-wide average. This is an example of spatial autocorrelation leading to poor estimation. The same underlying mechanism is at play with our MCMC estimator having reduced precision.

The literature on sequential data makes frequent use to autocorrelations $\rho_p$ of lag $p$ meant to capture associations between data points with varying amounts of time or distance between them. We can provide a formula for the effective sample size in terms of these autocorrelations (see here for more) via the following formula: $N_{eff} = \frac{N}{\sum_{t=-\infty}^{\infty}\rho_t}$

We truncate the sum in practice since the autocorrelations typically vanish after a large number of lags. Interestingly, the effective sample size also has another form which also implicitly involves autocorrelations. Suppose that we have a chain of samples $\theta_1,…,\theta_N$ and partition this chain into two batches comprising the samples from $1$ to $N/2$ and from $N/2$ to $N$. If the samples are close to independent, then the per-batch means $T_{(k)}$ should be relatively close to each other. If they aren’t, then the batches contain distinct subpopulations of samples. The key insight here is that if the subpopulations are distinct, then they exhibit high within-batch autocorrelation. Thus, we can attempt to back out the autocorrelations by looking at the differences between batch means! For $a$ batch means, each of size $N/a$, this produces the following quantity:

\[\lambda^2=\frac{N}{a(a-1)}\sum_k (T_{(k)}-\hat{\theta})^2\]

Then, if we take the ratio of this quantity with the overall sample variance $\sigma^2$, we get another formula for the effective sample size:

\[N_{eff} = \frac{n\lambda^2}{\sigma^2}\]

Effective sample size for multivariate draws

The aforementioned equations for the effective sample size are fine for draws of univariate quantities. We also want to know how to obtain an analogous number for vector-valued random processes. Researchers often attempt to do so by simply evaluating the scalar $N_{eff}$ for each individual dimension of a chain of vector samples, but this isn’t very satisying. Fortunately, recent work by Dats et al. (2019) has shown that a the straightforward multivariate generalization of the above formula works perfectly well! We simply have to generalize the quantities $\lambda^2$ and $\sigma^2$ to their matrix counterparts: $\Lambda=\frac{N}{a(a-1)}\sum_k ({\vec{T}}_{(k)}-{\hat{\theta}})^T({\vec{T}}_{(k)}-\hat)$ $\Sigma=\frac{1}{N-1}\sum_i ({\vec{\theta_i}}-{\hat{\theta}})^T({\vec{\theta_i}}^{(k)}-\hat)$ With these quantities, we write out the effective number of samples as before just with the matrix generalizations of all quantities involved. Note that here, $p$ represents the dimension of $\theta_i$.

\[N_{eff}^{multi} = N\left(\frac{\vert\Lambda\vert}{\vert\Sigma\vert}\right)^{1/p}\]

Note that you still need to choose how many batches are used - a rule-of-thumb (there are more technical conditions that are worth reading about, though) is to use a batch size of $\sqrt{N}$, so if you have 256 samples then there would be 16 batches of 16 samples each.

In the code below, I’ll show how to calculate this for a toy example.

import numpy as np

n    = 256 # Number of draws
p    = 10   # Dimension of each draw
cov  = np.eye(p) # True covariance matrix
mean = np.zeros(p) # true mean vector

samples = np.random.multivariate_normal(mean,cov,size=n)

n_batches         = int(n**0.5)
samples_per_batch = int(n / n_batches)

# Split up data into batches and take averages over
# individual batches as well as the whole dataset
batches     = samples.reshape(n_batches,samples_per_batch,p)
batch_means = batches.mean(axis=1)
full_mean   = samples.mean(axis=0)

# Calculate the matrix lam as a sum of vector
# outer products
prefactor       = samples_per_batch / (n_batches-1)
batch_residuals = (batch_means - full_mean)
lam = 0 
for i in range(n_batches):
    lam += prefactor * (batch_residuals[i:i+1,:] * batch_residuals[i:i+1,:].T)

sigma =  np.cov(samples.T)
n_eff = n* (np.linalg.det(lam) / np.linalg.det(sigma))**(1/p)

Let’s see what $N_{eff}$ is for this case:

print('There are {0} effective samples'.format(int(n_eff)))

There are 166 effective samples

Since I used 256 truly independent samples in total, it appears that this statistic is somewhat conservative in reporting the effective sample size. I hope this was useful! Again, you can read more about this method at this Biometrika article by Dootika Vats et al.

A review of recent work for posterior image inpainting

2020-02-10T00:00:00-08:00

Image, text and audio are examples of structured multivariate data where we have a total or partial ordering over the entries of our data points and also may exhibit long-range structure extending over many pixels, words or seconds of speech. As a consequence, it is difficult to model these kinds of data using models that allow for only short-range structure such as HMMs or which can make use of only pairwise dependency structures such as the covariance matrix in a multivariate normal distribution. What if we’d like to build Bayesian models with more sophisticated structure?

There exists a tremendous number of applications in which we might like to quantify our uncertainty regarding missing portions of structured data so that we can understand what the missing completion might look like. For example, an X-ray of a fractured wrist may be partially occluded and we would like to know whether the rest of a partially observed crack is large or small, conditional on the parts of the X-ray that we can actually observe. Ideally, we would be presented with an entire distribution of image completions which exhibit completed structure in proportion to their conditional probability given the observed piece. I call this task posterior inpainting which is a mcore stringent definition of a task already explored somewhat in the literature as pluralistic image completion (Zheng et al., 2019) or probabilistic semantic inpainting (Dupont and Suresha, 2019). This problem is mathematically identical to that of super-resolution; if we consider observed pixels on a regular grid and assume that between every pair of observed pixels is a series of $M$ masked pixels then the observed pixels constitute a downsampled version of the entire image with a resolution equal to $1/M$ of the original. Colorization can also be placed within the same formalism except we treat some of the channels as missing data.

Posterior inpainting: problem setup & formalism

We designate $x$ to be a single observation which is itself a vector with elements $x_1, x_2,…x_D$. For image data, this amounts to describing $x$ as an image with $D$ pixels. Where necessary, we may refer to alternative observations in the larger dataset as $\mathcal{D}={ x^{(1)},x^{(2)},…,x^{(N)}}$. Under several popular generative models of structured data such as variational autoencoders (VAEs) and generative adversarial networks (GANs), it is also assumed each data point $x$ has a latent representation $z$ that encodes the information in $x$ in a compressed, low-dimensional format. Not all state-of-the-art generative models share this assumption, however! Autoregressive models such as PixelCNNs or PixelRNNs may work either with or without the usage of latent variables.

When latent variables are used, we often represent the function linking the latent code $z$ to the observed data $x$ as $f_\theta(z)$ with $\theta$ representing the parameters of the generative model $f$. In the case of $L_2$ pixelwise reconstruction error for image data, we could thus represent the likelihood for $x$ given $z$ as $p_\theta(x\vert z)=MVN(f_\theta(z),\sigma^2_\epsilon I)$. This multivariate normal specification is simply saying that the log-likelihood has the form $\propto \frac{1}{\sigma^2_\epsilon} \vert\vert x-f_\theta(z)\vert\vert_2^2$. Note that $f_\theta$ is a deterministic function (though we could relax this if we had a stochastic generative network such as a Bayesian neural net) so each latent $z$ is mapped to exactly one output image. In the event that we have a partially observed $\tilde{x}$ missing some of its pixels, there may be multiple $\tilde{z}^{(1)},\tilde{z}^{(2)},…,\tilde{z}^{(L)}$ which all yield some $f_\theta(\tilde{z})$ which is a good match for $x$. The central problem I’m addressing in this post is the sampling and computation related to obtaining these $\tilde{z}$ such that they are truly representative of the posterior distribution $p(\tilde{z} \vert \tilde{x})$. The next section is a review of papers which attempt to address this problem. Here’s a glossary of some of the terms that will be used frequently:

$x$ indicates a single image with a partially observed counterpart $\tilde{x}$
$z$ is the corresponding latent code for $x$ and $\tilde{z}$ is one of potentially many latent codes consistent with the partially observed $\tilde{x}$
The generative model linking the latent code $z$ is denoted by $f_\theta$. For a variational autoencoder, this is the decoder network while for GANs, this is the generator. We refer to the weights of these networks as $\theta$.
An estimator of $z$ is written as $\hat{z}$
If a variational encoder network is used, $q_\phi(z\vert x)$ is used to refer to the conditional distribution of the latent code $z$ given image $x$. Note that this is conceptually quite different from a posterior $\propto p_\theta (x\vert z)p_\lambda(z)$ because the latter only requires a generative model $f_\theta$ and a prior on $z$ with parameters $\lambda$!

Review

Semantic image inpainting with deep generative models: Yeh et al. 2017

As soon as deep generative models such as VAEs and GANs started producing visually appealing samples when trained on more sophisticated data, researchers started investigating ways to use them to help solve a range of computer vision tasks including image inpainting. Yeh et al. 2017 presented a very straightforward and common-sense way to tackle image inpainting with a DGM. The basic recipe that they suggested for completing an image $\tilde{x}$ is:

Pick a randomly initialized $\hat{z}_0$
Calculate a loss function $L(\hat{z})=\gamma h(\hat{z}) +\vert\vert f_\theta(\hat{z}) - \tilde{x}\vert\vert$ where $\vert\vert \cdot \vert\vert$ denotes the norm or error function of your choice, $h$ is a prior loss and $\gamma$ is a weighting factor. This study used both $L_1$ and $L_2$ norms.
Use a gradient descent optimization scheme to repeatedly apply the update $\hat{z}_{t+1}=\hat{z}_t + \alpha\nabla_\hat{z}L(\hat{z}_{t})$ with some learning rate $\alpha$
Since $\vert\vert f_\theta(\hat{z}) - \tilde{x}\vert\vert$ is likely going to be nonzero, apply image post-processing to blend together any sharp discontinuities between the completion $f_\theta(\hat{z})$ and the original image $\tilde{x}$.

While this procedure is guaranteed to converge to a local minimum, this paper doesn’t provide a recipe to either escape these minima or try to draw a range of samples. That’s beside the point, though, since the main contribution of this paper was simply to show how to get a single inpainted completion at all.

It’s a shame that the author’s didn’t report on any results with injected noise in step #3 above (e.g. using an update rule $\hat{z}_{t+1}=\hat{z}_t + \alpha\nabla_\hat{z}L +\epsilon$ with $\epsilon$ drawn from an isotropic Gaussian) since this very nearly turns it into a Langevin sampler which I suspect would be a highly effective sampling scheme for this problem.

There’s an application paper by Dupont et al. 2018 which is nearly the exact same method used by Yeh et al. save with a minor modification to a mask applied to the loss function. As the authors of this paper noted:

To the best of our knowledge, creating models that can simultaneously (a) generate realistic images, (b) honor constraints, (c) exhibit high sample diversity is an open problem.

Clearly, the limitations of this approach are noted - getting high sample diversity could be challenging!

Pixel Constrained CNNs: Dupont and Suresha 2019

In an apparent follow-up to the challenge noted in the previous section, Dupont and Suresha attempted to address the major shortcomings of the Dupont et al. (2018) approach by embracing a latent variable-free approach that allowed for straightforward sampling from conditional distributions over images. The basic idea in this paper is to augment a PixelCNN’s predictive distribution over pixels to include information which is outside of the usual raster scan ordering imposed on the sequence of pixels.

We can think of the basic PixelCNN with weights $\theta$ as an autoregressive generative model $f_\theta(x_i\vert x_1,…x_{i-1})$. The general problem that Dupont and Suresha tackle is how to augment the conditioning set of variables with pixels that might be out of raster scan order, yet still observed. Let’s denote the set of observed pixels as $X_c$. Then, the generative model becomes $f_\theta(x_i\vert {x_1,…,x_{i-1}}\cup X_c)$. To implement the PixelCNN constrained to match observations, the authors represent the conditional likelihood of the discretized categories of $x_i$ to be log-linear in two different networks: (1) a standard PixelCNN with little modification, and (2) a fairly standard ConvNet which takes in masked pixels and outputs a logit. The second network also needs to have an extra channel for its inputs to indicate which pixels are masked since, for example, a value of zero in the masked data could correspond to either missing data or an observed value of zero. This has an advantage over latent variable-based approaches in that the samples of the completed image $\hat{x}$ will not need Poisson blending to match the observed pixels - there is no generation of the already-observed pixels in this procedure

I have to say that I am really impressed with the quality and diversity of the samples drawn from the conditional distribution over completions - I think this is a front-runner and current SOTA for posterior image completion.

Pluralistic image completion: Zheng et al. 2019

This paper was published in CVPR and either because of the journal’s format or because the study is heavy on technical details I found it to be very difficult to read. Unfortunately, I am unable to tell what the essence of this work is besides the fact that they pair two generative networks together which are trained on differing tasks. There were many details that would have ideally been given a longer treatment in this which likely contributed to it being relatively difficult to follow. This may have been an unavoidable consequence of the journal length format, however, and I do not intend this to be criticism of the authors’ writing.

A Bayesian perspective on the deep image prior: Cheng et al. 2019

The main contribution of this work is showing that sampled images from a deep generative model prior to training (AKA the deep image prior) are actually draws from a Gaussian process. While this is a neat coincidence, it’s not especially surprising given an abundance of work on relating neural networks and Gaussian processes as two leading forms of universal function approximators. However, the part that interested me the most was in their experiemntal section in which they discuss using the deep image prior for reconstruction as well as other image processing tasks and use Langevin dynamics to draw samples of $\theta$ leading to a posterior distribution of $p(x \vert \tilde{x})=\int_\theta p(x\vert \theta,\tilde{x})p(\theta \vert \tilde{x})d\theta$. Note that in this framework, there’s no mention of distributions over $z$ or $\tilde{z}$ - these are treated as fixed inputs!

I do want to take a minute here to critique the authors’ description, though. They aren’t using stochastic gradient Langevin Dynamics (SGLD) in the way that most people understand it. Let’s take a look at the deep image prior’s weight update equation from section 4 of this paper:

\[\eta_t \sim N(0,\epsilon)\\ \theta_{t+1}=\theta_t +\frac{\epsilon}{2}[\underbrace{\nabla_\theta \log p_\theta(\tilde{x}\vert\theta)}_{\substack{\text{Reconstruction error}\\ \text{for non-masked data}}} +\overbrace{\nabla_\theta p(\theta)]}^{\text{DIP}}+\eta_t\\\]

The “Langevin” part comes about because the behavior of $\theta$ can be thought of as a particle subject to random perturbations (i.e. the isotropic noise $\eta_t$) while also under the influence of a force represented as the gradient of a potential. In this context, the gradient is the sum of a gradient due to reconstruction error and due to the deep image prior (DIP). The “stochastic” part of SGLD refers to using minibatch approximations for the gradient estimator which we are forced to do because a full batch would be too computationally expensive. However, here, note that $\tilde{x}$ isn’t a minibatch - it’s the entire dataset! Within the setup laid out by Cheng et al., the deep generative model $f_\theta$ has parameters $\theta$ which do indeed need to be optimized, but they are only optimized using a single partial image $\tilde{x}$ rather than multiple images $x_1,…,x_N$ as is done with standard VAE and GAN training protocols. Thus, they are really just implementing Langevin dynamics. This is the same thing as MALA with the Metropolis accept/reject step removed.

Since only a single image is used to optimize / sample $\theta$, the model is really only able to capture information from two sources: (1) the inductive bias baked into the deep image prior (i.e. strong spatial covariance in the GP interpretation) and (2) image structures present in $\tilde{x}$ which thus influence $p(x\vert \tilde{x})$). This could have serious downsides - suppose we’d like to compute a posterior distribution of completions for an image of a man with blond hair yet his mouth (and mustache) are cropped out. Since the single image does not have any brown hair in it, it is unlikely that the deep image prior can be used to generate image completions consistent with a brown mustache. Yet, it is possible that in the collection of all training images $x_1,…,x_N$ there exist some pictures of men with blond hair and a brown mustache. This sort of outcome is also unlikely to have a nonnegligible probability under the deep image prior. All in all, this paper raises a number of possible directions for UQ with structured data via Langevin MCMC and also obviates the need to do any training at all!

Bayes by Backprop: Blundell et al. (2015)

Strictly speaking, this paper has nothing to do with image completion and it is focused entirely about treating neural network weights as random variables rather than fixed parameters. However, it’s not hard to see how this might give a possible receipe for posterior inpainting. Suppose we have a procedure $\nu (\theta,\tilde{x})$ that takes in a set of neural network weight parameters $\theta$ as well as a partially completed image $\tilde{x}$ and deterministically returns an estimated completion $\hat{x}$. For example, see [Yeh et al. (2017)](arxiv.org ‘ cs Semantic Image Inpainting with Deep Generative Models) for such a recipe. Then, if we could sample from a posterior distribution $p(\theta\vert \mathcal{D})$ then we could perform ancestral sampling to approximate $p(\hat{x}\vert\mathcal{D})=\int_\theta p(\hat{x}\vert\theta)p(\theta\vert\mathcal{D})d\theta$. Since this paper is about providing $p(\theta\vert\mathcal{D})$, I judge it as highly relevant to the task at hand. The paper works with a similar conceptual framework as the Autoencoding Variational Bayes paper but targets the neural network weights $\theta$ instead of the latent variables $z$ for a variational approximation. I’m going to spend much more time analyzing this paper because I think it provides a really nice template for thinking about Bayesian deep learning.

The stated objective in this work is to pose neural network training as solving the following optimization problem for the variational free energy $\mathcal{F}$ in terms of the variational parameters $\phi$ given dataset $\mathcal{D}$, likelihood $p(\mathcal{D}\vert\theta)$ and weight $p(\theta)$. $\begin{align} \phi^* &=\underset{\phi}{\text{arg min }}\mathcal{F}(\mathcal{D},\theta,\phi)\\ &=\underset{\phi}{\text{arg min }}KL(q(\theta\vert\phi)\vert\vert p(\theta\vert\mathcal{D}))\\ &= \underset{\phi}{\text{arg min }}KL(q(\theta\vert\phi)\vert\vert p(\theta)) - E_{q(\theta\vert\phi)}\left[\log p(\mathcal{D}\vert \theta)\right] \end{align}$ If this notation is opaque or these equations are especially hard to follow, I recommend looking at my earlier post which repeats these calculations ad nauseum. In line with their derivation, we next define the variational free energy as $\mathcal{F}(\theta,\phi)=KL(q(\theta\vert\phi)\vert\vert p(\theta)) - E_{q(\theta\vert\phi)}\left[\log p(\mathcal{D}\vert \theta)\right]$ and then attempt to find a Monte Carlo estimator of its gradient $\nabla_\phi \mathcal{F}(\theta,\phi)$. Unfortunately, this has the form $\nabla_\phi E_{q(\theta\vert\phi)}\left[…\right]$ and we can’t push the gradient operator inside the expectation since the density that we are integrating against itself depends on $\phi$. To solve this, we make use of the reparameterization trick and a deterministic function $t (\epsilon,\phi)$to rewrite $\theta=t(\epsilon,\phi)$. This yields:

\[\begin{align} \nabla_\phi \mathcal{F}(\theta,\phi)&=\nabla_\phi E_{q(\theta\vert\phi)}\left[\log\frac{q(\theta\vert\phi)}{p(\theta)}-\log p(\mathcal{D}\vert \theta)\right]\\ &=\nabla_\phi E_{q(\theta\vert\phi)}\left[\log q(\theta\vert\phi) - \log p(\theta)-\log p(\mathcal{D}\vert \theta)\right]\\ &=\nabla_\phi E_{p(\epsilon)}\left[\log q(t\vert\phi) - \log p(\theta)-\log p(\mathcal{D}\vert \theta)\right]\\ \end{align}\]

At this point we simplify the notation by designating $f(\theta,\phi) = \log q(\theta\vert\phi) - \log p(\theta)-\log p(\mathcal{D}\vert \theta)$, leading to the following: $\begin{align} \nabla_\phi \mathcal{F}(\theta,\phi)&=\nabla_\phi E_{p(\epsilon)}\left[f(t,\phi)\right]\\ &= E_{p(\epsilon)}\left[\nabla_\phi f(t,\phi)\right]\\ \end{align}$ To avoid having to specify in terms of products, I’ll focus on the elementwise derivative as done in the paper: $\begin{align} \frac{\partial}{\partial\phi}\mathcal{F}(\theta,\phi) &= E_{p(\epsilon)}\left[\frac{\partial}{\partial\phi} f(t,\phi)\right]\\ &= E_{p(\epsilon)}\left[\frac{\partial f}{\partial\theta}\frac{\partial \theta}{\partial\phi} + \frac{\partial f}{\partial \phi}\right]\\ \end{align}$ It turns out that this step is really all you need to be able to implement a BBB estimation scheme in a modern deep learning framework, though. See here for a great example from the Gluon developers! We will need to make some more assumptions about the specific parametric form of $t(\phi,\epsilon)$ to make the above gradient more explicit. While we’re free to consider any transformation $t: \epsilon\rightarrow\theta$, one of the simplest is a scale-location transformation where the $i$-th neural network weight is written as $\theta_i = \mu_i + \epsilon_i \cdot \sigma_i$ with $\mu_i$ giving the variational posterior mean of $\theta_i$ and $\sigma_i$ providing the variational posterior standard deviation. The standard deviation is always positive and we’d prefer to perform unconstrained optimization when possible, so Blundell et al. reparameterize $\sigma_i=\log (1+e^{\rho})$ instead.

Since the vector $\phi$ is supposed to include all of the variational parameters and each element of $\theta$ has a variational mean and standard deviation, the vector $\phi$ is going to have double the dimension of $\theta$. Let’s split apart $\phi$ and examine some of the gradients more closely, focusing on the variational mean $\mu$:

\[\begin{align} \frac{\partial\mathcal{F}} {\partial\mu} &=E_{p(\epsilon)}\left[\frac{\partial f}{\partial\theta}\cdot\frac{\partial\theta}{\partial\mu}+\frac{\partial{f}}{\partial\mu}\right]\\ &=E_{p(\epsilon)}\left[\frac{\partial}{\partial\theta}\left[\log q(\theta\vert\phi)-\log p(\theta)-\log p(\mathcal{D}\vert\theta)\right]\cdot\frac{\partial\theta}{\partial\mu}+\frac{\partial{f}}{\partial\mu}\right] \end{align}\]

Addressing each of these terms within $\partial f/\partial \mu$ individually will be more enlightening. The form of the conditional variational density $q(\theta\vert\phi)$ depends on our model assumptions; the default version given in Blundell et al. assumes a multivariate normal with diagonal covariance. Thus, we have $\log q(\theta\vert\phi)\propto \frac{1}{2}(\theta-\mu)^T\Sigma_q^{-1}(\theta-\mu)$. Here, the covariance matrix $\Sigma_q$ has the variances $\sigma_i^2$ on its diagonal, so its inverse will also be diagonal with diagonal entries of $1/\sigma_i^2$. We can see that this is going to push the values of $\mu$ in line with the values of $\theta$.

Next, the prior $p(\theta)$ is going to play a regularization role. We have a couple of options here; using an isotropic Gaussian with sufficiently small variance will induce $L_2$ regularization on the weights with equal strength everywhere. The study authors point out that a prior which allows for some large coefficients but mostly small coefficients can be useful and thereby include a two-component mixture of Gaussians with the idea that one mixture has a small variance (preferring lots of coefficients with small values) and the other mixture has a large variance to allow for large coefficient values to occasionally pop up. The mixture weights would need to be estimated, however, and Blundell et al. simply leave that up to your favorite choice of hyperparameter tuning.

Finally, the log-likelihood $p(\mathcal{D}\vert\theta)$ is straightforward to understand - it’s the error resulting from the mismatch between predictions $\hat{x}$ and true values $x$. In the case of a Gaussian likelihood, we get square error loss and for a Laplace likelihood, we recover absolute error loss.

For all of the terms in $\partial f/\partial \theta$, the gradients come down to gradients of quadratic forms of some type and under the right prior assumptions can even be done analytically.

Back in equation (12), the next term $\partial \theta/\partial\mu$ is just $1$ since $\theta \propto \mu$ in our function $\theta=t(\phi,\epsilon)=t(\mu,\rho,\epsilon)$. Then, the final term $\frac{\partial f}{\partial \mu}$ is much like the first term, except the parts that don’t depend on $\mu$ will drop out. Again, I want to stress that none of these calculations need to be done by hand - autodif software like Torch or Tensorflow will do these automatically. Once $\partial{\mathcal{F}}/\partial \mu$ is calculated with the above steps, it’s easy to apply stochastic gradient descent with a Monte Carlo estimator of $\partial \mathcal{F}/\partial \mu$ in order to do training. A similar recipe can be followed for the scale parameter $\rho$.

Treating the network weights $\theta$ as the random variable is orthogonal in some sense to the methods which treat the latent variable $z$ as the random quantity to be optimized over. Including both sources of uncertainty could be a promising line of future research.

An ELBO timeline

2020-02-07T00:00:00-08:00

In Bayesian machine learning, deep generative models are providing exciting ways to extend our understanding of optimization and flexible parametric forms to more conventional statistical problems while simultaneously lending insight from probabilistic modeling to AI / ML. This is an exciting time to be studying the topic as it is blending results from probability theory, statistical physics, deep learning and information theory in sometimes surprising ways. This post is a short summary of some of the major work on the subject and serves as an annotated bibliography on the most important developments in the subject. It also uses common notation to help smooth over some of the differences in detail between papers.

An initial good resource for a high-level overview of the problem is given in Variational Inference: A Review for Statisticians by David Blei et al. (2017). This paper gives a Bayesian statistical perspective on variational methods which are designed around manipulating the expected lower bound on the evidence (ELBO). From a statistical physics point of view, this can also be viewed as an upper bound on a system’s energy function.

The setup: approximate posterior inference

Blei et al. 2017 starts with a very broad general statement about Bayesian modeling: if we have some known observational data $\boldsymbol{x}=x_1,…,x_N$ and a model with parameters or latent variables $\boldsymbol{z}=z_1,…,z_N$, then our model is specified by a joint probability distribution $p(\boldsymbol{x},\boldsymbol{z})$. The $\boldsymbol{z}$ values could be latent (also described as local) parameters such as the factor scores in factor analysis or they could be global parameters like the coefficients in linear regression. After setting up this model, one of the main tasks is usually to conduct inference and obtain a posterior distribution $p(\boldsymbol{z}\vert \boldsymbol{x})=p(\boldsymbol{x},\boldsymbol{z})p(\boldsymbol{z})/p(\boldsymbol{x})$ where the intractable integral $p(\boldsymbol{x})=\int_z p(\boldsymbol{x}\vert \boldsymbol{z})p(\boldsymbol{z})dz$ prevents straightforward computation of $p(\boldsymbol{z}\vert \boldsymbol{x})$. The application of Markov chain Monte Carlo is intended to compute approximate estimates of exactly $p(\boldsymbol{z}\vert \boldsymbol{x})$ while the variational Bayes strategy is to get exact solutions to an approximate distribution $q(\boldsymbol{z}\vert \boldsymbol{x})$ where $q$ is chosen from some class of distributions that have nice properties for optimization. We typically choose $q$ from a family of parametric densities indexed by parameters $\boldsymbol{\phi}$. Then, the variational objective is to solve the following problem in terms of a loss function $f$, true posterior $p$ and approximate posterior $q$.

\[q_\phi^*(\boldsymbol{z})=\underset{q_{\phi}}{\mathrm{argmin}} \ f (q_\boldsymbol{\phi}(\boldsymbol{z}),p(\boldsymbol{z}\vert \boldsymbol{x}))\]

We are free to choose any $f$ that we want, keeping in mind that our choice of $f$ should intuitively encapsulate notions of closeness or fidelity between two distributions $p,q$. Many different methods can be categorized by their choice of $f$ within this framework. For example, using the asymmetric Kullback-Leibler divergence defined as $KL(p\vert \vert q)=E_p [\log p(z)/q(z)]$ yields either variational Bayes or expectation propagation depending upon whether $KL(p\vert\vert q)$ or $KL(q\vert\vert p)$ is used

We can also frame variational inference in the context of the evidence $p(\boldsymbol{x})$ also referred to as the marginal likelihood. Without loss of generality, we can also assume that the true generative model $p_{\theta}$ of the data has some parameters $\theta$. In this post, I’ll be extremely detailed with the derivations so that they are easy to follow.

\[\begin{align} \underbrace{\log p_{\boldsymbol{\theta}}(\boldsymbol{x})}_{\text{Log evidence}}&= \log \int_z p_{\boldsymbol{\theta}}(\boldsymbol{z,x}) dz\\ &= \log \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z,x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})} dz\\ &= \log \int_z q_{\boldsymbol{\phi}}(\boldsymbol{z}\vert \boldsymbol{x})\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})p_{\boldsymbol{\theta}}(\boldsymbol{x})}{q(\boldsymbol{z}\vert \boldsymbol{x})} dz\\ &= \log \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})p_{\boldsymbol{\theta}}(\boldsymbol{x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})} dz\\ &\ge \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\log \left[\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})p_{\boldsymbol{\theta}}(\boldsymbol{x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\right] dz = ELBO(q_{\boldsymbol{\phi}})\\ \end{align}\]

We used Bayes’ Rule in (3) and Jensen’s inequality after (5), leading us to the form of the expected lower bound of the model evidence shown in equation 6. With a few more manipulations we get:

\[\begin{align} ELBO(q_\boldsymbol{\phi}) &= \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\log \left[\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z,x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\right] dz\\ &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log {p_{\boldsymbol{\theta}}(\boldsymbol{z,x})}-\log q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\right] \\ &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log {p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})}-\log q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})-\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right] \\ &= -KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\vert \vert p_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \boldsymbol{x}))+ E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right] \\ &= -KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\vert \vert p_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \boldsymbol{x}))+ \log p_{\boldsymbol{\theta}}(\boldsymbol{x}) \end{align}\]

The form shown in (11) is informative - remember that the marginal likelihood $p(\boldsymbol{x})$ is not a function of $\boldsymbol{z}$. If we think of the log marginal likelihood as fixed, then $\log p(\boldsymbol{x})= KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\vert \vert p_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \boldsymbol{x})) + ELBO(q_\boldsymbol{\phi})$ so that increasing the KL-divergence must decrease the ELBO and vice versa. For the rest of this post, I’ll be reviewing papers that either dissect the ELBO into different representational forms or tweak prior assumptions to squeeze more performance out of models trained with variational Bayes.

Auto-Encoding Variational Bayes: Kingma and Welling (2013)

This paper ignited an enormous amount of interest from the machine learning community in variational methods because it recast approximate inference in a form that has a straightforward interpretation in the context of auto-encoder models. I won’t go into depth about how those work and will instead focus on the main contribution. We do need to know that within the conceptual framework of Kingma & Welling, we have a latent variable model that maps hidden or latent codes $z$ to observed data points $\boldsymbol{x}$ via a generator model $p_{\boldsymbol{\theta}}(\boldsymbol{x}\vert \boldsymbol{z})$. They make the assumption that this generator is a neural network parameterized by weights contained within $\boldsymbol{\theta}$.

Starting with equation (8) from the previous section, the authors made the following observation:

\[\begin{align}ELBO(q_\boldsymbol{\phi}) &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x,z})-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})]\\ &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z}) + \log p_{\boldsymbol{\theta}}(\boldsymbol{z})-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})]\\ &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] + E_{q(\boldsymbol{z}\vert \boldsymbol{x})}\left[ \log p_{\boldsymbol{\theta}}(\boldsymbol)-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})\right]\\ &= \underbrace{E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})]}_{\text{Reconstruction}} -\underbrace{KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\theta}}(\boldsymbol{z}))}_{\text{Shrinkage}}\\ \end{align}\]

(16) presents a common interpretation of the ELBO in terms of the variational parameters $\boldsymbol{\phi}$ as a tradeoff between maximizing the model likelihood $E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})\right]$ and keeping the learned posterior over $\boldsymbol{z}$ close to a prior distribution $p_{\boldsymbol{\theta}}$. An arbitrary choice of prior which seems to have caught on is to assume that $\boldsymbol{z} \sim N(\boldsymbol{0},\sigma^2 I)$ where $I$ denotes the identity matrix. From a non-Bayesian machine learning perspective, the first term is analogous to reconstruction or denoising error from a normal auto-encoder while the second term is a Bayesian innovation intended to help keep the learned latent space (governed by $\boldsymbol{\phi}$) relatively close to a spherical Gaussian.

ELBO surgery: Hoffman and Johnson (2016)

This is one of my favorite papers because it’s a lucid and compact explanation of an interesting phenomenon in deep generative models. It also has two more rearrangements of the ELBO. The first one is nearly the same expression as (12):

\[\begin{align} ELBO(q_\boldsymbol{\phi}) &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x,z})-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})\right]\\ &= \underbrace{E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x,z})\right]}_{\text{Negative expected energy}} -\underbrace{E_{q_\boldsymbol{\phi}}\left[\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})\right]}_{\text{Entropy}}\\ \end{align}\]

The term energy here refers to the convention that in statistical mechanics, the Boltzmann distribution is defined by an exponential dependence between energy and probability, i.e. $p(x)\propto e^{-U/kT}$ where $U$ is an energy function and $kT$ is a normalized temperature. This rewriting of the ELBO highlights how it balances likelihood maximization (equivalent to energy minimization) with keeping most of its probability mass from spreading out and thereby boosting the entropy term.

The second form of the ELBO is the key result of this paper and provides a more detailed breakdown than the previous forms. The setup is a little more involved and requires recasting the ELBO as a function dependent upon not just the variational parameters $\phi$ or the generative model parameters $\theta$ but also the identity of the $n$-th data point being analyzed. The main point of this section is that we should think about information being shared between the identity of the data point (as captured by its index $n$) and the latent code $z_n$.

\[\begin{align} ELBO(q_\boldsymbol{\phi}) &= \underbrace{E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})]}_{\text{Reconstruction}} -\underbrace{KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\theta}}(\boldsymbol{z}))}_{\text{Shrinkage}}\\ &=E_{q_\phi(z\vert x)}\left[\log p_\theta(\boldsymbol{x}\vert\boldsymbol{z})-\log \frac{q_\phi({\boldsymbol{z} \vert\boldsymbol{x})}}{p_\theta(\boldsymbol{z})}\right]\\ &=E_{q_\phi(z\vert x)}\left[ \log \left( \prod_n p_\theta(x_n\vert z_n)\right)-\log \left(\prod_n \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right)\right]\\ &=E_{q_\phi(z\vert x)}\left[\sum_n \log p_\theta(x_n\vert z_n)-\sum_n \log \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right]\\ &=E_{q_\phi(z\vert x)}\left[\sum_n \left(\log p_\theta(x_n\vert z_n)- \log \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right)\right]\\ &=\int_{z_1}\ldots \int_{z_N}\prod_n q_\phi(z_n\vert x_n) \left[\sum_n \left(\log p_\theta(x_n\vert z_n)- \log \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right)\right]dz_1 \ldots dz_n \end{align}\]

The latent variables $z_n$ are specific to each data point so $z_i$ is independent of $z_j$ given $x_i$. This allows us to rewrite the above integral as a sum.

\[\begin{align} ELBO(q_\phi)&=\sum_n \int_{z_n} q_\phi(z_n\vert x_n)\left(\log p_\theta( x_n\vert z_n)- \log \frac{q_\phi({ z_n \vert x_n)}}{p_\theta( z_n)}d z_n\right)\\ &=\sum_n E_{q_\phi(z_n\vert x_n)}\left[\log p_\theta( x_n\vert z_n)\right]- KL(q_\phi(z_n\vert x_n)\vert\vert p_\theta(z_n))\\ \end{align}\]

This expression can be seen in several other works as well and usually includes a prefactor of $1/N$, implying that the above equation is the term-by-term average reconstruction error minus a per-data point KL divergence. I am not sure why this is done and it doesn’t appear to be consistent with a physical point of view - the ELBO can be viewed as an upper bound on a total system-wide energy and a system’s total energy is a sum of energy functions across particles rather than an across-particle average. In practice, this factor of $1/N$ is unimportant because $N$ is known ahead of time and the optimization strategies resulting from the ELBO reparameterization are unaffected by it. However, to make these derivations consistent with the literature, I will include it here too.

Integrating results across different work in a common notation can be challenging and here we must be very specific in noting that $z_n$ refers to the latent code for a single data point, $\boldsymbol{z}$ refers to the latent codes for all data points and $z$ refers to a latent code which is not indexed by $n$ but which is conceptually linked to a single data point. This is an important distinction moving forward. We continue by defining priors over $n$ which are the probabilities that a given data point is sampled and fed into the ELBO expression. A natural choice is to simply choose them at random so that $p_{sample} = 1/N$ where $N$ is the number of observations in our dataset. We’ll make the same assumption for the accompanying prior under $q$ so that $p(n)=q(n)=1/N$. We also want to express $q_\phi (z_n\vert x_n)$ in terms of the random variable $n$ and not $x_n$ so we have $q_\phi (z\vert n)\triangleq q_\phi(z\vert x_n)$. This is purely notational - the random variable $n$ should be thought of as synonymous with $x_n$.

\[ELBO(q_\phi)=\frac{1}{N}\left(\sum_n E_{q_\phi(z_n\vert x_n)}\left[\log p_\theta( x_n\vert z_n)\right]- KL(q_\phi(z_n\vert x_n)\vert\vert p_\theta(z_n))\right)\\ \begin{align} \frac{1}{N}\sum_n KL(q_\phi(z_n\vert x_n)\vert\vert p_\theta(z_n))&=\frac{1}{N}\sum_n KL(q_\phi(z\vert n)\vert\vert p_\theta(z))\\ &=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)} \log \frac{q_\phi(z\vert n)}{p_\theta(z)}\\ &=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)} \log \frac{q_\phi(n\vert z)q_\phi(z)}{p_\theta(z)q_\phi(n)}\\ &=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)} \log \frac{q_\phi(n\vert z)q_\phi(z)}{p_\theta(z)q_\phi(n)}\\ &=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(z)}{p_\theta(z)} + \log \frac{q_\phi(n\vert z)}{q_\phi(n)}\right]\\ &=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(z)}{p_\theta(z)} + \log \frac{q_\phi(n\vert z)q_\phi(z)}{q_\phi(n)q_\phi(z)}\right]\\ &=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(z)}{p_\theta(z)} + \log \frac{q_\phi(n, z)}{q_\phi(n)q_\phi(z)}\right]\\ &=KL(q_\phi(z)\vert\vert p_\theta(z)) + \frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(n, z)}{q_\phi(n)q_\phi(z)}\right]\\ &=KL(q_\phi(z)\vert\vert p_\theta(z)) +\sum_n E_{q_\phi(n,z)}\left[ \log \frac{q_\phi(n, z)}{q_\phi(n)q_\phi(z)}\right]\\ &=KL(q_\phi(z)\vert\vert p_\theta(z)) + \mathbb{I}_{q_\phi}(n,z)\\ \end{align}\]

This result rearranges the sum of per-data point KL divergences into an averaged KL divergence and the mutual information $\mathbb{I}$ between the random variables $n$ and $z$. Conceptually, this is a very nice result - it represents the original regularizing term as a divergence between averaged (i.e. non data point specific) prior distributions and information shared acros $q_\phi$ between $n$ and $z$. We can start to think about $q_\phi$ as a communication channel which may perfectly communicate the information in the index $n$ to the latent code $z$, i.e. perfect reconstruction, or it may fail to communicate substantial information and thereby the generative model learns to ignore the latent code $z$! We can use these expressions to rewrite the ELBO in a form identical to an equation from the Hoffman and Johnson paper:

\[ELBO(q) =\underbrace{\left[\frac{1}{N}\sum_n E_{q_\phi(z_n\vert x_n})\left[\log p_\theta(x_n\vert z_n)\right] \right]}_{\text{Expected reconstruction error}} - \underbrace{\mathbb{I}_{q_\phi(n,z)}(n,z)}_\text{Decoded information} - \underbrace{KL(q_\phi(z)\vert\vert p_\theta(z))}_{\text{Marginal regularizer}}\]

In the above expression, the first term on the right hand side represents how well the generative model can reconstruct the data points $x_n$ using the latent codes. If the values of $\theta$ are chosen poorly and the generative model is insufficient, this term will be relatively low. The next term is the mutual information from before and tells us how well the encoder network $q_\phi$ is transmitting information from the identity of the data point $x_n$ into the latent variable $z_n$. Finally, the last term pushes the average distribution of latent codes $z_n$ to be close to the prior $p_\theta(z)$. For many applications, $p_\theta$ is chosen somewhat arbitrarily to be a diagonal or isotropic Gaussian and this form suggests that we may want to choose more carefully in order to obtain more desired behavior from variational methods.

$\beta$-VAE: Higgins et al. (2017)

As researchers began to catch on that the shrinkage term $KL(q_\phi(z\vert x)\vert\vert p_\theta(z))$ may play an important role in favoring certain classes of representations, they developed new modifications to the ELBO to help push the variational objective in different directions.

A conceptually straightforward way to do this is to simply up- or down-weight the shrinkage term in conjunction with the right prior. The intuition behind $\beta$-VAE appears to be that in the $\beta$-modified expression for the ELBO:$ELBO(q_\boldsymbol{\phi}) = E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] -\ \beta \cdot KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\theta}}(\boldsymbol{z}))$, the second term can be tuned to push $q_\phi(z\vert x)$ closer to a desired prior structure. The default prior that had been chosen in many studies up to this point is a simple isotropic Gaussian which promotes cross-factor independence and will naturally push towards a disentangled representation in which the different dimensions of $z$ are uncorrelated in the approximate posterior $q_\phi(z\vert x)$. It’s straightforward to show that for two approximately isotropic Gaussian distributions $q_\phi, p_\theta$, their KL divergence is proportional to

\[\log\frac {\vert\Sigma_{q_\phi}\vert}{\vert\Sigma_{p_\theta}\vert}\propto \log \sigma^2_{q_\phi}-\log \sigma^2_{p_\theta}\]

where $\Sigma_{q_{\phi}}$ and $\Sigma_{p_\theta}$ are the diagonal covariance matrices of $q_\phi$ and $p_\theta$ respectively. As a consequence, we can also view an adjustment to $\beta$ as equivalent to tweaking our latent space prior variance. In statistical physics, $\beta$ is a function of the system temperature so it is unclear to me why the notion of $\beta$ was introduced despite several other identical conceptual frameworks existing which were appropriate for describing this improvement. Perhaps this was indeed the motivation but this fact was omitted from the text.

Regardless, this led to marked improvements on learning disentangled representations and is such an easy computational tweak that it can be implemented into the vast majority of VI workflows.

Empirical Bayes for latent variable priors (VampPrior): Tomczak and Welling (2018)

The $\beta$-VAE paper suggested that tweaking the variational objective’s split across reconstruction error and shrinkage could produce better models and also more disentangled representations. Unfortunately, the discussion of choosing $\beta$ wasn’t linked to a specific choice of prior. In my opinion, the most interesting observation from the unbearably-cheesily-named VampPrior paper was that all latent variable priors can conceptually be ordered by the degree to which they depend on observed data. I’ll reproduce some of their arguments here after introducing some extra notation: $p_\lambda(z)$ is a prior over the latent state $z$ and in past paragraphs I lazily referred to this as $p_\theta(z)$ with the understanding that the vector $\boldsymbol{\theta}$ included not just the weights of the decoder network but also the hyperparameters of the latent space prior. I will be more explicit moving forward.

The paper picks right off at a familiar point: $\begin{align} ELBO(q_\boldsymbol{\phi}) &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] -KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\lambda}}(\boldsymbol{z}))\\ &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] -E_{q_\phi}\left[\log q_\phi(\boldsymbol{z}\vert \boldsymbol{x})-p_\lambda(\boldsymbol{z})) \right]\\ \end{align}$ If the goal is to maximize the ELBO, then we could simply drive the second term on the RHS of (39) to zero by setting our prior equal to the learned posterior $q_\phi(\boldsymbol{z}\vert \boldsymbol{x})$ and thereby commit a cardinal sin by snooping on the data. However, this would remove any shrinkage effects and not let the prior do its job by restricting the capacity of the model in an effective way. The other extreme is to choose $p_\lambda$ to be very restrictive and not make use of any of the observed data points $x_n$.

The key insight from the Tomczak and Welling paper is that there is an empirical Bayes (EB) middle ground between these two extremes. We can implement this EB prior by expressing $p_\lambda$ as a weakened version of the variational posterior $q_\phi$ via the usage of $K$ pseudo-inputs $ $u_1,…,u_K$ in a variational mixture of posteriors (VAMP):

\[p^{VAMP}(z)=\frac{1}{K}\sum_k q_\phi(\boldsymbol{z}\vert u_k)\]

In the limit where $K\approx N$, the prior and posterior are identical so there is little regularization, but when a good value of $K$ is selected, $p^{VAMP}$ is clearly going to be highly multimodal as a mixture distribution, but it is also going to have less capacity than the full posterior. However, this opens another question regarding how the $u_k$ are selected and generated. In true empirical Bayes fashion, these are treated as additional model parameters amenable to optimization via backprop. The implementation that Tomczak and Welling actually go with for their experiments uses a two layer hierarchical VAMP prior; I would like to comment on this but there was virtually no motivation or discussion of why this multilayer prior would help.