Posts

2022

Nonparametric changepoint model in PyMC

Published:

Identifying structural breaks in data is an important problem to folks that frequently work with time series data. Some examples of how people have dealt with the problem of a single changepoint can be found here and here.

Easy local average with Google Earth Engine’s Python API

Published:

Across many projects, I’ve needed to analyze remote sensing data and compute, for some points or polygons, a statistical summary of the remotely sensed layer in a neighborhood of those objects. Sometimes, that summary needs to be exactly calculated within the extent of the spatial object (like the boundaries of a field or a county) but other times, simply knowing the average in a circular or square buffer around the feature is good enough.

2021

Fast Kronecker matrix-vector product with einsum

Published:

In numerical linear algebra, a common problem that arises in the analysis of large datasets is the product of a dense but structured $N \times N$ matrix $\mathbf{A} = \bigotimes_{j=1}^J \mathbf{A}_j$ with a similarly dense vector $\mathbf{y}$. We’re assuming that $\mathbf{A}$ can be written as the tensor or Kronecker product of $J$ smaller matrices denoted by $\mathbf{A}_j$, each of which has dimension $N_j$.

Balanced spatial partitioning for point data in 20 lines

Published:

When working with geospatial data, sometimes a dataset is simply too large in its original form or file to be worked with effectively. I often work with hierarchical statistical models that function effectively when a larger dataset is partitioned into smaller subsets. To preprocess the data, I frequently need to find a way to split up a larger spatial domain into smaller pieces such that the number of objects or data points in each piece is roughly equal.

Modeling spatial structure in binary data with an H3 hexagonal coordinate system

Published:

We often model geostatistical (i.e. point-referenced data) in order to determine whether or not there are spatial patterns of autocorrelation. The object of interest is frequently an underlying spatial function giving rise to patterns of spatially correlated data. When we work with discrete observational data, a problem arises - we want to study smoothly-varying response surfaces over space, but the data themselves are not continuous and therefore we cannot specify a likelihood which is continuous in both space and response. Consequently, we often choose to reparameterize our model in terms of a latent smooth spatial surface and a link function mapping this spatial surface to the parameters of a likelihood function appropriate for discrete data.

Creating an emulator for an agent-based model

Published:

Computers are (still) getting faster every year and it is now commonplace to run simulations in seconds that would have required hours’ or days’ worth of compute time in previous generations. That said, we still often come across cases where our computer models are simply too intricate and/or have too many components to run as often and quickly as we would like. In this scenario, we are frequently forced to choose a limited subset of potential scenarios manifest as parameter settings for which we have the resources to run simulations. I’ve written this notebook to show how to use a statistical emulator to help understand how the outputs of a model’s simulations might vary with parameters.

2020

Density estimation for geospatial imagery using autoregressive neural models

Published:

Bayesian machine learning is all about learning a good representation of very complicated datasets, leveraging cleverly structured models and effective parameter estimation techniques to create a high-dimensional probability distribution approximating the observed data. A key advantage of posing computer vision research under the umbrella of Bayesian inference is that some tasks become really straightforward with the right choice of model.

Multivariate sample size for Markov chains

Published:

Summary: I show how to calculate a multivariate effective sample size after Vats et al. (2019). In applied statistics, Markov chain Monte Carlo (MCMC) is now widely used to fit statistical models. Suppose we have a statistical model $p_\theta$ of some dataset $\mathcal{D}$ which has a parameter $\theta$. The basic idea behind MCMC is to estimate $\theta$ by generating $N$ random variates $\theta_i,\theta_2,…$ from the posterior distribution $p(\theta\vert\mathcal{D})$ which are (hopefully) distributed around $\theta$ in a predictable way. The Bayesian central limit theorem states that under the right conditions, $\theta_i$ is normally distributed about the true parameter value $\theta$ with some sample variance $\sigma^2$. We might then want to use the mean of these samples $\hat{\theta}=\sum_i^N \theta_i$ as an estimator of $\theta$ since this posterior mean is an optimal estimator in the context of Bayes risk and mean square loss.

A review of recent work for posterior image inpainting

Published:

Image, text and audio are examples of structured multivariate data where we have a total or partial ordering over the entries of our data points and also may exhibit long-range structure extending over many pixels, words or seconds of speech. As a consequence, it is difficult to model these kinds of data using models that allow for only short-range structure such as HMMs or which can make use of only pairwise dependency structures such as the covariance matrix in a multivariate normal distribution. What if we’d like to build Bayesian models with more sophisticated structure?

An ELBO timeline

Published:

In Bayesian machine learning, deep generative models are providing exciting ways to extend our understanding of optimization and flexible parametric forms to more conventional statistical problems while simultaneously lending insight from probabilistic modeling to AI / ML. This is an exciting time to be studying the topic as it is blending results from probability theory, statistical physics, deep learning and information theory in sometimes surprising ways. This post is a short summary of some of the major work on the subject and serves as an annotated bibliography on the most important developments in the subject. It also uses common notation to help smooth over some of the differences in detail between papers.

2018

Spline regression in PyMC3

Published:

Linear regression is a useful tool but is limited in the expressivity of functional relations which it can capture. Once we move to nonlinear regression, we have tons of options and a hard question to answer - how do we specify our model?

It’s Hard to Optimize Recurrent Models in PyMC3

Published:

During my investigations of gradient-based Markov Chain Monte Carlo with nonlinear dynamical systems, I have come across one ugly, glaring fact: inference can be slow. There are real, practical problems we could solve but which would require running our samplers for weeks in a harkening back to an earlier time when computation was not cheap. The code below shows my (unsuccessful) attempts to speed up the inference with NUTS for a complicated recurrent model.

Hamiltonian Monte Carlo for Hydrology, Part III

Published:

This is the third in a series of posts I am writing about my work on statistical hydrology. In posts I and II, I laid out the rationale for implementing a conceptual hydrology model in Theano and PyMC3. Here, I complete the implementation of GR4–a simple four parameter hydrology model–in PyMC3 and show that it can recover parameter estimates exactly with simulated data. To recap, I am doing this for multiple reasons. First, it allows for really, really fast Metropolis-Hastings updates because Theano is compiling all of this down to machine code for each sample execution. Second, it allows us to apply Hamiltonian Monte Carlo in case there are either lots of parameters or some of the parameters are highly correlated. Finally, a nice byproduct of this model is the fact that we can calibrate GR4J using backpropagation, just like in neural networks. For a simple watershed, we might not want to do this but if we have a model involving dozens of reaches and hundreds of parameters, it might be more effective than existing methods. All of this is possible because I have put GR4J in a framework that allows for automatic differentiation and thus the gradient of the model likelihood with regard to its parameters is available. This is a big, big advantage over gradient-free methods like genetic algorithms or similar approaches since we are always guaranteed to find at least a local minimum of our loss function. I won’t cover backpropagation-style calibration here but will give examples of applying Metropolis-Hastings and Hamiltonian Monte Carlo to the problem of estimating the parameters from synthetic streamflow data.

Hamiltonian Monte Carlo for Hydrology, Part II

Published:

In this post, I’ll show how to implement part of a conceptual hydrology model in Theano and use it for sampling efficiently in PyMC3. The model I’ve chosen is GR4J and you can read about it at this website. It takes in four parameters and equal length time series of precipitation and evaporation and simulates streamflow. The model has two main components for generating runoff from rainfall and routing it via two hydrographs. I’ve grabbed a diagram from an existing paper and shown it below.

Hamiltonian Monte Carlo for Hydrology, Part I

Published:

In a previous post I sketched out how to make a statistical model out of a simple ordinary differential equation. This is the first in a series of posts following my attempts to make a Bayesian version of a very simple conceptual hydrology model which is fit into a modern deep learning / Markov Chain Monte Carlo (MCMC) framework.

A Statistical ODE Model in PyMC3

Published:

A recurring theme in my posts is the power of combined statistical and physical/mechanistic models that are really only possible with modern Markov Chain Monte Carlo (MCMC) frameworks. In this post, I’ll go over how to make a statistical model out of a simple dynamical system. Let’s assume we have a system governed by the following equations: \(\frac{dx}{dt} = -a\cdot x(t) + y(t)\) where \(y\) is a forcing variable that varies over time. Both \(x\) and \(y\) are observed but \(a\) is unknown. We want to make a statistical inference about the values of \(a\) and we’ll employ PyMC3 to do this.

Bayesian Changepoint Analysis of PPR Water Levels

Published:

One of the most wonderful recent developments in computational statistics is explosion of probabilistic programming frameworks. These allow for incredibly flexible statistical (and usually Bayesian) models that more accurately fit the systems that we study even if the conceptual model of the system can be complicated. Some of the better known frameworks include Stan, Edward and PyMC3. In this notebook, I am using PyMC3 in Python to fit a regression model incorporating an unobserved switchpoint. The general idea is that we know that some sort of structural change happened over the course of our data collection and we want to know when/where this occurred with accompanying uncertainties.

Landsat and the Prairie Potholes

Published:

For the last year, most of my research has focused on the evolution of nearly a million lakes, ponds and wetlands across the North American interior. Many of these are collected in the Prairie Pothole Region (PPR) which extends across the Dakotas, Iowa, Minnesota and large parts of southern Canada as well. These are a fairly unique set of habitats and environments because people have been living and working all across them since homesteading days but we have yet to achieve a complete understanding of their complex interactions between climate and ecosystem. Since the defining characteristic of the region is that it is flat, there are multitudes of shallow depressions which are subject to drought-and-deluge cycles of wetland filling / emptying. A wide range of highly specialized plant and animal species has come to inhabit these potholes despite these difficulties.