# It’s Hard to Optimize Recurrent Models in PyMC3

Published:

During my investigations of gradient-based Markov Chain Monte Carlo with nonlinear dynamical systems, I have come across one ugly, glaring fact: inference can be slow. There are real, practical problems we could solve but which would require running our samplers for weeks in a harkening back to an earlier time when computation was not cheap. The code below shows my (unsuccessful) attempts to speed up the inference with NUTS for a complicated recurrent model.

As in earlier posts, I have a simple hydrology model (GR4J) which has been wrapped in the gr4j_theano package. You can find that at my github. I have some synthetic data generated by GR4J and I want to try to recover the model parameters given the first 50 observations of precipitation, evapotranspiration and streamflow. I’ll be using NUTS, the No U Turn Sampler which is built into PyMC3 by default.

import pandas as pd
T = 50
P = data['P'].values
ET = data['ET'].values
modeled_Q = data['modeled_Q'].values


I’ve got several modified snippets of code below. I’m going to record the time it takes the inference to finish as well as the samples that are drawn.

execution_times = []
traces = []

import time
import pymc3 as pm
from IPython.display import clear_output

def time_sampling(draws,model):
old = time.time()
times = []
with model:
for i,draw in enumerate(pm.iter_sample(draws = draws,step=pm.NUTS())):
clear_output(wait=True)
print i
basic_trace = draw
new = time.time()
times.append(new-old)
old = new
return draw,times


### Basic version

In the first iteration. I am going to run inference with no modifications. Later on, I will truncate the gradient used in the Hamiltonian MC proposals and also reduce the number of observations that I use.

import numpy as np
import gr4j_theano as gt

with pm.Model() as basic_model:
x1 = pm.Uniform('x1',lower = 50, upper = 2000)
x2 = pm.Uniform('x2',lower = 0.1, upper = 10)
x3 = pm.Uniform('x3',lower = 5, upper = 100)
x4 = pm.Uniform('x4',lower = 0.5, upper = 5)
x4_limit = 5

S0  = pm.Uniform('S0',lower = 100, upper = 2000)
R0  = pm.Uniform('R0',lower = 5, upper = 100)
Pr0 = np.zeros(2*x4_limit-1)
sd  = 0.1
streamflow = gt.GR4J('streamflow',x1=x1,x2=x2,x3=x3,x4=x4,
x4_limit=x4_limit,S0=S0,R0=R0,Pr0=Pr0,sd=sd,
precipitation = P,
evaporation = ET,
observed = modeled_Q,truncate = -1)



Here, I simply tell PyMC3 and Theano to stop computing the gradient with backpropagation after 5 steps backwards in time. I expect this to dramatically speed up computation, though at the potential expense of accuracy.

with pm.Model() as truncate_grad_model:
x1 = pm.Uniform('x1',lower = 50, upper = 2000)
x2 = pm.Uniform('x2',lower = 0.1, upper = 10)
x3 = pm.Uniform('x3',lower = 5, upper = 100)
x4 = pm.Uniform('x4',lower = 0.5, upper = 5)
x4_limit = 5

S0  = pm.Uniform('S0',lower = 100, upper = 2000)
R0  = pm.Uniform('R0',lower = 5, upper = 100)
Pr0 = np.zeros(2*x4_limit-1)
sd  = 0.1
streamflow = gt.GR4J('streamflow',x1=x1,x2=x2,x3=x3,x4=x4,
x4_limit=x4_limit,S0=S0,R0=R0,Pr0=Pr0,sd=sd,
precipitation = P,
evaporation = ET,
observed = modeled_Q,truncate = 5)



### Subsampling

I’ll also consider the same model but with only 10 of the streamflow observations used instead of the whole 50. I’ve specified to use only the timesteps [0,5,…,45] with the argument subsample_index.

n = 5
with pm.Model() as subsample_model:
x1 = pm.Uniform('x1',lower = 50, upper = 2000)
x2 = pm.Uniform('x2',lower = 0.1, upper = 10)
x3 = pm.Uniform('x3',lower = 5, upper = 100)
x4 = pm.Uniform('x4',lower = 0.5, upper = 5)
x4_limit = 5

S0  = pm.Uniform('S0',lower = 100, upper = 2000)
R0  = pm.Uniform('R0',lower = 5, upper = 100)
Pr0 = np.zeros(2*x4_limit-1)
sd  = 0.1
streamflow = gt.GR4J('streamflow',x1=x1,x2=x2,x3=x3,x4=x4,
x4_limit=x4_limit,S0=S0,R0=R0,Pr0=Pr0,sd=sd,
precipitation = P,
evaporation = ET,subsample_index=np.arange(T)[0::n],truncate=-1,
observed = modeled_Q)



### Fitting all the models

Each inference run took roughly 10 minutes, for a grand total of around 30 minutes for all 3 cases.

for model in [basic_model,truncate_grad_model,subsample_model]:
draws,times = time_sampling(1000,model)
execution_times.append(times)
traces.append(draws)


999


### Comparison

The first plot I’ve generated is just a display of how long it took each model to finish sampling.

import matplotlib.pyplot as plt
%matplotlib inline
labels = ['Basic','Truncate','Subsample']
for i in range(3):
plt.plot(np.cumsum(execution_times[i]),label = labels[i])
plt.legend()
plt.ylabel('Cumulative execution time')
_ = plt.xlabel('Draw')


The results are a little confusing - subsampling the observations actually required more time to finish than either truncating the gradient or doing nothing at all (the “basic” case).

Next, I’ve made plots of the posterior for the GR4J parameter x1 and the overall energy distribution. The true value is 320.11. In the case where I truncated the gradient, the energy plot shows an awful mismatch between the sampler proposals and the overall likelihood function. Unsurprisingly the posterior for that case misses the true parameter value altogether. Both the basic and subsampled case look like they did okay, though the posterior is broader for the subsampled model.

f,ax = plt.subplots(3,2,figsize = (10,15))
for i,trace in enumerate(traces):
pm.plot_posterior(trace[500::],varnames=['x1'],ax=ax[i,0])
pm.energyplot(trace[500::],ax = ax[i,1])
ax[i,0].set_ylabel(labels[i],fontsize = 24)


One reason that the No U Turn Sampler gets bogged down is that it is forced to build a large binary tree for local exploration of the state space and building a large tree is expensive. The plot below just shows a histogram of the tree depth for each model.

for i,trace in enumerate(traces):
trace.get_sampler_stats('depth')
plt.hist(trace.get_sampler_stats('depth'),bins = 20,alpha = 0.5,label = labels[i])
plt.legend()
plt.xlabel('Tree_depth')
_ = plt.ylabel('# Samples')


For the subsampled model, the binary tree is typically deeper than for the basic model. This suggests that the posterior is harder to navigate; maybe reducing the amount of data isn’t a good thing, then. The truncated version sometimes reaches the max tree depth of 10 (i.e. it’s making lots and lots of proposals that aren’t very good) which is indicative of poor sampling.

### Conclusion

Neither of the methods I used to reduce the runtime of my sampler worked especially well. It looks like that more data and more random variables can be good if it opens up the posterior and makes a pathological, complicated likelihood function unfold into one which has less curvature.