Jekyll2021-04-11T07:27:19-07:00https://ckrapu.github.io/feed.xmlckrapu.github.ioChristopher KrapuCreating An Emulator For An Agent Based Model2021-04-05T00:00:00-07:002021-04-05T00:00:00-07:00https://ckrapu.github.io/creating-an-emulator-for-an-agent-based-model<p>Computers are (still) getting faster every year and it is now commonplace to run simulations in seconds that would have required hours’ or days’ worth of compute time in previous generations. That said, we still often come across cases where our computer models are simply too intricate and/or have too many components to run as often and quickly as we would like. In this scenario, we are frequently forced to choose a limited subset of potential scenarios manifest as parameter settings for which we have the resources to run simulations. I’ve written this notebook to show how to use a <em>statistical emulator</em> to help understand how the outputs of a model’s simulations might vary with parameters.</p>
<p>This is going to be similar in many ways to the paper written by Kennedy and O’Hagan (2001) which is frequently cited on the subject, though our approach will be simpler in some regards.To start us off, I’ve modified an example of an agent-based model for disease spread on a grid which was written by Damien Farrell on <a href="https://dmnfarrell.github.io/bioinformatics/abm-mesa-python">his personal site</a>. We’re going to write a statistical emulator in PyMC3 and use it to infer likely values for the date of peak infection <em>without</em> running the simulator exhaustively over the entire parameter space.</p>
<p><strong>TL;DR</strong>: we run our simulation for a few combinations of parameter settings and then try to estimate a simulation summary statistic for the entire parameter space.</p>
<p>If you’re interested in reproducing this notebook, you can find the <code class="language-plaintext highlighter-rouge">abm_lib.py</code> file at <a href="https://gist.github.com/ckrapu/e2fb8692972ec2b499a1494760ff626e">this gist</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">abm_lib</span> <span class="kn">import</span> <span class="n">SIR</span>
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="o">%</span><span class="n">config</span> <span class="n">InlineBackend</span><span class="p">.</span><span class="n">figure_format</span> <span class="o">=</span> <span class="s">'retina'</span>
</code></pre></div></div>
<h2 id="simulating-with-an-abm">Simulating with an ABM</h2>
<p>We’ll first need to specify the parameters for the SIR model. This model is fairly rudimentary and is parameterized by:</p>
<ul>
<li>The number of agents in the simulation</li>
<li>The height and width of the spatial grid</li>
<li>The proportion of infected agents at the beginng</li>
<li>Probability of infecting other agents in the same grid cell</li>
<li>Probability of dying from the infection</li>
<li>Mean + standard deviation of time required to overcome the infection and recover</li>
</ul>
<p>These parameters, as well as the number of timesteps in the simulation, are all specified in the following cells. I am going to let most of the parameters be fixed as single values - only two parameters will be allowed to vary in our simulations.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fixed_params</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"N"</span><span class="p">:</span><span class="mi">20000</span><span class="p">,</span>
<span class="s">"width"</span><span class="p">:</span><span class="mi">80</span><span class="p">,</span>
<span class="s">"height"</span><span class="p">:</span><span class="mi">30</span><span class="p">,</span>
<span class="s">"recovery_sd"</span><span class="p">:</span><span class="mi">4</span><span class="p">,</span>
<span class="s">"recovery_days"</span><span class="p">:</span><span class="mi">21</span><span class="p">,</span>
<span class="s">"p_infected_initial"</span><span class="p">:</span><span class="mf">0.0002</span>
<span class="p">}</span>
</code></pre></div></div>
<p>For the probability of transmission and death rate, we’ll randomly sample some values from the domains indicated below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sample_bounds</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"ptrans"</span><span class="p">:[</span><span class="mf">0.0001</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">],</span>
<span class="s">"death_rate"</span><span class="p">:[</span><span class="mf">0.001</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">],</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here, we iteratively sample new values of the parameters and run the simulation. Since each one takes ~40 seconds, it would take too long to run the simulation at every single parameter value in a dense grid of 1000 or more possible settings.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">n_samples_init</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">input_dicts</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_samples_init</span><span class="p">):</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">fixed_params</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span><span class="n">v</span> <span class="ow">in</span> <span class="n">sample_bounds</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">d</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">uniform</span><span class="p">(</span><span class="o">*</span><span class="n">v</span><span class="p">)</span>
<span class="n">input_dicts</span> <span class="o">+=</span> <span class="p">[</span><span class="n">d</span><span class="p">]</span>
<span class="n">n_steps</span><span class="o">=</span><span class="mi">100</span>
<span class="n">simulations</span> <span class="o">=</span> <span class="p">[</span><span class="n">SIR</span><span class="p">(</span><span class="n">n_steps</span><span class="p">,</span> <span class="n">model_kwargs</span><span class="o">=</span><span class="n">d</span><span class="p">)</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">input_dicts</span><span class="p">]</span>
<span class="n">all_states</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">simulations</span><span class="p">]</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>100%|██████████| 100/100 [00:57<00:00, 1.75it/s]
100%|██████████| 100/100 [00:36<00:00, 2.78it/s]
100%|██████████| 100/100 [00:52<00:00, 1.91it/s]
100%|██████████| 100/100 [00:35<00:00, 2.80it/s]
100%|██████████| 100/100 [00:37<00:00, 2.64it/s]
100%|██████████| 100/100 [00:40<00:00, 2.48it/s]
100%|██████████| 100/100 [00:40<00:00, 2.47it/s]
100%|██████████| 100/100 [00:40<00:00, 2.44it/s]
100%|██████████| 100/100 [00:39<00:00, 2.50it/s]
100%|██████████| 100/100 [00:23<00:00, 4.32it/s]
</code></pre></div></div>
<p>Next, we combine all the sampled parameter values into a dataframe. We also add a column for our response variable which presents a summary of the results from the ABM simulation. We’ll use the timestep for which the level of infection was highest as the <code class="language-plaintext highlighter-rouge">worst_day</code> column in the dataframe.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">params_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">input_dicts</span><span class="p">)</span>
<span class="c1"># Add column showing the day with the peak infection rate
</span><span class="n">params_df</span><span class="p">[</span><span class="s">'worst_day'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">x</span><span class="p">[...,</span><span class="mi">1</span><span class="p">].</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)))</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">all_states</span><span class="p">]</span>
</code></pre></div></div>
<p>We can also spit out a few animations to visualize how the model dynamics behave. This can take quite awhile, however.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.gridspec</span> <span class="k">as</span> <span class="n">gridspec</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">mpl_toolkits.axes_grid1</span> <span class="kn">import</span> <span class="n">make_axes_locatable</span>
<span class="n">generate_animations</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">figure_directory</span> <span class="o">=</span> <span class="s">'./figures/sir-states/'</span>
<span class="k">if</span> <span class="n">generate_animations</span><span class="p">:</span>
<span class="n">colors</span><span class="o">=</span><span class="p">[</span><span class="s">'r'</span><span class="p">,</span><span class="s">'g'</span><span class="p">,</span><span class="s">'b'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">pair</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">simulations</span><span class="p">):</span>
<span class="n">model</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">pair</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">n_steps</span><span class="p">)):</span>
<span class="n">fig</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">constrained_layout</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">gs</span> <span class="o">=</span> <span class="n">gridspec</span><span class="p">.</span><span class="n">GridSpec</span><span class="p">(</span><span class="n">ncols</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">figure</span><span class="o">=</span><span class="n">fig</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="n">gs</span><span class="p">[</span><span class="mi">2</span><span class="p">:</span><span class="mi">4</span><span class="p">])</span>
<span class="n">im</span> <span class="o">=</span> <span class="n">ax</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">state</span><span class="p">[</span><span class="n">i</span><span class="p">,:,:,</span><span class="mi">1</span><span class="p">].</span><span class="n">T</span><span class="p">,</span> <span class="n">vmax</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">vmin</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="s">'jet'</span><span class="p">)</span>
<span class="n">ax</span><span class="p">.</span><span class="n">set_axis_off</span><span class="p">()</span>
<span class="n">divider</span> <span class="o">=</span> <span class="n">make_axes_locatable</span><span class="p">(</span><span class="n">ax</span><span class="p">)</span>
<span class="n">cax</span> <span class="o">=</span> <span class="n">divider</span><span class="p">.</span><span class="n">append_axes</span><span class="p">(</span><span class="s">"right"</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="s">"3%"</span><span class="p">,</span> <span class="n">pad</span><span class="o">=</span><span class="mf">0.05</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">im</span><span class="p">,</span> <span class="n">cax</span><span class="o">=</span><span class="n">cax</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Number infected'</span><span class="p">)</span>
<span class="n">ax2</span> <span class="o">=</span> <span class="n">fig</span><span class="p">.</span><span class="n">add_subplot</span><span class="p">(</span><span class="n">gs</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">])</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">state</span><span class="p">[:,:,:,</span><span class="n">j</span><span class="p">].</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)),</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">state</span><span class="p">[</span><span class="n">i</span><span class="p">,:,:,</span><span class="n">j</span><span class="p">].</span><span class="nb">sum</span><span class="p">(),</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Number infected'</span><span class="p">)</span>
<span class="n">ax2</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Timestep'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">figure_directory</span><span class="o">+</span><span class="s">'frame_{1}_{0}.jpg'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">).</span><span class="n">zfill</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span><span class="n">j</span><span class="p">),</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">250</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">close</span><span class="p">()</span>
<span class="err">!</span> <span class="n">cd</span> <span class="o">/</span><span class="n">Users</span><span class="o">/</span><span class="n">v7k</span><span class="o">/</span><span class="n">Dropbox</span>\ \<span class="p">(</span><span class="n">ORNL</span>\<span class="p">)</span><span class="o">/</span><span class="n">research</span><span class="o">/</span><span class="n">abm</span><span class="o">-</span><span class="n">inference</span><span class="o">/</span><span class="n">figures</span><span class="o">/</span><span class="n">sir</span><span class="o">-</span><span class="n">states</span><span class="o">/</span><span class="p">;</span> <span class="n">convert</span> <span class="o">*</span><span class="p">.</span><span class="n">jpg</span> <span class="n">sir_states</span><span class="p">{</span><span class="n">k</span><span class="p">}.</span><span class="n">gif</span><span class="p">;</span> <span class="n">rm</span> <span class="o">*</span><span class="p">.</span><span class="n">jpg</span>
</code></pre></div></div>
<p><img src="/images/sir_states9.gif" alt="gif" />
<img src="/images/sir_states.gif" alt="gif" /></p>
<p>Clearly, the model parameters make a major difference in the rate of spread of the virus. In the lower case, the spread requires over 100 timesteps to infect most of the agents.</p>
<p>If we make a plot depicting the date of peak infection as a function of <code class="language-plaintext highlighter-rouge">ptrans</code> and <code class="language-plaintext highlighter-rouge">death_rate</code>, we’ll get something that looks like the picture below. This is a fairly small set of points and the rest of this notebook will focus on interpolating between them in a way which provides quantified uncertainty.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">params_df</span><span class="p">.</span><span class="n">death_rate</span><span class="p">,</span> <span class="n">params_df</span><span class="p">.</span><span class="n">ptrans</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="n">params_df</span><span class="p">.</span><span class="n">worst_day</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Death rate'</span><span class="p">),</span> <span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Transmission probability'</span><span class="p">),</span> <span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s">"Day / timestep"</span><span class="p">);</span>
</code></pre></div></div>
<p><img src="/images/2021-04-05-spatial-abm-emulator_19_0.png" alt="png" /></p>
<h2 id="building-a-simplified-gaussian-process-emulator">Building a simplified Gaussian process emulator</h2>
<p>Our probabilistic model for interpolating between ABM parameter points is shown below in the next few code cells. We first rescale the parameter points and the response variable to have unit variance. This makes it a little easier to specify reasonable priors for the parameters of our Gaussian process model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">param_scales</span> <span class="o">=</span> <span class="n">params_df</span><span class="p">.</span><span class="n">std</span><span class="p">()</span>
<span class="n">params_df_std</span> <span class="o">=</span> <span class="n">params_df</span> <span class="o">/</span> <span class="n">param_scales</span>
<span class="n">input_vars</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">sample_bounds</span><span class="p">.</span><span class="n">keys</span><span class="p">())</span>
<span class="n">n_inputs</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">input_vars</span><span class="p">)</span>
</code></pre></div></div>
<p>We assume that the mean function of our Gaussian process is a constant, and we use fairly standard priors for the remaining GP parameters. In particular, we use a <code class="language-plaintext highlighter-rouge">Matern52</code> covariance kernel which allows the correlation between values of our response variable to be a function of the Euclidean distance between them.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pymc3</span> <span class="k">as</span> <span class="n">pm</span>
<span class="k">def</span> <span class="nf">sample_emulator_model_basic</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">sampler_kwargs</span><span class="o">=</span><span class="p">{</span><span class="s">'target_accept'</span><span class="p">:</span><span class="mf">0.95</span><span class="p">}):</span>
<span class="n">_</span><span class="p">,</span> <span class="n">n_inputs</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span>
<span class="k">with</span> <span class="n">pm</span><span class="p">.</span><span class="n">Model</span><span class="p">()</span> <span class="k">as</span> <span class="n">emulator_model</span><span class="p">:</span>
<span class="n">intercept</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span><span class="s">'intercept'</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="n">length_scale</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">HalfNormal</span><span class="p">(</span><span class="s">'length_scale'</span><span class="p">,</span> <span class="n">sd</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">variance</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">InverseGamma</span><span class="p">(</span><span class="s">'variance'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">beta</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
<span class="n">cov_func</span> <span class="o">=</span> <span class="n">variance</span><span class="o">*</span><span class="n">pm</span><span class="p">.</span><span class="n">gp</span><span class="p">.</span><span class="n">cov</span><span class="p">.</span><span class="n">Matern52</span><span class="p">(</span><span class="n">n_inputs</span><span class="p">,</span> <span class="n">ls</span><span class="o">=</span><span class="n">length_scale</span><span class="p">)</span>
<span class="n">mean_func</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">gp</span><span class="p">.</span><span class="n">mean</span><span class="p">.</span><span class="n">Constant</span><span class="p">(</span><span class="n">intercept</span><span class="p">)</span>
<span class="n">gp</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">gp</span><span class="p">.</span><span class="n">Marginal</span><span class="p">(</span><span class="n">mean_func</span><span class="o">=</span><span class="n">mean_func</span><span class="p">,</span> <span class="n">cov_func</span><span class="o">=</span><span class="n">cov_func</span><span class="p">)</span>
<span class="n">noise</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">HalfNormal</span><span class="p">(</span><span class="s">'noise'</span><span class="p">)</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">gp</span><span class="p">.</span><span class="n">marginal_likelihood</span><span class="p">(</span><span class="s">'response'</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">noise</span><span class="p">)</span>
<span class="n">trace</span> <span class="o">=</span> <span class="n">pm</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="o">**</span><span class="n">sampler_kwargs</span><span class="p">)</span>
<span class="k">return</span> <span class="n">trace</span><span class="p">,</span> <span class="n">emulator_model</span><span class="p">,</span> <span class="n">gp</span>
</code></pre></div></div>
<p>Fitting the model runs fairly quickly since we have only a handful of observed data points. If we had 1000 or more instead of 10, we might need to use a different flavor of Gaussian process model to accommodate the larger set of data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">params_df_std</span><span class="p">[</span><span class="n">input_vars</span><span class="p">].</span><span class="n">values</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">params_df_std</span><span class="p">[</span><span class="s">'worst_day'</span><span class="p">].</span><span class="n">values</span>
<span class="n">trace</span><span class="p">,</span> <span class="n">emulator_model</span><span class="p">,</span> <span class="n">gp</span> <span class="o">=</span> <span class="n">sample_emulator_model_basic</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><ipython-input-15-996f43c1af4e>:17: FutureWarning: In v4.0, pm.sample will return an `arviz.InferenceData` object instead of a `MultiTrace` by default. You can pass return_inferencedata=True or return_inferencedata=False to be safe and silence this warning.
trace = pm.sample(**sampler_kwargs)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [noise, variance, length_scale, intercept]
</code></pre></div></div>
<div>
<style>
/* Turns off some styling */
progress {
/* gets rid of default border in Firefox and Opera. */
border: none;
/* Needs to be in here for Safari polyfill so background images work as expected. */
background-size: auto;
}
.progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {
background: #F44336;
}
</style>
<progress value="8000" class="" max="8000" style="width:300px; height:20px; vertical-align: middle;"></progress>
100.00% [8000/8000 00:29<00:00 Sampling 4 chains, 0 divergences]
</div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 30 seconds.
The number of effective samples is smaller than 25% for some parameters.
</code></pre></div></div>
<p>Predicting at new locations is easy too, once we have our fitted model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Xnew</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">*</span><span class="n">sample_bounds</span><span class="p">[</span><span class="n">k</span><span class="p">])</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">input_vars</span><span class="p">]))</span>
<span class="n">Xnew</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">([</span><span class="n">Xnew</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">ravel</span><span class="p">(),</span> <span class="n">Xnew</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">ravel</span><span class="p">()]).</span><span class="n">T</span> <span class="o">/</span> <span class="n">param_scales</span><span class="p">[</span><span class="n">input_vars</span><span class="p">].</span><span class="n">values</span>
<span class="k">with</span> <span class="n">emulator_model</span><span class="p">:</span>
<span class="n">pred_mean</span><span class="p">,</span> <span class="n">pred_var</span> <span class="o">=</span> <span class="n">gp</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">Xnew</span><span class="p">,</span> <span class="n">given</span><span class="o">=</span><span class="n">trace</span><span class="p">,</span> <span class="n">diag</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p>The final two cells create plots showing the posterior predictive distribution of the GP over all the values in parameter space for which we have no data. As we can see, it smoothly interpolates between data points.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">Xnew</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">Xnew</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">pred_mean</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">y</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'k'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Death rate'</span><span class="p">),</span> <span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Transmission probability'</span><span class="p">),</span>
<span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s">'Posterior mean'</span><span class="p">),</span> <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Posterior mean surface'</span><span class="p">);</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(<matplotlib.colorbar.Colorbar at 0x7fd2c3beb3a0>,
Text(0.5, 1.0, 'Posterior mean surface'))
</code></pre></div></div>
<p><img src="/images/2021-04-05-spatial-abm-emulator_30_1.png" alt="png" /></p>
<p>We also see that the variance in the predictions grows as we move farther and farther away from observed data points.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">Xnew</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">Xnew</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">pred_var</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span><span class="o">=</span><span class="s">'k'</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'w'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Death rate'</span><span class="p">),</span> <span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Transmission probability'</span><span class="p">),</span>
<span class="n">plt</span><span class="p">.</span><span class="n">colorbar</span><span class="p">(</span><span class="n">label</span><span class="o">=</span><span class="s">'Posterior variance'</span><span class="p">),</span> <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Posterior variance surface'</span><span class="p">);</span>
</code></pre></div></div>
<p><img src="/images/2021-04-05-spatial-abm-emulator_32_0.png" alt="png" /></p>Christopher KrapuComputers are (still) getting faster every year and it is now commonplace to run simulations in seconds that would have required hours’ or days’ worth of compute time in previous generations. That said, we still often come across cases where our computer models are simply too intricate and/or have too many components to run as often and quickly as we would like. In this scenario, we are frequently forced to choose a limited subset of potential scenarios manifest as parameter settings for which we have the resources to run simulations. I’ve written this notebook to show how to use a statistical emulator to help understand how the outputs of a model’s simulations might vary with parameters.Density Estimation For Geospatial Imagery Using Autoregressive Models2020-03-30T00:00:00-07:002020-03-30T00:00:00-07:00https://ckrapu.github.io/density-estimation-for-geospatial-imagery-using-autoregressive-models<p>Bayesian machine learning is all about learning a good representation of very complicated datasets, leveraging cleverly structured models and effective parameter estimation techniques to create a high-dimensional probability distribution approximating the observed data. A key advantage of posing computer vision research under the umbrella of Bayesian inference is that some tasks become really straightforward with the right choice of model.</p>
<p>In this notebook, I show how to use <strong>PixelCNN</strong>, a deep generative model of structured data, to perform density estimation on geospatial topographic imagery derived from LiDAR maps of the Earth’s surface. I also highlight how easy this is within TensorFlow Probability, a new open-source project extending the capabilities of Tensorflow into <strong>probabilistic programming</strong>, i.e. the representation of probability distributions with computer programs in a way that treats random variables as first-class citizens.</p>
<p><strong>Note</strong>: To reproduce this notebook, you will need the digital elevation map dataset I used to train the model. It’s too large to be hosted on my Github repository. Email me at ckrapu at gmail.com to get everything you need to reproduce this!</p>
<h3 id="density-estimation">Density estimation</h3>
<p>Density estimation is a task which has a common sense interpretation: if our understanding of the world is encoded in a probabilistic model, data points with especially low density are <strong>rare</strong> according to the model while points with high density are <strong>common</strong>. Suppose that you are walking down the street and you see a bright, neon blue dog that is as large as a firetruck. This is an instance which would probably receive low density under your subjective model of the world because there is exceedingly low probability of it appearing. Conversely, a smaller brown dog would receive a higher density value because it is more likely under the set of beliefs and assumptions you hold about the world.</p>
<p>Most probability distributions are not as rich or flexible as the set of beliefs that we individually hold about the world. Coming up with extremely flexible and rich distributions is an active area of research. As of right now, a leading approach to generating these distributions is via neural autoregressive models which extend standard time series models such as the autoregressive or ARIMA models to have a neural transition operation rather than a linear, Markovian operation. The <a href="https://arxiv.org/abs/1606.05328">PixelCNN architecture</a> is a popular neural autoregressive model currently in use.</p>
<p>Many machine learning models of imagery do not allow for easy density estimation. For example, the variational autoencoder provides a mapping from latent variable $\mathbf{z}$ to observed data point $\mathbf{x}$. Unfortunately, calculating $p(\mathbf{x})$ under the model typically requires approximating the integral $p(\mathbf{x}) = \int_z p(\mathbf{x}\vert \mathbf{z})p(\mathbf{z}) d\mathbf{z}$. Autoregressive models, in their most basic form, just don’t have this latent variable representative and instead parameterize the function $p(x_i \vert x_{i-1},…,x_1)$ where $x_i$ denotes the $i$-th pixel in the image. This admits a decomposition of the image’s probability as $p(\mathbf{x})=\prod_i p(x_i\vert x_{i-1},…,x_1)$. This assumes a total ordering of the pixels in an image; we usually assume the raster scan order (though there are <a href="https://arxiv.org/abs/1712.09763">creative solutions</a> which can improve on this!).</p>
<p>The rest of this post shows how to use the PixelCNN distribution from Tensorflow Probability and apply density estimation. The PixelCNN distribution was included with the 0.9 update of <code class="language-plaintext highlighter-rouge">tensorflow-probability</code>, so you’ll need to upgrade your installation if you were on 0.8 or earlier.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="kn">import</span> <span class="nn">tensorflow_probability</span> <span class="k">as</span> <span class="n">tfp</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">from</span> <span class="nn">utils</span> <span class="kn">import</span> <span class="n">flatten_image_batch</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'This script uses Tensorflow </span><span class="si">{</span><span class="n">tf</span><span class="p">.</span><span class="n">__version__</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Tensorflow Probability version: </span><span class="si">{</span><span class="n">tfp</span><span class="p">.</span><span class="n">__version__</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>This script uses Tensorflow 2.1.0
Tensorflow Probability version: 0.9.0
</code></pre></div></div>
<p>The dataset that I’m using consists of images with dimension $32\times32\times1$ representing topographical maps of the Earth’s surface in the state of North Dakota. Each pixel’s single channel of data represents the average elevation across several square meters. Features like roads, ditches, rivers and valleys can be seen in these images.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data_numpy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">'../data/datasets/training/dem_32_filtered.npy'</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">dem_as_int</span> <span class="o">=</span> <span class="p">(((</span><span class="n">data_numpy</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="mi">255</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">uint8</span><span class="p">)</span>
</code></pre></div></div>
<p>Currently, the available architectures for PixelCNN work best when the output data is quantized. The image data originally had pixel values within the rage $[-1,1]$ which need to be mapped to ${0,1,…,255}$. Let’s take a look below and see what these images look like:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">selected</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">data_numpy</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span><span class="n">size</span><span class="o">=</span><span class="mi">36</span><span class="p">,</span><span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">images</span> <span class="o">=</span> <span class="n">dem_as_int</span><span class="p">[</span><span class="n">selected</span><span class="p">][</span><span class="mi">0</span><span class="p">:</span><span class="mi">32</span><span class="p">]</span>
<span class="n">flat</span> <span class="o">=</span> <span class="n">flatten_image_batch</span><span class="p">(</span><span class="n">images</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(),</span><span class="mi">4</span><span class="p">,</span><span class="mi">8</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">flat</span><span class="p">),</span> <span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Training data'</span><span class="p">),</span><span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">().</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">);</span>
</code></pre></div></div>
<p><img src="/images/density-estimation-for-geospatial-imagery-using-autoregressive-models_files/density-estimation-for-geospatial-imagery-using-autoregressive-models_5_0.png" alt="png" /></p>
<p>Many of the images are of gently sloped or rolling surfaces with a few linear features such as ditches or roads. Many of the images have local regions of high variance corresponding to marshy vegetation which scatters the LiDAR pulses used for elevation estimation.</p>
<p>The PixelCNN model is actually a joint distribution over all the pixels of an image. Thus, it was possible for the developers of the <code class="language-plaintext highlighter-rouge">tensorflow-probability</code> package to actually include it as one of their distributions! This makes it really easy to work with and the code below shows how little setup is required to train a PixelCNN with TFP. Much of this code was copied from the TFP documentation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Specify inputs and training settings
</span><span class="n">input_shape</span> <span class="o">=</span> <span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">epochs</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">filters</span> <span class="o">=</span> <span class="mi">96</span>
<span class="c1"># Create a Tensorflow Dataset object
</span><span class="n">train_dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">from_tensor_slices</span><span class="p">(</span><span class="n">dem_as_int</span><span class="p">)</span>
<span class="n">train_it</span> <span class="o">=</span> <span class="n">train_dataset</span><span class="p">.</span><span class="n">batch</span><span class="p">(</span><span class="n">batch_size</span><span class="p">).</span><span class="n">shuffle</span><span class="p">(</span><span class="n">data_numpy</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="c1"># Create the PixelCNN using TFP
</span><span class="n">dist</span> <span class="o">=</span> <span class="n">tfp</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">PixelCNN</span><span class="p">(</span>
<span class="n">image_shape</span><span class="o">=</span><span class="n">input_shape</span><span class="p">,</span>
<span class="n">num_resnet</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">num_hierarchies</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">num_filters</span><span class="o">=</span><span class="n">filters</span><span class="p">,</span>
<span class="n">num_logistic_mix</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span>
<span class="n">dropout_p</span><span class="o">=</span><span class="p">.</span><span class="mi">3</span><span class="p">,</span>
<span class="p">)</span>
<span class="c1"># Define the model input and objective function
</span><span class="n">image_input</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Input</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">input_shape</span><span class="p">)</span>
<span class="n">log_prob</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">image_input</span><span class="p">)</span>
<span class="c1"># Specify model inputs and loss function
</span><span class="n">model</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">Model</span><span class="p">(</span><span class="n">inputs</span><span class="o">=</span><span class="n">image_input</span><span class="p">,</span> <span class="n">outputs</span><span class="o">=</span><span class="n">log_prob</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">add_loss</span><span class="p">(</span><span class="o">-</span><span class="n">tf</span><span class="p">.</span><span class="n">reduce_mean</span><span class="p">(</span><span class="n">log_prob</span><span class="p">))</span>
</code></pre></div></div>
<p>Once the model is specified, we just need to compile it and start training. PixelCNN is an example of an autoregressive model and these are notorious for taking a long time to train. Unfortunately, I only have access to a single GPU currently. Normally, this code would display a progress bar and training metrics. I’ve toggled these off to keep the document short and prevent a large number of warnings from being shown.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compile and train the model
</span><span class="n">model</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span>
<span class="n">optimizer</span><span class="o">=</span><span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">optimizers</span><span class="p">.</span><span class="n">Adam</span><span class="p">(.</span><span class="mi">001</span><span class="p">),</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[])</span>
<span class="n">history</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_it</span><span class="p">,</span> <span class="n">epochs</span><span class="o">=</span><span class="n">epochs</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>WARNING:tensorflow:Output tf_op_layer_Reshape_3 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to tf_op_layer_Reshape_3.
Train for 4602 steps
Epoch 1/3
3626/4602 [======================>.......] - ETA: 30:48 - loss: 2357.9549
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.
Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)
4602/4602 [==============================] - 8735s 2s/step - loss: 1989.1206
Epoch 3/3
612/4602 [==>...........................] - ETA: 2:06:25 - loss: 1972.0003
</code></pre></div></div>
<p>Since we’ve created an approximation of a probability distribution, we can sample from it to see examples of points that have high density under the PixelCNN model. As a warning, this sampling procedure can take quite awhile.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">samples</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="mi">36</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s visually compare the sampled values with ground truth data points.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">utils</span> <span class="kn">import</span> <span class="n">flatten_image_batch</span>
<span class="n">samples_numpy</span> <span class="o">=</span> <span class="n">samples</span><span class="p">.</span><span class="n">numpy</span><span class="p">().</span><span class="n">squeeze</span><span class="p">()</span>
<span class="n">flat_samples</span> <span class="o">=</span> <span class="n">flatten_image_batch</span><span class="p">(</span><span class="n">samples_numpy</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">6</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span><span class="mi">6</span><span class="p">)),</span><span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">flat_samples</span><span class="p">),</span><span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Simulated images'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">().</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
<span class="n">selected</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">choice</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">data_numpy</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span><span class="n">size</span><span class="o">=</span><span class="mi">36</span><span class="p">,</span><span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">flat_ground_truth</span> <span class="o">=</span> <span class="n">flatten_image_batch</span><span class="p">(</span><span class="n">data_numpy</span><span class="p">[</span><span class="n">selected</span><span class="p">].</span><span class="n">squeeze</span><span class="p">(),</span><span class="mi">6</span><span class="p">,</span><span class="mi">6</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span><span class="mi">6</span><span class="p">)),</span><span class="n">plt</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span><span class="n">flat_ground_truth</span><span class="p">),</span><span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'True images'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">gca</span><span class="p">().</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">);</span>
</code></pre></div></div>
<p><img src="/images/density-estimation-for-geospatial-imagery-using-autoregressive-models_files/density-estimation-for-geospatial-imagery-using-autoregressive-models_14_0.png" alt="png" /></p>
<p><img src="/images/density-estimation-for-geospatial-imagery-using-autoregressive-models_files/density-estimation-for-geospatial-imagery-using-autoregressive-models_14_1.png" alt="png" /></p>
<p>Both the ground truth and sampled images appear to show winding streams and sloping hillsides, though there are more linear features such as roads and ditches in the true data than the synthetic samples.</p>
<p>In the next cell, I calculate the log density of 1000 ground truth images using the PixelCNN as my probability distribution. I also calculate the ranking of each image with regard to its probability.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">subset</span> <span class="o">=</span> <span class="n">dem_as_int</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">1000</span><span class="p">]</span>
<span class="n">log_probs</span> <span class="o">=</span> <span class="n">dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">subset</span><span class="p">).</span><span class="n">numpy</span><span class="p">()</span>
<span class="n">ranking</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">log_probs</span><span class="p">)</span>
<span class="n">sorted_log_prob</span> <span class="o">=</span> <span class="n">log_probs</span><span class="p">[</span><span class="n">ranking</span><span class="p">]</span>
</code></pre></div></div>
<p>With these rankings, I can show images which have low, medium, or high density under the PixelCNN model</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig</span><span class="p">,</span><span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">16</span><span class="p">))</span>
<span class="n">sorted_by_prob</span> <span class="o">=</span> <span class="n">subset</span><span class="p">[</span><span class="n">ranking</span><span class="p">]</span>
<span class="n">subsets</span> <span class="o">=</span> <span class="p">[</span><span class="n">sorted_by_prob</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">32</span><span class="p">],</span><span class="n">sorted_by_prob</span><span class="p">[</span><span class="mi">484</span><span class="p">:</span><span class="mi">516</span><span class="p">],</span><span class="n">sorted_by_prob</span><span class="p">[</span><span class="o">-</span><span class="mi">32</span><span class="p">:]]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="s">'Images with low density'</span><span class="p">,</span> <span class="s">'Images with medium density'</span><span class="p">,</span><span class="s">'Images with high density'</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">subset</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">subsets</span><span class="p">):</span>
<span class="n">flat</span> <span class="o">=</span> <span class="n">flatten_image_batch</span><span class="p">(</span><span class="n">subset</span><span class="p">.</span><span class="n">squeeze</span><span class="p">(),</span><span class="mi">4</span><span class="p">,</span><span class="mi">8</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">imshow</span><span class="p">(</span><span class="n">flat</span><span class="p">),</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="n">labels</span><span class="p">[</span><span class="n">i</span><span class="p">]),</span><span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">axis</span><span class="p">(</span><span class="s">'off'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/images/density-estimation-for-geospatial-imagery-using-autoregressive-models_files/density-estimation-for-geospatial-imagery-using-autoregressive-models_19_0.png" alt="png" /></p>
<p>These images help us understand the representation that the model has learned. In the top panel, we see that the images with the lowest probability are those with a lot of “fuzziness”; these are images with lots of noisy LiDAR reflections due to water and vegetation. Since this is effectively random noise, it isn’t possible to predict perfectly what these values will be.</p>
<p>Images with high density, on the other hand, show smoothly varying topography and very strong spatial autocorrelations. Again, this isn’t terribly surprising because the model has favored data points for which it can easily yield very good pixel-level predictions. If each pixel differs from its neighbor by only a small amount, it is much easier to construct a predictive model with low error.</p>
<p>I hope that this provided a straightforward and minimal example of how to use Tensorflow Probability for a rather sophisticated machine learning task. I’ve been impressed with the functionality incorporated into the TFP codebase and look forward to using it more in the future!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
</code></pre></div></div>Christopher KrapuBayesian machine learning is all about learning a good representation of very complicated datasets, leveraging cleverly structured models and effective parameter estimation techniques to create a high-dimensional probability distribution approximating the observed data. A key advantage of posing computer vision research under the umbrella of Bayesian inference is that some tasks become really straightforward with the right choice of model.Multivariate Sample Size For Markov Chains2020-02-18T00:00:00-08:002020-02-18T00:00:00-08:00https://ckrapu.github.io/multivariate-sample-size-for-markov-chains<p><strong>Summary: I show how to calculate a multivariate effective sample size after <a href="https://academic.oup.com/biomet/article/106/2/321/5426969">Vats et al. (2019)</a></strong>. In applied statistics, Markov chain Monte Carlo (MCMC) is now widely used to fit statistical models. Suppose we have a statistical model $p_\theta$ of some dataset $\mathcal{D}$ which has a parameter $\theta$. The basic idea behind MCMC is to estimate $\theta$ by generating $N$ random variates $\theta_i,\theta_2,…$ from the posterior distribution $p(\theta\vert\mathcal{D})$ which are (hopefully) distributed around $\theta$ in a predictable way. The Bayesian central limit theorem states that under the right conditions, $\theta_i$ is normally distributed about the true parameter value $\theta$ with some sample variance $\sigma^2$. We might then want to use the mean of these samples $\hat{\theta}=\sum_i^N \theta_i$ as an estimator of $\theta$ since this <em>posterior mean</em> is an optimal estimator in the context of Bayes risk and mean square loss.</p>
<p>This task has a few challenges lurking within. The accuracy of our estimate of $\theta$ is going to be low when we have only a few samples, i.e. $N$ is quite small. We can increase our accuracy by taking more samples. Ideally, our samples $\theta_i$ are all going to be <em>independent</em> so that we can make use of the theory of Monte Carlo estimators to assert that the error in our estimation of $\theta$ decreases at a rate of $1/N$. Thus, to get more accuracy, we draw more samples!</p>
<h2 id="autocorrelated-mcmc-draws">Autocorrelated MCMC draws</h2>
<p>Unfortunately, MCMC won’t provide uncorrelated values of $\theta_i$ because of its inherently sequential nature. These samples are going to have some autocorrelation $\rho$ and it’s helpful to think of this autocorrelation as reduced the number of samples from a nominal $N$ to an effective number $N$. Here’s a helpful analogy - suppose that you want to determine the average income within a city. You could pursue two sampling strategies; the first leads you to travel to 10 spots randomly selected on the map and then query a single person. The second approach is that you travel to two neighborhoods and query five people each. The latter method has the downside that you may get grossly misrepresentative numbers if you happen to land in a neighborhood where everyone has similar incomes which are not close to the city-wide average. This is an example of <em>spatial</em> autocorrelation leading to poor estimation. The same underlying mechanism is at play with our MCMC estimator having reduced precision.</p>
<p>The literature on sequential data makes frequent use to autocorrelations $\rho_p$ of lag $p$ meant to capture associations between data points with varying amounts of time or distance between them. We can provide a formula for the effective sample size in terms of these autocorrelations <a href="https://mc-stan.org/docs/2_22/reference-manual/effective-sample-size-section.html">(see here for more)</a> via the following formula:
\(N_{eff} = \frac{N}{\sum_{t=-\infty}^{\infty}\rho_t}\)</p>
<p>We truncate the sum in practice since the autocorrelations typically vanish after a large number of lags. Interestingly, the effective sample size also has another form which also implicitly involves autocorrelations. Suppose that we have a chain of samples $\theta_1,…,\theta_N$ and partition this chain into two <em>batches</em> comprising the samples from $1$ to $N/2$ and from $N/2$ to $N$. If the samples are close to independent, then the per-batch means $T_{(k)}$ should be relatively close to each other. If they aren’t, then the batches contain distinct subpopulations of samples. The key insight here is that if the subpopulations are distinct, then they exhibit high within-batch autocorrelation. Thus, we can attempt to back out the autocorrelations by looking at the differences between batch means! For $a$ batch means, each of size $N/a$, this produces the following quantity:</p>
\[\lambda^2=\frac{N}{a(a-1)}\sum_k (T_{(k)}-\hat{\theta})^2\]
<p>Then, if we take the ratio of this quantity with the overall sample variance $\sigma^2$, we get another formula for the effective sample size:</p>
\[N_{eff} = \frac{n\lambda^2}{\sigma^2}\]
<h2 id="effective-sample-size-for-multivariate-draws">Effective sample size for multivariate draws</h2>
<p>The aforementioned equations for the effective sample size are fine for draws of univariate quantities. We also want to know how to obtain an analogous number for vector-valued random processes. Researchers often attempt to do so by simply evaluating the scalar $N_{eff}$ for each individual dimension of a chain of vector samples, but this isn’t very satisying. Fortunately, recent work by <a href="https://academic.oup.com/biomet/article/106/2/321/5426969">Dats et al. (2019)</a> has shown that a the straightforward multivariate generalization of the above formula works perfectly well! We simply have to generalize the quantities $\lambda^2$ and $\sigma^2$ to their matrix counterparts:
\(\Lambda=\frac{N}{a(a-1)}\sum_k ({\vec{T}}_{(k)}-{\hat{\theta}})^T({\vec{T}}_{(k)}-\hat)\)
\(\Sigma=\frac{1}{N-1}\sum_i ({\vec{\theta_i}}-{\hat{\theta}})^T({\vec{\theta_i}}^{(k)}-\hat)\)
With these quantities, we write out the effective number of samples as before just with the matrix generalizations of all quantities involved. Note that here, $p$ represents the dimension of $\theta_i$.</p>
\[N_{eff}^{multi} = N\left(\frac{\vert\Lambda\vert}{\vert\Sigma\vert}\right)^{1/p}\]
<p>Note that you still need to choose how many batches are used - a rule-of-thumb (there are more technical conditions that are worth reading about, though) is to use a batch size of $\sqrt{N}$, so if you have 256 samples then there would be 16 batches of 16 samples each.</p>
<p>In the code below, I’ll show how to calculate this for a toy example.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">256</span> <span class="c1"># Number of draws
</span><span class="n">p</span> <span class="o">=</span> <span class="mi">10</span> <span class="c1"># Dimension of each draw
</span><span class="n">cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="c1"># True covariance matrix
</span><span class="n">mean</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="c1"># true mean vector
</span>
<span class="n">samples</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span><span class="n">cov</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="n">n</span><span class="p">)</span>
<span class="n">n_batches</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">n</span><span class="o">**</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">samples_per_batch</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">n</span> <span class="o">/</span> <span class="n">n_batches</span><span class="p">)</span>
<span class="c1"># Split up data into batches and take averages over
# individual batches as well as the whole dataset
</span><span class="n">batches</span> <span class="o">=</span> <span class="n">samples</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">n_batches</span><span class="p">,</span><span class="n">samples_per_batch</span><span class="p">,</span><span class="n">p</span><span class="p">)</span>
<span class="n">batch_means</span> <span class="o">=</span> <span class="n">batches</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">full_mean</span> <span class="o">=</span> <span class="n">samples</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># Calculate the matrix lam as a sum of vector
# outer products
</span><span class="n">prefactor</span> <span class="o">=</span> <span class="n">samples_per_batch</span> <span class="o">/</span> <span class="p">(</span><span class="n">n_batches</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">batch_residuals</span> <span class="o">=</span> <span class="p">(</span><span class="n">batch_means</span> <span class="o">-</span> <span class="n">full_mean</span><span class="p">)</span>
<span class="n">lam</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_batches</span><span class="p">):</span>
<span class="n">lam</span> <span class="o">+=</span> <span class="n">prefactor</span> <span class="o">*</span> <span class="p">(</span><span class="n">batch_residuals</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,:]</span> <span class="o">*</span> <span class="n">batch_residuals</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">,:].</span><span class="n">T</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">cov</span><span class="p">(</span><span class="n">samples</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="n">n_eff</span> <span class="o">=</span> <span class="n">n</span><span class="o">*</span> <span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">det</span><span class="p">(</span><span class="n">lam</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">det</span><span class="p">(</span><span class="n">sigma</span><span class="p">))</span><span class="o">**</span><span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">p</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s see what $N_{eff}$ is for this case:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'There are {0} effective samples'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">n_eff</span><span class="p">)))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>There are 166 effective samples
</code></pre></div></div>
<p>Since I used 256 truly independent samples in total, it appears that this statistic is somewhat conservative in reporting the effective sample size. I hope this was useful! Again, you can read more about this method at <a href="https://academic.oup.com/biomet/article/106/2/321/5426969">this Biometrika article</a> by Dootika Vats et al.</p>Christopher KrapuSummary: I show how to calculate a multivariate effective sample size after Vats et al. (2019). In applied statistics, Markov chain Monte Carlo (MCMC) is now widely used to fit statistical models. Suppose we have a statistical model $p_\theta$ of some dataset $\mathcal{D}$ which has a parameter $\theta$. The basic idea behind MCMC is to estimate $\theta$ by generating $N$ random variates $\theta_i,\theta_2,…$ from the posterior distribution $p(\theta\vert\mathcal{D})$ which are (hopefully) distributed around $\theta$ in a predictable way. The Bayesian central limit theorem states that under the right conditions, $\theta_i$ is normally distributed about the true parameter value $\theta$ with some sample variance $\sigma^2$. We might then want to use the mean of these samples $\hat{\theta}=\sum_i^N \theta_i$ as an estimator of $\theta$ since this posterior mean is an optimal estimator in the context of Bayes risk and mean square loss.Posterior Image Inpainting, Part I - Review of Recent Work2020-02-10T00:00:00-08:002020-02-10T00:00:00-08:00https://ckrapu.github.io/UQ-for-structured-data<p>Image, text and audio are examples of <em>structured</em> multivariate data where we have a total or partial ordering over the entries of our data points and also may exhibit <em>long-range</em> structure extending over many pixels, words or seconds of speech. As a consequence, it is difficult to model these kinds of data using models that allow for only short-range structure such as HMMs or which can make use of only pairwise dependency structures such as the covariance matrix in a multivariate normal distribution. What if we’d like to build Bayesian models with more sophisticated structure?</p>
<p>There exists a tremendous number of applications in which we might like to quantify our uncertainty regarding missing portions of structured data so that we can understand what the missing completion might look like. For example, an X-ray of a fractured wrist may be partially occluded and we would like to know whether the rest of a partially observed crack is large or small, conditional on the parts of the X-ray that we can actually observe. Ideally, we would be presented with an entire <em>distribution</em> of image completions which exhibit completed structure in proportion to their conditional probability given the observed piece. I call this task <strong>posterior inpainting</strong> which is a mcore stringent definition of a task already explored somewhat in the literature as <a href="https://zpascal.net/cvpr2019/Zheng_Pluralistic_Image_Completion_CVPR_2019_paper.pdf">pluralistic image completion (Zheng et al., 2019)</a> or <a href="https://arxiv.org/pdf/1810.03728.pdf">probabilistic semantic inpainting (Dupont and Suresha, 2019)</a>. This problem is mathematically identical to that of super-resolution; if we consider observed pixels on a regular grid and assume that between every pair of observed pixels is a series of $M$ masked pixels then the observed pixels constitute a downsampled version of the entire image with a resolution equal to $1/M$ of the original. Colorization can also be placed within the same formalism except we treat some of the <em>channels</em> as missing data.</p>
<h2 id="posterior-inpainting-problem-setup--formalism">Posterior inpainting: problem setup & formalism</h2>
<p>We designate $x$ to be a single observation which is itself a vector with elements $x_1, x_2,…x_D$. For image data, this amounts to describing $x$ as an image with $D$ pixels. Where necessary, we may refer to alternative observations in the larger dataset as $\mathcal{D}={ x^{(1)},x^{(2)},…,x^{(N)}}$. Under several popular generative models of structured data such as variational autoencoders (VAEs) and generative adversarial networks (GANs), it is also assumed each data point $x$ has a latent representation $z$ that encodes the information in $x$ in a compressed, low-dimensional format. Not all state-of-the-art generative models share this assumption, however! Autoregressive models such as PixelCNNs or PixelRNNs may work either <a href="https://arxiv.org/abs/1606.05328">with or without the usage of latent variables</a>.</p>
<p>When latent variables are used, we often represent the function linking the latent code $z$ to the observed data $x$ as $f_\theta(z)$ with $\theta$ representing the parameters of the generative model $f$. In the case of $L_2$ pixelwise reconstruction error for image data, we could thus represent the likelihood for $x$ given $z$ as \(p_\theta(x\vert z)=MVN(f_\theta(z),\sigma^2_\epsilon I)\). This multivariate normal specification is simply saying that the log-likelihood has the form \(\propto \frac{1}{\sigma^2_\epsilon} \vert\vert x-f_\theta(z)\vert\vert_2^2\). Note that $f_\theta$ is a deterministic function (though we could relax this if we had a stochastic generative network such as a Bayesian neural net) so each latent $z$ is mapped to exactly one output image. In the event that we have a partially observed $\tilde{x}$ missing some of its pixels, there may be multiple $\tilde{z}^{(1)},\tilde{z}^{(2)},…,\tilde{z}^{(L)}$ which all yield some $f_\theta(\tilde{z})$ which is a good match for $x$. The central problem I’m addressing in this post is the sampling and computation related to obtaining these $\tilde{z}$ such that they are truly representative of the posterior distribution $p(\tilde{z} \vert \tilde{x})$. The next section is a review of papers which attempt to address this problem. Here’s a glossary of some of the terms that will be used frequently:</p>
<ul>
<li>$x$ indicates a single image with a partially observed counterpart $\tilde{x}$</li>
<li>$z$ is the corresponding latent code for $x$ and $\tilde{z}$ is one of potentially many latent codes consistent with the partially observed $\tilde{x}$</li>
<li>The generative model linking the latent code $z$ is denoted by $f_\theta$. For a variational autoencoder, this is the decoder network while for GANs, this is the generator. We refer to the weights of these networks as $\theta$.</li>
<li>An estimator of $z$ is written as $\hat{z}$</li>
<li>If a variational encoder network is used, $q_\phi(z\vert x)$ is used to refer to the conditional distribution of the latent code $z$ given image $x$. Note that this is conceptually quite different from a posterior $\propto p_\theta (x\vert z)p_\lambda(z)$ because the latter only requires a generative model $f_\theta$ and a prior on $z$ with parameters $\lambda$!</li>
</ul>
<h2 id="review">Review</h2>
<h2 id="semantic-image-inpainting-with-deep-generative-models--yeh-et-al-2017">Semantic image inpainting with deep generative models: <a href="https://arxiv.org/pdf/1607.07539.pdf">Yeh et al. 2017</a></h2>
<p>As soon as deep generative models such as VAEs and GANs started producing visually appealing samples when trained on more sophisticated data, researchers started investigating ways to use them to help solve a range of computer vision tasks including image inpainting. Yeh et al. 2017 presented a very straightforward and common-sense way to tackle image inpainting with a DGM. The basic recipe that they suggested for completing an image $\tilde{x}$ is:</p>
<ol>
<li>Pick a randomly initialized $\hat{z}_0$</li>
<li>Calculate a loss function $L(\hat{z})=\gamma h(\hat{z}) +\vert\vert f_\theta(\hat{z}) - \tilde{x}\vert\vert$ where $\vert\vert \cdot \vert\vert$ denotes the norm or error function of your choice, $h$ is a prior loss and $\gamma$ is a weighting factor. This study used both $L_1$ and $L_2$ norms.</li>
<li>Use a gradient descent optimization scheme to repeatedly apply the update \(\hat{z}_{t+1}=\hat{z}_t + \alpha\nabla_\hat{z}L(\hat{z}_{t})\) with some learning rate $\alpha$</li>
<li>Since $\vert\vert f_\theta(\hat{z}) - \tilde{x}\vert\vert$ is likely going to be nonzero, apply image post-processing to blend together any sharp discontinuities between the completion $f_\theta(\hat{z})$ and the original image $\tilde{x}$.</li>
</ol>
<p>While this procedure is guaranteed to converge to a local minimum, this paper doesn’t provide a recipe to either escape these minima or try to draw a range of samples. That’s beside the point, though, since the main contribution of this paper was simply to show how to get a single inpainted completion at all.</p>
<p>It’s a shame that the author’s didn’t report on any results with injected noise in step #3 above (e.g. using an update rule \(\hat{z}_{t+1}=\hat{z}_t + \alpha\nabla_\hat{z}L +\epsilon\) with $\epsilon$ drawn from an isotropic Gaussian) since this very nearly turns it into a <a href="https://en.wikipedia.org/wiki/Metropolis-adjusted_Langevin_algorithm">Langevin sampler</a> which I suspect would be a highly effective sampling scheme for this problem.</p>
<p>There’s an application paper by <a href="Generating Realistic Geology Conditioned on Physical Measurements with Generative Adversarial Networks">Dupont et al. 2018</a> which is nearly the exact same method used by Yeh et al. save with a minor modification to a mask applied to the loss function. As the authors of this paper noted:</p>
<blockquote>
<p>To the best of our knowledge, creating models that can simultaneously (a) generate realistic images, (b) honor constraints, (c) exhibit high sample diversity is an open problem.</p>
</blockquote>
<p>Clearly, the limitations of this approach are noted - getting high sample diversity could be challenging!</p>
<h2 id="pixel-constrained-cnns--dupont-and-suresha-2019">Pixel Constrained CNNs: <a href="https://arxiv.org/pdf/1810.03728.pdf">Dupont and Suresha 2019</a></h2>
<p>In an apparent follow-up to the challenge noted in the previous section, Dupont and Suresha
attempted to address the major shortcomings of the Dupont et al. (2018) approach by embracing a latent variable-free approach that allowed for straightforward sampling from conditional distributions over images. The basic idea in this paper is to augment a PixelCNN’s predictive distribution over pixels to include information which is outside of the usual raster scan ordering imposed on the sequence of pixels.</p>
<p>We can think of the basic PixelCNN with weights $\theta$ as an autoregressive generative model $f_\theta(x_i\vert x_1,…x_{i-1})$. The general problem that Dupont and Suresha tackle is how to augment the conditioning set of variables with pixels that might be out of raster scan order, yet still observed. Let’s denote the set of observed pixels as $X_c$. Then, the generative model becomes $f_\theta(x_i\vert {x_1,…,x_{i-1}}\cup X_c)$. To implement the PixelCNN constrained to match observations, the authors represent the conditional likelihood of the discretized categories of $x_i$ to be log-linear in two different networks: (1) a standard PixelCNN with little modification, and (2) a fairly standard ConvNet which takes in masked pixels and outputs a logit. The second network also needs to have an extra channel for its inputs to indicate which pixels are masked since, for example, a value of zero in the masked data could correspond to either missing data or an observed value of zero. This has an advantage over latent variable-based approaches in that the samples of the completed image $\hat{x}$ will not need Poisson blending to match the observed pixels - there is no generation of the already-observed pixels in this procedure</p>
<p>I have to say that I am really impressed with the quality and diversity of the samples drawn from the conditional distribution over completions - I think this is a front-runner and current SOTA for posterior image completion.</p>
<h2 id="pluralistic-image-completion-zheng-et-al-2019">Pluralistic image completion: <a href="https://zpascal.net/cvpr2019/Zheng_Pluralistic_Image_Completion_CVPR_2019_paper.pdf">Zheng et al. 2019</a></h2>
<p>This paper was published in CVPR and either because of the journal’s format or because the study is heavy on technical details I found it to be very difficult to read. Unfortunately, I am unable to tell what the essence of this work is besides the fact that they pair two generative networks together which are trained on differing tasks. There were many details that would have ideally been given a longer treatment in this which likely contributed to it being relatively difficult to follow. This may have been an unavoidable consequence of the journal length format, however, and I do not intend this to be criticism of the authors’ writing.</p>
<h2 id="a-bayesian-perspective-on-the-deep-image-prior-cheng-et-al-2019">A Bayesian perspective on the deep image prior: <a href="https://people.cs.umass.edu/~zezhoucheng/gp-dip/gp-dip.pdf">Cheng et al. 2019</a></h2>
<p>The main contribution of this work is showing that sampled images from a deep generative model prior to training (AKA the <a href="https://arxiv.org/abs/1711.10925">deep image prior</a>) are actually draws from a Gaussian process. While this is a neat coincidence, it’s not especially surprising given an abundance of work on relating neural networks and Gaussian processes as two leading forms of universal function approximators. However, the part that interested me the most was in their experiemntal section in which they discuss using the deep image prior for reconstruction as well as other image processing tasks and use Langevin dynamics to draw samples of $\theta$ leading to a posterior distribution of $p(x \vert \tilde{x})=\int_\theta p(x\vert \theta,\tilde{x})p(\theta \vert \tilde{x})d\theta$. Note that in this framework, there’s no mention of distributions over $z$ or $\tilde{z}$ - these are treated as fixed inputs!</p>
<p>I do want to take a minute here to critique the authors’ description, though. They aren’t using stochastic gradient Langevin Dynamics (SGLD) in the way that most people understand it. Let’s take a look at the deep image prior’s weight update equation from section 4 of this paper:</p>
\[\eta_t \sim N(0,\epsilon)\\
\theta_{t+1}=\theta_t +\frac{\epsilon}{2}[\underbrace{\nabla_\theta \log p_\theta(\tilde{x}\vert\theta)}_{\substack{\text{Reconstruction error}\\ \text{for non-masked data}}}
+\overbrace{\nabla_\theta p(\theta)]}^{\text{DIP}}+\eta_t\\\]
<p>The “Langevin” part comes about because the behavior of $\theta$ can be thought of as a particle subject to random perturbations (i.e. the isotropic noise $\eta_t$) while also under the influence of a force represented as the gradient of a potential. In this context, the gradient is the sum of a gradient due to reconstruction error and due to the deep image prior (DIP). The “stochastic” part of SGLD refers to using minibatch approximations for the gradient estimator which we are forced to do because a full batch would be too computationally expensive. However, here, note that $\tilde{x}$ isn’t a minibatch - it’s the entire dataset! Within the setup laid out by Cheng et al., the deep generative model $f_\theta$ has parameters $\theta$ which do indeed need to be optimized, but they are only optimized using a single partial image $\tilde{x}$ rather than multiple images $x_1,…,x_N$ as is done with standard VAE and GAN training protocols. Thus, they are really just implementing Langevin dynamics. This is the same thing as <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00123">MALA</a> with the Metropolis accept/reject step removed.</p>
<p>Since only a single image is used to optimize / sample $\theta$, the model is really only able to capture information from two sources: (1) the inductive bias baked into the deep image prior (i.e. strong spatial covariance in the GP interpretation) and (2) image structures present in $\tilde{x}$ which thus influence $p(x\vert \tilde{x})$). This could have serious downsides - suppose we’d like to compute a posterior distribution of completions for an image of a man with blond hair yet his mouth (and mustache) are cropped out. Since the single image does not have any brown hair in it, it is unlikely that the deep image prior can be used to generate image completions consistent with a brown mustache. Yet, it is possible that in the collection of all training images $x_1,…,x_N$ there exist some pictures of men with blond hair and a brown mustache. This sort of outcome is also unlikely to have a nonnegligible probability under the deep image prior. All in all, this paper raises a number of possible directions for UQ with structured data via Langevin MCMC and also obviates the need to do any training at all!</p>
<h2 id="bayes-by-backprop-blundell-et-al-2015">Bayes by Backprop: <a href="https://arxiv.org/pdf/1505.05424.pdf">Blundell et al. (2015)</a></h2>
<p>Strictly speaking, this paper has nothing to do with image completion and it is focused entirely about treating neural network weights as random variables rather than fixed parameters. However, it’s not hard to see how this might give a possible receipe for posterior inpainting. Suppose we have a procedure $\nu (\theta,\tilde{x})$ that takes in a set of neural network weight parameters $\theta$ as well as a partially completed image $\tilde{x}$ and deterministically returns an estimated completion $\hat{x}$. For example, see [Yeh et al. (2017)](<a href="https://arxiv.org/abs/1607.07539">arxiv.org ‘ cs
Semantic Image Inpainting with Deep Generative Models</a>) for such a recipe. Then, if we could sample from a posterior distribution $p(\theta\vert \mathcal{D})$ then we could perform ancestral sampling to approximate $p(\hat{x}\vert\mathcal{D})=\int_\theta p(\hat{x}\vert\theta)p(\theta\vert\mathcal{D})d\theta$. Since this paper is about providing $p(\theta\vert\mathcal{D})$, I judge it as highly relevant to the task at hand. The paper works with a similar conceptual framework as the <a href="https://arxiv.org/abs/1312.6114">Autoencoding Variational Bayes paper</a> but targets the neural network weights $\theta$ instead of the latent variables $z$ for a variational approximation. I’m going to spend much more time analyzing this paper because I think it provides a really nice template for thinking about Bayesian deep learning.</p>
<p>The stated objective in this work is to pose neural network training as solving the following optimization problem for the variational free energy $\mathcal{F}$ in terms of the variational parameters $\phi$ given dataset $\mathcal{D}$, likelihood $p(\mathcal{D}\vert\theta)$ and weight $p(\theta)$.
\(\begin{align}
\phi^* &=\underset{\phi}{\text{arg min }}\mathcal{F}(\mathcal{D},\theta,\phi)\\
&=\underset{\phi}{\text{arg min }}KL(q(\theta\vert\phi)\vert\vert p(\theta\vert\mathcal{D}))\\
&= \underset{\phi}{\text{arg min }}KL(q(\theta\vert\phi)\vert\vert p(\theta)) - E_{q(\theta\vert\phi)}\left[\log p(\mathcal{D}\vert \theta)\right]
\end{align}\)
If this notation is opaque or these equations are especially hard to follow, I recommend looking at my <a href="[An ELBO timeline](https://ckrapu.github.io/a-timeline-of-the-variational-expected-lower-bound/)">earlier post </a> which repeats these calculations ad nauseum. In line with their derivation, we next define the variational free energy as \(\mathcal{F}(\theta,\phi)=KL(q(\theta\vert\phi)\vert\vert p(\theta)) - E_{q(\theta\vert\phi)}\left[\log p(\mathcal{D}\vert \theta)\right]\) and then attempt to find a Monte Carlo estimator of its gradient $\nabla_\phi \mathcal{F}(\theta,\phi)$. Unfortunately, this has the form $\nabla_\phi E_{q(\theta\vert\phi)}\left[…\right]$ and we can’t push the gradient operator inside the expectation since the density that we are integrating against itself depends on $\phi$. To solve this, we make use of the reparameterization trick and a deterministic function $t (\epsilon,\phi)$to rewrite $\theta=t(\epsilon,\phi)$. This yields:</p>
\[\begin{align}
\nabla_\phi \mathcal{F}(\theta,\phi)&=\nabla_\phi E_{q(\theta\vert\phi)}\left[\log\frac{q(\theta\vert\phi)}{p(\theta)}-\log p(\mathcal{D}\vert \theta)\right]\\
&=\nabla_\phi E_{q(\theta\vert\phi)}\left[\log q(\theta\vert\phi) - \log p(\theta)-\log p(\mathcal{D}\vert \theta)\right]\\
&=\nabla_\phi E_{p(\epsilon)}\left[\log q(t\vert\phi) - \log p(\theta)-\log p(\mathcal{D}\vert \theta)\right]\\
\end{align}\]
<p>At this point we simplify the notation by designating \(f(\theta,\phi) = \log q(\theta\vert\phi) - \log p(\theta)-\log p(\mathcal{D}\vert \theta)\), leading to the following:
\(\begin{align}
\nabla_\phi \mathcal{F}(\theta,\phi)&=\nabla_\phi E_{p(\epsilon)}\left[f(t,\phi)\right]\\
&= E_{p(\epsilon)}\left[\nabla_\phi f(t,\phi)\right]\\
\end{align}\)
To avoid having to specify in terms of products, I’ll focus on the elementwise derivative as done in the paper:
\(\begin{align}
\frac{\partial}{\partial\phi}\mathcal{F}(\theta,\phi)
&= E_{p(\epsilon)}\left[\frac{\partial}{\partial\phi} f(t,\phi)\right]\\
&= E_{p(\epsilon)}\left[\frac{\partial f}{\partial\theta}\frac{\partial \theta}{\partial\phi} + \frac{\partial f}{\partial \phi}\right]\\
\end{align}\)
It turns out that this step is really all you need to be able to implement a BBB estimation scheme in a modern deep learning framework, though. <a href="https://gluon.mxnet.io/chapter18_variational-methods-and-uncertainty/bayes-by-backprop.html">See here for a great example from the Gluon developers!</a> We will need to make some more assumptions about the specific parametric form of $t(\phi,\epsilon)$ to make the above gradient more explicit. While we’re free to consider any transformation $t: \epsilon\rightarrow\theta$, one of the simplest is a scale-location transformation where the $i$-th neural network weight is written as $\theta_i = \mu_i + \epsilon_i \cdot \sigma_i$ with $\mu_i$ giving the variational posterior mean of $\theta_i$ and $\sigma_i$ providing the variational posterior standard deviation. The standard deviation is always positive and we’d prefer to perform unconstrained optimization when possible, so Blundell et al. reparameterize $\sigma_i=\log (1+e^{\rho})$ instead.</p>
<p>Since the vector $\phi$ is supposed to include all of the variational parameters and each element of $\theta$ has a variational mean and standard deviation, the vector $\phi$ is going to have double the dimension of $\theta$. Let’s split apart $\phi$ and examine some of the gradients more closely, focusing on the variational mean $\mu$:</p>
\[\begin{align}
\frac{\partial\mathcal{F}}
{\partial\mu}
&=E_{p(\epsilon)}\left[\frac{\partial f}{\partial\theta}\cdot\frac{\partial\theta}{\partial\mu}+\frac{\partial{f}}{\partial\mu}\right]\\
&=E_{p(\epsilon)}\left[\frac{\partial}{\partial\theta}\left[\log q(\theta\vert\phi)-\log p(\theta)-\log p(\mathcal{D}\vert\theta)\right]\cdot\frac{\partial\theta}{\partial\mu}+\frac{\partial{f}}{\partial\mu}\right]
\end{align}\]
<p>Addressing each of these terms within $\partial f/\partial \mu$ individually will be more enlightening. The form of the conditional variational density $q(\theta\vert\phi)$ depends on our model assumptions; the default version given in Blundell et al. assumes a multivariate normal with diagonal covariance. Thus, we have \(\log q(\theta\vert\phi)\propto \frac{1}{2}(\theta-\mu)^T\Sigma_q^{-1}(\theta-\mu)\). Here, the covariance matrix $\Sigma_q$ has the variances $\sigma_i^2$ on its diagonal, so its inverse will also be diagonal with diagonal entries of $1/\sigma_i^2$. We can see that this is going to push the values of $\mu$ in line with the values of $\theta$.</p>
<p>Next, the prior $p(\theta)$ is going to play a regularization role. We have a couple of options here; using an isotropic Gaussian with sufficiently small variance will induce $L_2$ regularization on the weights with equal strength everywhere. The study authors point out that a prior which allows for some large coefficients but mostly small coefficients can be useful and thereby include a two-component mixture of Gaussians with the idea that one mixture has a small variance (preferring lots of coefficients with small values) and the other mixture has a large variance to allow for large coefficient values to occasionally pop up. The mixture weights would need to be estimated, however, and Blundell et al. simply leave that up to your favorite choice of hyperparameter tuning.</p>
<p>Finally, the log-likelihood $p(\mathcal{D}\vert\theta)$ is straightforward to understand - it’s the error resulting from the mismatch between predictions $\hat{x}$ and true values $x$. In the case of a Gaussian likelihood, we get square error loss and for a Laplace likelihood, we recover absolute error loss.</p>
<p>For all of the terms in $\partial f/\partial \theta$, the gradients come down to gradients of quadratic forms of some type and under the right prior assumptions can even be done analytically.</p>
<p>Back in equation (12), the next term \(\partial \theta/\partial\mu\) is just $1$ since $\theta \propto \mu$ in our function \(\theta=t(\phi,\epsilon)=t(\mu,\rho,\epsilon)\). Then, the final term \(\frac{\partial f}{\partial \mu}\) is much like the first term, except the parts that don’t depend on $\mu$ will drop out. Again, I want to stress that none of these calculations need to be done by hand - autodif software like Torch or Tensorflow will do these automatically. Once $\partial{\mathcal{F}}/\partial \mu$ is calculated with the above steps, it’s easy to apply stochastic gradient descent with a Monte Carlo estimator of $\partial \mathcal{F}/\partial \mu$ in order to do training. A similar recipe can be followed for the scale parameter $\rho$.</p>
<p>Treating the network weights $\theta$ as the random variable is orthogonal in some sense to the methods which treat the latent variable $z$ as the random quantity to be optimized over. Including both sources of uncertainty could be a promising line of future research.</p>Christopher KrapuImage, text and audio are examples of structured multivariate data where we have a total or partial ordering over the entries of our data points and also may exhibit long-range structure extending over many pixels, words or seconds of speech. As a consequence, it is difficult to model these kinds of data using models that allow for only short-range structure such as HMMs or which can make use of only pairwise dependency structures such as the covariance matrix in a multivariate normal distribution. What if we’d like to build Bayesian models with more sophisticated structure?An ELBO Timeline2020-02-07T00:00:00-08:002020-02-07T00:00:00-08:00https://ckrapu.github.io/a-timeline-of-the-variational-expected-lower-bound<p>In Bayesian machine learning, deep generative models are providing exciting ways to extend our understanding of optimization and flexible parametric forms to more conventional statistical problems while simultaneously lending insight from probabilistic modeling to AI / ML. This is an exciting time to be studying the topic as it is blending results from probability theory, statistical physics, deep learning and information theory in sometimes surprising ways. This post is a short summary of some of the major work on the subject and serves as an annotated bibliography on the most important developments in the subject. It also uses common notation to help smooth over some of the differences in detail between papers.</p>
<p>An initial good resource for a high-level overview of the problem is given in <a href="https://arxiv.org/abs/1601.00670">Variational Inference: A Review for Statisticians</a> by David Blei et al. (2017). This paper gives a Bayesian statistical perspective on variational methods which are designed around manipulating the <em>expected lower bound on the evidence</em> (ELBO). From a statistical physics point of view, this can also be viewed as an upper bound on a system’s energy function.</p>
<h2 id="the-setup--approximate-posterior-inference">The setup: approximate posterior inference</h2>
<p>Blei et al. 2017 starts with a very broad general statement about Bayesian modeling: if we have some known observational data $\boldsymbol{x}=x_1,…,x_N$ and a model with parameters or latent variables $\boldsymbol{z}=z_1,…,z_N$, then our model is specified by a joint probability distribution $p(\boldsymbol{x},\boldsymbol{z})$. The $\boldsymbol{z}$ values could be latent (also described as <em>local</em>) parameters such as the factor scores in factor analysis or they could be global parameters like the coefficients in linear regression. After setting up this model, one of the main tasks is usually to conduct inference and obtain a posterior distribution $p(\boldsymbol{z}\vert \boldsymbol{x})=p(\boldsymbol{x},\boldsymbol{z})p(\boldsymbol{z})/p(\boldsymbol{x})$ where the intractable integral $p(\boldsymbol{x})=\int_z p(\boldsymbol{x}\vert \boldsymbol{z})p(\boldsymbol{z})dz$ prevents straightforward computation of $p(\boldsymbol{z}\vert \boldsymbol{x})$. The application of Markov chain Monte Carlo is intended to compute approximate estimates of exactly $p(\boldsymbol{z}\vert \boldsymbol{x})$ while the variational Bayes strategy is to get exact solutions to an approximate distribution $q(\boldsymbol{z}\vert \boldsymbol{x})$ where $q$ is chosen from some class of distributions that have nice properties for optimization. We typically choose $q$ from a family of parametric densities indexed by parameters $\boldsymbol{\phi}$. Then, the variational objective is to solve the following problem in terms of a loss function $f$, true posterior $p$ and approximate posterior $q$.</p>
\[q_\phi^*(\boldsymbol{z})=\underset{q_{\phi}}{\mathrm{argmin}} \ f
(q_\boldsymbol{\phi}(\boldsymbol{z}),p(\boldsymbol{z}\vert \boldsymbol{x}))\]
<p>We are free to choose any $f$ that we want, keeping in mind that our choice of $f$ should intuitively encapsulate notions of closeness or fidelity between two distributions $p,q$. Many different methods can be categorized by their choice of $f$ within this framework. For example, using the asymmetric Kullback-Leibler divergence defined as $KL(p\vert \vert q)=E_p [\log p(z)/q(z)]$ yields either variational Bayes or <a href="https://arxiv.org/abs/1412.4869">expectation propagation</a> depending upon whether $KL(p\vert\vert q)$ or $KL(q\vert\vert p)$ is used</p>
<p>We can also frame variational inference in the context of the evidence $p(\boldsymbol{x})$ also referred to as the <em>marginal likelihood</em>. Without loss of generality, we can also assume that the true generative model $p_{\theta}$ of the data has some parameters $\theta$. In this post, I’ll be extremely detailed with the derivations so that they are easy to follow.</p>
\[\begin{align}
\underbrace{\log p_{\boldsymbol{\theta}}(\boldsymbol{x})}_{\text{Log evidence}}&= \log \int_z p_{\boldsymbol{\theta}}(\boldsymbol{z,x}) dz\\
&= \log \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z,x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})} dz\\
&= \log \int_z q_{\boldsymbol{\phi}}(\boldsymbol{z}\vert \boldsymbol{x})\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})p_{\boldsymbol{\theta}}(\boldsymbol{x})}{q(\boldsymbol{z}\vert \boldsymbol{x})} dz\\
&= \log \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})p_{\boldsymbol{\theta}}(\boldsymbol{x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})} dz\\
&\ge \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\log \left[\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})p_{\boldsymbol{\theta}}(\boldsymbol{x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\right] dz = ELBO(q_{\boldsymbol{\phi}})\\
\end{align}\]
<p>We used Bayes’ Rule in (3) and Jensen’s inequality after (5), leading us to the form of the expected lower bound of the model evidence shown in equation 6. With a few more manipulations we get:</p>
\[\begin{align}
ELBO(q_\boldsymbol{\phi}) &= \int_z q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\log \left[\frac{p_{\boldsymbol{\theta}}(\boldsymbol{z,x})}{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\right] dz\\
&= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log {p_{\boldsymbol{\theta}}(\boldsymbol{z,x})}-\log q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\right] \\
&= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log {p_{\boldsymbol{\theta}}(\boldsymbol{z\vert x})}-\log q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})-\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right] \\
&= -KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\vert \vert p_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \boldsymbol{x}))+ E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x})\right] \\
&= -KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\vert \vert p_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \boldsymbol{x}))+ \log p_{\boldsymbol{\theta}}(\boldsymbol{x})
\end{align}\]
<p>The form shown in (11) is informative - remember that the marginal likelihood $p(\boldsymbol{x})$ is not a function of $\boldsymbol{z}$. If we think of the log marginal likelihood as fixed, then \(\log p(\boldsymbol{x})= KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})\vert \vert p_{\boldsymbol{\theta}}(\boldsymbol{z}\vert \boldsymbol{x})) + ELBO(q_\boldsymbol{\phi})\) so that increasing the KL-divergence must decrease the ELBO and vice versa. For the rest of this post, I’ll be reviewing papers that either dissect the ELBO into different representational forms or tweak prior assumptions to squeeze more performance out of models trained with variational Bayes.</p>
<h2 id="auto-encoding-variational-bayes-kingma-and-welling-2013">Auto-Encoding Variational Bayes: <a href="https://arxiv.org/abs/1312.6114">Kingma and Welling (2013)</a></h2>
<p>This paper ignited an enormous amount of interest from the machine learning community in variational methods because it recast approximate inference in a form that has a straightforward interpretation in the context of auto-encoder models. I won’t go into depth about how those work and will instead focus on the main contribution. We do need to know that within the conceptual framework of Kingma & Welling, we have a latent variable model that maps hidden or latent codes $z$ to observed data points $\boldsymbol{x}$ via a generator model \(p_{\boldsymbol{\theta}}(\boldsymbol{x}\vert \boldsymbol{z})\). They make the assumption that this generator is a neural network parameterized by weights contained within $\boldsymbol{\theta}$.</p>
<p>Starting with equation (8) from the previous section, the authors made the following observation:</p>
\[\begin{align}ELBO(q_\boldsymbol{\phi})
&= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x,z})-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})]\\
&= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z}) + \log p_{\boldsymbol{\theta}}(\boldsymbol{z})-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})]\\
&= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] + E_{q(\boldsymbol{z}\vert \boldsymbol{x})}\left[ \log p_{\boldsymbol{\theta}}(\boldsymbol)-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})\right]\\
&= \underbrace{E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})]}_{\text{Reconstruction}} -\underbrace{KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\theta}}(\boldsymbol{z}))}_{\text{Shrinkage}}\\
\end{align}\]
<p>(16) presents a common interpretation of the ELBO in terms of the variational parameters $\boldsymbol{\phi}$ as a tradeoff between maximizing the model likelihood $E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})\right]$ and keeping the learned posterior over $\boldsymbol{z}$ close to a prior distribution $p_{\boldsymbol{\theta}}$. An arbitrary choice of prior which seems to have caught on is to assume that $\boldsymbol{z} \sim N(\boldsymbol{0},\sigma^2 I)$ where $I$ denotes the identity matrix. From a non-Bayesian machine learning perspective, the first term is analogous to reconstruction or denoising error from a normal auto-encoder while the second term is a Bayesian innovation intended to help keep the learned latent space (governed by $\boldsymbol{\phi}$) relatively close to a spherical Gaussian.</p>
<h2 id="elbo-surgery-hoffman-and-johnson-2016">ELBO surgery: <a href="http://approximateinference.org/accepted/HoffmanJohnson2016.pdf">Hoffman and Johnson (2016)</a></h2>
<p>This is one of my favorite papers because it’s a lucid and compact explanation of an interesting phenomenon in deep generative models. It also has two more rearrangements of the ELBO. The first one is nearly the same expression as (12):</p>
\[\begin{align}
ELBO(q_\boldsymbol{\phi})
&= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x,z})-\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})\right]\\
&= \underbrace{E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x,z})\right]}_{\text{Negative expected energy}} -\underbrace{E_{q_\boldsymbol{\phi}}\left[\log q_\boldsymbol{\phi}(\boldsymbol{z\vert x})\right]}_{\text{Entropy}}\\
\end{align}\]
<p>The term <em>energy</em> here refers to the convention that in statistical mechanics, the Boltzmann distribution is defined by an exponential dependence between energy and probability, i.e. $p(x)\propto e^{-U/kT}$ where $U$ is an energy function and $kT$ is a normalized temperature. This rewriting of the ELBO highlights how it balances likelihood maximization (equivalent to energy minimization) with keeping most of its probability mass from spreading out and thereby boosting the entropy term.</p>
<p>The second form of the ELBO is the key result of this paper and provides a more detailed breakdown than the previous forms. The setup is a little more involved and requires recasting the ELBO as a function dependent upon not just the variational parameters $\phi$ or the generative model parameters $\theta$ but also the identity of the $n$-th data point being analyzed. The main point of this section is that we should think about information being shared between the identity of the data point (as captured by its index $n$) and the latent code $z_n$.</p>
\[\begin{align}
ELBO(q_\boldsymbol{\phi}) &= \underbrace{E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})]}_{\text{Reconstruction}} -\underbrace{KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\theta}}(\boldsymbol{z}))}_{\text{Shrinkage}}\\
&=E_{q_\phi(z\vert x)}\left[\log p_\theta(\boldsymbol{x}\vert\boldsymbol{z})-\log \frac{q_\phi({\boldsymbol{z} \vert\boldsymbol{x})}}{p_\theta(\boldsymbol{z})}\right]\\
&=E_{q_\phi(z\vert x)}\left[ \log \left( \prod_n p_\theta(x_n\vert z_n)\right)-\log \left(\prod_n \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right)\right]\\
&=E_{q_\phi(z\vert x)}\left[\sum_n \log p_\theta(x_n\vert z_n)-\sum_n \log \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right]\\
&=E_{q_\phi(z\vert x)}\left[\sum_n \left(\log p_\theta(x_n\vert z_n)- \log \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right)\right]\\
&=\int_{z_1}\ldots \int_{z_N}\prod_n q_\phi(z_n\vert x_n) \left[\sum_n \left(\log p_\theta(x_n\vert z_n)- \log \frac{q_\phi({z_n \vert x_n)}}{p_\theta(z_n)}\right)\right]dz_1 \ldots dz_n
\end{align}\]
<p>The latent variables $z_n$ are specific to each data point so $z_i$ is independent of $z_j$ given $x_i$. This allows us to rewrite the above integral as a sum.</p>
\[\begin{align}
ELBO(q_\phi)&=\sum_n \int_{z_n} q_\phi(z_n\vert x_n)\left(\log p_\theta( x_n\vert z_n)- \log \frac{q_\phi({ z_n \vert x_n)}}{p_\theta( z_n)}d z_n\right)\\
&=\sum_n E_{q_\phi(z_n\vert x_n)}\left[\log p_\theta( x_n\vert z_n)\right]- KL(q_\phi(z_n\vert x_n)\vert\vert p_\theta(z_n))\\
\end{align}\]
<p>This expression can be seen in several other works as well and usually includes a prefactor of $1/N$, implying that the above equation is the term-by-term average reconstruction error minus a per-data point KL divergence. I am not sure why this is done and it doesn’t appear to be consistent with a physical point of view - the ELBO can be viewed as an upper bound on a total system-wide energy and a system’s total energy is a sum of energy functions across particles rather than an across-particle average. In practice, this factor of $1/N$ is unimportant because $N$ is known ahead of time and the optimization strategies resulting from the ELBO reparameterization are unaffected by it. However, to make these derivations consistent with the literature, I will include it here too.</p>
<p>Integrating results across different work in a common notation can be challenging and here we must be very specific in noting that $z_n$ refers to the latent code for a single data point, $\boldsymbol{z}$ refers to the latent codes for all data points and $z$ refers to a latent code which is not indexed by $n$ but which is conceptually linked to a single data point. This is an important distinction moving forward. We continue by defining priors over $n$ which are the probabilities that a given data point is sampled and fed into the ELBO expression. A natural choice is to simply choose them at random so that $p_{sample} = 1/N$ where $N$ is the number of observations in our dataset. We’ll make the same assumption for the accompanying prior under $q$ so that $p(n)=q(n)=1/N$. We also want to express $q_\phi (z_n\vert x_n)$ in terms of the random variable $n$ and not $x_n$ so we have $q_\phi (z\vert n)\triangleq q_\phi(z\vert x_n)$. This is purely notational - the random variable $n$ should be thought of as synonymous with $x_n$.</p>
\[ELBO(q_\phi)=\frac{1}{N}\left(\sum_n E_{q_\phi(z_n\vert x_n)}\left[\log p_\theta( x_n\vert z_n)\right]- KL(q_\phi(z_n\vert x_n)\vert\vert p_\theta(z_n))\right)\\
\begin{align}
\frac{1}{N}\sum_n KL(q_\phi(z_n\vert x_n)\vert\vert p_\theta(z_n))&=\frac{1}{N}\sum_n KL(q_\phi(z\vert n)\vert\vert p_\theta(z))\\
&=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)} \log \frac{q_\phi(z\vert n)}{p_\theta(z)}\\
&=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)} \log \frac{q_\phi(n\vert z)q_\phi(z)}{p_\theta(z)q_\phi(n)}\\
&=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)} \log \frac{q_\phi(n\vert z)q_\phi(z)}{p_\theta(z)q_\phi(n)}\\
&=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(z)}{p_\theta(z)} + \log \frac{q_\phi(n\vert z)}{q_\phi(n)}\right]\\
&=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(z)}{p_\theta(z)} + \log \frac{q_\phi(n\vert z)q_\phi(z)}{q_\phi(n)q_\phi(z)}\right]\\
&=\frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(z)}{p_\theta(z)} + \log \frac{q_\phi(n, z)}{q_\phi(n)q_\phi(z)}\right]\\
&=KL(q_\phi(z)\vert\vert p_\theta(z)) + \frac{1}{N}\sum_n E_{q_\phi(z\vert n)}\left[ \log \frac{q_\phi(n, z)}{q_\phi(n)q_\phi(z)}\right]\\
&=KL(q_\phi(z)\vert\vert p_\theta(z)) +\sum_n E_{q_\phi(n,z)}\left[ \log \frac{q_\phi(n, z)}{q_\phi(n)q_\phi(z)}\right]\\
&=KL(q_\phi(z)\vert\vert p_\theta(z)) + \mathbb{I}_{q_\phi}(n,z)\\
\end{align}\]
<p>This result rearranges the sum of per-data point KL divergences into an averaged KL divergence and the mutual information $\mathbb{I}$ between the random variables $n$ and $z$. Conceptually, this is a very nice result - it represents the original regularizing term as a divergence between averaged (i.e. non data point specific) prior distributions and information shared acros $q_\phi$ between $n$ and $z$. We can start to think about $q_\phi$ as a communication channel which may perfectly communicate the information in the index $n$ to the latent code $z$, i.e. perfect reconstruction, or it may fail to communicate substantial information and thereby the generative model learns to ignore the latent code $z$! We can use these expressions to rewrite the ELBO in a form identical to an equation from the Hoffman and Johnson paper:</p>
\[ELBO(q) =\underbrace{\left[\frac{1}{N}\sum_n E_{q_\phi(z_n\vert x_n})\left[\log p_\theta(x_n\vert z_n)\right] \right]}_{\text{Expected reconstruction error}} - \underbrace{\mathbb{I}_{q_\phi(n,z)}(n,z)}_\text{Decoded information} - \underbrace{KL(q_\phi(z)\vert\vert p_\theta(z))}_{\text{Marginal regularizer}}\]
<p>In the above expression, the first term on the right hand side represents how well the generative model can reconstruct the data points $x_n$ using the latent codes. If the values of $\theta$ are chosen poorly and the generative model is insufficient, this term will be relatively low. The next term is the mutual information from before and tells us how well the encoder network $q_\phi$ is transmitting information from the identity of the data point $x_n$ into the latent variable $z_n$. Finally, the last term pushes the <em>average</em> distribution of latent codes $z_n$ to be close to the prior $p_\theta(z)$. For many applications, $p_\theta$ is chosen somewhat arbitrarily to be a diagonal or isotropic Gaussian and this form suggests that we may want to choose more carefully in order to obtain more desired behavior from variational methods.</p>
<h2 id="beta-vae-higgins-et-al-2017">$\beta$-VAE: <a href="https://openreview.net/pdf?id=Sy2fzU9gl">Higgins et al. (2017)</a></h2>
<p>As researchers began to catch on that the shrinkage term $KL(q_\phi(z\vert x)\vert\vert p_\theta(z))$ may play an important role in favoring certain classes of representations, they developed new modifications to the ELBO to help push the variational objective in different directions.</p>
<p>A conceptually straightforward way to do this is to simply up- or down-weight the shrinkage term in conjunction with the right prior. The intuition behind $\beta$-VAE appears to be that in the $\beta$-modified expression for the ELBO:$ELBO(q_\boldsymbol{\phi}) = E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] -\ \beta \cdot KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\theta}}(\boldsymbol{z}))$, the second term can be tuned to push $q_\phi(z\vert x)$ closer to a desired prior structure. The default prior that had been chosen in many studies up to this point is a simple isotropic Gaussian which <em>promotes cross-factor independence</em> and will naturally push towards a <em>disentangled</em> representation in which the different dimensions of $z$ are uncorrelated in the approximate posterior $q_\phi(z\vert x)$. It’s straightforward to show that for two approximately isotropic Gaussian distributions $q_\phi, p_\theta$, their KL divergence is proportional to</p>
\[\log\frac {\vert\Sigma_{q_\phi}\vert}{\vert\Sigma_{p_\theta}\vert}\propto \log \sigma^2_{q_\phi}-\log \sigma^2_{p_\theta}\]
<p>where $\Sigma_{q_{\phi}}$ and $\Sigma_{p_\theta}$ are the diagonal covariance matrices of $q_\phi$ and $p_\theta$ respectively. As a consequence, we can also view an adjustment to $\beta$ as equivalent to tweaking our latent space prior variance. In statistical physics, $\beta$ is a function of the system temperature so it is unclear to me why the notion of $\beta$ was introduced despite several other identical conceptual frameworks existing which were appropriate for describing this improvement. Perhaps this was indeed the motivation but this fact was omitted from the text.</p>
<p>Regardless, this led to marked improvements on learning disentangled representations and is such an easy computational tweak that it can be implemented into the vast majority of VI workflows.</p>
<h2 id="empirical-bayes-for-latent-variable-priors-vampprior-tomczak-and-welling-2018">Empirical Bayes for latent variable priors (VampPrior): <a href="https://arxiv.org/pdf/1705.07120.pdf">Tomczak and Welling (2018)</a></h2>
<p>The $\beta$-VAE paper suggested that tweaking the variational objective’s split across reconstruction error and shrinkage could produce better models and also more disentangled representations. Unfortunately, the discussion of choosing $\beta$ wasn’t linked to a specific choice of prior. In my opinion, the most interesting observation from the unbearably-cheesily-named VampPrior paper was that all latent variable priors can conceptually be ordered by the degree to which they depend on observed data. I’ll reproduce some of their arguments here after introducing some extra notation: $p_\lambda(z)$ is a prior over the latent state $z$ and in past paragraphs I lazily referred to this as $p_\theta(z)$ with the understanding that the vector $\boldsymbol{\theta}$ included not just the weights of the decoder network but also the hyperparameters of the latent space prior. I will be more explicit moving forward.</p>
<p>The paper picks right off at a familiar point:
\(\begin{align}
ELBO(q_\boldsymbol{\phi}) &= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] -KL(q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x}),p_{\boldsymbol{\lambda}}(\boldsymbol{z}))\\
&= E_{q_\boldsymbol{\phi}(\boldsymbol{z}\vert \boldsymbol{x})}[\log p_{\boldsymbol{\theta}}(\boldsymbol{x\vert z})] -E_{q_\phi}\left[\log q_\phi(\boldsymbol{z}\vert \boldsymbol{x})-p_\lambda(\boldsymbol{z})) \right]\\
\end{align}\)
If the goal is to maximize the ELBO, then we could simply drive the second term on the RHS of (39) to zero by setting our prior equal to the learned posterior $q_\phi(\boldsymbol{z}\vert \boldsymbol{x})$ and thereby commit a cardinal sin by snooping on the data. However, this would remove any shrinkage effects and not let the prior do its job by restricting the capacity of the model in an effective way. The other extreme is to choose $p_\lambda$ to be very restrictive and not make use of any of the observed data points $x_n$.</p>
<p>The key insight from the Tomczak and Welling paper is that there is an empirical Bayes (EB) middle ground between these two extremes. We can implement this EB prior by expressing $p_\lambda$ as a weakened version of the variational posterior $q_\phi$ via the usage of $K$ <em>pseudo-inputs</em> $ $u_1,…,u_K$ in a variational mixture of posteriors (VAMP):</p>
\[p^{VAMP}(z)=\frac{1}{K}\sum_k q_\phi(\boldsymbol{z}\vert u_k)\]
<p>In the limit where $K\approx N$, the prior and posterior are identical so there is little regularization, but when a good value of $K$ is selected, $p^{VAMP}$ is clearly going to be highly multimodal as a mixture distribution, but it is also going to have less capacity than the full posterior. However, this opens another question regarding how the $u_k$ are selected and generated. In true empirical Bayes fashion, these are treated as additional model parameters amenable to optimization via backprop. The implementation that Tomczak and Welling actually go with for their experiments uses a two layer hierarchical VAMP prior; I would like to comment on this but there was virtually no motivation or discussion of why this multilayer prior would help.</p>Christopher KrapuIn Bayesian machine learning, deep generative models are providing exciting ways to extend our understanding of optimization and flexible parametric forms to more conventional statistical problems while simultaneously lending insight from probabilistic modeling to AI / ML. This is an exciting time to be studying the topic as it is blending results from probability theory, statistical physics, deep learning and information theory in sometimes surprising ways. This post is a short summary of some of the major work on the subject and serves as an annotated bibliography on the most important developments in the subject. It also uses common notation to help smooth over some of the differences in detail between papers.