## Motivation

Many real world applications involve observations $o$ and hidden (latent) variables $x$. And often time we are interested in two problems.

1. Given the current model, what are the probability of getting some hidden variable given the current observations? That is, what is $p(x|o)$?
2. Does our model match well with the data? That is, is $p(o)$ reasonably high?

The first problem is an inference problem while the second problem is a model training problem. Typically, our model may be parametrized by some parameter, say $\theta$. And in the second problem, more precisely we try to train the model by solving

$\hat{\theta} = \arg\max_\theta p(o|\theta)=\arg\max_\theta \int p(o|x) p(x|\theta) dx$

Of course, if both conditional probability distributions are Gaussian, we probably can find the integral with a close form solution. However, for most distributions, such a solution does not exist. One may resort to numerical integrations. But Monte Carlo methods will often be simpler and more elegant.

Before discussing further on Monte Carlo methods, let’s also elaborate the first inference problem. Often time, we do not have a particular $x$ in mind that we want to compute $p(x|o)$. But instead  we may want to find the most probable $x$ given $o$, i.e., Maximum A Posteriori (MAP). That is

$\hat{x}= \arg\max_x p(x|o,\theta) = \arg\max_x \frac{p(o|x) p(x|\theta)}{p(o|\theta)}=\arg\max_x p(o|x)p(x|\theta)$

For low dimensional or scalar $x$, we may be able to find the optimum $x$ with brute force. For high dimensional $x$, the inference problem above easily become intractable as well.

## Basic Idea of Monte Carlo Methods

Ultimately, almost all Monte Carlo Methods are doing nothing but estimating the expectation of some function $f(x)$ with $X \sim p(x)$ by

$E_{X\sim p(x)}[f(X)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i)$,

where the above is guaranteed by the law of large number for sufficiently large $N$.

For example, let’s consider the model training problem, we can approximate

$p(o|\theta) = \int_x p(o|x) p(x|\theta) dx = E_{X\sim p(x|\theta)}[p(o|x)]\approx\sum_{i=1}^N 1(o_i = o)$,

where $o_i$ are sampled observation as we draw $x_i$ from $p(x|\theta)$ and $1(o_i=o)$ is an indicator function that is 1 when argument is satisfied $(o_i=o)$ and 0 otherwise.

For the inference problem, while we are not going to compute the maximum a posteriori directly as an expectation, we can draw samples $x_1,x_2,\cdots$ with $x_i \sim p(x|\theta)$ and $o_i \sim p(o_i|x_i)$, and then record the $x_i$ that maximizes $p(o_i|x_i)p(x_i|\theta)$.

## Basic Sampling Methods

The key question left is how we can sample data from a distribution. For some standard distributions like Gaussian distribution, we can draw sample directly with some well-developed packages. However, there is no direct way of drawing sample for many regular distribution, not to mention those do not even have a standard form. There are ways to sample any arbitary distributions, we will mention a couple simple ones here.

### Rejection Sampling

Probably the simplest sampling method is the rejection sampling. Consider drawing sample $X \sim p(x)$. Select any “sampleable” distribution $q(x)$ such that $c q(x) > p(x), \forall x$ with some constant $c$. Now, instead of drawing sample from $p(x)$, we will draw from $q(x)$ instead. However, in order to have the sample appear to have the same distribution of $p(x)$, we will only keep a fraction $\frac{p(x')}{c q(x')}$ of $x'$ whenever $x'$ is drawn and discard the rest. To do this, we simply sample a $Z$ from $[0,1]$ uniformly after each $x$ sampled. And keep the current $x$ only if $Z<\frac{p(x)}{c q(x)}$.

### Importance Sampling

Rejection sampling is generally very inefficient as a large number of samples can be discarded. If we only care about computing the expection of some function, then we may not need to discard any sample but adjust the weight of each sample instead. Note that

$E_{X\sim p(x)}[f(X)]=\int f(x) p(x) dx = \int \left[f(x)\frac{p(x)}{q(x)}\right] q(x) dx= E_{X\sim q(x)}\left[f(X)\frac{p(X)}{q(X)}\right]$.

Thus, we can estimate the expectation by drawing samples from $q(x)$ and compute weighted average of $f(x)$ instead. And the corresponding weights are $\frac{p(x)}{q(x)}$.

Even though we can now utilize all samples, importance sampling has two concerns.

1. Unlike rejection sampling, we do not directly draw samples with the desired distribution. So we can use those samples to do other things rather than computing expectation. For example, we cannot use that to maximize $p(o|x)$ as described in the inference problem. Or at least, we would anticipate bias from our estimate.
2. The variance of the estimate can be highly sensitive with the choice of $q(\cdot)$. In particular, since the variance of the estimate is $\frac{1}{N}E\left[ f(X)\frac{p(X)}{q(X)}\right]$, the variance will be especially large if a more probable region in $p(x)$ is not covered well by $q(x)$, i.e., we have extensive region where $p(x)$ large but $q(x) \approx 0$.

## Markov Chain Monte Carlo

Instead of sampling from $p(x)$ directly, Markov chain Monte Carlo (MCMC) methods try to sample from a Markov chain (essentially a probabilistic state model). Probably two most important properties of the Markov chain are irreducible and aperiodic. We say a Markov chain is irreducible if we can reach any one state to any other state. And we say a Markov chain is periodic if there exist two states such that one state can reach the other state only in a multiple of $n$ steps with $n>1$. If a Markov chain is not periodic, we say it is aperiodic. Under the above two conditions (aperiodic and irreducible), regardless the initial state, the distribution of states will converge to the steady state probability distribution asymptoically as time goes to infinity.

Consider any two connected states with transition probabilities $T(x|x')$ (from $x'$ to $x$) and $T(x'|x)$ (from $x$ to $x'$). We say the detailed balance equations are satisfied if $p(x)T(x'|x) = p(x')T(x|x')$. Note that if detailed balance are satisfied among all states, $p(x)$ has to be the steady state probability of state $x$. Because the probability “flux” going out from $x$ to $x'$ is exactly canceled by the “flux” coming in from $x'$ to $x$ when the chain reaches this equalibrium.

With the disucssion above, we see that one can sample $X \sim p(x)$ if we can create a Markov chain with $p(x)$ designed to be the steady state probability of the chain (by making sure that detailed balance are satisfied). Note that, however, we have to let the chain to run for a while before the probability distribution converges to the steady state probabilty. Thus, samples before the chain reaching equilibrium are discard. This initial step is known as the “burn-in”. Moreover, since adjacent samples drawn from the chain (even after burn-in) are highly correlated, we may also skip every several samples to reduced correlation.

### Metropolis-Hastings

The most well-known MCMC method is the Metropolis-Hastings algorithm. Just as the discussion above, given the distribution $p(x)$, we want to design Markov chain such that $p(x)$ satisfies the detailed balance for all states $x$. Given the current state $x$, the immediate question is to which state $x'$ we should transit to. Assuming that $x\in \mathbb{R}^N$, a simple possibility is just to do a random walk. Essentially perturb $x$ with a zero-mean Gaussian random noise. The problem is that the transition probability $q(x'|x)$ and the “reverse” transition probability $q(x|x')$ most likely will not satisfy the detailed balance equation $p(x)q(x'|x)=p(x')q(x|x')$. Without loss of generality, we may have $p(x) q(x'|x) > p(x')q(x|x')$. To “balance” the equation, we may reject some of the transition to reduce the probability “flow” on the left hand side. More precisely, let’s denote $A(x\rightarrow x')=\min(1,\frac{p(x')q(x|x')}{p(x)q(x'|x)})$. Now, we will randomly reject $A(x\rightarrow x')$ of the transition by drawing a uniform $Z\in [0,1]$ and only allowing transition if $Z.  And similar adjustment is applied to all transitions (including $x'$ to $x$). This way, the transition probability from $x$ reduces to $p(x) \underset{T(x'|x)}{\underbrace{q(x'|x) A(x\rightarrow x')}}= p(x) q(x'|x)\frac{p(x')q(x|x')}{p(x)q(x'|x)}=p(x')q(x|x')=p(x') \underset{T(x|x')}{\underbrace{q(x|x') A(x'\rightarrow x)}}$ and so the detailed balance equation is satisfied.

When the transition probabilities from $x$ to $x'$ and from $x'$ to $x$ are equal (i.e., $q(x|x')=q(x'|x)$), note that $A(x\rightarrow x')$ just simplifies to $\min\left(1,\frac{p(x')}{p(x)}\right)$.

### Gibbs Sampling

When the state $x$ can be split into multiple component, say $p(x)=p(x_1,x_2,\cdots,x_n)$. We can transit one component at a time while keeping the rest fixed. For example, we can transit from $x_1$ to $x'_1$ with probability $p(x'_1|x_2,\cdots,x_n)$. After updating one component, we can update another component in a similar manner until all components are updated. Then, the same procedure can be repeated starting with $x_1$. As we continue to update the state, the sample drawn will again converge to $p(x)$ as the Markov chain reaches equilibrium. This sampling method is known as the Gibbs sampling and can be shown as a special case of Metropolis-Hastings sampling as follows.

Consider the step when transiting from $x_1$ to $x'_1$ while keeping the rest of components fixed. $A(x_1 \rightarrow x’_1)$ as defined earlier is $\frac{p(x_1,x_2,\cdots,x_n)p(x’_1|x_2,\cdots,x_n)}{p(x’_1,x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)}=\frac{p(x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)p(x’_1|x_2,\cdots,x_n)}{p(x_2,\cdots,x_n)p(x’_1|x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)}=1$. Thus, Gibbs sampling is really Metropolis-Hasting sampling but with all transitions always be accepted.

### Hermiltonian Monte Carlo (HMC)

The trajectory sampled by Metropolis-Hasting method is essentially random walk like a drunk man. But we have the complete information of $p(x)$ and we know how it looks like. Why don’t we leverage the “geometrical” information of $p(x)$?

#### Power of Physics

We definitely can. And recall the Boltzmann distribution $p(x)=\exp(-E(x))$ (ignoring temperature here) from statistical physics and we can model any distribution with a Boltzmann distribution with appropriate energy function $E(x)=-\log(p(x))$. Further, we expect lower energy states are more likely to happen than higher energy states as expected.

If we think of $E(x)$ as the potential energy, a “particle” naturally moves towards lower energy state and the excessive energy will convert to kinetic energy as we learn in Newtonian mechanics. Let’s write the total energy as $H(x,p)=PE(x)+KE(p)$, where the potential energy $PE(x)$ is just $E(x)$ here. (Sorry for the overloading of symbol $p$ here, $p$ is commonly used to represent momentum in physics and so we are sticking to that convention here. Please don’t be confuse with the $p$ in $p(x)$.) And the kinetic energy $KE(p)=\frac{p^2}{2 m}$ with $p$ and $m$ being the momentum and the mass, respectively. The total energy, $H(x,p)$, also known as the Hamiltonian is supposed to be conserved as $x$ and $p$ naturally vary in the phase space $(x,p)$. Therefore, $\frac{\partial H(x,p)}{\partial t}=\frac{\partial H(x,p)}{\partial x}\frac{d x}{dt}+\frac{\partial H(x,p)}{\partial p}\frac{d p}{d t}=0$.

As we know from classical mechanics, $\frac{\partial H}{\partial x}=-\frac{d p}{dt}$ and $\frac{\partial H}{\partial p}=\frac{d x}{dt}$. For example, let’s consider an object just moving vertically and $x$ is the height of the object, then $H(x,p)=mgx+\frac{p^2}{2 m}$, where $g$ is the gravitational force constant.  Then $\frac{\partial H}{\partial x}=mg=-F=-\frac{d p}{dt}$ with $F$ being the gravitational force and $\frac{\partial H}{\partial p}=\frac{p}{m}=\frac{dx}{dt}$ as desired. Moreover, the total energy $H(x,p)$ indeed conserves as $p$ and $v$ changes as $\frac{\partial H(x,p)}{\partial t}=\frac{\partial H(x,p)}{\partial x}\frac{d x}{dt}+\frac{\partial H(x,p)}{\partial p}\frac{d p}{d t}=-\frac{dp}{dt}\frac{d x}{dt}+\frac{dx}{dt}\frac{d p}{d t}=0$.

#### Back to Monte Carlo

Why are we talking all these? Because we can draw samples following natural physical trajectories more efficiently than random walks as in Metropolis-Hastings. Given the distribution of interest $p(x)$, we again define the Hamiltonian $H(x,p)=E(x)+KE(p)$. Now, instead of trying to draw samples of $x$, we will draw samples of $(x,p)$ instead. And $p(x,p) \propto \exp(-H(x,p))=\exp(-E(x))\exp(KE(p))=p(x)\exp(KE(p))$. So if we marginalize out the momentum $p$, we get back $p(x)$ as desired.

And as we samples from the phase space $(x,p)$, rather than random walking in the phase space, we can just follow the flow according to Hamiltonian mechanics as described earlier. For example, as we  start from $(x,p)$, we may simulate a short time instace $\Delta t$ to $(x+\Delta x,p+\Delta p)$ with $\Delta x = \frac{dx}{dt}\Delta t = \frac{\partial H}{\partial p}\Delta t=\frac{p}{m}\Delta t$ and $\Delta p = \frac{dp}{dt}\Delta t= -\frac{\partial H}{\partial x}\Delta t=-\frac{\partial E}{\partial x}\Delta t$. One can also update $x$ first and then update $p$. The latter has many different forms and sometimes known as the leapfrog integration. As we apply multiple ($L$) leapfrog steps and reach $(x',p')$, we may decide whether to accept the sample as in Metropolis-Hasting by evaluating $A((x,p)\rightarrow (x',p'))=\min\left(1,\frac{exp(-H(x',p'))}{exp(-H(x,p)}\right)$ as the transition probabilities are assumed to be the same for both forward and reverse directions. Moreover, if the leapfrog integration is perfect, $H(x,p)$ is supposed to the same as $H(x',p')$ because of conservation of energy. So we have a high chance of accepting a sample.

### No-U-Turn Sampling (NUTS)

The main challenge of applying HMC is that the algorithm is quite sensitive to the number of leapfrog step $L$ and the step size $\Delta t$ (more commonly known as $\epsilon$). The goal of NUTS is to automatically select the above two parameters. The main idea is to simulate both forward and backward directions until the algorithm detects reaching a “boundary” and needed to go U-turn. By detecting U-turns, the algorithm can adjust the parameters accordingly. The criterion is simply to check if the momentum starts to turn direction, i.e., $(x^+-x^-)\cdot p^+ <0$ or $(x^+ -x^-)\cdot p^- <0$, where $(x^+,p^+)$ and $(x^-,p^-)$ are the two extremes as we go forward and backward in time.

## Reference:

1. https://en.wikipedia.org/wiki/Hamiltonian_Monte_Carlo
2. Mackay, Information Theory, Inference, and Learning Algorithms
3. Pattern Recognition and Machine Learning by Bishop
4. A Conceptual Introduction to Hamiltonian Monte Carlo by Michael Betancourt
5. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo by Matthew D. HoffmanAndrew Gelman