Markov Chain Monte Carlo (MCMC)

Motivation

While Monte Carlo methods have lots of potential applications, we will just name two here as a motivation:

Data generation. The generated data can be used for simulation or for visualization purposes.
Inference. Infer unknown variables from observations.

For the first application, the use of Monte Carlo is rather straight forward. We will assume the model parameters are known and we can sample parameters sequentially with Gibbs sampling as described below. If all conditional distributions can be sampled directly (more explanation below), we can use regular Monte Carlo without the need for MCMC.

For the second application, there is an additional layer of complication as we typically would like to sample from the posterior distribution for efficiency. But often only the likelihood distributions are specified. In that case, we will need to sample from a proposal distribution and MCMC is needed.

Basic Idea of Monte Carlo Methods

Ultimately, almost all Monte Carlo Methods are doing nothing but estimating the expectation of some function $f(x)$ with $X \sim p(x)$ by

$E_{X\sim p(x)}[f(X)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i)$ ,

where the above is guaranteed by the law of large number for sufficiently large $N$ .

For example, in the coin-flipping examples in Q1 of HW5, we can estimate the ultimate probability of head by conducting the Monte Carlo experiment and counting the number of head. More precisely, denote $i_H(\cdot)$ as an indicator function where $i_H(\cdot)$ is 1 when $Y$ is head. Then

$Pr(Y=H)=E_{Y \sim p(y)}[i_H(Y)] \approx \frac{1}{N} \sum_{i=1}^N i_H(y_i)$ .

The key question here is how we are going to collect samples of $Y$ , $y_1,\cdots,y_N$ . In this coin-flipping setup, the Monte-Carlo simulation is trivial. We can conduct the experiment exactly as in a real setup. That is, first draw a coin from the jar; that corresponds to sample a binary random variable $X$ with the specified probability of getting coins $A$ or $B$ . Then, based on the outcome of $X$ (i.e., which coin that actually picked), we will draw another binary random variable $Y$ for each coin flip according to the probability of head of the chosen coin.

In the coin-flipping problem, since we are just drawing binary random variables, we can sample the distribution directly without any issue. But for many continuous distributions except a few special cases, direct sampling them is not possible and more involved sampling techniques are needed as described in the following.

This sequential sampling conditioned on previously specified random variables is generally known as Gibbs sampling. Strictly speaking, it is a form of MCMC as we will show later on. But I think it is more appropriate NOT to consider this simple case as MCMC because there is no convergence issue here as you will see that is generally not the case for MCMC.

Basic Sampling Methods

The key question left is how we can sample data from a distribution. For some standard distributions like Gaussian distribution, we can draw samples directly with some well-developed packages. However, there is no direct way of drawing samples for many regular distributions, not to mention those that do not even have a standard form. There are ways to sample any arbitrary distributions, we will mention a couple of simple ones here.

Rejection Sampling

Probably the simplest sampling method is the rejection sampling. Consider drawing sample $X \sim p(x)$ . Select any “sampleable” distribution $q(x)$ such that $c q(x) > p(x), \forall x$ with some constant $c$ . Now, instead of drawing sample from $p(x)$ , we will draw from $q(x)$ instead. However, in order to have the sample appear to have the same distribution of $p(x)$ , we will only keep a fraction $\frac{p(x')}{c q(x')}$ of $x'$ whenever $x'$ is drawn and discard the rest. To do this, we simply sample a $Z$ from $[0,1]$ uniformly after each $x$ sampled. And keep the current $x$ only if $Z<\frac{p(x)}{c q(x)}$ .

Note that in Q4 of HW5, we are essentially doing some form of rejection sampling. Draw the samples from the prior and the likelihood distributions, and then only accept samples that fit the observations. Rejection sampling is simple but it can be quite inefficient. For example, if the current estimates of variables used in sampling are far from the actual values, the sampled outcome most likely will not match well with the observations and we have to reject lots of samples. That’s why other sampling techniques are needed.

Importance Sampling

Rejection sampling is generally very inefficient as a large number of samples can be discarded. If we only care about computing the expection of some function, then we may not need to discard any sample but adjust the weight of each sample instead. Note that

$E_{X\sim p(x)}[f(X)]=\int f(x) p(x) dx = \int \left[f(x)\frac{p(x)}{q(x)}\right] q(x) dx= E_{X\sim q(x)}\left[f(X)\frac{p(X)}{q(X)}\right]$ .

Thus, we can estimate the expectation by drawing samples from $q(x)$ and compute weighted average of $f(x)$ instead. And the corresponding weights are $\frac{p(x)}{q(x)}$ .

Even though we can now utilize all samples, importance sampling has two concerns.

Unlike rejection sampling, we do not directly draw samples with the desired distribution. So we can use those samples to do other things rather than computing expectations.
The variance of the estimate can be highly sensitive with the choice of $q(\cdot)$ . In particular, since the variance of the estimate is $\frac{1}{N}E\left[ f(X)\frac{p(X)}{q(X)}\right]$ , the variance will be especially large if a more probable region in $p(x)$ is not covered well by $q(x)$ , i.e., we have extensive region where $p(x)$ large but $q(x) \approx 0$ .

Markov Chain Monte Carlo

Instead of sampling from $p(x)$ directly, Markov chain Monte Carlo (MCMC) methods try to sample from a Markov chain (essentially a probabilistic state model). Probably the two most important properties of the Markov chain are irreducible and aperiodic. We say a Markov chain is irreducible if we can reach any single state to any other state. And we say a Markov chain is periodic if there exist two states such that one state can reach the other state only in a multiple of $n$ steps with $n>1$ . If a Markov chain is not periodic, we say it is aperiodic. Under the above two conditions (aperiodic and irreducible), regardless of the initial state, the distribution of states will converge to the steady-state probability distribution asymptotically as time goes to infinity.

Consider any two connected states with transition probabilities $T(x|x')$ (from $x'$ to $x$ ) and $T(x'|x)$ (from $x$ to $x'$ ). We say the detailed balance equations are satisfied if $p(x)T(x'|x) = p(x')T(x|x')$ . Note that if detailed balance are satisfied among all states, $p(x)$ has to be the steady state probability of state $x$ . Because the probability “flux” going out from $x$ to $x'$ is exactly canceled by the “flux” coming in from $x'$ to $x$ when the chain reaches this equalibrium.

With the discussion above, we see that one can sample $X \sim p(x)$ if we can create a Markov chain with $p(x)$ designed to be the steady-state probability of the chain (by making sure that detailed balance is satisfied). Note that, however, we have to let the chain to run for a while before the probability distribution converges to the steady-state probability. Thus, samples before the chain reaching equilibrium are discarded. This initial step is known as the “burn-in”. Moreover, since adjacent samples drawn from the chain (even after burn-in) are highly correlated, we may also skip every several samples to reduced correlation.

Metropolis-Hastings

The most well-known MCMC method is the Metropolis-Hastings algorithm. Just as the discussion above, given the distribution $p(x)$ , we want to design Markov chain such that $p(x)$ satisfies the detailed balance for all states $x$ . Given the current state $x$ , the immediate question is to which state $x'$ we should transit to. Assuming that $x\in \mathbb{R}^N$ , a simple possibility is just to do a random walk. Essentially perturb $x$ with a zero-mean Gaussian random noise. The problem is that the transition probability $q(x'|x)$ and the “reverse” transition probability $q(x|x')$ most likely will not satisfy the detailed balance equation $p(x)q(x'|x)=p(x')q(x|x')$ . Without loss of generality, we may have $p(x) q(x'|x) > p(x')q(x|x')$ . To “balance” the equation, we may reject some of the transition to reduce the probability “flow” on the left hand side. More precisely, let’s denote $A(x\rightarrow x')=\min(1,\frac{p(x')q(x|x')}{p(x)q(x'|x)})$ . Now, we will randomly reject $A(x\rightarrow x')$ of the transition by drawing a uniform $Z\in [0,1]$ and only allowing transition if $Z<A(x\rightarrow x')$ . And similar adjustment is applied to all transitions (including $x'$ to $x$ ). This way, the transition probability from $x$ reduces to $p(x) \underset{T(x'|x)}{\underbrace{q(x'|x) A(x\rightarrow x')}}= p(x) q(x'|x)\frac{p(x')q(x|x')}{p(x)q(x'|x)}=p(x')q(x|x')=p(x') \underset{T(x|x')}{\underbrace{q(x|x') A(x'\rightarrow x)}}$ and so the detailed balance equation is satisfied.

When the transition probabilities from $x$ to $x'$ and from $x'$ to $x$ are equal (i.e., $q(x|x')=q(x'|x)$ ), note that $A(x\rightarrow x')$ just simplifies to $\min\left(1,\frac{p(x')}{p(x)}\right)$ .

Gibbs Sampling

When the state $x$ can be split into multiple component, say $p(x)=p(x_1,x_2,\cdots,x_n)$ . We can transit one component at a time while keeping the rest fixed. For example, we can transit from $x_1$ to $x'_1$ with probability $p(x'_1|x_2,\cdots,x_n)$ . After updating one component, we can update another component in a similar manner until all components are updated. Then, the same procedure can be repeated starting with $x_1$ . As we continue to update the state, the sample drawn will again converge to $p(x)$ as the Markov chain reaches equilibrium. This sampling method is known as the Gibbs sampling and can be shown as a special case of Metropolis-Hastings sampling as follows.

Consider the step when transiting from $x_1$ to $x'_1$ while keeping the rest of components fixed. $A(x_1 \rightarrow x'_1)$ as defined earlier is $\frac{p(x_1,x_2,\cdots,x_n)p(x'_1|x_2,\cdots,x_n)}{p(x'_1,x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)}=\frac{p(x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)p(x'_1|x_2,\cdots,x_n)}{p(x_2,\cdots,x_n)p(x'_1|x_2,\cdots,x_n)p(x_1|x_2,\cdots,x_n)}=1$ . Thus, Gibbs sampling is really Metropolis-Hasting sampling but with all transitions always be accepted.

Hamiltonian Monte Carlo (HMC)

The trajectory sampled by the Metropolis-Hasting method is essentially a random walk like a drunk man. But we have the complete information of $p(x)$ and we know how it looks like. Why don’t we leverage the “geometrical” information of $p(x)$ ?

Power of Physics

We definitely can. And recall the Boltzmann distribution $p(x)=\exp(-E(x))$ (ignoring temperature here) from statistical physics and we can model any distribution with a Boltzmann distribution with appropriate energy function $E(x)=-\log(p(x))$ . Further, we expect lower energy states are more likely to happen than higher energy states as expected.

If we think of $E(x)$ as the potential energy, a “particle” naturally moves towards lower energy state and the excessive energy will convert to kinetic energy as we learn in Newtonian mechanics. Let’s write the total energy as $H(x,p)=PE(x)+KE(p)$ , where the potential energy $PE(x)$ is just $E(x)$ here. (Sorry for the overloading of symbol $p$ here, $p$ is commonly used to represent momentum in physics and so we are sticking to that convention here. Please don’t be confuse with the $p$ in $p(x)$ .) And the kinetic energy $KE(p)=\frac{p^2}{2 m}$ with $p$ and $m$ being the momentum and the mass, respectively. The total energy, $H(x,p)$ , also known as the Hamiltonian is supposed to be conserved as $x$ and $p$ naturally vary in the phase space $(x,p)$ . Therefore, $\frac{\partial H(x,p)}{\partial t}=\frac{\partial H(x,p)}{\partial x}\frac{d x}{dt}+\frac{\partial H(x,p)}{\partial p}\frac{d p}{d t}=0$ .

As we know from classical mechanics, $\frac{\partial H}{\partial x}=-\frac{d p}{dt}$ and $\frac{\partial H}{\partial p}=\frac{d x}{dt}$ . For example, let’s consider an object just moving vertically and $x$ is the height of the object, then $H(x,p)=mgx+\frac{p^2}{2 m}$ , where $g$ is the gravitational force constant. Then $\frac{\partial H}{\partial x}=mg=-F=-\frac{d p}{dt}$ with $F$ being the gravitational force and $\frac{\partial H}{\partial p}=\frac{p}{m}=\frac{dx}{dt}$ as desired. Moreover, the total energy $H(x,p)$ indeed conserves as $p$ and $v$ changes as $\frac{\partial H(x,p)}{\partial t}=\frac{\partial H(x,p)}{\partial x}\frac{d x}{dt}+\frac{\partial H(x,p)}{\partial p}\frac{d p}{d t}=-\frac{dp}{dt}\frac{d x}{dt}+\frac{dx}{dt}\frac{d p}{d t}=0$ .

Back to Monte Carlo

Why are we talking all these? Because we can draw samples following natural physical trajectories more efficiently than random walks as in Metropolis-Hastings. Given the distribution of interest $p(x)$ , we again define the Hamiltonian $H(x,p)=E(x)+KE(p)$ . Now, instead of trying to draw samples of $x$ , we will draw samples of $(x,p)$ instead. And $p(x,p) \propto \exp(-H(x,p))=\exp(-E(x))\exp(-KE(p))=p(x)\exp(-KE(p))$ . So if we marginalize out the momentum $p$ , we get back $p(x)$ as desired.

And as we samples from the phase space $(x,p)$ , rather than random walking in the phase space, we can just follow the flow according to Hamiltonian mechanics as described earlier. For example, as we start from $(x,p)$ , we may simulate a short time instace $\Delta t$ to $(x+\Delta x,p+\Delta p)$ with $\Delta x = \frac{dx}{dt}\Delta t = \frac{\partial H}{\partial p}\Delta t=\frac{p}{m}\Delta t$ and $\Delta p = \frac{dp}{dt}\Delta t= -\frac{\partial H}{\partial x}\Delta t=-\frac{\partial E}{\partial x}\Delta t$ . One can also update $x$ first and then update $p$ . The latter has many different forms and sometimes known as the leapfrog integration. As we apply multiple ( $L$ ) leapfrog steps and reach $(x',p')$ , we may decide whether to accept the sample as in Metropolis-Hasting by evaluating $A((x,p)\rightarrow (x',p'))=\min\left(1,\frac{exp(-H(x',p'))}{exp(-H(x,p)}\right)$ as the transition probabilities are assumed to be the same for both forward and reverse directions. Moreover, if the leapfrog integration is perfect, $H(x,p)$ is supposed to the same as $H(x',p')$ because of conservation of energy. So we have a high chance of accepting a sample.

No-U-Turn Sampling (NUTS)

The main challenge of applying HMC is that the algorithm is quite sensitive to the number of leapfrog step $L$ and the step size $\Delta t$ (more commonly known as $\epsilon$ ). The goal of NUTS is to automatically select the above two parameters. The main idea is to simulate both forward and backward directions until the algorithm detects reaching a “boundary” and needed to go U-turn. By detecting U-turns, the algorithm can adjust the parameters accordingly. The criterion is simply to check if the momentum starts to turn direction, i.e., $(x^+-x^-)\cdot p^+ <0$ or $(x^+ -x^-)\cdot p^- <0$ , where $(x^+,p^+)$ and $(x^-,p^-)$ are the two extremes as we go forward and backward in time.