38 Gibbs Sampling

38.1 Computational Bayes

If we move away from conjugate priors, we are likely to have no closed-form solution for the posterior distribution. But we can approximate the posterior distribution by some computational algorithms. The class of these algorithms are famously known as Markov chain Monte Carlo (MCMC) methods. That is, one can construct a Markov chain that has the desired distribution as its equilibrium distribution.

One strategy to approximate the shape of a distribution is through random sampling. If we can draw random samples from a distribution, as the sample size grows, sampling frequency will approach the probability density. Properties of a distribution can also be estimated from finite samples, e.g. \(\frac{1}{n}\sum_{i=1}^{n} x_i\to\mathbb{E}(x)\).

The problem is how to draw samples from a distribution without knowing its PDF. It seems to be an impossible task. But what if the conditional PDF is easier to compute?

38.2 Gibbs Sampler for Linear Regression Models

Consider the example in Chapter 35 with unknown variance. The joint distribution of \(p(\beta,\sigma^2 | Y)\) is hard to come by. But the condition distribution is easier to compute:

\[ \begin{aligned} p(\beta|\boldsymbol Y, \sigma^2) &\sim N(\hat\beta, D_\beta) \\ p(\sigma^2 | \boldsymbol Y,\beta) &\sim IG(\bar\nu, \bar S) \end{aligned} \]

Suppose we know how to draw random samples from the normal distribution and the inverse-Gamma distribution (in fact, if the CDF is known, a random sample can be generated by \(y=\text{CDF}^{-1}(x)\), where \(x \sim U(0,1)\)), we can generate samples \(\{(\beta,\sigma^2)\}\) by the following algorithm.

Algorithm (Gibbs Sampling)

Pick some initial value \(\beta^{(0)}=a_0\) and \(\sigma^{2(0)} = b_0\). Repeat the following steps for \(i=1...N\):

Draw \(\sigma^{2(i)}\sim p(\sigma^2 | Y, \beta^{(i-1)})\) (Inverse-Gamma);
Draw \(\beta^{(i)}\sim p(\beta| Y, \sigma^{2(i)})\) (Normal).

The most efficient sampling is direct independent sampling \((\beta,\sigma^2)\) from the posterior distribution. If this is not feasible, Gibbs sampler is a reasonable alternative. It can be imagined, in the stationary situation, sampling from the conditional distributions would be statistically identical to sampling from the joint probability distribution. However, Gibbs sampling can be highly inefficient if the posterior variables are highly correlated.

38.3 Gibbs Sampler for General Models

Here is the more general Gibbs algorithm. Suppose we have a model with \(k\) parameters \(\boldsymbol\theta=(\theta_1,\theta_2,\dots,\theta_k)\). Assume we can derive the exact expressions for each of the conditional distributions: \(p(\theta_i|\theta_1,\dots,\theta_{i-1},\theta_{i+1},\dots,\theta_k, Y)\). Further assume we can generate independent samples from each of them. Then the Gibbs sampler runs as follows.

Algorithm (Gibbs Sampling)

Initialize the algorithm with \((\theta_1^{(0)},\theta_2^{(0)},\dots,\theta_k^{(0)})\). Then repeat the following steps:

Choose a random parameter ordering, e.g. \((\theta_3, \theta_1,\theta_2, …)\). Denote the parameters with the new ordering \((\theta_1,\theta_2,\theta_3,…)\).
Sample from the conditional distribution for each parameter using the most up-to-date parameters. That is, draw samples from the following distributions subsequently:

\[ \begin{aligned} & p(\theta_1^{(i)} | \theta_2^{(i-1)}, \theta_3^{(i-1)},\dots,\theta_k^{(i-1)}, Y) \\ & p(\theta_2^{(i)} | \theta_1^{(i)}, \theta_3^{(i-1)},\dots,\theta_k^{(i-1)}, Y) \\ & p(\theta_3^{(i)} | \theta_1^{(i)}, \theta_2^{(i)},\dots,\theta_k^{(i-1)}, Y) \\ & \vdots \\ & p(\theta_k^{(i)} | \theta_1^{(i)}, \theta_2^{(i)},\dots,\theta_{k-1}^{(i)}, Y) \end{aligned} \]

Repeat the process until the algorithm is reasonably converged.