Last time

Recap on Bayesian inference \[ \pi(\theta|y) \propto \pi_0(\theta)f(y|\theta) \]
General state space Markov chains and ergodic averages \[ \begin{aligned} X_{i+1}|X_i \sim &P(X_i,\cdot), ~~~ \lim_{n\to\infty}\| P^n(x,\cdot) - \pi \|_{TV} = 0,\\ &\frac{1}{n}\sum_{i=1}^nf(X_i) \approx \mathbb{E}_\pi[f]. \end{aligned} \]

Outline

Quick recap from last time
Markov chain Monte Carlo
- The (Random Walk) Metropolis algorithm
- The (General) Metropolis–Hastings algorithm
- Gibbs sampler
- A case study (if time)

Markov chains on general state spaces

Key object is the transition kernel: \[ P: E \times \mathcal{E} \to [0,1] \]
For every \(x \in E\), \(P(x,\cdot)\) is the distribution for the next state, given that the current state is \(x\). I.e. \[ P(x,A) := \mathbb{P}(X_{i+1} \in A|X_i = x). \]

Limiting distributions

If a Markov chain \(P\) is \(\pi\)-irreducible, aperiodic and \(\pi\)-invariant, then \(\pi\) is the unique limiting distribution.
\(\pi\)-irreducible: any event \(A\) s.t. \(\pi(A) > 0\) has a chance of being visited
Aperiodic means the chain doesn’t cycle through disjoint regions periodically
\(\pi\)-invariant if \(X_i \sim \pi \implies X_{i+1} \sim \pi\)

Recap of \(\pi\)-invariance

Key fact: If \(X_n \sim \mu\), then the (marginal) distribution of \(X_{n+1}\) is given by \[ \mathbb{P}(X_{n+1} \in A) = \int P(x,A)\mu(dx) \]
\(\pi\) is invariant for \(P\) if for any \(A \in \mathcal{E}\) \[ \pi(A) = \int P(x,A)\pi(dx) \]
If \(P\) is \(\pi\)-reversible, then it is \(\pi\)-invariant \[ \pi(dx)P(x,dy) = \pi(dy)P(y,dx) \]
Note, however, that Markov chains can be \(\pi\)-invariant without being \(\pi\)-reversible

Part 2: MCMC & the (random walk) Metropolis algorithm

Basic idea of MCMC

Goal: compute \(\mathbb{E}_\pi[f]\), with only \(c\pi\) known
Approach:
- Construct a Markov chain \[ \{X_t\}_{t\geq 0} \] s.t. \(\pi\) is the limiting distribution
- Compute ergodic averages \[ \frac{1}{n} \sum_{i=0}^{n-1} f(X_i) \approx \mathbb{E}_\pi[f]. \]
But how can this be done??

The Metropolis–Hastings algorithm: background

Almost all MCMC methods in use today are either a particular case of the Metropolis–Hastings algorithm or can be derived from it in some way
First introduced in 1953 by Metropolis, Rosenbluth, Rosenbluth, Teller & Teller
General algorithm published in 1970 by Hastings
Only really became popular in Statistics in the 1990s (when computers got powerful enough)
Sparked a Markov chain Monte Carlo revolution, as Bayesian inference finally became usable
The algorithm is now ranked among the top 10 of the 20th century (throughout all of science)!

The original Metropolis, Rosenbluth, Rosenbluth, Teller & Teller (1953) paper

Idea of the (random walk) Metropolis algorithm

Based on the familiar ‘accept-reject’ principle
First we will propose a new candidate state for the Markov chain
Then, we use some information about \(\pi\) to decide whether or not to ‘accept’ the candidate state
If we do not accept, the chain stays put

Random Walk Metropolis algorithm

(Assume \(\pi\) has a density (or mass function in discrete case))
For \(i \in \{0,...,n-1\}\)
- Draw \(\xi_i \sim N(0,1)\) and set \(Y = X_i + h\xi_i\)
- Draw \(U \sim \mathcal{U}[0,1]\) and compute \[ \alpha(X_i,Y) = \min\left(1, \frac{\pi(Y)}{\pi(X_i)} \right) \]
- Set \[ X_{i+1} = \begin{cases} Y \qquad U \leq \alpha(X_i,Y) \\ X_i \qquad \text{otherwise} \end{cases} \]
Output the Markov chain \(\{X_1,...,X_n\}\).
Question. Why does this work when only \(c\pi\) is known?

A detour: the Random Walk

A random walk is a very simple Markov chain \[ X_{i+1} = X_i + \xi_{i+1}, \] \(\xi_{i+1} \sim N(0,\sigma^2)\) (for example).
It is well-studied and easy to simulate
The transition density is given by \[ q(x,y) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2\sigma^2}(y-x)^2\right\} \]
Note that \(q(x,y) = q(y,x)\)
Questions. 1. How is this Markov chain used in the random walk Metropolis algorithm? 2. What measure is the random walk reversible with respect to?

Some discussion…

The acceptance rate has an intuitive interpretation \[ \pi(Y) \geq \pi(X_i) \implies \text{Accept the move!} \]
Uphill moves are always accepted, downhill moves are sometimes rejected
This means that the Markov chain spends more time in regions of higher probability under \(\pi\)

Stepping through an example

Show demo code

Sketch proof of \(\pi\)-invariance

Suppose that \(y \neq x\) is proposed and then accepted.
The density associated with this is \[ q(x,y)\min\left( 1, \frac{\pi(y)}{\pi(x)}\right) \] where \(q(x,y)\) is the random walk transition density
The detailed balanced equations are \[ \pi(x)q(x,y)\min\left(1,\frac{\pi(y)}{\pi(x)}\right) = \pi(y)q(y,x)\min\left(1,\frac{\pi(x)}{\pi(y)}\right) \]
Here this reduces to \[ \pi(x)\min\left(1,\frac{\pi(y)}{\pi(x)}\right) = \pi(y)\min\left(1, \frac{\pi(x)}{\pi(y)}\right) \]
- Question. Why?
Both sides are the same and equal to \(\min(\pi(y),\pi(x))\), so detailed balance holds

RWM example, \(\pi\) is \(N(0,1)\), \(X_0 = 0\)

Constructing an ergodic average

For example:

sum( (-2 < mc$x_store) * (mc$x_store < 2) * 1 ) / nits

## [1] 0.9475

is an approximation for the integral: \[ \int \mathbb{I}_{(-2,2)}(x) \frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx. \]

To approximate \(\mathbb{E}_\pi[X]\) we can compute

mean(mc$x_store)

## [1] 0.009321799

Question. What are the true numbers we are estimating here given that \(\pi\) is \(N(0,1)\)?

Running average plot for the mean estimate

The (average) acceptance rate

cat("The acceptance rate is: ",mc$a_rate)

## The acceptance rate is:  0.4705

An important indicator of algorithm performance

What happens when the step-size is too large?

Random Walk proposals are blind in some sense (don’t use any information about \(\pi\) to construct candidate move)
So if \(h\) is too large, then it is likely that \(\pi(Y) \ll \pi(X_i)\), meaning \[ \alpha(X_i,Y) = \min\left(1, \frac{\pi(Y)}{\pi(X_i)} \right) \approx 0. \]
This will mean very few proposals are accepted and the chain won’t move much.

Choosing \(h\) too large example

## The acceptance rate is:  0.043

What about when it’s too small?

## The acceptance rate is:  0.9835

The Goldilocks principle

There is a sweet spot (\(h\) not too big or small)
In 1D the optimal acceptance rate is typically around 44%
In the high dimensional settings it is around 23%
Guidelines are based on interesting probabilistic limiting arguments
- We relate Markov chains to a limiting diffusion process and then optimise the speed of this diffusion
- We will most likely not explore this further, but see e.g. Roberts & Rosenthal (2001)

Quick questions

What is the difference between the following expressions (\(U\), \(X_i\) and \(Y\) defined as above)? \[ \begin{aligned} (1)& ~ U \leq \min\left( 1, \frac{\pi(Y)}{\pi(X_i)}\right) \\ (2)& ~ U \leq \pi(Y)/\pi(X_i) \\ (3)& ~ \log(U) \leq \log\pi(Y) - \log\pi(X_i) \end{aligned} \]
What is the probability of any of the above expressions (\(U \sim \mathcal{U}[0,1]\), fixed \(X_i\) and \(Y\))? How would you prove it?
Which would we prefer to implement on a computer?

Part 3: Metropolis–Hastings

Metropolis-Hastings algorithm: general case (assuming \(\pi\) has a density)

For \(i \in \{1,...,n-1\}\)

Draw \(Y \sim Q(X_i,\cdot)\)
Draw \(U \sim \mathcal{U}[0,1]\), compute \[ \alpha(X_i,Y) = \min\left(1, \frac{\pi(Y)q(Y,X_i)}{\pi(X_i)q(X_i,Y)}\right). \]
Set \[ X_{i+1} = \begin{cases} Y \qquad U \leq \alpha(X_i,Y) \\ X_i \qquad \text{otherwise} \end{cases} \]
Output the Markov chain \(\{X_1,...,X_n\}\).
Question. How are \(Q\) and \(q\) related?

Some discussion…

In the random walk case the acceptance rate has an intuitive interpretation \[ \pi(Y) \geq \pi(X_i) \implies \text{Accept the move!} \]
This is because the proposal is symmetric, meaning \(q(y,x) = q(x,y)\)
- Recall \(e^{-(y-x)^2/2} = e^{-(x-y)^2/2}\) in Gaussian case
In the general case this intuition is lost, but nonetheless choosing a different form of proposal can be advantageous

Some other popular choices for proposals in Metropolis-Hastings

The Independence Sampler. \(Y \sim Q(\cdot)\)
- Here the proposal is independent of the current state \(X_i\)
- Easy to implement and analyse, but often performs poorly when \(d\) large
Component-wise updates/Metropolis-within-Gibbs. Only update some components of the state at each iteration (e.g. a single dimension)
- A classic example is the Gibbs sampler
Gradient-based algorithms.. State-of-the-art for continuous and smooth targets
- Use \(\nabla \log\pi(x)\) within the proposal somehow
- We will look at these in lecture 5
- Classic examples: Langevin algorithms (e.g. MALA), Hamiltonian Monte Carlo (HMC)

Theorem: Metropolis–Hastings works!

The following hold for a Markov chain produced using the Metropolis-Hastings algorithm:

The chain is \(\pi\)-invariant.
\(\pi\) is the unique limiting distribution if \(\infty > \pi(x) > 0\) and \(\infty > q(x,y) > 0\) for all \(x,y \in E\)

Proof sketch, part 1

Consider \(y \neq x\) is proposed and then accepted. Density associated with this is \[ q(x,y)\alpha(x,y). \]
Using the above we can show \(\pi\)-reversibility, since. \[ \begin{aligned} \pi(x)q(x,y)\alpha(x,y) &= \pi(x)q(x,y) \min \left(1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}\right) \\ &= \min\left( \pi(x)q(x,y), \pi(y)q(y,x) \right) \\ &= \pi(y)q(y,x) \min \left(\frac{\pi(x)q(x,y)}{\pi(y)q(y,x)}, 1 \right) \\ &= \pi(y)q(y,x)\alpha(y,x). \end{aligned} \]

Proof of part 2

Note that the conditions imply \[ q(x,y)\alpha(x,y) > 0 \] for all \(x,y \in E\).
This means that for any \(A \in \mathcal{E}\) \[ \int_A \pi(y)dy > 0 \implies \int_A q(x,y)\alpha(x,y)dy > 0, \] and hence the chain is \(\pi\)-irreducible.
It also implies aperiodicity by setting \(A\) to be any set that includes the current state \(x\)
Remark. Conditions are sufficient but not necessary.

Towards a more general algorithm & proof

To move beyond densities/mass functions, we can construct the Radon–Nikodym derivative \[ \rho(x,y) := \frac{\pi(dy)Q(y,dx)}{\pi(dx)Q(x,dy)}, \]
The acceptance rate becomes \[ \alpha(x,y) = \min(1,\rho(x,y)) \]
For a rigorous proof of \(\pi\)-invariance, we need to construct the Metropolis–Hastings transition kernel

The Metropolis–Hastings kernel \(P\)

Note that for \(x \not\in A\) (using disintegration theorem) \[ P(x,A) = \mathbb{P}[\text{Propose }y \in A \text{ & accept proposal}] = \int_A \alpha(x,y)Q(x,dy) \]
Question. If we are interested in \(P(x,B)\) where \(x \in B\) then what do we we also need to consider?
The full kernel is therefore \[ P(x,dy) = \alpha(x,y)Q(x,dy) + \left( \int [1-\alpha(x,z)]Q(x,dz) \right)\delta_x(dy) \]
Question. What is \(P(x,E)\) equal to?
The sketch proof essentially shows that when \(\pi\) and \(Q\) have densities \[ \pi(dx)\alpha(x,y)Q(x,dy) = \pi(dy)\alpha(y,x)Q(y,dx), \]
In general this will be true whenever \[ \frac{\alpha(x,y)}{\alpha(y,x)} = \rho(x,y), \] which holds for \(\alpha(x,y) = \min(1,\rho(x,y))\) (note: other choices possible)
The other component of \(P\) also satisfies detailed balance (proof left as an exercise)

Part 4: The Gibbs sampler

The Gibbs sampler

Write \(y_{-j} := (y_1,y_2,...,y_{j-1},y_{j+1},...,y_d) \in \mathbb{R}^{d-1}\)
- This is \(y \in \mathbb{R}^d\) excluding the \(j\)th component
Also write \(\pi(x) := \pi_{-j}(x_{-j})\pi_j(x_j|x_{-j})\)
Consider the proposal \[ Q(x,dy) = \pi_j(dy_j|x)\delta_{x_{-j}}(dy_{-j}) \]
Note two things:
- Only the \(j\)th component changes
- It is sampled from the conditional of \(\pi\) (must be possible!)
The Hastings ratio becomes: \[ \frac{\pi_j(y_j|x_{-j})\pi_j(x_j|y_{-j}))}{\pi_j(x_j|y_{-j})\pi_j(y_j|x_{-j}))} = 1 \]
Therefore we always accept!

Gibbs sampler Example 1

Consider the bi-variate Normal distribution \[ (X_1,X_2) \sim N\left( 0, \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} \right) \]
It is not hard to derive the conditionals \[ \begin{aligned} X_1|X_2 &\sim N(\rho X_2, 1-\rho^2) \\ X_2|X_1 &\sim N(\rho X_1, 1-\rho^2) \end{aligned} \]

The Gibbs sampler code

for (i in 1:(nits-1)) {
    x[1,i+1] <- rnorm(1, mean = rho * x[2,i], sd = sqrt(1-rho^2) )
    x[2,i+1] <- rnorm(1, mean = rho * x[1,i+1], sd = sqrt(1-rho^2) )
  }
}

When \(\rho = 0\) the sampler is very good!

When \(\rho = 0.99\) the sampler is not very good!

Gibbs sampler Example 2

Take the classical Normal-Gamma model \[ \begin{aligned} Y_i|\mu,\tau &\sim N(\mu, \tau^{-1}) , \qquad i=1,...,n \\ \mu &\sim N(0, \phi^{-1}) \\ \tau &\sim \text{Gamma}(\alpha,\beta) \end{aligned} \]
The priors for \(\mu\) and \(\tau\) are not conjugate, meaning the posterior \(\pi(\mu,\tau|y)\) is not analytically tractable
The conditional posteriors are however analytically tractable \[ \begin{aligned} \mu | \tau, y &\sim N \left( \frac{\tau}{n\tau + \phi} \sum_{i=1}^n y_i, \frac{1}{n\tau + \phi} \right) \\ \tau|\mu,y &\sim \text{Gamma}\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2}\sum_{i=1}^n (y_i - \mu)^2 \right) \end{aligned} \]

Gibbs sampler for example 2

Remarks on the Gibbs sampler

PRO: No tuning parameters!
CON: Not always applicable, as conditionals must be derivable (usually not)
HISTORY: Until ~2010 probably by far the most popular MCMC algorithm in Bayesian stats!
- REASON: Conditioning and clever prior choice are skills that Bayesians have
MODERN DAY: Still very popular, more focus has shifted to other algorithms (e.g. gradient-based)
- REASON: New generation with more algorithmic approach to problems (less scared of tuning, more scared of problem specific approaches)

Part 4: A case study

A case study: Poisson regression

\[ \begin{aligned} Y_i &\sim \text{Poisson}(e^{\sum_{j=1}^d \beta_jx_{ij}}), \\ \beta_j &\stackrel{iid}{\sim} N(0, 1^2). \end{aligned} \]

There are many types of data and phenomena for which this kind of probabilistic model will be appropriate, so feel free to impose that of most interest to you onto the problem!

Remarks on the model

We have imposed \(N(0,1^2)\) independent priors for each \(\beta_j\)
This will shrink the \(\beta_j\)’s towards zero
But is it the right amount of shrinkage?
To interpret parameters let’s think. If \(x_{ij,\text{new}} = x_{ij,\text{old}}+1\), then \[ \mu_{i,\text{new}} = e^{\beta_j} \times \mu_{i,\text{old}} \]
Prior therefore ensures that plausible changes are a factor of between \(e^{-2}\) and \(e^2\).
Whether or not this is sensible of course depends on your particular setting
Common alternative: \(\beta_j \stackrel{iid}{\sim} N(0,\sigma^2)\) with hyperprior on \(\sigma^2\)

Step 1: simulate some data (so we know the truth)

The (unnormalised) posterior distribution

\[ \begin{aligned} \pi(\beta|y) &\propto \prod_i \frac{(e^{\sum_j \beta_j x_{ij}})^{y_i} e^{-e^{\sum_j \beta_j x_{ij}}}}{y_i!} \prod_j \frac{1}{\sqrt{2\pi}}e^{ -\beta_j^2/2 } \\ & \propto \exp \left( \sum_{i,j} \beta_j x_{ij}y_i - e^{\sum_{i,j} \beta_j x_{ij} } - \frac{1}{2} \sum_j \beta_j^2 \right). \end{aligned} \]

Random Walk Metropolis output

## The acceptance rate is:  0.2546

Marginal posterior histograms

Further comments

colMeans(mc2$x_store)

## [1]  0.3944325 -0.2350571  1.1027465

The truth is \(\beta_0 = (0.5, -0.25, 1)\)
Questions. Do these estimates make sense?

Markov chain Monte Carlo sampling algorithms

Lecture 3

Last time

Outline

Part 1: Recap

Markov chains on general state spaces

Limiting distributions

Recap of \(\pi\)-invariance

Part 2: MCMC & the (random walk) Metropolis algorithm

Basic idea of MCMC

The Metropolis–Hastings algorithm: background

The original Metropolis, Rosenbluth, Rosenbluth, Teller & Teller (1953) paper

Idea of the (random walk) Metropolis algorithm

Random Walk Metropolis algorithm

A detour: the Random Walk

Some discussion…

Stepping through an example

Sketch proof of \(\pi\)-invariance

RWM example, \(\pi\) is \(N(0,1)\), \(X_0 = 0\)

Constructing an ergodic average

Running average plot for the mean estimate

The (average) acceptance rate

What happens when the step-size is too large?

Choosing \(h\) too large example

What about when it’s too small?

The Goldilocks principle

Quick questions

Part 3: Metropolis–Hastings

Metropolis-Hastings algorithm: general case (assuming \(\pi\) has a density)

Some discussion…

Some other popular choices for proposals in Metropolis-Hastings

Theorem: Metropolis–Hastings works!

Proof sketch, part 1

Proof of part 2

Towards a more general algorithm & proof

The Metropolis–Hastings kernel \(P\)

Part 4: The Gibbs sampler

The Gibbs sampler

Gibbs sampler Example 1

The Gibbs sampler code

When \(\rho = 0\) the sampler is very good!

When \(\rho = 0.99\) the sampler is not very good!

Gibbs sampler Example 2

Gibbs sampler for example 2

Remarks on the Gibbs sampler

Part 4: A case study

A case study: Poisson regression

Remarks on the model

Step 1: simulate some data (so we know the truth)

The (unnormalised) posterior distribution

Random Walk Metropolis output

Marginal posterior histograms

Further comments

End of lecture