Last time

Markov chain Monte Carlo \[ \{X_t\}_{t\geq 0}, \qquad \mathbb{E}_\pi[f] \approx \frac{1}{n}\sum_{i=1}^n f(X_i) \]
General Metropolis–Hastings algorithm \[ Y \sim Q(X_i,\cdot), \qquad \alpha(X_i,Y) = \min\left(1, \frac{\pi(Y)q(Y,X_i)}{\pi(X_i)q(X_i,Y)} \right). \]
Random walk proposals \[ Y \sim N(X_i, h^2 I_{d \times d} ), \qquad \alpha(X_i,Y) = \min\left( 1, \frac{\pi(Y)}{\pi(X_i)} \right) \]

Outline for today - Theory of MCMC

Bias (Mixing times)
(Asymptotic) Variance
Complexity

Part 1: Introduction to MCMC theory

How to assess MCMC algorithms

Recall: Interested in computing \(\mathbb{E}_\pi[f]\)
Approach: construct a Markov chain and compute ergodic averages
- Transition kernel \(P\), limiting distribution \(\pi\)
- Simulate a realisation of the chain \(\{x_1,...,x_N\}\)
- Compute ergodic average \(\frac{1}{n} \sum_{i=1}^N f(x_i)\)
Statistically, we are constructing a point estimate for \(\mathbb{E}_\pi[f]\) of the form \[ \frac{1}{N}\sum_{i=1}^N f(x_i), \quad \{x_1,...,x_N\} \text{ is algorithm output} \]
We can therefore study the properties of the estimator \[ \tilde{f}_N := \frac{1}{N}\sum_{i=1}^N f(X_i), \quad X_{i+1} \sim P(X_i,\cdot) \]

Statistical properties of the estimator \(\tilde{f}_N\)

Writing \(f_0 := \mathbb{E}_\pi[f]\), we can look at the mean squared error \[ L(\tilde{f}_N,f_0) := \mathbb{E} \left[ (\tilde{f}_N - f_0)^2 \right]. \]
Question. Why have I used the letter \(L\) above? What is the name for this kind of function?
This mean squared error can be broken down using the more familiar quantities of bias and variance, since \[ L(\tilde{f}_N,f_0) = \text{Bias}(\tilde{f}_N)^2 + \text{Var}[\tilde{f}_N]. \]

Part 2: Assessing bias

The bias of an MCMC estimator

Recall that in general \[ \mathbb{E}\left[\frac{1}{N}\sum_i f(X_i)\right] = \frac{1}{N}\sum_i \mathbb{E}[f(X_i)] \]
In ordinary Monte Carlo each \(X_i \sim \pi\). So for every \(i\) \[ \mathbb{E}[f(X_i)] = \mathbb{E}_\pi[f(X)] \]
The Monte Carlo estimator is therefore unbiased (proved in the lectures)
Question. Why is this no longer the case for an MCMC estimator?

Quantifying bias

In MCMC \(X_i \sim P^i(x,\cdot)\), which will not be exactly \(\pi\) in general. So (Slightly abusing notation) \[ \mathbb{E}[f(X_i)] = \mathbb{E}_{P^i}[f(X)] \neq \mathbb{E}_\pi[f(X)] \]
To quantify the bias we must study how quickly the Markov chain converges to equilibrium, or mixes
We can do this by finding an upper bound for the distance \[ \|P^n(x,\cdot) - \pi \|_{TV} := \sup_{A \in \mathcal{E}}|P^n(x,A) - \pi(A)| = \sup_{f:E \to [0,1]}|\mathbb{E}_{P^n}[f] - \mathbb{E}_\pi[f]| \]
If e.g. our function of interest \(f:E \to [0,1]\), we can then use this to upper bound bias
This is the study of convergence rates or mixing times in Markov chain theory (several other distance metrics possible for unbounded functions)

Convergence rates in MCMC

\(P\) is called uniformly ergodic if there is a \(\rho < 1\) and \(M < \infty\) s.t. from (\(\pi\)-almost) any \(x \in E\) and all \(n\) \[ \|P^n(x,\cdot) - \pi\|_{TV} \leq M\rho^n \]
\(P\) is called geometrically ergodic if there is a function \(M : E \to [1,\infty)\) and \(\rho <1\) s.t. \[ \|P^n(x,\cdot) - \pi\|_{TV} \leq M(x)\rho^n \]
Question. Which of the two rates is weaker?
(Even weaker rates (e.g. polynomial ergodicity) exist, but we won’t discuss them here)

Proposition

If \(f: E \to [0,1]\) and \[ \|P^n(x,\cdot) - \pi\|_{TV} \leq M(x)\rho^n \] for some \(M:E \to [1,\infty)\) and \(\rho < 1\), then \[ |\text{Bias}(\tilde{f}_N)| \leq \frac{M(x)}{N(1-\rho)}. \]

Proof

First note that \[ |\text{Bias}(\tilde{f}_N)| = \left|\frac{1}{N}\sum_{i=1}^N \mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f] \right| \]
We can bring the absolute value inside the sum because \[ \left| \frac{1}{N}\sum_{i=1}^N \mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f] \right| \leq \frac{1}{N}\sum_{i=1}^N |\mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f]| \]
Then note that since \(0 \leq f(x) \leq 1\) \[ |\mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f]| \leq \sup_{0 \leq f(x) \leq 1}|\mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f]| = \|P^i(x,\cdot) - \pi\|_{TV} \]
Using \(\|P^i(x,\cdot) - \pi\|_{TV} \leq M(x)\rho^i\) gives \[ \text{Bias}(\tilde{f}_N) \leq \frac{1}{N} \sum_{i=1}^N M(x)\rho^i \leq \frac{M(x)}{N(1-\rho)}. \]

Why is this useful to us?

The above suggests that \[ \text{Bias}(\tilde{f}_N)^2 \propto \frac{1}{N^2}. \]
Typically the variance will decay at a slower rate of \(\frac{1}{N}\)
So if such a bound can be established, then we can focus on the variance term in the mean squared error expression, for sufficiently large \(N\)

An example of a geometrically ergodic Markov chain

If \(E\) is finite all Markov chains that converge to equilibrium are geometrically ergodic
- See e.g. Perron–Frobenius/Doeblin condition
But if \(E = \mathbb{R}^d\) (or even if it is countably infinite) this is no longer the case!
Example that is: AR(1) process (\(Z_n \sim N(0,I)\), \(|\beta| < 1\)) \[ X_n = \beta X_{n-1} + \sqrt{1-\beta^2}Z_n, \qquad X_0 = x \]
If \(X_0 = x\) this satisfies \[ X_n \sim N(\beta^n x, (1+\beta^n)(1-\beta^{n-1})I) \]
This can be shown to be geometrically ergodic in various ways provided \(|\beta| < 1\)

AR(1) simulation \(\beta = 0.7\) (regardless of starting point, the chain reaches the typical set quickly)

An example that is not geometrically ergodic

The canonical example is the Random Walk (\(Z_n \sim N(0,1)\)) \[ X_n = X_{n-1} + \sqrt{h} Z_n \]
This does not even have an equilibrium distribution!
If \(X_0 = x\) then marginally \[ X_n \sim N(x, nh) \]
Questions. (a) How does \(\mathbb{E}[X_n]\) depend on \(n\)? (b) What about \(\text{Var}[X_n]\)?

20 simulations of Random Walk, \(X_1 = 0\), \(h = 1\), \(N = 50\)

What about the Random Walk Metropolis?

The theory essentially says that if \(\pi\) has `lighter than exponential tails’, the Markov chain will be geometrically ergodic
So for Gaussians & Laplace distributions, we should converge quickly
For heavy-tailed distributions (e.g. Cauchy) it is not Geometrically ergodic

Lack of GE and practical performance (e.g. traceplots)

There are essentially two scenarios in which a Metropolis–Hastings algorithm is not geometrically ergodic
1. In some part of \(E\) the acceptance rate gets arbitrarily close to \(0\) \[ \text{ess}\inf_{x \in E} \int \alpha(x,y)Q(x,dy) = 0 \]
2. In some parts of \(E\) the chain exhibits ‘random walk behaviour’
If 1 happens the traceplot shows the chain getting stuck
If 2 happens the traceplot shows the chain meandering aimlessly (see Cauchy example)

Example of sticking chain (Not typical of RWM, but other MH algorithms)

Establishing geometric ergodicity

The standard approach is ‘drift and minorisation’
(Essentially) we construct a Lyapunov function \(V : E \to [1,\infty)\) with compact level sets and \(\mathbb{E}_\pi[V]<\infty\), such that for some \(\lambda < 1\) the geometric drift condition \[ \mathbb{E}[V(X_{n+1})|X_n = x] \leq \lambda V(x) \] is satisfied for all \(x\) outside of some compact set \(C \subset E\).
It can be shown that this implies geometric ergodicity in many ways
- Using a probabilistic coupling and using the coupling inequality for TV distance (e.g. Roberts & Rosenthal, 2004)
- Constructing a \(V-\) norm distance between probabiilties, equivalent to TV, showing that drift condition implies a contraction in this distance + applying Banach fixed point theorem (Hairer & Mattingley, 2011)
Also called Harris’ theorem for Markov chains (Hairer & Mattingley, 2011)
The classical textbook reference is Markov chains & Stochastic Stability (Meyn & Tweedie, 1993)
Remark. In general ‘compact’ -> ‘small’, a slightly different but often overlapping concept, which I will not cover explicitly here

Example: the AR(1) process

Consider the chain \[ X_{n+1} = 0.5 X_n + \xi_{n+1}, \] where \(\xi_{n+1} \sim N(0,1)\)
Choose \(V(x) := \max(1, \|x\|)\)
Then for large enough \(\|X_n\|\) \[ V(X_{n+1}) = \max( 1, \|X_{n+1}\|) \approx 0.5\|X_n\| + \|\xi_{n+1}\| \leq (0.5+\epsilon)\|X_i\| \] (argument can be made rigorous in various ways)
Note: It’s not usually this easy!
- Part of the skill comes in choosing the right \(V\) (natural + something we can do calculations with)
- In Metropolis–Hastings making assumptions on \(\pi\) to deal with \(\alpha(x,y)\) is a technical challenge

A functional analytic perspective (1)

For \(\pi\)-reversible chains Geometric ergodicity in TV distance \(\iff\) existence of a positive absolute spectral gap
First let \(\pi(f) := \mathbb{E}_\pi[f]\) for \(f \in L^2(\pi) :=\{g :E \to \mathbb{R} ; \int g(x)^2\pi(dx) < \infty\}\)
Then define the Markov operator \[ Pf(x) := \int f(y) P(x,dy) \]
The Markov chain has a positive absolute spectral gap if \[ \|P - \pi\| < 1 \] where \(\|v\|^2 := \int v(x)^2\pi(dx)\|\) and \(\|P\| := \sup_{\|v\| = 1}\|Pv\|\)

A functional analytic perspective (2)

When \(P\) is a positive operator (cf ‘lazy’ chain) on \(L^2(\pi)\) this can be established by showing that this variational quantity is positive \[ \text{Gap}(P) := \inf_{\|f\| = 1}\langle f, (I-P)f \rangle \] where \(\langle f,g\rangle := \int f(x)g(x)\pi(dx)\)
When \(P\) is a finite matrix this is just \(1-\lambda_2\) (one minus 2nd largest eigenvalue)
More generally it can also be written \[ \text{Gap}(P) := \inf_{\|f\| = 1}\frac{1}{2}\int (f(y) - f(x))^2 \pi(dx)P(x,dy) = 1- \text{Corr}(f(X_n), f(X_{n+1})) \]
‘Expected squared jump distance’ or ‘worst case 1st order auto-correlation’
It can be upper/lower bounded using tools from analysis (e.g. Poincare inequalities, conductance, canonical paths etc.)

Burn in: a more practical approach to the bias problem

Burn in: the main idea

Burn in (or less frequently warm up) just means discarding the initial portion of the chain
This will reduce the bias of the resulting estimator
One can equivalently adapt our MCMC estimator to be \[ \tilde{f}_N = \frac{1}{N} \sum_{t = m+1}^{N+m} f(X_t), \] where the first \(m\) samples are discarded as burn in
We can often judge how much to discard through trace plots
Question. What is the problem with using trace plots to decide on burn in in high dimensions?

Burn in in general

We cannot assess the trace plot for every single parameter of interest. But we can do several pragmatic things:
- Have some kind of blanket rule, such as ‘discard the first 50% of the chain’
- Examine trace plots of some suitable summaries, e.g. \(\log\pi(X_t)\)
- Plot the running averages and check that these have stabilised
There are many other ‘rules of thumb’ among practitioners and various diagnostics (e.g. \(\hat{R}\))
In many cases a combination of these is a good practical way to reduce bias, and is usually done

Part 3: Assessing Variance

Assessing variance

The variance of \(\tilde{f}_N\) is \[ \text{Var}(\tilde{f}_N) = \text{Var} \left( \frac{1}{N} \sum_{i=1}^N f(X_i) \right) = \frac{1}{N^2}\text{Var}\left( \sum_{i=1}^N f(X_i) \right) \]
Because \(\text{Var}(f) = \text{Cov}(f,f)\), we can write \[ \text{Var}(\tilde{f}_N) = \frac{1}{N^2} \text{Cov}\left(\sum_{i=1}^N f(X_i), \sum_{i=1}^N f(X_i) \right) \]
Recalling that \(\text{Cov}(f_1 + f_2,f_3) = \text{Cov}(f_1,f_3) + \text{Cov}(f_2,f_3)\) (and similarly for 2nd argument) we can write this as \[ \frac{1}{N^2} \left( \sum_{i=1}^N\text{Var}(f(X_i)) + 2 \sum_{i<j} \text{Cov}\left( f(X_i), f(X_j) \right) \right) \]
If \(X_1 \sim \pi\) then we can go even further since marginally \(X_i \sim \pi\) for all \(i\), meaning \[ \text{Var}\left( \tilde{f}_N \right) = \frac{1}{N} \left( \text{Var}(f(X_1)) + 2 \sum_{k=2}^N \left( \frac{N-k+1}{N} \right) \text{Cov}\left( f(X_1), f(X_k) \right) \right) \]
Question. How does this relate to the variance under iid sampling from \(\pi\)?

Returning to our intuition of things not working well in the RWM

When \(h\) is too small, nearby points will be similar, meaning \[ \text{Cov}(f(X_1),f(X_k)) \gg 0 \] for many values of \(k\).
When \(h\) is too large and the first \(k\) proposals are all rejected then \[ \text{Cov}(f(X_1),f(X_k)) = \text{Var}(f(X_1)) \] the maximum possible value
Question. What will happen to \(\text{Var}(\tilde{f}_N)\) in either of these two cases?

The asymptotic variance

In practice \(X_1\) is not from \(\pi\)
But imagining \(X_1 \sim \pi\) we can define the limiting quantity \[ \nu(P,f) := \lim_{N \to \infty}N \text{Var}[\tilde{f}_N] \]
This is called the asymptotic variance
We often compare MCMC algorithms \(P_1\) and \(P_2\) by comparing \(\nu(P_1,f)\) and \(\nu(P_2,f)\)
Question. Why do we multiply \(\text{Var}[\tilde{f}_N]\) by \(N\) here?

A Markov chain CLT

If \(P\) is \(\pi\)-reversible and geometrically ergodic in TV, and \(\mathbb{E}_\pi[f^2]<\infty\), then as \(N\to\infty\) \[ \sqrt{N}(\tilde{f}_N - \mathbb{E}_\pi[f]) \overset{d}{\to} N(0, \nu(P,f)) \]
This enables the construction of asymptotically valid confidence intervals for MCMC estimators

Peskun–Tierney orderings

Peskun (1973, finite state) & Tierney (1998, general state) define the partial ordering on Markov kernels \[ P_1 \succeq_p P_2 \iff P_1(x,A \cap \{x\}^c) \geq P_2(x,A\cap\{x\}^c) \] for all \(x \in E\), \(A \in \mathcal{E}\).
Off-diagonal domination: \(P_1\) always has a larger chance of moving away from the current state than \(P_2\)
Peskun’s theorem. If \(P_1\) and \(P_2\) are \(\pi\)-reversible and \(P_1 \succeq_p P_2\), then \[ \nu(P_1,f) \leq \nu(P_2,f) \] for all \(f \in L^2(\pi)\).
Proof: Write \(P(\beta) := \beta P_1 + (1-\beta)P_2\), show \(\partial_\beta \nu(P(\beta),f) \leq 0\) (proof uses e.g. spectral theorem)

Peskun’s theorem in MCMC

Consider two Metropolis–Hastings style kernels \[ \begin{aligned} P_1(x,dy) &= \alpha_1(x,y)Q(x,dy) + r_1(x)\delta_x(dy) \\ P_2(x,dy) &= \alpha_2(x,y)Q(x,dy) + r_2(x)\delta_x(dy) \end{aligned} \] where \(r_i(x) := \int [1-\alpha_i(x,y)]Q(x,dy)\) is the average rejection probability at state \(x\)
These are two algorithms with the same \(Q\) but different acceptance rate functions
Letting \(\rho(x,y) := \frac{\pi(dy)Q(y,dx)}{\pi(dx)Q(x,dy)}\), two valid choices are
- Metropolis \(\alpha_1(x,y) := \min(1,\rho(x,y))\)
- Barker \(\alpha_2(x,y) := \frac{\rho(x,y)}{1+\rho(x,y)}\)
It can be shown that \(\alpha_1(x,y) \geq \alpha_2(x,y)\) for all \(x,y \in E\).
And further that \[ \alpha_1(x,y) \geq \alpha_2(x,y) \implies P_1 \succeq_p P_2 \]
Question. What does this tell us about using Metropolis vs Barker acceptance rates?

The MCMC effective sample size

Note that for the ordinary Monte Carlo estimator \(\hat{f}_N\), we have \[ \text{Var}(\hat{f}_N) = \frac{\text{Var}_\pi(f(X))}{N} \]
This means \[ \frac{\text{Var}(\tilde{f}_N)}{\text{Var}(\hat{f}_N)} = 1 + 2 \sum_{k=2}^N \left( \frac{N-k}{N} \right) \text{Corr}\left( f(X_1), f(X_k) \right). \]
We can simplify further by taking the limit as \(N \to \infty\), giving \[ \lim_{N \to \infty} \frac{\text{Var}(\tilde{f}_N)}{\text{Var}(\hat{f}_N)} = 1 + 2 \sum_{k=2}^\infty \text{Corr}\left( f(X_1), f(X_k) \right). \]

The effective sample size (in theory)

\[ N_{\text{eff}} := \frac{N}{1+2\sum_{k=2}^\infty \text{Corr}(f(X_1),f(X_k))}. \]

(This is how many independent samples from \(\pi\) you would need to get an estimator with the same variance as \(\tilde{f}_N\)).

In practice this is estimated using some intelligent heuristics
It is common to report \(N_{\text{eff}}\), or even \(N_{\text{eff}}/\text{second}\) (much fairer)
See the effectiveSize() command in the R package CODA

Auto-correlation plots are also helpful to compare algorithms/visualise variance

Computational complexity in MCMC

There are many ways to measure this
One popular strategy: Cost per iteration \(\times\) Convergence rate/Mixing time
Another is to only discuss the mixing time, since cost per iteration is more problem dependent
Both costs should be written as a function of dimension \(d\)
Generally we desire polynomial time algorithms
- complexity is a polynomial in \(d\) (not exponential)

The cost per iteration of a Metropolis–Hastings algorithm

This depends on the algorithm!
Also depends on \(\pi\)
There is the cost of the proposal \[ Y \sim Q(x,\cdot) \]
- (for vanilla random walk Metropolis this is \(O(d)\))
Then the cost of accept-reject decision \[ U \leq \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \]
- (Generally the most expensive part is calculating \(\pi\))
- e.g. if \(\pi(x) = \prod_i \pi_i(x_i)\) then cost is \(O(d)\) (but this is unrealistic!)

Convergence rates/mixing times with dimension

This is a highly active area of study
Again depends strongly on both the algorithm and \(\pi\)
If \(\pi\) is suitably ‘nice’ (proper definition omitted) then mixing time of random walk Metropolis is \(O(d)\)
This implies Random Walk Metropolis has \(O(d^2)\) complexity in some cases
I prefer to simply say that the mixing time is \(O(d)\)…

Markov chain Monte Carlo sampling algorithms

Lecture 4