Goals

Introduce MCMC and Metropolis–Hastings in a probabilistic way
Give a guided tour through “A note on Metropolis–Hastings kernels for general state spaces” by Luke Tierney (1998).
Peskun’s theorem (Peskun 1973, Tierney 1998)
Some extensions (if time)

The integration problem

Compute \[ \mathbb{E}_\pi[f] := \int f(x)\pi(dx), \] where \(\pi\) is a probability measure defined on some suitable space \((\mathcal{X},\mathcal{B})\) (e.g. \(\mathcal{X} \subset \mathbb{R}^d\), \(\mathbb{Z}^d\) etc.)
\(\mathcal{X}\) might be high-dimensional
Only \(c\pi\) available for some \(c \in(0,\infty)\)
The sampling problem can also be considered: draw \(X \sim \pi\)

Two most famous applications

Molecular dynamics simulation: \(U\) constructed based on interactions between particles (bond potential, Lennard–Jones etc.), \(\pi\) is the Boltzmann Gibbs distribution \[ \pi(dx) \propto e^{-U(x)}dx \]

Bayesian computation: \(\pi\) is the posterior distribution given some prior \(\pi_0\), data \(y\) and likelihood \(L(x;y) := f(y|x)\) \[ \pi(x)dx \propto f(y|x)\pi_0(x)dx \]

Example (Logistic regression)

Data \(y_i \in \{0,1\}\) for \(i \in \{1,...,N\}\) (response), \(z_i \in \mathbb{R}^p\) (covariates), Parameter \(\beta \in \mathbb{R}^p\) (\(\beta = x\))

Likelihood is just Bernoulli \[ f(y|x) = \prod_{i=1}^N p(z_i,\beta)^{y_i} (1-p(z_i,\beta))^{1-y_i}, \] where \(p(z_i,\beta) := \frac{e^{z_i^T \beta}}{1+e^{z_i^T \beta}}\).

Often \(\pi_0(d\beta)\) chosen to be \(N(0,\Sigma_0)\).

The MCMC solution

Construct a \(\pi\)-invariant Markov transition kernel \(P\)
Simulate a chain \(\{X_1,X_2,...,X_n\}\)
Compute \[ \hat{f}_n := \frac{1}{n}\sum_{i=1}^n f(X_i) \]
By the ergodic theorem (under suitable assumptions on \(P\)) \[ \hat{f}_n \xrightarrow{a.s.} \mathbb{E}_\pi[f] \]

MCMC: first questions

Can this be done in some kind of generality?
If so how effectively?

The answer to the first question is yes, in fact it is not difficult. The second is more interesting…

Metropolis–Hastings kernel

Construct a \(\pi\)-invariant kernel using an accept-reject approach \[ P(x,dx') = \alpha(x,x')Q(x,dx') + \left( \int (1-\alpha(x,x'))Q(x,dx') \right)\delta_x(dx'). \]
\(Q:\mathcal{X} \times \mathcal{B} \to [0,1]\) is the candidate kernel
\(\alpha : \mathcal{X} \times \mathcal{X} \to [0,1]\) is as acceptance rate
If the candidate move is rejected then the chain stays put

MH algorithm

For each \(i \in \{1,...n-1\}\)

Draw \(Y \sim Q(X_i,\cdot)\)
Set \[ X_{i+1} = \begin{cases} Y \quad \text{w.p. } \alpha(X_i,Y) \\ X_i \quad \text{otherwise.} \end{cases} \]

Return \(\{X_1,...,X_n\}\)

MH: \(\pi\)-reversibility

If \[ \pi(dx)Q(x,dx')\alpha(x,x') = \pi(dx')Q(x',dx)\alpha(x',x), \] then \(P\) will be \(\pi\)-reversible, and hence \(\pi\)-invariant.

Define the Radon-Nikodym derivative \[ r:= \frac{\pi(dx')Q(x',dx)}{\pi(dx)Q(x,dx')} \] (Tierney shows that this is always valid on a suitable subset).

If \(\alpha(x,x') := \alpha(r)\) the above can be written \[ \alpha(r) = r\alpha(1/r). \]

Why this choice?

Assume (for example) a common dominating measure for \(\pi(dx):=\pi(x)m(dx)\) and each \(Q(x,dx') := q(x,x')m(dx')\). Then \[ r = \frac{\pi(x')q(x',x)}{\pi(x)q(x,x')} = \frac{c\pi(x')q(x',x)}{c\pi(x)q(x,x')}. \] So this works when only \(c\pi\) is known!

Examples: \(\alpha(r) := \min(1,r)\), \(\alpha(r) = r/(1+r)\) (there are whole families of these)

Some key foundational questions

What is the best choice of \(\alpha\) (for a fixed \(Q\))?
What is the best choice of \(Q\)?
What do we mean by best?

Question 1 has a clear answer, 3 is a choice from a few options, and 2 is very much open.

We will discuss 3 and 1 now, and 2 later.

But first, a toy example

Take \(Q(x,\cdot)\) to be \(N(x,\sigma^2)\) (Gaussian random walk)
Set \[ \alpha(x,x') := \min(1,r(x,x')). \] (Note that \(r\alpha(1/r) = r \min(1,1/r) = \min(r,1) = \alpha(r)\), and \(\alpha \in [0,1]\).
Set \(\pi\) to be \(N(0,1^2)\)
(Show R simulation.)

3. What do we mean by best MCMC method?

Ideally some finite \(n\) comparison, e.g. \[ \text{MSE}(\hat{f}_n) := \mathbb{E}\left[ \left( \hat{f}_n - \mathbb{E}_\pi[f] \right)^2 \right]. \]

Proxy 1: asymptotic variance \[ \nu(f,P) := \lim_{n\to\infty} n\text{Var}(\hat{f}_n) \]

Proxy 2: mixing properties, e.g. \[ \mathcal{D}(\delta_x P^n,\pi) \] (where \(\delta_x P := P^n(x,\cdot)\) and \(\mathcal{D}\) is a metric on \(\mathcal{P}(\mathcal{X})\)).

1’. What is the best choice of \(\alpha\) for a fixed \(Q\) in terms of asymptotic variance?

Peskun orderings

\(P_1\) dominates \(P_2\) in the Peskun sense if for all \(x \in \mathcal{X}\) and all \(A \in \mathcal{B}\) \[ P_1(x,A\cap\{x\}^c) \geq P_2(x,A\cap\{x\}^c). \]

Strong requirement, but fixing \(Q\) then if \[ \alpha_1(x,x') \geq \alpha_2(x,x') \] \(\forall (x,x') \in \mathcal{X} \times \mathcal{X}\), then the Metropolis–Hastings kernel using \(\alpha_1\) dominates that using \(\alpha_2\).

Peskun’s theorem

Thm If \(P_1\) dominates \(P_2\) (and both are \(\pi\)-reversible) then \[ \nu(f,P_1) \leq \nu(f,P_2) \] for all \(f \in L^2(\pi)\). (Statement about spectral gaps can also be made).

Peskun (1973) worked on finite state spaces, Tierney (1998) produced a general result

Steps of the proof

Consider the average kernel \(P(\beta) = \beta P_1 + (1-\beta)P_2\)
Introduce \(\nu_\lambda(f,P(\beta))\) (regularised version)
Show that \(\frac{\partial}{\partial \beta} \nu_\lambda(f, P(\beta))\) is of constant sign, for any \(f \in L^2(\pi)\) and any \(\lambda \in [0,1)\)
Show that \(\nu_\lambda \to \nu\) as \(\lambda \to 1\)

Functional analysis preliminaries

We can view \(\pi\)-invariant Markov processes through operator \(P: L^2(\pi) \to L^2(\pi)\) defined pointwise via \[ Pf(x) := \int f(y)P(x,dy) \] Probabilistically \(Pf(x) = \mathbb{E}[f(X_{i+1})|X_i = x]\)

Recall the \(L^2(\pi)\) inner product \[ \langle f,g\rangle_\pi := \int f(x)g(x)\pi(dx) \] WLOG we restrict to \(f \in L^2(\pi)\) with \(\mathbb{E}_\pi[f] = 0\).

Kernel is \(\pi\)-reversible \(\iff\) operator is self-adjoint, meaning \[ \langle f,P g\rangle_\pi = \langle Pf, g \rangle_\pi \] (Think \(\pi\)-symmetric matrix).

Proof sketch

The asymptotic variance might not exist!

If \(P\) is \(\pi\)-reversible, then it will, but might be \(\infty\)

If it exists and is finite then \[ \nu(f,P) = \text{Var}_\pi[f] + 2\sum_{j=1}^\infty \text{Cov}_\pi[f_0,f_j] \]

Lemma. If it exists and is finite, \(\nu\) can also be written \[ \nu(f,P) = 2\langle f,(I-P)^{-1} f\rangle_\pi - \|f\|^2_\pi. \]

So in this case we need only worry about \(\langle f,(I-P(\beta))^{-1} f \rangle_\pi\)

Proof sketch

For \(\lambda \in [0,1)\) define \[ \nu_\lambda(f,P) := 2\langle f,(I-\lambda P)^{-1} f\rangle_\pi - \|f\|^2_\pi \]

This always exists and is finite, because \[ (I-\lambda P)^{-1} := \sum_{k=0}^\infty \lambda^k P^k. \]

Two intermediate results

Lemma. If \(P\) is self-adjoint then so is \((I-\lambda P)^{-1}\).

Lemma. If \(P_1\) dominates \(P_2\) in the Peskun sense then \[ P_1 - P_2 \] is a positive operator (think positive-definite matrix).

Proof sketch

The (right-) derivative is given by \[ \frac{\partial}{\partial \beta} \nu_\lambda(f,P(\beta)) = 2\lambda\langle f,(I-\lambda P(\beta))^{-1}(P_1 - P_2)(I-\lambda P(\beta))^{-1}f\rangle_\pi. \]

Because \((I-\lambda P)^{-1}\) is self-adjoint \[ \frac{\partial}{\partial \beta} \nu_\lambda(f,P(\beta)) = 2\lambda \langle g_\beta, (P_1 - P_2) g_\beta \rangle_\pi \] where \(g_\beta := (I-\lambda P)^{-1} f\).

Because \(P_1 - P_2\) is positive this is always \(\geq 0\).

Final part

Tierney shows that when \(P\) is self-adjoint, as \(\lambda \to 1\) \[ \nu_\lambda(f,P(\beta)) \to \nu(f,P(\beta)) \]

Note \(\nu(f,P)\) can still be \(\infty\)

Uses spectral representation \(\langle f, P^n f \rangle = \int l^n \mathcal{E}_{f,P}(dl)\) to show limit exists

Some remarks/extensions to finish

Key conclusion

First \(\alpha(r) \leq 1\) by definition

But also \[ \alpha(r) = r\alpha(1/r) \leq r \] since \(\alpha(1/r) \leq 1\)

Combining gives \[ \alpha(r) \leq \min(1,r). \] So this is the optimal choice in terms of asymptotic variance!

Generalisations

There are other orderings etc.

Conditions stated in terms of Dirichlet forms are more popular now, as they allow to us move beyond `accept-reject’ processes \[ \mathcal{E}(f,P) := \frac{1}{2} \int (f(y)-f(x))^2 \pi(dx)P(x,dy) \]

Relaxations of the form \(P_1(x,A\cap\{x\}^c) \geq \epsilon P_2(x,A\cap\{x\}^c)\) also exist and can be useful (I have used them).

Some of my work here

Recently some interesting non-reversible MCMC methods have been introduced

In fact the operators share a common structure. We introduce the condition \[ \langle f, P g \rangle_\pi = \langle QPQf, g\rangle_\pi \] and call \(P\) \((\pi,Q)\)-self-adjoint. Here \(Q\) is a special type of operator (isometric involution).

It turns out that to compare such processes one must consider not \(P\) itself but \[ PQ \quad \text{or} \quad QP \] We refer to both as the \(Q\)-symmetrisation of \(P\). If \(Qf = f\) then the asymptotic variance is ordered one way but if \(Qf = -f\) then it is the other way.

Our results also extend to continuous-time, and the generator \(L\) must be \(Q\)-symmetrised (can be applied to e.g. PDMPs). It’s very strange as often \(QL\) does not correspond to a process at all.

(If time show R simulation.)

Thanks and see you next time!

References

Andrieu, C., & Livingstone, S. (2019). Peskun-Tierney ordering for Markov chain and process Monte Carlo: beyond the reversible scenario. arXiv preprint arXiv:1906.06197.
Diaconis, P., Holmes, S., & Neal, R. M. (2000). Analysis of a nonreversible Markov chain sampler. Annals of Applied Probability, 726-752.
Gustafson, P. (1998). A guided walk Metropolis algorithm. Statistics and computing, 8(4), 357-364.
Peskun, P. H. (1973). Optimum monte-carlo sampling using markov chains. Biometrika, 60(3), 607-612.
Tierney, L. (1998). A note on Metropolis-Hastings kernels for general state spaces. Annals of applied probability, 1-9.

MCMC 1: Metropolis–Hastings and Peskun–Tierney orderings

UCL Probability reading group

Goals

The integration problem

Two most famous applications

Example (Logistic regression)

The MCMC solution

MCMC: first questions

Metropolis–Hastings kernel

MH algorithm

MH: \(\pi\)-reversibility

Why this choice?

Some key foundational questions

But first, a toy example

3. What do we mean by best MCMC method?

1’. What is the best choice of \(\alpha\) for a fixed \(Q\) in terms of asymptotic variance?

Peskun orderings

Peskun’s theorem

Steps of the proof

Functional analysis preliminaries

Proof sketch

Proof sketch

Two intermediate results

Proof sketch

Final part

Some remarks/extensions to finish

Key conclusion

Generalisations

Some of my work here

Thanks and see you next time!

References