MCMC 1: Metropolis–Hastings and Peskun–Tierney orderings
UCL Probability reading group
Samuel Livingstone
Goals
Introduce MCMC and Metropolis–Hastings in a probabilistic way
Give a guided tour through “A note on Metropolis–Hastings kernels for general state spaces” by Luke Tierney (1998).
Peskun’s theorem (Peskun 1973, Tierney 1998)
Some extensions (if time)
The integration problem
Compute \[
\mathbb{E}_\pi[f] := \int f(x)\pi(dx),
\] where \(\pi\) is a probability measure defined on some suitable space \((\mathcal{X},\mathcal{B})\) (e.g. \(\mathcal{X} \subset \mathbb{R}^d\), \(\mathbb{Z}^d\) etc.)
\(\mathcal{X}\) might be high-dimensional
Only \(c\pi\) available for some \(c \in(0,\infty)\)
The sampling problem can also be considered: draw \(X \sim \pi\)
Two most famous applications
- Molecular dynamics simulation: \(U\) constructed based on interactions between particles (bond potential, Lennard–Jones etc.), \(\pi\) is the Boltzmann Gibbs distribution \[
\pi(dx) \propto e^{-U(x)}dx
\]
- Bayesian computation: \(\pi\) is the posterior distribution given some prior \(\pi_0\), data \(y\) and likelihood \(L(x;y) := f(y|x)\) \[
\pi(x)dx \propto f(y|x)\pi_0(x)dx
\]
Example (Logistic regression)
- Data \(y_i \in \{0,1\}\) for \(i \in \{1,...,N\}\) (response), \(z_i \in \mathbb{R}^p\) (covariates), Parameter \(\beta \in \mathbb{R}^p\) (\(\beta = x\))
- Likelihood is just Bernoulli \[
f(y|x) = \prod_{i=1}^N p(z_i,\beta)^{y_i} (1-p(z_i,\beta))^{1-y_i},
\] where \(p(z_i,\beta) := \frac{e^{z_i^T \beta}}{1+e^{z_i^T \beta}}\).
- Often \(\pi_0(d\beta)\) chosen to be \(N(0,\Sigma_0)\).
The MCMC solution
Construct a \(\pi\)-invariant Markov transition kernel \(P\)
Simulate a chain \(\{X_1,X_2,...,X_n\}\)
Compute \[
\hat{f}_n := \frac{1}{n}\sum_{i=1}^n f(X_i)
\]
By the ergodic theorem (under suitable assumptions on \(P\)) \[
\hat{f}_n \xrightarrow{a.s.} \mathbb{E}_\pi[f]
\]
MCMC: first questions
The answer to the first question is yes, in fact it is not difficult. The second is more interesting…
Metropolis–Hastings kernel
- Construct a \(\pi\)-invariant kernel using an accept-reject approach \[
P(x,dx') = \alpha(x,x')Q(x,dx') + \left( \int (1-\alpha(x,x'))Q(x,dx') \right)\delta_x(dx').
\]
- \(Q:\mathcal{X} \times \mathcal{B} \to [0,1]\) is the candidate kernel
- \(\alpha : \mathcal{X} \times \mathcal{X} \to [0,1]\) is as acceptance rate
- If the candidate move is rejected then the chain stays put
MH algorithm
For each \(i \in \{1,...n-1\}\)
- Draw \(Y \sim Q(X_i,\cdot)\)
- Set \[
X_{i+1} =
\begin{cases}
Y \quad \text{w.p. } \alpha(X_i,Y) \\
X_i \quad \text{otherwise.}
\end{cases}
\]
Return \(\{X_1,...,X_n\}\)
MH: \(\pi\)-reversibility
- If \[
\pi(dx)Q(x,dx')\alpha(x,x') = \pi(dx')Q(x',dx)\alpha(x',x),
\] then \(P\) will be \(\pi\)-reversible, and hence \(\pi\)-invariant.
- Define the Radon-Nikodym derivative \[
r:= \frac{\pi(dx')Q(x',dx)}{\pi(dx)Q(x,dx')}
\] (Tierney shows that this is always valid on a suitable subset).
- If \(\alpha(x,x') := \alpha(r)\) the above can be written \[
\alpha(r) = r\alpha(1/r).
\]
Why this choice?
- Assume (for example) a common dominating measure for \(\pi(dx):=\pi(x)m(dx)\) and each \(Q(x,dx') := q(x,x')m(dx')\). Then \[
r = \frac{\pi(x')q(x',x)}{\pi(x)q(x,x')} = \frac{c\pi(x')q(x',x)}{c\pi(x)q(x,x')}.
\] So this works when only \(c\pi\) is known!
- Examples: \(\alpha(r) := \min(1,r)\), \(\alpha(r) = r/(1+r)\) (there are whole families of these)
Some key foundational questions
- What is the best choice of \(\alpha\) (for a fixed \(Q\))?
- What is the best choice of \(Q\)?
- What do we mean by best?
Question 1 has a clear answer, 3 is a choice from a few options, and 2 is very much open.
We will discuss 3 and 1 now, and 2 later.
But first, a toy example
- Take \(Q(x,\cdot)\) to be \(N(x,\sigma^2)\) (Gaussian random walk)
Set \[
\alpha(x,x') := \min(1,r(x,x')).
\] (Note that \(r\alpha(1/r) = r \min(1,1/r) = \min(r,1) = \alpha(r)\), and \(\alpha \in [0,1]\).
Set \(\pi\) to be \(N(0,1^2)\)
(Show R simulation.)
3. What do we mean by best MCMC method?
- Ideally some finite \(n\) comparison, e.g. \[
\text{MSE}(\hat{f}_n) := \mathbb{E}\left[ \left( \hat{f}_n - \mathbb{E}_\pi[f] \right)^2 \right].
\]
- Proxy 1: asymptotic variance \[
\nu(f,P) := \lim_{n\to\infty} n\text{Var}(\hat{f}_n)
\]
- Proxy 2: mixing properties, e.g. \[
\mathcal{D}(\delta_x P^n,\pi)
\] (where \(\delta_x P := P^n(x,\cdot)\) and \(\mathcal{D}\) is a metric on \(\mathcal{P}(\mathcal{X})\)).
1’. What is the best choice of \(\alpha\) for a fixed \(Q\) in terms of asymptotic variance?
Peskun orderings
- \(P_1\) dominates \(P_2\) in the Peskun sense if for all \(x \in \mathcal{X}\) and all \(A \in \mathcal{B}\) \[
P_1(x,A\cap\{x\}^c) \geq P_2(x,A\cap\{x\}^c).
\]
- Strong requirement, but fixing \(Q\) then if \[
\alpha_1(x,x') \geq \alpha_2(x,x')
\] \(\forall (x,x') \in \mathcal{X} \times \mathcal{X}\), then the Metropolis–Hastings kernel using \(\alpha_1\) dominates that using \(\alpha_2\).
Peskun’s theorem
Thm If \(P_1\) dominates \(P_2\) (and both are \(\pi\)-reversible) then \[
\nu(f,P_1) \leq \nu(f,P_2)
\] for all \(f \in L^2(\pi)\). (Statement about spectral gaps can also be made).
Peskun (1973) worked on finite state spaces, Tierney (1998) produced a general result
Steps of the proof
- Consider the average kernel \(P(\beta) = \beta P_1 + (1-\beta)P_2\)
- Introduce \(\nu_\lambda(f,P(\beta))\) (regularised version)
- Show that \(\frac{\partial}{\partial \beta} \nu_\lambda(f, P(\beta))\) is of constant sign, for any \(f \in L^2(\pi)\) and any \(\lambda \in [0,1)\)
- Show that \(\nu_\lambda \to \nu\) as \(\lambda \to 1\)
Functional analysis preliminaries
- We can view \(\pi\)-invariant Markov processes through operator \(P: L^2(\pi) \to L^2(\pi)\) defined pointwise via \[
Pf(x) := \int f(y)P(x,dy)
\] Probabilistically \(Pf(x) = \mathbb{E}[f(X_{i+1})|X_i = x]\)
- Recall the \(L^2(\pi)\) inner product \[
\langle f,g\rangle_\pi := \int f(x)g(x)\pi(dx)
\] WLOG we restrict to \(f \in L^2(\pi)\) with \(\mathbb{E}_\pi[f] = 0\).
- Kernel is \(\pi\)-reversible \(\iff\) operator is self-adjoint, meaning \[
\langle f,P g\rangle_\pi = \langle Pf, g \rangle_\pi
\] (Think \(\pi\)-symmetric matrix).
Proof sketch
- The asymptotic variance might not exist!
- If \(P\) is \(\pi\)-reversible, then it will, but might be \(\infty\)
- If it exists and is finite then \[
\nu(f,P) = \text{Var}_\pi[f] + 2\sum_{j=1}^\infty \text{Cov}_\pi[f_0,f_j]
\]
- Lemma. If it exists and is finite, \(\nu\) can also be written \[
\nu(f,P) = 2\langle f,(I-P)^{-1} f\rangle_\pi - \|f\|^2_\pi.
\]
- So in this case we need only worry about \(\langle f,(I-P(\beta))^{-1} f \rangle_\pi\)
Proof sketch
- For \(\lambda \in [0,1)\) define \[
\nu_\lambda(f,P) := 2\langle f,(I-\lambda P)^{-1} f\rangle_\pi - \|f\|^2_\pi
\]
- This always exists and is finite, because \[
(I-\lambda P)^{-1} := \sum_{k=0}^\infty \lambda^k P^k.
\]
Proof sketch
- The (right-) derivative is given by \[
\frac{\partial}{\partial \beta} \nu_\lambda(f,P(\beta)) = 2\lambda\langle f,(I-\lambda P(\beta))^{-1}(P_1 - P_2)(I-\lambda P(\beta))^{-1}f\rangle_\pi.
\]
- Because \((I-\lambda P)^{-1}\) is self-adjoint \[
\frac{\partial}{\partial \beta} \nu_\lambda(f,P(\beta)) = 2\lambda \langle g_\beta, (P_1 - P_2) g_\beta \rangle_\pi
\] where \(g_\beta := (I-\lambda P)^{-1} f\).
- Because \(P_1 - P_2\) is positive this is always \(\geq 0\).
Final part
- Tierney shows that when \(P\) is self-adjoint, as \(\lambda \to 1\) \[
\nu_\lambda(f,P(\beta)) \to \nu(f,P(\beta))
\]
- Note \(\nu(f,P)\) can still be \(\infty\)
- Uses spectral representation \(\langle f, P^n f \rangle = \int l^n \mathcal{E}_{f,P}(dl)\) to show limit exists
Key conclusion
- First \(\alpha(r) \leq 1\) by definition
- But also \[
\alpha(r) = r\alpha(1/r) \leq r
\] since \(\alpha(1/r) \leq 1\)
- Combining gives \[
\alpha(r) \leq \min(1,r).
\] So this is the optimal choice in terms of asymptotic variance!
Generalisations
- There are other orderings etc.
- Conditions stated in terms of Dirichlet forms are more popular now, as they allow to us move beyond `accept-reject’ processes \[
\mathcal{E}(f,P) := \frac{1}{2} \int (f(y)-f(x))^2 \pi(dx)P(x,dy)
\]
- Relaxations of the form \(P_1(x,A\cap\{x\}^c) \geq \epsilon P_2(x,A\cap\{x\}^c)\) also exist and can be useful (I have used them).
Some of my work here
- Recently some interesting non-reversible MCMC methods have been introduced
- In fact the operators share a common structure. We introduce the condition \[
\langle f, P g \rangle_\pi = \langle QPQf, g\rangle_\pi
\] and call \(P\) \((\pi,Q)\)-self-adjoint. Here \(Q\) is a special type of operator (isometric involution).
- It turns out that to compare such processes one must consider not \(P\) itself but \[
PQ \quad \text{or} \quad QP
\] We refer to both as the \(Q\)-symmetrisation of \(P\). If \(Qf = f\) then the asymptotic variance is ordered one way but if \(Qf = -f\) then it is the other way.
- Our results also extend to continuous-time, and the generator \(L\) must be \(Q\)-symmetrised (can be applied to e.g. PDMPs). It’s very strange as often \(QL\) does not correspond to a process at all.
- (If time show R simulation.)
Thanks and see you next time!
References
Andrieu, C., & Livingstone, S. (2019). Peskun-Tierney ordering for Markov chain and process Monte Carlo: beyond the reversible scenario. arXiv preprint arXiv:1906.06197.
Diaconis, P., Holmes, S., & Neal, R. M. (2000). Analysis of a nonreversible Markov chain sampler. Annals of Applied Probability, 726-752.
Gustafson, P. (1998). A guided walk Metropolis algorithm. Statistics and computing, 8(4), 357-364.
Peskun, P. H. (1973). Optimum monte-carlo sampling using markov chains. Biometrika, 60(3), 607-612.
Tierney, L. (1998). A note on Metropolis-Hastings kernels for general state spaces. Annals of applied probability, 1-9.