Markov chain Monte Carlo sampling algorithms

If you wish to take the module for certification, then you should complete questions marked [E] and bring them to the final lecture (you do not need to get 100%, just to have made a genuine attempt to answer the questions).

Lecture 1

[E] A random variable \(X\) on \(E = \mathbb{R}\) follows an arcsine distribution if it has CDF \[ F(x) = \begin{cases} 0 & x < 0 \\ \frac{2}{\pi}\arcsin \left( \sqrt{x} \right) & 0 \leq x \leq 1 \\ 1 & x > 1 \end{cases} \]
1. Calculate \(F^{-1}\) for \(x \in [0,1]\)
2. Describe one way to sample from the distribution beginning from a uniform random variable by applying a transformation
3. Prove that your approach in part (b) will generate a sample from the arcsine distribution
4. The density of the arcsine distribution is \[ f(x) = \frac{1}{\pi\sqrt{x(1-x)}} \mathbb{I}_{[0,1]}(x). \] Explain why it is not possible to devise a rejection sampler using a \(U[0,1]\) candidate distribution here. Can you describe what kind of features would be needed of the candidate distribution for a rejection sampler to be possible?
You are provided with a set of \(n\) independent samples from the \(N(0,1)\) distribution. We would like to instead consider learning properties about the truncated (standard) normal distribution truncated to lie on \([a,b]\) using importance sampling.
1. Derive a simple expression for the importance weight associated with a sample \(x_i\) from the \(N(0,1)\) distribution as a function of \(x_i\), and the interval \([a,b]\)
2. Derive the naive importance sampling estimator for a given expectation \(\mathbb{E}[f(X)]\) in this truncated normal setting, and also the self-normalised importance sampling estimator
3. Consider instead taking your \(n\) \(N(0,1)\) samples and performing rejection sampling to generate samples from the truncated Normal distribution, and then computing the ordinary Monte Carlo estimator for the expectation \(\mathbb{E}[f(X)]\). In this specific setting, show that this estimator is actually equal to the self-normalised importance sampling estimator for the integral \(\mathbb{E}[f(X)]\) (Note that this is not true in general, but is true in this specific case).

Lecture 2

Consider a state space \(E = \mathbb{R}^d\) and two probability distributions \(m_1\) and \(m_2\) defined on it. Let \(m_1 = \delta_x\) for some \(x \in \mathbb{R}^d\), i.e. a discrete distribution Let \(m_2\) be a continuous distribution (meaning one that possesses a density). Show that \(\|m_1 - m_2\|_{TV} = 1\). For a bonus, consider how your argument can be generalised to show that same is true when \(m_1\) is any discrete distribution.
[E] Consider a Markov chain defined on the state space \(E = \mathbb{R}\) with dynamics given by \[ X_{i+1} = X_i + \xi_{i+1} \] where each \(\xi_{i+1} \sim N(0,\sigma^2)\) for some \(\sigma^2 > 0\).
1. For which distributions \(\mu\) is this chain \(\mu\)-irreducible?
2. Can you write down the transition density, which we will denote \(q(x,y)\)?
3. Is there a probability distribution \(\pi\) for which the chain is \(\pi\)-reversible? Explain your answer.
4. Consider the modified chain with the following dynamics: \[ X_{i+1} = \begin{cases} X_i + \xi_{i+1} \qquad \text{if} ~~ X_i+\xi_{i+1} \in [-10,10] \\ X_i \qquad\qquad~~~ \text{otherwise.} \end{cases} \] We can refer to the scenario in which \(X_i + \xi_{i+1} \not\in [-10,10]\) as a rejection. Show that the (sub-)transition density \(\mathbb{I}_{[-10,10]}(y)q(x,y)\) is \(\pi\)-reversible for \(\pi(x) \propto \mathbb{I}_{[-10,10]}(x)\)
5. Do you think that the chain is aperiodic? What can you conclude about whether or not the chain converges to equilibrium?

Consider the total variation distance \(\|\mu - \nu\|_{TV} = \sup_{A \in \mathcal{E}} |\mu(A) - \nu(A)|\) between two probability distributions \(\mu\) and \(\nu\) defined on \(E = \mathbb{R}^d\). Suppose that both distributions are continuous with densities \(\mu(x)\) and \(\nu(x)\).
1. Explain why the supremum will never be attained at the set \(A = E\) provided that \(\mu \neq \nu\).
2. Set \(B := \{ x \in E: \mu(x) > \nu(x) \}\) (we will take for granted that \(B \in \mathcal{E}\)). Using the densities of the two distributions, show that for any \(A \in \mathcal{E}\) \[ \mu(A) - \nu(A) \leq \mu(A \cap B) - \nu(A \cap B) \leq \mu(B) - \nu(B), \] and also \[ \mu(A) - \nu(A) \geq \mu(A \cap B^c) - \nu(A \cap B^c) \geq \mu(B^c) - \nu(B^c). \] What do you conclude about the possible sets for which the supremum in the total variation distance is obtained?
3. Show further using basic properties of distributions that \[ |\mu(B) - \nu(B)| = |\mu(B^c) - \nu(B^c)|. \]
4. Noting that \(|\mu(B) - \nu(B)| = \int_B |\mu(x) - \nu(x)|dx\), construct a similar expression for the \(B^c\) case, and following this show that an alternative way of writing the total variation distance in this setting (as mentioned in the lectures) is \[ \|\mu - \nu\|_{TV} = \frac{1}{2} \int_E |\mu(x) - \nu(x)| dx. \]

Lecture 3

We will consider a random walk Metropolis on \(E = \mathbb{R}\) with candidate kernel \(U[x-\delta,x+\delta]\) if the current state is \(x\) (i.e. a uniform distribution with positive density between \(x-\delta\) and \(x+\delta\), for some \(\delta>0\)).
1. What is the mean of this candidate kernel? Show that if \(Y \sim U[x-\delta,x+\delta]\) then \(q(y,x)/q(x,y) = 1\)
2. Consider the setting \(\pi(x) \propto (1+x^2)^{-1}\) (Cauchy tails). We can get a good understanding of how the acceptance rate will behave by considering the behaviour of the functions \[ \gamma_{+,\delta}(x) := \frac{\pi(x + \delta)}{\pi(x)} \] and \[ \gamma_{-,\delta}(x) := \frac{\pi(x -\delta)}{\pi(x)} \] for large values of \(|x|\). Compute the limits \(\lim_{x \to\infty}\gamma_{+,\delta}(x)\) and \(\lim_{x \to\infty}\gamma_{-,\delta}(x)\). What do you conclude about the likely behaviour of the random walk Metropolis in the tails of this distribution?
3. Perform the same calculation but this time using the Laplace target, meaning \(\pi(x) \propto e^{-|x|}\). What do you conclude in this case?
4. Now compute \(\nabla\log\pi(x)\) for both the Laplace and Cauchy examples and assess how it behaves in the limit as \(x \to \infty\). Can you relate the behaviour of this function to your analysis in parts (b) and (c)?
[E] Consider a modified version of the Random Walk Metropolis algorithm in which \(Y \sim N(x,hI_{d \times d})\) as usual, but instead of using the typical acceptance rate \(\alpha(x,y) = \min(1, \pi(y)/\pi(x))\) we instead use the acceptance rate \[ \kappa(x,y) = \frac{\pi(y)}{\pi(x) + \pi(y)} \]
1. Show that this is a valid approach by showing that for \(y \neq x\) then \[ \pi(x)q(x,y)\kappa(x,y) = \pi(y)q(y,x)\kappa(y,x), \] where \(q(x,y)\) is the Gaussian density associated with the \(N(x,hI_{d\times d})\) distribution. What property of the Markov chain does this show?
2. Show further that for any pair \(x, y \in \mathbb{R}^d\) the inequality \(\kappa(x,y) \leq \alpha(x,y)\) holds. How does this mean that the algorithm will compare in terms of the probability of accepting a candidate move?
3. Denoting by \(P_1\) the regular random walk Metropolis transition kernel and by \(P_2\) the kernel using \(\kappa\) in place of \(\alpha\), show that \(P_1(x,A\cap \{x\}^c) \geq P_2(x,A \cap \{x\}^c)\) for any \(A \in \mathcal{E}\), where \(A\cap \{x\}^c\) is the set \(A\) excluding the state \(x\). This is known as a Peskun–Tierney ordering, and we say the \(P_1\) dominates \(P_2\) in the Peskun sense. If such an ordering can be found then using Peskun’s theorem it holds that \(\nu(P_1,f) \leq \nu(P_2,f)\) for any \(f\) such that \(\mathbb{E}[f^2]<\infty\).
4. More generally let \[ t(x,y) := \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \] and consider a Metropolis–Hastings style algorithm with candidate kernel \(Q\) and acceptance rate \(g(t(x,y))\), where \(g:\mathbb{R} \to [0,1]\). Find a relation between \(t(x,y)\) and \(t(y,x)\), and using this and the detailed balance equations as in part (b) show that the algorithm produces a \(\pi\)-reversible Markov chain if \(g \leq 1\) and \[ g(t) = tg(1/t), \] where \(t = t(x,y)\). Then show that any such \(g(t) \leq \min(1,t)\), implying that the usual Metropolis–Hastings acceptance rate is the largest possible choice among the class of \(g\)’s. (Such g’s are called balancing functions. The choice in part (a) is \(g(t) = t/(1+t)\), also known as Barker’s rule).

Lecture 4

Consider a Metropolis–Hastings algorithm with candidate transition kernel chosen to be \(Q(x,\cdot) = Q\) for any \(x \in E = \mathbb{R}^d\), where \(Q\) is a fixed probability measure. In other words, the proposal is chosen independently of the current state of the chain. This is called the independence sampler. Denote the associated Markov transition kernel by \(P\). We will assume that the target distribution \(\pi\) has a density (also denoted \(\pi\)) and \(Q\) has a density \(q\). Assume further that \(\pi(x) > 0\) and \(q(x) > 0\) for all \(x \in E\), and that there is an \(\epsilon \in (0,1)\) such that \(q(x) \geq \epsilon \pi(x)\) for all \(x \in E\) (this is one way of expressing the idea that \(Q\) has heavier tails than \(\pi\)).
1. State the acceptance probability for the independence sampler, and prove that if \(Q = \pi\) then the independence sampler has acceptance probability 1.
2. Show that if \(q(x) \geq \epsilon \pi(x)\) for all \(x \in E\) then \(\int_A \alpha(x,y)q(y)dy \geq \epsilon \pi(A)\) for any \(A \in \mathcal{E}\) and all \(x \in E\). You may find it useful to partition \(A\) into \(A \cap \mathcal{A}(x)\) and \(A \cap \mathcal{A}(x)^c\), where \(\mathcal{A}(x) := \{y \in E | \alpha(x,y) = 1\}\) is the set of states \(y\) for which \(\alpha(x,y) = 1\).
3. Your result from part (b) implies that the transition kernel \(P\) can be written in the mixture form \[ P(x,A) = \epsilon\pi(A) + (1-\epsilon)R(x,A) \] for any \(A \in \mathcal{E}\), where \(R(x,A) := (P(x,A) - \epsilon \pi(A))/(1-\epsilon)\) is called the residual kernel. You may use without proof the fact that for any \(x\) the kernel \(R(x,\cdot)\) is a probability distribution. Use this representation to show that whenever \(x \neq y\) \[ \|P(x,\cdot) - P(y,\cdot)\|_{TV} \leq 1-\epsilon. \] This essentially shows that \(P\) is a contraction on the space of probability measures over \((E,\mathcal{E})\), which is one way of showing that \(P\) is uniformly ergodic, meaning \(\|P^n(x,\cdot) - \pi\|_{TV} \leq (1-\epsilon)^n\).
[E] The pre-conditioned Crank Nicholson algorithm (pCN) on \(\mathbb{R}\) is a Metropolis–Hastings algorithm in which given the current state \(X_i = x\) a proposal \(Y\) is generated using the equation \[ Y = \rho x + \sqrt{1-\rho^2}Z_i, ~~ Z_i \sim N(0,1) \] for some \(\rho \in (0,1)\).
1. Show that if \(\log\pi(x) = \Phi(x) - x^2/2 + C_\pi\) for some function \(\Phi: \mathbb{R} \to \mathbb{R}\) and constant \(C_\pi \in \mathbb{R}\), then the acceptance rate can be written \[ \alpha(x,y) = \min\left(1, \exp \left\{ \Phi(y) - \Phi(x) \right\}\right). \]
2. If the target distribution \(\pi\) is a standard Gaussian, then what would the function \(\Phi\) be, and what would \(\alpha(x,y)\) become for any \(x,y \in \mathbb{R}\)?
3. The pCN method works well when \(\pi\) is not too far from Gaussian. But when \(\pi\) has heavier tails it struggles. Consider the case \[ \Phi(x) = -f(|x|) + \frac{1}{2}x^2, \] where \(f:[0,\infty) \to \mathbb{R}\) is a strictly increasing function that satisfies \(\lim_{r\to\infty}f(r)=\infty\) and \(f(r) = o(r^2)\). Explain in words, but using mathematics to help if you wish, how \(\Phi\) will behave as \(|x|\) grows.
4. With \(\Phi\) as in part (c) we can study \(\alpha(x,y)\) when \(|x|\) is large via the function \[ \Phi(\rho x) - \Phi(x), \] noting that \(\rho x\) is the proposal mean if the current state is \(x\). Describe in words how this function will behave as \(|x| \to \infty\), what this is likely to mean for the acceptance rate of the pCN algorithm, and how you might diagnose this type of behaviour through algorithm trace plots. This argument can be made rigorous when \(\Phi\) is twice continuously differentiable using a Taylor series expansion of \(\Phi(y)\) and controlling the remainder, since typically \(\Phi''\) is bounded.

Markov chain Monte Carlo sampling algorithms - mathematical exercises

Samuel Livingstone

Lecture 1

Lecture 2

Lecture 3

Lecture 4