If you wish to take the module for certification, then you should
complete questions marked [E] and bring them to the final lecture (you
do not need to get 100%, just to have made a genuine attempt to answer
the questions).
Lecture 1
- [E] A random variable \(X\) on
\(E = \mathbb{R}\) follows an
arcsine distribution if it has CDF \[
F(x) = \begin{cases}
0 & x < 0 \\
\frac{2}{\pi}\arcsin \left( \sqrt{x} \right) & 0 \leq x \leq 1 \\
1 & x > 1
\end{cases}
\]
- Calculate \(F^{-1}\) for \(x \in [0,1]\)
- Describe one way to sample from the distribution beginning from a
uniform random variable by applying a transformation
- Prove that your approach in part (b) will generate a sample from the
arcsine distribution
- The density of the arcsine distribution is \[
f(x) = \frac{1}{\pi\sqrt{x(1-x)}} \mathbb{I}_{[0,1]}(x).
\] Explain why it is not possible to devise a rejection sampler
using a \(U[0,1]\) candidate
distribution here. Can you describe what kind of features would be
needed of the candidate distribution for a rejection sampler to be
possible?
- You are provided with a set of \(n\) independent samples from the \(N(0,1)\) distribution. We would like to
instead consider learning properties about the truncated (standard)
normal distribution truncated to lie on \([a,b]\) using importance sampling.
- Derive a simple expression for the importance weight associated with
a sample \(x_i\) from the \(N(0,1)\) distribution as a function of
\(x_i\), and the interval \([a,b]\)
- Derive the naive importance sampling estimator for a given
expectation \(\mathbb{E}[f(X)]\) in
this truncated normal setting, and also the self-normalised importance
sampling estimator
- Consider instead taking your \(n\)
\(N(0,1)\) samples and performing
rejection sampling to generate samples from the truncated Normal
distribution, and then computing the ordinary Monte Carlo estimator for
the expectation \(\mathbb{E}[f(X)]\).
In this specific setting, show that this estimator is actually equal to
the self-normalised importance sampling estimator for the integral \(\mathbb{E}[f(X)]\) (Note that this is
not true in general, but is true in this specific case).
Lecture 2
Consider a state space \(E =
\mathbb{R}^d\) and two probability distributions \(m_1\) and \(m_2\) defined on it. Let \(m_1 = \delta_x\) for some \(x \in \mathbb{R}^d\), i.e. a discrete
distribution Let \(m_2\) be a
continuous distribution (meaning one that possesses a density). Show
that \(\|m_1 - m_2\|_{TV} = 1\).
For a bonus, consider how your argument can be generalised to show
that same is true when \(m_1\) is any
discrete distribution.
[E] Consider a Markov chain defined on the state space \(E = \mathbb{R}\) with dynamics given by
\[
X_{i+1} = X_i + \xi_{i+1}
\] where each \(\xi_{i+1} \sim
N(0,\sigma^2)\) for some \(\sigma^2
> 0\).
- For which distributions \(\mu\) is
this chain \(\mu\)-irreducible?
- Can you write down the transition density, which we will denote
\(q(x,y)\)?
- Is there a probability distribution \(\pi\) for which the chain is \(\pi\)-reversible? Explain your answer.
- Consider the modified chain with the following dynamics: \[
X_{i+1} =
\begin{cases}
X_i + \xi_{i+1} \qquad \text{if} ~~ X_i+\xi_{i+1} \in [-10,10] \\
X_i \qquad\qquad~~~ \text{otherwise.}
\end{cases}
\] We can refer to the scenario in which \(X_i + \xi_{i+1} \not\in [-10,10]\) as a
rejection. Show that the (sub-)transition density \(\mathbb{I}_{[-10,10]}(y)q(x,y)\) is \(\pi\)-reversible for \(\pi(x) \propto
\mathbb{I}_{[-10,10]}(x)\)
- Do you think that the chain is aperiodic? What can you conclude
about whether or not the chain converges to equilibrium?
- Consider the total variation distance \(\|\mu - \nu\|_{TV} = \sup_{A \in \mathcal{E}}
|\mu(A) - \nu(A)|\) between two probability distributions \(\mu\) and \(\nu\) defined on \(E = \mathbb{R}^d\). Suppose that both
distributions are continuous with densities \(\mu(x)\) and \(\nu(x)\).
- Explain why the supremum will never be attained at the set \(A = E\) provided that \(\mu \neq \nu\).
- Set \(B := \{ x \in E: \mu(x) > \nu(x)
\}\) (we will take for granted that \(B
\in \mathcal{E}\)). Using the densities of the two distributions,
show that for any \(A \in \mathcal{E}\)
\[
\mu(A) - \nu(A) \leq \mu(A \cap B) - \nu(A \cap B) \leq \mu(B) - \nu(B),
\] and also \[
\mu(A) - \nu(A) \geq \mu(A \cap B^c) - \nu(A \cap B^c) \geq \mu(B^c) -
\nu(B^c).
\] What do you conclude about the possible sets for which the
supremum in the total variation distance is obtained?
- Show further using basic properties of distributions that \[
|\mu(B) - \nu(B)| = |\mu(B^c) - \nu(B^c)|.
\]
- Noting that \(|\mu(B) - \nu(B)| = \int_B
|\mu(x) - \nu(x)|dx\), construct a similar expression for the
\(B^c\) case, and following this show
that an alternative way of writing the total variation distance in this
setting (as mentioned in the lectures) is \[
\|\mu - \nu\|_{TV} = \frac{1}{2} \int_E |\mu(x) - \nu(x)| dx.
\]
Lecture 3
- We will consider a random walk Metropolis on \(E = \mathbb{R}\) with candidate kernel
\(U[x-\delta,x+\delta]\) if the current
state is \(x\) (i.e. a uniform
distribution with positive density between \(x-\delta\) and \(x+\delta\), for some \(\delta>0\)).
- What is the mean of this candidate kernel? Show that if \(Y \sim U[x-\delta,x+\delta]\) then \(q(y,x)/q(x,y) = 1\)
- Consider the setting \(\pi(x) \propto
(1+x^2)^{-1}\) (Cauchy tails). We can get a good understanding of
how the acceptance rate will behave by considering the behaviour of the
functions \[
\gamma_{+,\delta}(x) := \frac{\pi(x + \delta)}{\pi(x)}
\] and \[
\gamma_{-,\delta}(x) := \frac{\pi(x -\delta)}{\pi(x)}
\] for large values of \(|x|\).
Compute the limits \(\lim_{x
\to\infty}\gamma_{+,\delta}(x)\) and \(\lim_{x \to\infty}\gamma_{-,\delta}(x)\).
What do you conclude about the likely behaviour of the random walk
Metropolis in the tails of this distribution?
- Perform the same calculation but this time using the Laplace target,
meaning \(\pi(x) \propto e^{-|x|}\).
What do you conclude in this case?
- Now compute \(\nabla\log\pi(x)\)
for both the Laplace and Cauchy examples and assess how it behaves in
the limit as \(x \to \infty\). Can you
relate the behaviour of this function to your analysis in parts (b) and
(c)?
- [E] Consider a modified version of the Random Walk Metropolis
algorithm in which \(Y \sim N(x,hI_{d \times
d})\) as usual, but instead of using the typical acceptance rate
\(\alpha(x,y) = \min(1,
\pi(y)/\pi(x))\) we instead use the acceptance rate \[
\kappa(x,y) = \frac{\pi(y)}{\pi(x) + \pi(y)}
\]
- Show that this is a valid approach by showing that for \(y \neq x\) then \[
\pi(x)q(x,y)\kappa(x,y) = \pi(y)q(y,x)\kappa(y,x),
\] where \(q(x,y)\) is the
Gaussian density associated with the \(N(x,hI_{d\times d})\) distribution. What
property of the Markov chain does this show?
- Show further that for any pair \(x, y \in
\mathbb{R}^d\) the inequality \(\kappa(x,y) \leq \alpha(x,y)\) holds. How
does this mean that the algorithm will compare in terms of the
probability of accepting a candidate move?
- Denoting by \(P_1\) the regular
random walk Metropolis transition kernel and by \(P_2\) the kernel using \(\kappa\) in place of \(\alpha\), show that \(P_1(x,A\cap \{x\}^c) \geq P_2(x,A \cap
\{x\}^c)\) for any \(A \in
\mathcal{E}\), where \(A\cap
\{x\}^c\) is the set \(A\)
excluding the state \(x\). This is
known as a Peskun–Tierney ordering, and we say the \(P_1\) dominates \(P_2\) in the Peskun sense. If such an
ordering can be found then using Peskun’s theorem it holds that \(\nu(P_1,f) \leq \nu(P_2,f)\) for any \(f\) such that \(\mathbb{E}[f^2]<\infty\).
- More generally let \[
t(x,y) := \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}
\] and consider a Metropolis–Hastings style algorithm with
candidate kernel \(Q\) and acceptance
rate \(g(t(x,y))\), where \(g:\mathbb{R} \to [0,1]\). Find a relation
between \(t(x,y)\) and \(t(y,x)\), and using this and the detailed
balance equations as in part (b) show that the algorithm produces a
\(\pi\)-reversible Markov chain if
\(g \leq 1\) and \[
g(t) = tg(1/t),
\] where \(t = t(x,y)\). Then
show that any such \(g(t) \leq
\min(1,t)\), implying that the usual Metropolis–Hastings
acceptance rate is the largest possible choice among the class of \(g\)’s. (Such g’s are called balancing
functions. The choice in part (a) is \(g(t) =
t/(1+t)\), also known as Barker’s rule).
Lecture 4
- Consider a Metropolis–Hastings algorithm with candidate transition
kernel chosen to be \(Q(x,\cdot) = Q\)
for any \(x \in E = \mathbb{R}^d\),
where \(Q\) is a fixed probability
measure. In other words, the proposal is chosen independently of the
current state of the chain. This is called the independence
sampler. Denote the associated Markov transition kernel by \(P\). We will assume that the target
distribution \(\pi\) has a density
(also denoted \(\pi\)) and \(Q\) has a density \(q\). Assume further that \(\pi(x) > 0\) and \(q(x) > 0\) for all \(x \in E\), and that there is an \(\epsilon \in (0,1)\) such that \(q(x) \geq \epsilon \pi(x)\) for all \(x \in E\) (this is one way of expressing
the idea that \(Q\) has heavier tails
than \(\pi\)).
- State the acceptance probability for the independence sampler, and
prove that if \(Q = \pi\) then the
independence sampler has acceptance probability 1.
- Show that if \(q(x) \geq \epsilon
\pi(x)\) for all \(x \in E\)
then \(\int_A \alpha(x,y)q(y)dy \geq \epsilon
\pi(A)\) for any \(A \in
\mathcal{E}\) and all \(x \in
E\). You may find it useful to partition \(A\) into \(A \cap
\mathcal{A}(x)\) and \(A \cap
\mathcal{A}(x)^c\), where \(\mathcal{A}(x) := \{y \in E | \alpha(x,y) =
1\}\) is the set of states \(y\)
for which \(\alpha(x,y) = 1\).
- Your result from part (b) implies that the transition kernel \(P\) can be written in the mixture
form \[
P(x,A) = \epsilon\pi(A) + (1-\epsilon)R(x,A)
\] for any \(A \in
\mathcal{E}\), where \(R(x,A) :=
(P(x,A) - \epsilon \pi(A))/(1-\epsilon)\) is called the
residual kernel. You may use without proof the fact that for
any \(x\) the kernel \(R(x,\cdot)\) is a probability distribution.
Use this representation to show that whenever \(x \neq y\) \[
\|P(x,\cdot) - P(y,\cdot)\|_{TV} \leq 1-\epsilon.
\] This essentially shows that \(P\) is a contraction on the space of
probability measures over \((E,\mathcal{E})\), which is one way of
showing that \(P\) is uniformly
ergodic, meaning \(\|P^n(x,\cdot) - \pi\|_{TV}
\leq (1-\epsilon)^n\).
- [E] The pre-conditioned Crank Nicholson algorithm (pCN) on \(\mathbb{R}\) is a Metropolis–Hastings
algorithm in which given the current state \(X_i = x\) a proposal \(Y\) is generated using the equation \[
Y = \rho x + \sqrt{1-\rho^2}Z_i, ~~ Z_i \sim N(0,1)
\] for some \(\rho \in (0,1)\).
- Show that if \(\log\pi(x) = \Phi(x) -
x^2/2 + C_\pi\) for some function \(\Phi: \mathbb{R} \to \mathbb{R}\) and
constant \(C_\pi \in \mathbb{R}\), then
the acceptance rate can be written \[
\alpha(x,y) = \min\left(1, \exp \left\{ \Phi(y) - \Phi(x)
\right\}\right).
\]
- If the target distribution \(\pi\)
is a standard Gaussian, then what would the function \(\Phi\) be, and what would \(\alpha(x,y)\) become for any \(x,y \in \mathbb{R}\)?
- The pCN method works well when \(\pi\) is not too far from Gaussian. But
when \(\pi\) has heavier tails it
struggles. Consider the case \[
\Phi(x) = -f(|x|) + \frac{1}{2}x^2,
\] where \(f:[0,\infty) \to
\mathbb{R}\) is a strictly increasing function that satisfies
\(\lim_{r\to\infty}f(r)=\infty\) and
\(f(r) = o(r^2)\). Explain in words,
but using mathematics to help if you wish, how \(\Phi\) will behave as \(|x|\) grows.
- With \(\Phi\) as in part (c) we can
study \(\alpha(x,y)\) when \(|x|\) is large via the function \[
\Phi(\rho x) - \Phi(x),
\] noting that \(\rho x\) is the
proposal mean if the current state is \(x\). Describe in words how this function
will behave as \(|x| \to \infty\), what
this is likely to mean for the acceptance rate of the pCN algorithm, and
how you might diagnose this type of behaviour through algorithm trace
plots. This argument can be made rigorous when \(\Phi\) is twice continuously differentiable
using a Taylor series expansion of \(\Phi(y)\) and controlling the remainder,
since typically \(\Phi''\) is
bounded.