The need for better sampling algorithms

So far we have learned:
- Direct sampling (e.g. inverse CDF method)
- Rejection sampling
- Importance sampling and SIR
Problem 1. Direct sampling is usually not possible for complex models
Problem 2. Rejection and importance sampling usually suffer from the curse of dimensionality

Rejection sampling example (1)

Recall that in rejection sampling \[ \mathbb{P}(\text{Accept candidate sample}) = \frac{1}{M} \]
Consider a sequence of target distributions \(\pi^{(1)}, \pi^{(2)},...\) with densities \[ \pi^{(d)}(x^{(d)}) = (2\pi)^{-d/2} \exp\left\{-\frac{1}{2} \|x^{(d)}\|_2^2 \right\} \]
Questions. Which distribution is this? What does \(d\) represent?
Set the sequence of candidate distributions \(q^{(1)}, q^{(2)},...\) to have densities \[ q^{(d)}(x^{(d)}) = (2\pi\sigma^2)^{-d/2} \exp\left\{-\frac{1}{2\sigma^2}\|x^{(d)}\|_2^2\right\} \]

Rejection sampling example (2)

The resulting sequence of \(M^{(d)}\)’s is \[ M^{(d)} = \sup_{x^{(d)}} \frac{\pi^{(d)}(x^{(d)})}{q^{(d)}(x^{(d)})} \]
This can be simplified to \[ M^{(d)} = \sigma^d \sup_{x^{(d)}} \exp\left\{ \frac{1}{2}\left(\frac{1}{\sigma^2} - 1 \right)\|x^{(d)}\|_2^2\right\} \]
If \(\sigma^2 < 1\) then \(M = \infty\)
- (rejection sampling not possible)
If \(\sigma^2 > 1\) then \(M = \sigma^d\)
- So \(\mathbb{P}(\text{Accept candidate sample}) = \sigma^{-d}\), which decreases exponentially with dimension
(This is an example of the concentration of measure phenomenon)

Some remarks

Importance sampling suffers a similar problem
- variance of weights typically grows exponentially with dimension
To solve this issue we will devise a new sampling strategy based on Markov chains
But first - motivation for sampling in Bayesian inference

Bayesian inference refresh

The process and goal are the same as in any other type of model-based inference
- Begin with some data
- Design a family of probabilistic models for the data generating process, indexed by parameter \(\theta \in \Theta\)
- We believe there is a best choice of \(\theta_0 \in \Theta\) (either the ‘true’ model or closest approximation within the family)
- We try to make a best inference for what \(\theta_{0}\in\Theta\) is given the data
The big idea (build a model for the data to learn things) does not change whichever inference procedure you use, only the details

Bayes in a nutshell

Choose a distribution over \(\Theta\) space, expressing your beliefs about what \(\theta_0 \in \Theta\) might be
- the prior distribution with density \(\pi_0(\theta)\)
Update the prior into a posterior distribution using Bayes’ theorem, e.g. \[ \pi(\theta|y) = \frac{\pi_0(\theta)f(y|\theta)}{\int \pi_0(\vartheta)f(y|\vartheta)d\vartheta} \]
Question. What is \(f(y|\theta)\)?

Bayes: an example (1)

We asked \(10\) students that took STAT0044 (my module at UCL) if they enjoyed it \[ X_i = \begin{cases} 1 ~~~ \text{ if student liked STAT0044} \\ 0 ~~~ \text{ otherwise.} \end{cases} \]
The model for the data is \[ \begin{aligned} X_i | p &\sim \text{Bernoulli}(p), \\ p &\sim U[0,1] \end{aligned} \]
Of course all the students liked the module…
- Question. So what are the 10 data points?

Bayes: an example (2)

This means \[ \begin{aligned} \pi(p|x) &= \frac{f(x|p)\pi_0(p)}{\int f(x|p)\pi_0(p)dp} \\ &= \frac{(p^n \cdot 1) \mathbb{I}_{[0,1]}(p)}{\int_0^1 {p}^n \cdot 1 dp} \\ &= (n+1)p^n\mathbb{I}_{[0,1]}(p). \end{aligned} \]
Recall that the Beta\((\alpha,\beta)\) distribution has pdf \[ f(p) \propto p^{\alpha - 1}(1-p)^{\beta - 1}\mathbb{I}_{[0,1]}(p) \]
Question: So what is the posterior distribution here?

Bayes: an example

par(mfrow = c(1,2))

n <- 10
f1 <- function(x) { dbeta(x, shape1 = 1, shape2 = 1) }
f2 <- function(x) { dbeta(x, shape1 = 1+n, shape2 = 1) }

curve(f1, from = 0, to = 1, main = "Prior", xlab = "p", ylab = "Density", ylim = c(0,10))
curve(f2, from = 0, to = 1, main = "Posterior", xlab = "p", ylab = "", ylim = c(0,10))

Bayes: another example

Consider the same model, but now we choose a different prior \[ p \sim \pi_0 \]
Now the posterior is \[ \pi(p|x) \propto f(x|p)\pi_0(p) \]
This is as far as we can go. We only know \(\pi(p|x)\) up to a constant, so we cannot normalise it
So to compute expectations wrt \(\pi(\cdot|x)\) we are in the realm of the integration problem from lecture 1!

There are pros and cons to Bayes, just like any other approach!

Some pros:
- Easy to predict whilst incorporating parameter uncertainty. \[ p(y'|y)=\int f(y'|\theta)\pi(\theta|y)d\theta \]
- Marginal uncertainty is easy: compute the marginal posterior \[ \pi(\theta_1|y) = \int \pi(\theta|y)d\theta_2...d\theta_d \]
- Prior is often used as a regulariser. Most agree that regularisation is good!
Some cons:
- The prior can be hard to construct
- Expectations wrt the posterior requires integration! (often intractable)

How to integrate with respect to a high-dimensional unnormalised posterior?

There is an entire field dedicated to this: Bayesian computation.
Some of the approaches we have learned about can be used/modified when only \(c\pi\) is known (\(c \in [0,\infty)\))
- But as we have seen, they often they succumb to the curse of dimensionality
To beat this, we relax the problem setting
- Rather than drawing independent samples, let them be dependent
- Rather than sampling from \(\pi\), just try to get as close as we can
This is the realm of Markov chain Monte Carlo!

Part 2: General state space Markov chains

Markov chain

Definition An \(E\)-valued discrete-time stochastic process \(\{X_{t}\}_{t\geq0}\) is called a Markov chain if for any fixed \(t\) and any \(x_{t}\in E\) the random variable \[ X_{t+1}|(X_{t}=x_{t}) \] is independent of \(X_{t-k}\) for all \(t \geq k \geq 1\).

Symbolically this can be written \[ \left(X_{t+1}\perp\!\!\!\perp X_{t-k}\right)|X_{t}. \]
In words the future is independent of the past if we know the present.
The Markov property in terms of probabilities (ignoring measure theoretic subtleties) \[ \mathbb{P}\left(X_{t+1}\in A|X_{t}=x_{t},...,X_{0}=x_{0}\right)=\mathbb{P}\left(X_{t+1}\in A|X_{t}=x_{t}\right) \]

To simulate a Markov chain, we need only two things

An initial distribution \(\mu_{0}\), such that \(\mathbb{P}(X_{0}\in A):=\mu_0(A)\) for all \(A\in\mathcal{E}\)
A transition kernel \[ P(x,A) := \mathbb{P}(X_{i} \in A|X_{i-1}=x) \] for all \(A\in\mathcal{E}\) and all \(x\in E\)
Then set \(i=1\), draw \(X_0 \sim \mu_0\) and repeat:
1. Sample \(X_{i}|X_{i-1}\sim P(X_{i-1},\cdot)\)
2. If \(i<n\), set \(i\to i+1\) and return to 1.
We will focus on time-homogeneous chains, meaning \(P\) does not depend on \(t\).

Example (finite state space)

Let \(X_0 = 1\) or \(2\) w.p. \(1/2\), and introduce the matrix

\[ \bar{P} = \left( \begin{array}{cc} 1/3 & 2/3 \\ 2/3 & 1/3 \end{array}\right). \]

Let \(\mathbb{P}(X_{i}=y|X_{i-1}=x)=\bar{P}(x,y)\).
\(\bar{P}\) is called the transition matrix
The transition kernel is \[ P(x,A):=\sum_{y\in A}\bar{P}(x,y). \]
Normally in finite state spaces we just work with the matrix
- Question. Why can’t we do the same in continuous state spaces?

Example (finite state space)

Example (continuous state space)

Let \(X_0 \sim N(0,1)\) and introduce the recursion \[ X_{i}=\gamma X_{i-1}+\sqrt{(1-\gamma^{2})}Z_{i}, \] where each \(Z_{i}\stackrel{iid}{\sim} N(0,1)\) and \(\gamma\in\mathbb{R}\) with \(|\gamma|<1\).
Here the kernel is \[ P(x,A)=\frac{1}{\sqrt{2\pi(1-\gamma^{2})}}\int_{y\in A}\exp\left(-\frac{1}{2(1-\gamma^{2})}(y-\gamma x)^{2}\right)dy. \]
We can write \(P(x,A) = \int_A p(x,y)dy\) and call \(p(x,y)\) the transition density.
More generally we can write \(P(x,A) = \int_A P(x,dy)\)

Example (continuous state space, \(X_0 = 20\), \(\gamma = 0.8\))

Remark

Often we will see that it’s convenient to describe the chain through the recursion \[ X_{i+1} = f(X_i, \xi_{i+1}), \] where \(X_i\) is the current state and \(\xi_{i+1}\) is independent of \(X_i\)
Sometimes called the nonlinear state space model (Meyn & Tweedie) or iterated random function (Diaconis & Freedman)
The Markov property is clear if the process can be described in this way

Marginal distributions

If \(X_0 \sim \mu\) and \(X_{t+1}|(X_t = x) \sim P(x,\cdot)\), then the marginally \[ \mathbb{P}(X_1 \in A) = \int P(x,A)\mu(dx). \]
We sometimes define the probability measure \(\mu P\) s.t. for any \(A \in \mathcal{E}\) \[ \mu P(A) := \int \mu(dx)P(x,A) \] (and write \(X_1 \sim \mu P\))
Note that in the finite setting this integral can be re-written \[ \begin{aligned} \mathbb{P}(X_1 \in A) &= \sum_{x \in E} \mu(\{x\})P(x,A) \\ &= \sum_{x \in E} \mathbb{P}(X_0 =x) \mathbb{P}(X_1 \in A |X_0 = x) \end{aligned} \] (i.e. just the law of total probability)

Calculating marginal distributions: Example 1

Let \(E =\{1,2\}\), \(\mu = \delta_1\), \(\bar{P} = \begin{pmatrix} 1/3 & 2/3 \\ 2/3 & 1/3 \\ \end{pmatrix}\)
Then for any \(A \in \mathcal{E}\) \[ \mu P (A) = \int \delta_1(dx)P(x,A) = P(1,A). \]
In more classical vector notation we could write \[ \mu = (1,0), ~~~~ \mu P = (1/3, 2/3) \]
Question. What would \(\mu P\) be if instead \(\mu = (0, 1)\)?

Calculating marginal distributions: Example 2

Let \(E = \mathbb{R}\), \(\mu = \frac{1}{2}\delta_{-1} + \frac{1}{2}\delta_1\) and \[ P(x,A) = \int_A \phi(y-x)dy \] where \(\phi(z) = e^{-z^2/2}/\sqrt{2\pi}\) is a standard Normal density.
Question. What is the (famous) name of this type of Markov chain?
Then for any \(A \in\mathcal{E}\) \[ \begin{aligned} \mu P (A) &= \int \mu(dx)P(x, A) \\ &= \int \left( \frac{1}{2}\delta_{-1}(dx) + \frac{1}{2} \delta_1(dx)\right)P(x,A) \\ &= \frac{1}{2}P(-1,A) + \frac{1}{2}P(1,A) \\ &= \frac{1}{2}\int_A \phi(y+1)dy + \frac{1}{2}\int_A \phi(y-1)dy \end{aligned} \]
Question. So if \(X_0 \sim \mu\), what is the distribution of \(X_1\) under this Markov chain?

Finite state space example re-visited

With \(\bar{P}\) as above, recall the Chapman–Kolmogorov equations \[ \mathbb{P}(X_{t+n} = y|X_t = x) = \bar{P}^n(x,y) \]
Examining this matrix for different values of \(n\) gives \[ \begin{aligned} \bar{P} & =\left(\begin{array}{cc} 1/3 & 2/3\\ 2/3 & 1/3 \end{array}\right)\\ \bar{P}^{2} & =\left(\begin{array}{cc} 0.55 & 0.44\\ 0.44 & 0.55 \end{array}\right)\\ \bar{P}^{5} & =\left(\begin{array}{cc} 0.498 & 0.502\\ 0.502 & 0.498 \end{array}\right) \end{aligned} \]
Question. What is happening to each row of \(\bar{P}^{n}\) as \(n\to\infty\)?

Forgetting the initial state

Many Markov chains have the property that when \(n\) is large the distribution of \(X_{n}\) begins to stabilise
The phenomenon is called convergence to equilibrium or mixing
To define it mathematically we will need a precise notion of distance between distributions

Measuring the distance between distributions

For now we will work with the total variation distance (TV). For distributions \(\mu\) and \(\nu\) on \((E,\mathcal{E})\) \[ \|\mu - \nu\|_{TV} := \sup_{A \in \mathcal{E}}|\mu(A) - \nu(A)| \]
TV is within the family of integral probability metrics, so can also be written \[ \|\mu - \nu\|_{TV} = \sup_{f :E \to [0,1]}|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| \]
If \(\mu\) and \(\nu\) have densities then it can also be written (see exercises) \[ \|\mu - \nu\|_{TV} = \frac{1}{2} \int |\mu(x) - \nu(x)|dx \]

Convergence to equilibrium

Definition. The \(n\)-step transition kernel is \[ P^{n}(x,A):= \mathbb{P}(X_n \in A | X_0 = x) \]
Definition. A Markov chain with transition kernel \(P\) has a unique limiting distribution \(\pi\) if for (\(\pi-\)almost) all starting points \(x\in E\) \[ \lim_{n\to\infty}\| P^{n}(x,\cdot) - \pi \|_{TV}=0. \]
In this instance we say the chain converges to equilibrium (or mixes)
(For the general case \(X_0 \sim \mu\) we can write \(\lim_{n\to\infty}\|\mu P^n - \pi\|_{TV} = 0\))
- Question. What choice of \(\mu\) would bring us back to the case \(\mu P^n = P^n(x,\cdot)\)?
To understand when convergence to equilibrium happens, we need some more properties of Markov chains.

More properties of Markov chains

A Markov chain with transition kernel \(P\) is called:
- \(\pi\)-irreducible if for every \(A\in\mathcal{E}\) and every \(x\in E\) there is an \(n<\infty\) s.t. \[ \pi(A)>0\implies P^{n}(x,A)>0. \]
- Aperiodic if there is no collection of disjoint events \(A_{0},...,A_{k-1}\in\mathcal{E}\) s.t. \[ x\in A_{i}\implies P(x,A_{(i+1)\mod k})=1. \]
- \(\pi\)-invariant if for all \(n\geq0\) \[ X_{t}\sim\pi \implies X_{t+n} \sim \pi \]
For most of the chains we will study \(\pi\)-irreducibility and aperiodicity are obvious, so we won’t dwell on them

Example of \(\pi\)-invariance

Take the continuous example \[ X_{i+1} = \gamma X_i + \sqrt{1-\gamma^2}Z_{i+1}, \] for \(|\gamma|<1\) and \(Z_{i+1} \sim N(0,1)\).
Imagine that \(X_i \sim N(0,1)\)
Then \(X_{i+1}\) is a linear combination of independent Gaussians
- Question. What does this imply about the distribution for \(X_{i+1}\)?
\(\mathbb{E}[X_{i+1}] = 0\) (Check!)
For the variance we have \[ \begin{aligned} \text{Var}[X_{i+1}] &= \gamma^2 \text{Var}[X_i] + (1-\gamma^2)\text{Var}[Z_i] \\ &= \gamma^2 + (1 - \gamma^2) \\ &= 1 \end{aligned} \]
So \(N(0,1)\) is an invariant distribution

Theorem: convergence to equilibrium

If a Markov chain with transition kernel \(P\) is \(\pi\)-invariant, \(\pi\)-irreducible and aperiodic, then \(\pi\) is the unique limiting distribution for the chain.

Checking the conditions

\(\pi\)-irreducibility and aperiodicity are often obvious (but not always!)
\(\pi\)-invariance is not hard to check provided some more structure is imposed
In most cases the structure is reversibility/detailed balance

\(\pi\)-reversibility

Definition. A finite state space chain with transition matrix \(\bar{P}\) is \(\pi\)-reversible if \[ \pi(x)\bar{P}(x,y) = \pi(y)\bar{P}(y,x) \] for all \(x,y \in E\).
For a general state space chain with a transition kernel \(P\) the condition can be written \[ \pi(dx)P(x,dy) = \pi(dy)P(y,dx) \]
If \(P(x,\cdot)\) has a transition density \(p(x,y)\) for all \(x \in E\) then \(\pi\) has a density and we can write \[ \pi(x)p(x,y) = \pi(y)p(y,x) \]
(‘\(P\) has a transition density \(p\)’ just means \(P(x,A) = \int_A p(x,y)dy\))

Example (re-visited again)

Take the discrete two state example from earlier \[ \bar{P} = \left( \begin{array}{cc} 1/3 & 2/3 \\ 2/3 & 1/3 \end{array}\right). \]
Here \[ \bar{P}(1,2) = \bar{P}(2,1), \]
Question. So which distribution is this chain reversible with respect to?

Theorem (reversibility \(\implies\) invariance)

If a Markov chain is \(\pi\)-reversible, then it is \(\pi\)-invariant.
Proof. In the general case, we need to show for any \(A \in \mathcal{E}\) that \(X_n \sim \pi \implies X_{n+1} \sim \pi\)
Recall that if \(X_n \sim \pi\) then \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_n \in E}\pi(dx_n)P(x_n, A) = \int_{x_{n+1} \in A} \int_{x_n \in E} \pi(dx_n)P(x_n, dx_{n+1}) \]
Now by \(\pi\)-reversibility \(\pi(dx_n) P(x_n,dx_{n+1}) = \pi(dx_{n+1}) P(x_{n+1},dx_n)\), meaning we can instead write \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_n \in E} \int_{x_{n+1} \in A} \pi(dx_{n+1}) P(x_{n+1},dx_n) \]
Using Fubini–Tonelli theorem the integrals can be switched around, meaning \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_{n+1} \in A} \left[ \int_{x_n \in E} P(x_{n+1},dx_n) \right] \pi(dx_{n+1}) \]
But since \(P(x_{n+1}, \cdot)\) is a probability measure then \(P(x_{n+1},E) = 1\), meaning \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_{n+1} \in A} \pi(dx_{n+1}) = \pi(A). \]

The Markov chain ergodic theorem

If a Markov chain \(\{X_{t}\}_{t\geq0}\) is \(\pi\)-irreducible and \(\pi\)-invariant, and \(f: E \to\mathbb{R}\) satisfies \(\mathbb{E}_{\pi}[f(X)]<\infty\), then the ergodic average \[ \hat{f}_{n}:= \frac{1}{n}\sum_{i=0}^{n-1}f(X_{i}) \] satisfies \[ \mathbb{P}\left(\lim_{n\to\infty}\hat{f}_{n}=\mathbb{E}_{\pi}[f(X)]\right)=1. \]

Question. Which theorem about iid random variables is this similar to?

Part 3: Markov chain Monte Carlo (1 slide)

Markov chain Monte Carlo (basic idea)

Goal: compute \(\mathbb{E}_\pi[f]\), with only \(c\pi\) known
Approach:
- Construct a Markov chain \[ \{X_t\}_{t\geq 0} \] s.t. \(\pi\) is the limiting distribution
- Compute ergodic averages \[ \frac{1}{n} \sum_{i=0}^{n-1} f(X_i) \approx \mathbb{E}_\pi[f]. \]
But how can this be done??

Markov chain Monte Carlo sampling algorithms

Lecture 2

Last time

Today - towards more advanced sampling algorithms!

Part 1: Motivation

The need for better sampling algorithms

Rejection sampling example (1)

Rejection sampling example (2)

Some remarks

Bayesian inference refresh

Bayes in a nutshell

Bayes: an example (1)

Bayes: an example (2)

Bayes: an example

Bayes: another example

There are pros and cons to Bayes, just like any other approach!

How to integrate with respect to a high-dimensional unnormalised posterior?

Part 2: General state space Markov chains

Markov chain

To simulate a Markov chain, we need only two things

Example (finite state space)

Example (finite state space)

Example (continuous state space)

Example (continuous state space, \(X_0 = 20\), \(\gamma = 0.8\))

Remark

Marginal distributions

Calculating marginal distributions: Example 1

Calculating marginal distributions: Example 2

Finite state space example re-visited

Forgetting the initial state

Measuring the distance between distributions

Convergence to equilibrium

More properties of Markov chains

Example of \(\pi\)-invariance

Theorem: convergence to equilibrium

Checking the conditions

\(\pi\)-reversibility

Example (re-visited again)

Theorem (reversibility \(\implies\) invariance)

The Markov chain ergodic theorem

Part 3: Markov chain Monte Carlo (1 slide)

Markov chain Monte Carlo (basic idea)

End of lecture