Outline

Gradient-based algorithms
- Metropolis-adjusted Langevin algorithm (MALA)
- Unadjusted algorithms & stochastic gradients
- Hamiltonian Monte Carlo
Other approaches
- Parallel tempering
- Adaptive MCMC
(Appendix: case studies)
- (Bayesian variable selection in regression)
- (Modelling radioactivity on Rongelap island via Generalised linear spatial models)

Part 1: Gradient-based algorithms

Motivation for better Metropolis–Hastings

Recall that in Metropolis–Hastings the key design choice is the candidate kernel/proposal distribution \[ Q(x,\cdot) \]
We have seen that simple random walk moves can be surprisingly effective
You can (indeed you should) always run longer chains to ensure the ESS is sufficiently high etc.
But no information about \(\pi\) is used to design \(Q(x,\cdot)\)
Can we do better?

How does the Random Walk Metropolis perform in high dimension?

Consider a simple choice of target \(\pi^{(d)}(x)\) as \(N(0,I_{d \times d})\), for \(d \in \{10,20,30,40,50,60,70,80,90,100\}\)
We implement an optimally tuned Random Walk Metropolis in each case (\(\alpha \approx 0.234\))
Then we study the mixing time, meaning \[ t^*(\epsilon) := \inf\{t \geq 0 : \|P^t(x,\cdot) - \pi\|_{TV} \leq \epsilon \} \]

The mixing time is typically \(O(d)\) (good, but let’s try to do better)

A popular modern strategy for constructing Metropolis–Hastings algorithms

Find a \(\pi\)-invariant stochastic process \((Y_t)_{t\geq 0}\)
- (usually continuous time, usually not possible to simulate exactly)
Approximate the dynamics of the process
- (this is construction of \(Q(x,\cdot)\))
Correct for approximation errors with Metropolis–Hastings step
- (ensures \(\pi\) is equilibrium distribution)

We will focus on \(E = \mathbb{R}^d\), but strategy can be employed on any space if a \(\pi\)-invariant process can be found

Diffusion processes

One example of a continuous-time stochastic process that cannot usually be simulated is a diffusion process
The dynamics are described via a stochastic differential equation (SDE)
- Question. How many people have studied diffusions/SDEs?
Consider diffusion process \((Y_t)_{t \geq 0}\), with corresponding SDE \[ dY_t = b(Y_t)dt + \sigma(Y_t)dW_t, \qquad Y_0 = y_0 \]
This is just notation!
- \(b:\mathbb{R}^d \to \mathbb{R}^d\) drift, \(\sigma:\mathbb{R}^d \to \mathbb{R}^{d \times d}\) volatility
If \(Y_t = y_t\) then for small \(\delta > 0\) \[ \begin{aligned} \mathbb{E}[Y_{t+\delta}] &= y_t + b(y_t)\delta + o(\delta) \\ \text{Cov}(Y_{t+\delta}) &= \sigma(y_t)\sigma(y_t)^T\delta^2 + o(\delta^2) \end{aligned} \]
The process \((W_t)_{t \geq 0}\) is called a standard Wiener process, or Brownian motion
- \(W_0 = 0\), and for any \(t > s > 0\) \[ W_t - W_s \sim N(0,t-s I_{d \times d}) \]

Simulating diffusion processes

Note that the true process satisfies \[ Y_{t+h} = Y_t + \int_t^{t+h} b(Y_s)ds + {\int_t^{t+h}\sigma(Y_s)dW_s}_{\text{Stochastic `Ito' integral}} \]
Assume that \(b(Y_s) \approx b(Y_t)\) over the interval \(s \in [t,t+h]\) (and similarly for \(\sigma(Y_t)\)), for small enough \(h>0\)
This gives the approximation \[ Y_{t+h} \approx Y_t + b(Y_t)\int_t^{t+h}ds + \sigma(Y_t)\int_t^{t+h} dW_s \]
The integrals can now be evaluated, giving \[ Y_t + b(Y_t)h + \sigma(Y_t)(W_{t+h} - W_t) \]
Rearranging and writing \(W_{t+h} - W_t := \sqrt{h}Z_{t+h}\), where \(Z_{t+h} \sim N(0,I)\) gives Euler–Maruyama numerical scheme \[ Y_{t+h} = Y_t + b(Y_t)h + \sqrt{h}\sigma(Y_t)Z_{t+h} \]

A simulation of the diffusion with SDE \(dY_t = dW_t\) for \(t\in [0,1]\) (Euler–Maruyama skeleton + interpolation)

set.seed(5367)
delta <- 1/1000        # time step
dW <- sqrt(delta)*rnorm(1000)
W <- cumsum(dW)
plot((1:1000)/1000,W, type = 'l', xlab = "t",ylab = "W_t")

The (overdamped) Langevin diffusion

Let \((Y_t)_{t\geq 0}\) be a diffusion process, and write \(Y_t \sim \mu_t\)
Our goal is to construct a diffusion process for which as \(t\to\infty\) \[ \|\mu_t - \pi\|_{TV} \to 0 \]
Question. What should we call \(\pi\) if this is the case?
The following diffusion has exactly this property under mild regularity conditions on \(\pi\) \[ dY_t = \frac{1}{2}\nabla\log\pi(Y_t)dt + dW_t \]
It is not hard to show that \(\pi\) is invariant using the Fokker–Planck/Kolmogorov forward equation (I will not cover this)

Using the Langevin diffusion to construct MCMC proposals

The Euler–Maruyama numerical scheme can be used to generate Metropolis–Hastings proposals
If current state \(X_n = x\) we can choose our Metropolis–Hastings proposal as \[ Y = x + \frac{h}{2}\nabla\log\pi(x) + \sqrt{h}Z \] where \(Z \sim N(0,I)\)
This means that candidate kernel \(Q(x,\cdot)\) takes the form \[ Q(x,\cdot) = N\left(x+\frac{h}{2}\nabla\log\pi(x), hI\right) \]
We then accept-reject the proposed move using the Metropolis–Hastings ratio \[ \alpha(x,y) = \min\left(1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \right) \]
This is called the Metropolis-adjusted Langevin algorithm (MALA)
- Question. Does \(q(x,y) = q(y,x)\) here?

Observations about MALA

The proposal combines a gradient step towards the mode of \(\pi\) and some noise
This is an exploration/exploitation trade-off
- smaller noise makes the proposal closer to an optimiser
- smaller gradient step makes it more like a random walk
- Just the right relative size of each gives Langevin
Knowing only \(c\pi\) for unknown \(c\) means we know \[ \log\pi(x) + \log(c) \] so \(\nabla\log\pi(x)\) can be calculated exactly in case of unnormalised \(\pi\)
Proposals are no longer blind (information about \(\pi\) used in their construction)
Question. Can we use MALA when \(E\) is not \(\mathbb{R}^d\)

What is known about MALA

Recall that for Random Walk Metropolis it is asymptotically (in \(d\)) optimal to accept 23.4% of proposals
The same optimal acceptance rate analysis for MALA gives an optimum \[ \alpha^* = 0.57 \]
Intuitively this is good, more proposals should be accepted!
The convergence rate/mixing time typically scales either \(O(d^{1/3})\) or \(O(d^{1/2})\) depending on \(\pi\)
- This is better than RWM \(O(d)\)
Cost per step is larger than RWM (need to calculate gradient)
- How much larger depends on the model
MALA often not much better than RWM for heavy tails (\(\nabla\log\pi(x) \to 0\)) and also struggles with lighter than Gaussian tails (\(\nabla\log\pi(x)\) unstable, changes too quickly)
- Roberts & Tweedie (1996) showed it is geometrically ergodic when the tails are between exponential and Gaussian
- In the lighter than Gaussian case alternative numerical schemes can be used

Empirical comparison vs Random walk

Show output

Stylised example model class

Consider target distributions of the form \[ \pi(x) \propto \exp\left\{ -\frac{x^{\beta}}{\beta} \right\} \mathbb{I}(x>0) \]
Gradients for \(x > 0\) take the form \[ \nabla\log\pi(x) = -x^{\beta - 1} \]
Question. What kind of tails are \(\beta <1\), \(\beta = 1\), \(\beta = 2\), \(\beta > 2\) cases?

MALA for the stylised example class

The MALA proposal for \(x > 0\) takes the form \[ Y = x - \frac{h}{2}x^{\beta - 1} + \sqrt{h}Z \]
Question. What does the proposal become when \(\beta = 1/2\) and \(x \to \infty\)?
Question. What does the proposal become when \(\beta = 4\) and \(x \to \infty\)?

Plots for the stylised example class

Remarks on other gradient-based algorithms

Very much the state of the art for Euclidean state space problems
Hamiltonian Monte Carlo is a popular and more complex alternative
- More accurately MALA is a special case of Hamiltonian Monte Carlo
- HMC can achieve \(O(d^{1/4})\) mixing in the right setting
Simple and more robust alternatives have also been developed (e.g. Barker proposal of L & Zanella (2022), Proximal algorithms (Benko, Chlebicka, Endal & Miasojedow, 2025))
The situation is very different on discrete spaces, where gradients cannot be used

A flavour of Hamiltonian Monte Carlo (HMC) 1

Imagine \(U(x) = -\log\pi(x)\) as a valley
The current state of the chain as a football
To generate a proposal we kick the football with arbitrary force, giving it a random initial momentum
Then we let it roll up and down the valley
- Imagine that there is no friction, so if we leave it, the ball will keep rolling forever
At some point we stop and take the position of the football as the next state of the chain

A flavour of Hamiltonian Monte Carlo (HMC) 2

A flavour of Hamiltonian Monte Carlo (HMC) 3

The ’frictionless ball rolling is actually Hamiltonian dynamics
- Hamilton’s formulation of Newton’s laws of motion
- Hamilton’s equations: system of ODEs to be solved numerically
Key additional tuning parameter: how long to integrate for
I won’t give a full description of the algorithm
Generally mixing is \(O(d^{1/4})\) and optimal acceptance rate 0.651 (Beskos et al., 2013)
Suffers from the same problems as MALA for very heavy/light tails
- Geometric ergodicity/lack of in same settings as MALA (L et al., 2019)
Implemented in many major software packages (Stan, pyMC, rmcmc)

Example implementation

Show animation

Part 2: Unadjusted sampling algorithms

What is an unadjusted algorithm?

MALA is so-called because a numerical approximation to a \(\pi\)-invariant process is ‘adjusted’ via the Metropolis–Hastings step
This is done to ensure that the algorithm still produces a \(\pi\)-invariant Markov chain
Without the MH step \(\pi\) is typically no longer equilibrium distribution, HOWEVER
- numerous authors argue that equilibrium is never reached anyway
- often the equilibrium distribution \(\pi_h\) is not far from \(\pi\) (e.g. in TV)
- Ignoring the MH step saves on computation

The unadjusted Langevin algorithm

The unadjusted Langevin algorithm is defined through the simple recursion \[ X_{n+1} = X_n + \frac{h}{2}\nabla\log\pi(X_n) + \sqrt{h}Z_{n+1} \] where \(Z_{n+1} \sim N(0,I)\)
This is simply the Euler–Maruyama discretisation of the (overdamped) Langevin diffusion
Many results exist showing when \((X_n)_{n \geq 0}\) has an equilibrium distribution \(\pi_h\) and how close this is to \(\pi\)
Also mixing time bounds exists for \(\pi\) itself
- Often we can get within distance \(\epsilon\) of \(\pi\) in finite time, even if we never actually reach it (e.g. Dalalyan (2017), Durmus, Majewski & Miasojedow (2019))
Generally preferred by numerical analysts, also called Langevin Monte Carlo

Stochastic gradient Langevin dynamics (SGLD)

Without the costly Metropolis–Hastings step, we can also save further by approximating \(\nabla\log\pi\)
Imagine \(\pi\) is of the form \[ \pi(\theta|y) \propto \prod_i f(y_i|\theta) \pi_0(\theta) \]
Question. If \(y_i\) are data points, what kind of assumption are we making about them?
If this is the case then \[ \nabla\log\pi(\theta|y) = \nabla\log\pi_0(\theta) + \sum_{i=1}^n \nabla\log f(y_i|\theta) \]
This computation is \(O(n)\). We can approximate through subsampling
- Sample \(N \ll n\) integers between \(1\) and \(n\), store as the set \(I_N\) (with \(|I_N| = N\))
- Approximate gradient with \[ \nabla\log\pi_0(\theta) + \frac{n}{N} \sum_{j \in I_N} \nabla\log f(y_j|\theta) \]
Question. What is the new computational cost?

Remarks on SGLD

Can be implemented with either fixed or decreasing step-size (fixed more common, simpler to analyse)
Often \(N = cn\) size subsample is needed to guarantee efficient sampling, which is a negative
- But the constant \(c\) matters and is often small
- further improvements can be made using control variates (I will not cover this)
- see Welling & Teh (2011) or Nemeth & Fearnhead (2021) for more
SGLD is popular in machine learning, particularly Bayesian deep learning
Generally useful in the ‘tall data scenario’ (\(n \gg d\)), which is not always interesting statistically
- Posterior often looks Gaussian (Bernstein–Von Mises), see Bardenet et al (2017)
- Lots of data per parameter

Part 3: Parallel Tempering + adaptive MCMC

There are many more elaborate and exciting MCMC algorithms
I will explain one, parallel tempering, and then one technique to automate tuning

Multi-modal target distributions

Gradient-based methods do not typically work well if the target distribution has a multi-modal density
- Question. Why do you think this is?
In this instance many alternative strategies are based on tempering the target \[ \pi(x) \to \pi(x)^\beta \] for some \(\beta < 1\)
The name comes from tempering metal (heating it up so it’s easier to work with)
\(1/\beta\) is called the temperature, so \(\beta < 1 \implies\) heating up the target
This typically reduces the level of multi-modality

Tempering algorithms

There are many ways to take this idea and use it to create sampling/optimisation algorithms
- Simulated annealing (optimisation)
- Simulated tempering (sampling)
- Parallel tempering (sampling)
We will focus on the last of these
Parallel tempering algorithm:
- Create \(K\) parallel Markov chains each targeting \(\pi^{\beta_k}\) for some \(\beta_k\), \(1 \leq k \leq K\)
- At pre-specified intervals propose swaps between adjacent chains
- In the hotter chains travel between modes is easier, and then states swap with the cooler chains

Example: mixture of 2 Gaussians, means 8 and -8

Algorithm design for tempering - natural questions

How many parallel chains?
At what temperatures should they be set?
How often should swaps be proposed?
What proportion of swaps should be accepted?

Adaptive MCMC - Motivation

One reason that the Gibbs sampler remains popular is that there are no tuning parameters
E.g. in the Random Walk Metropolis \[ Y \sim N(x, h^2 I) \] we must choose a step-size \(h\)
Often there are many more parameters to choose (e.g. preconditioning matrix \(\Sigma \in \mathbb{R}^{d \times d}\))
It can be hard to find a good choice (through e.g. pilot runs/mathematical calculations)
But good choices are crucial to algorithm performance
Adaptive MCMC is an attempt to solve this issue

Adaptive MCMC - basic idea

In adaptive MCMC we try to learn good choices for tuning parameters ‘on-the-fly’
1. Start with an arbitrary choice
2. Run the algorithm for \(k\) iterations
3. Check some statistics and update our choices accordingly
4. Repeat until sufficiently good choices are found

A very simple adaptive MCMC algorithm

Take the simple 1D random walk Metropolis, with proposal \[ Y \sim N(x, h^2) \]
We might want to choose \(h\) s.t. \(\mathbb{E}[\alpha] \approx 0.44\)
So we can do the following:
1. Start with \(h = h_0\)
2. Run algorithm for \(k=100\) (say) iterations
3. Calculate the average acceptance rate
4. Adapt \(h\) based on how far this is from 0.44
5. Repeat until some criterion is met (e.g. close to 0.44, max iterations etc.)
Question. If the average acceptance rate is larger than 0.44, should we increase or decrease \(h\)?

Code for the simple adaptive MCMC algorithm

RWM_output <- RWM(logpi = logpi_gauss, 
                    nits = adapt_iter,
                    h = h[k],
                    x_curr = start_x[k])

av_accept[k] <- RWM_output$a_rate

# update h based on average_accept_rate
if (av_accept[k] > 0.5) {
  h[k+1] <- h[k] * 1.2
} else if (av_accept[k] < 0.4) {
  h[k+1] <- h[k] * 0.8
} else {
  h[k+1] <- h[k]
}

Results for the simple adaptive MCMC algorithm

## Epoch:  1 h =  100 Average acceptace rate:  0.03 
## Epoch:  2 h =  80 Average acceptace rate:  0.01 
## Epoch:  3 h =  64 Average acceptace rate:  0.02 
## Epoch:  4 h =  51.2 Average acceptace rate:  0.02 
## Epoch:  5 h =  40.96 Average acceptace rate:  0.03 
## Epoch:  6 h =  32.768 Average acceptace rate:  0.01 
## Epoch:  7 h =  26.2144 Average acceptace rate:  0.04 
## Epoch:  8 h =  20.97152 Average acceptace rate:  0.05 
## Epoch:  9 h =  16.77722 Average acceptace rate:  0.06 
## Epoch:  10 h =  13.42177 Average acceptace rate:  0.1 
## Epoch:  11 h =  10.73742 Average acceptace rate:  0.15 
## Epoch:  12 h =  8.589935 Average acceptace rate:  0.12 
## Epoch:  13 h =  6.871948 Average acceptace rate:  0.18 
## Epoch:  14 h =  5.497558 Average acceptace rate:  0.23 
## Epoch:  15 h =  4.398047 Average acceptace rate:  0.25 
## Epoch:  16 h =  3.518437 Average acceptace rate:  0.35 
## Epoch:  17 h =  2.81475 Average acceptace rate:  0.32 
## Epoch:  18 h =  2.2518 Average acceptace rate:  0.48 
## Epoch:  19 h =  2.2518 Average acceptace rate:  0.44 
## Epoch:  20 h =  2.2518 Average acceptace rate:  0.37

A word of caution about adaptation

Naively adapting the tuning parameters of the Markov kernel is dangerous
- We are no longer dealing with a time-homogeneous chain (the kernel changes each iteration)
- The way the kernel changes depends on the past (may no longer be Markov)
To prove results about adaptive MCMC is therefore challenging
In practice the simplest way of avoiding issues is to (safely) adapt during Burn in, and then stop
I just gave a very simple way of adapting, there are far more sophisticated approaches (that I will not cover)
Also lots of open theoretical questions in the area (e.g. which adaptive schemes are better?)

Some further MCMC algorithms that I didn’t mention

Reversible jump MCMC
- sampling between different spaces of models
- aim to choose the best one or average over them
Pseudo-marginal MCMC
- Running an exact MCMC algorithm when the likelihood is intractable
- Typically relies on importance sampling to unbiasedly estimate likelihoods
Interacting particle systems (e.g. Stein variational gradient descent)
- Evolving several particles along a ‘gradient flow’ to reach a particle approximation to \(\pi\)
Non-reversible algorithms
- E.g. piece-wise deterministic Markov processes, which can be exactly simulated in continuous-time

End of module - Do zobaczenia!

I hope you enjoyed the course and am happy to answer any further questions over e-mail!

Appendix: Case studies

I will look at two case examples of MCMC in practice
- Bayesian Variable selection for genomic data analysis
- Generalised linear spatial models for radioactivity on a pacific island

Case Study 1: Bayesian Variable Selection

The Variable selection problem

A very famous problem in probabilistic modelling
Given data \((y_i,x_{i1},...,x_{ip})_{i=1}^n\), which of the covariates (\(x\)’s) should I use to predict \(y\)?
We will consider the setting of linear regression here

Solution 1: Choose the best model

We can use many heuristic model selection procedures
- forward selection
- backwards elimination
We can also use more modern penalized approaches
- e.g. LASSO
These approaches will choose the single best model for the data
If there are two correlated covariates, it will typically choose one
- Question. What do we call two correlated covariates in linear regression?
This is suboptimal if our primary interest is in learning which covariates might be associated with the response

Example: genomic data

A typical genomic dataset contains the covariates relating to many different genes in the body
The response might be whether someone has a disease, or some continuous value that is higher if people have a disease than not (e.g. blood pressure, testosterone etc.)
Scientists want to know all the genes that might be associated with the disease, so that they can investigate those genes further
Typically the number of possible genes (number of predictors \(p\)) is much larger than the sample size \(n\)
This setting is known as high-dimensional statistics

Bayesian variable selection via MCMC

Introduce \(\gamma = \{0,1\}^p\), where \[ \gamma_j = 1 \implies \beta_j \neq 0 \]
This controls whether the \(j\)th covariate is included in the model
The full linear regression model is now: \[ y_i = \alpha + \beta_1 \gamma_1 x_{i1} + ... + \beta_p\gamma_p x_{ip} + \epsilon_i \]
Each particular value for \(\gamma\) is associated with a particular model \(M_\gamma\)
- Question. If only \(\gamma_1\) and \(\gamma_p\) are not zero, what is model \(M_\gamma\)?

Choice of prior distribution

We will make the problem easier and choose conjugate priors where we can
Conditional on a certain value for \(\gamma\), we can choose priors for \(\beta\), \(\alpha\) and \(\sigma^2\) such that \(\pi(\beta, \sigma^2, \alpha|y, X, \gamma)\) is tractable
These can also be chosen in such a way that the marginal posterior \(\pi(\gamma|y,X)\) can be computed
Note: this cannot be done in general, only for linear regression

Example prior specification

One example is called the g-prior. Gives marginal posterior (for positive real \(g\)) \[ \pi(\gamma|y,X) \propto \pi_0(\gamma) \frac{(1+g)^{\frac{n-p_\gamma-1}{2}}}{(1+g(1-R_\gamma^2))^{\frac{n-1}{2}}} \]
Question. What do you think \(R_\gamma^2\) is?
Question. What do you think \(p_\gamma\) is?

MCMC over \(\gamma\) space

Our task is now to perform MCMC over the space \(\{0,1\}^p\)
Question. Mathematically, what is this space?
We will introduce a form of ‘add-delete-swap sampler’ for this

Add-Delete-Swap sampler

At each iteration, given current state \(\gamma \in \{0,1\}^p\), do the following:

Decide whether to ADD/DELETE w.p. \(\lambda\) or SWAP w.p. \(1-\lambda\)
First set proposal \(\gamma^* = \gamma\). Then do the following
1. If ADD/DELETE, draw \(j \sim \mathcal{U}[{1,...,p}]\) and set \(\gamma^*_j = 1-\gamma_j\)
2. If SWAP, choose a \(j\) for which \(\gamma^*_j = 1\) and a \(k\) for which \(\gamma^*_k = 0\) and swap them (both choices uniform)
Set the next state to be \(\gamma^*\) with probability \(\alpha(\gamma, \gamma^*)\), otherwise set it to be \(\gamma\), where \[ \alpha(\gamma,\gamma^*) = \min \left( 1, \frac{\pi(\gamma^*|y,X)}{\pi(\gamma|y,X)} \right). \]

Question. Why do I only need a Metropolis step here, not full Metropolis–Hastings with the \(q(\gamma^*,\gamma)/q(\gamma,\gamma^*)\) part?

Some remarks

The SWAP move is important for mixing when dealing with correlated predictors
This sampler isn’t fool proof and developing alternatives is an active research area

One useful posterior summary

The posterior inclusion probability (or PIP) for predictor \(j\) is \[ \mathbb{P}_\pi(\gamma_j = 1|y, X) \]
This tells us the posterior probability that covariate \(j\) should be included in the model
Any PIP significantly bigger than 0 might be of interest

Examples

See page 178-183 here

Case study 2: GLSMs for radioactivity estimation

Generalised Linear Spatial Models (GLSMs)

Consider observations \((y_i,z_i,x_i)\)
- \(y_i\) treated as realisations of \(Y_i\) following some exponential family distribution
- \(z_i\) spatial location (e.g. in \(\mathbb{R}^2\)) associated with sample \(i\)
- \(x_i\) other covariates associated with sample \(i\)
Generalised linear spatial model (GLSM) takes the form \[ h(\mathbb{E}[Y_i|S(z_i)]) = S(z_i) + x_i^T\beta, \] for some link function \(h\)
\((S(z))_{z \in \mathbb{R}^2}\) is some unknown function defined over \(z\)-space. Represents some latent intensity field
Interest is usually in some expectations from the posterior \[ \pi(S(z)|y,x) \]

Radioactivity at Rongelap island

Rongelap Island forms part of the Republic of the Marshall Islands
Located in the Pacific Ocean ~2500 miles from Hawaii
Experienced contamination following Bikini Atoll nuclear weapons testing programme in 1950s
Former inhabitants lived in self-imposed exile on neighbouring island since 1985.
Marshall Islands National Radiological Survey has examined the current levels of contamination by counting photon emissions at \(n=157\) locations over the island.
Good reasons to model data as Poisson counts with spatial correlation

The Marshall Islands (from Encyclopedia Brittanica)

The dataset (Diggle, Tawn & Moyeed, 1997)

The Posterior

The model here is \[ Y_i |S(z_i) \sim \text{Poisson} (e^{S(z_i)}) \]
The prior on the spatial intensity field \(S(z)\) is a Gaussian process, which means \[ S(z_i) \sim N(0,\sigma^2), \qquad \text{Cov}(S(z_i),S(z_j)) = \sigma^2\exp\left\{\phi^{-1}\|z_i - z_j\|^2 \right\} \]
The parameters \(\sigma^2\) and \(\phi\) will also have a hyperprior \(\pi_0(\sigma^2,\phi)\)
Posterior for \((S(z_1),S(z_2),...,S(z_n),\sigma,\phi)|y\) is therefore 159-dimensional and quite complicated!

MCMC strategy

Common to use a blocking strategy, alternating \[ (\sigma^2,\phi)|S,y, \qquad S|(\sigma^2,\phi,y) \]
Use something like MALA to update \(S|(\sigma^2,\phi,y)\) (high-dimensional, smooth)
Same can be used for \((\sigma^2,\phi)|S,y\), or something more bespoke (low-dimensional, more irregular)

Markov chain Monte Carlo sampling algorithms

Lecture 5

Last time

Outline

Part 1: Gradient-based algorithms

Motivation for better Metropolis–Hastings

How does the Random Walk Metropolis perform in high dimension?

The mixing time is typically \(O(d)\) (good, but let’s try to do better)

A popular modern strategy for constructing Metropolis–Hastings algorithms

Diffusion processes

Simulating diffusion processes

A simulation of the diffusion with SDE \(dY_t = dW_t\) for \(t\in [0,1]\) (Euler–Maruyama skeleton + interpolation)

The (overdamped) Langevin diffusion

Using the Langevin diffusion to construct MCMC proposals

Observations about MALA

What is known about MALA

Empirical comparison vs Random walk

Stylised example model class

MALA for the stylised example class

Plots for the stylised example class

Remarks on other gradient-based algorithms

A flavour of Hamiltonian Monte Carlo (HMC) 1

A flavour of Hamiltonian Monte Carlo (HMC) 2

A flavour of Hamiltonian Monte Carlo (HMC) 3

Example implementation

Part 2: Unadjusted sampling algorithms

What is an unadjusted algorithm?

The unadjusted Langevin algorithm

Stochastic gradient Langevin dynamics (SGLD)

Remarks on SGLD

Part 3: Parallel Tempering + adaptive MCMC

Multi-modal target distributions

Tempering example

Tempering algorithms

Example: mixture of 2 Gaussians, means 8 and -8

Algorithm design for tempering - natural questions

Adaptive MCMC - Motivation

Adaptive MCMC - basic idea

A very simple adaptive MCMC algorithm

Code for the simple adaptive MCMC algorithm

Results for the simple adaptive MCMC algorithm

A word of caution about adaptation

Some further MCMC algorithms that I didn’t mention

End of module - Do zobaczenia!

Appendix: Case studies

Case Study 1: Bayesian Variable Selection

The Variable selection problem

Solution 1: Choose the best model

Example: genomic data

Bayesian variable selection via MCMC

Choice of prior distribution

Example prior specification

MCMC over \(\gamma\) space

Add-Delete-Swap sampler

Some remarks

One useful posterior summary

Examples

Case study 2: GLSMs for radioactivity estimation

Generalised Linear Spatial Models (GLSMs)

Radioactivity at Rongelap island

The Marshall Islands (from Encyclopedia Brittanica)

The dataset (Diggle, Tawn & Moyeed, 1997)

The Posterior

MCMC strategy

The mean posterior intensity field (Diggle, Tawn & Moyeed, 1997)

Areas of with high probability of high contamination (Diggle, Tawn & Moyeed, 1997)