Markov chain Monte Carlo sampling algorithms
Lecture 5
Dr Samuel Livingstone
Last time
- Theory of MCMC
- Bias (via mixing times) \[
\|P^n(x,\cdot) - \pi\|_{TV} \leq M(x) \rho^n
\]
- (Asymptotic) Variance \[
\nu(P,f) := \lim_{N \to \infty} N\text{Var}(\tilde{f}_N)
\]
Outline
- Gradient-based algorithms
- Metropolis-adjusted Langevin algorithm (MALA)
- Unadjusted algorithms & stochastic gradients
- Hamiltonian Monte Carlo
- Other approaches
- Parallel tempering
- Adaptive MCMC
- (Appendix: case studies)
- (Bayesian variable selection in regression)
- (Modelling radioactivity on Rongelap island via Generalised
linear spatial models)
Part 1: Gradient-based algorithms
Motivation for better Metropolis–Hastings
- Recall that in Metropolis–Hastings the key design choice is the
candidate kernel/proposal distribution \[
Q(x,\cdot)
\]
- We have seen that simple random walk moves can be surprisingly
effective
- You can (indeed you should) always run longer chains to
ensure the ESS is sufficiently high etc.
- But no information about \(\pi\) is
used to design \(Q(x,\cdot)\)
- Can we do better?
The mixing time is typically \(O(d)\) (good, but let’s try to do
better)

A popular modern strategy for constructing Metropolis–Hastings
algorithms
- Find a \(\pi\)-invariant stochastic
process \((Y_t)_{t\geq 0}\)
- (usually continuous time, usually not possible to simulate
exactly)
- Approximate the dynamics of the process
- (this is construction of \(Q(x,\cdot)\))
- Correct for approximation errors with Metropolis–Hastings step
- (ensures \(\pi\) is equilibrium
distribution)
- We will focus on \(E =
\mathbb{R}^d\), but strategy can be employed on any space if a
\(\pi\)-invariant process can be
found
Diffusion processes
- One example of a continuous-time stochastic process that cannot
usually be simulated is a diffusion process
- The dynamics are described via a stochastic differential equation
(SDE)
- Question. How many people have studied
diffusions/SDEs?
- Consider diffusion process \((Y_t)_{t \geq
0}\), with corresponding SDE \[
dY_t = b(Y_t)dt + \sigma(Y_t)dW_t, \qquad Y_0 = y_0
\]
- This is just notation!
- \(b:\mathbb{R}^d \to \mathbb{R}^d\)
drift, \(\sigma:\mathbb{R}^d \to
\mathbb{R}^{d \times d}\) volatility
- If \(Y_t = y_t\) then for small
\(\delta > 0\) \[
\begin{aligned}
\mathbb{E}[Y_{t+\delta}]
&=
y_t + b(y_t)\delta + o(\delta)
\\
\text{Cov}(Y_{t+\delta})
&= \sigma(y_t)\sigma(y_t)^T\delta^2 + o(\delta^2)
\end{aligned}
\]
- The process \((W_t)_{t \geq 0}\) is
called a standard Wiener process, or Brownian motion
- \(W_0 = 0\), and for any \(t > s > 0\) \[
W_t - W_s \sim N(0,t-s I_{d \times d})
\]
Simulating diffusion processes
- Note that the true process satisfies \[
Y_{t+h} = Y_t + \int_t^{t+h} b(Y_s)ds +
{\int_t^{t+h}\sigma(Y_s)dW_s}_{\text{Stochastic `Ito' integral}}
\]
- Assume that \(b(Y_s) \approx
b(Y_t)\) over the interval \(s \in
[t,t+h]\) (and similarly for \(\sigma(Y_t)\)), for small enough \(h>0\)
- This gives the approximation \[
Y_{t+h}
\approx
Y_t + b(Y_t)\int_t^{t+h}ds + \sigma(Y_t)\int_t^{t+h} dW_s
\]
- The integrals can now be evaluated, giving \[
Y_t + b(Y_t)h + \sigma(Y_t)(W_{t+h} - W_t)
\]
- Rearranging and writing \(W_{t+h} - W_t :=
\sqrt{h}Z_{t+h}\), where \(Z_{t+h} \sim
N(0,I)\) gives Euler–Maruyama numerical scheme
\[
Y_{t+h} = Y_t + b(Y_t)h + \sqrt{h}\sigma(Y_t)Z_{t+h}
\]
A simulation of the diffusion with SDE \(dY_t = dW_t\) for \(t\in [0,1]\) (Euler–Maruyama skeleton +
interpolation)
set.seed(5367)
delta <- 1/1000 # time step
dW <- sqrt(delta)*rnorm(1000)
W <- cumsum(dW)
plot((1:1000)/1000,W, type = 'l', xlab = "t",ylab = "W_t")

The (overdamped) Langevin diffusion
- Let \((Y_t)_{t\geq 0}\) be a
diffusion process, and write \(Y_t \sim
\mu_t\)
- Our goal is to construct a diffusion process for which as \(t\to\infty\) \[
\|\mu_t - \pi\|_{TV} \to 0
\]
- Question. What should we call \(\pi\) if this is the case?
- The following diffusion has exactly this property under mild
regularity conditions on \(\pi\) \[
dY_t = \frac{1}{2}\nabla\log\pi(Y_t)dt + dW_t
\]
- It is not hard to show that \(\pi\)
is invariant using the Fokker–Planck/Kolmogorov forward equation (I will
not cover this)
Using the Langevin diffusion to construct MCMC proposals
- The Euler–Maruyama numerical scheme can be used to generate
Metropolis–Hastings proposals
- If current state \(X_n = x\) we can
choose our Metropolis–Hastings proposal as \[
Y = x + \frac{h}{2}\nabla\log\pi(x) + \sqrt{h}Z
\] where \(Z \sim N(0,I)\)
- This means that candidate kernel \(Q(x,\cdot)\) takes the form \[
Q(x,\cdot) = N\left(x+\frac{h}{2}\nabla\log\pi(x), hI\right)
\]
- We then accept-reject the proposed move using the
Metropolis–Hastings ratio \[
\alpha(x,y) = \min\left(1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)} \right)
\]
- This is called the Metropolis-adjusted Langevin algorithm
(MALA)
- Question. Does \(q(x,y) =
q(y,x)\) here?
Observations about MALA
- The proposal combines a gradient step towards the mode of
\(\pi\) and some noise
- This is an exploration/exploitation trade-off
- smaller noise makes the proposal closer to an optimiser
- smaller gradient step makes it more like a random walk
- Just the right relative size of each gives Langevin
- Knowing only \(c\pi\) for unknown
\(c\) means we know \[
\log\pi(x) + \log(c)
\] so \(\nabla\log\pi(x)\) can
be calculated exactly in case of unnormalised \(\pi\)
- Proposals are no longer blind (information about \(\pi\) used in their construction)
- Question. Can we use MALA when \(E\) is not \(\mathbb{R}^d\)
What is known about MALA
- Recall that for Random Walk Metropolis it is asymptotically (in
\(d\)) optimal to accept 23.4% of
proposals
- The same optimal acceptance rate analysis for MALA gives an optimum
\[
\alpha^* = 0.57
\]
- Intuitively this is good, more proposals should be accepted!
- The convergence rate/mixing time typically scales either \(O(d^{1/3})\) or \(O(d^{1/2})\) depending on \(\pi\)
- This is better than RWM \(O(d)\)
- Cost per step is larger than RWM (need to calculate gradient)
- How much larger depends on the model
- MALA often not much better than RWM for heavy tails (\(\nabla\log\pi(x) \to 0\)) and also
struggles with lighter than Gaussian tails (\(\nabla\log\pi(x)\) unstable, changes too
quickly)
- Roberts & Tweedie (1996) showed it is geometrically ergodic when
the tails are between exponential and Gaussian
- In the lighter than Gaussian case alternative numerical schemes can
be used
Empirical comparison vs Random walk
Show output
Stylised example model class
- Consider target distributions of the form \[
\pi(x) \propto \exp\left\{ -\frac{x^{\beta}}{\beta} \right\}
\mathbb{I}(x>0)
\]
- Gradients for \(x > 0\) take the
form \[
\nabla\log\pi(x) = -x^{\beta - 1}
\]
- Question. What kind of tails are \(\beta <1\), \(\beta = 1\), \(\beta = 2\), \(\beta > 2\) cases?
MALA for the stylised example class
- The MALA proposal for \(x > 0\)
takes the form \[
Y = x - \frac{h}{2}x^{\beta - 1} + \sqrt{h}Z
\]
- Question. What does the proposal become when \(\beta = 1/2\) and \(x \to \infty\)?
- Question. What does the proposal become when \(\beta = 4\) and \(x \to \infty\)?
Plots for the stylised example class

A flavour of Hamiltonian Monte Carlo (HMC) 1
- Imagine \(U(x) = -\log\pi(x)\) as a
valley
- The current state of the chain as a football
- To generate a proposal we kick the football with arbitrary force,
giving it a random initial momentum
- Then we let it roll up and down the valley
- Imagine that there is no friction, so if we leave it, the ball will
keep rolling forever
- At some point we stop and take the position of the football as the
next state of the chain
A flavour of Hamiltonian Monte Carlo (HMC) 2

A flavour of Hamiltonian Monte Carlo (HMC) 3
- The ’frictionless ball rolling is actually Hamiltonian dynamics
- Hamilton’s formulation of Newton’s laws of motion
- Hamilton’s equations: system of ODEs to be solved numerically
- Key additional tuning parameter: how long to integrate for
- I won’t give a full description of the algorithm
- Generally mixing is \(O(d^{1/4})\)
and optimal acceptance rate 0.651 (Beskos et al., 2013)
- Suffers from the same problems as MALA for very heavy/light tails
- Geometric ergodicity/lack of in same settings as MALA (L et al.,
2019)
- Implemented in many major software packages (Stan, pyMC,
rmcmc
)
Example implementation
Show animation
Part 2: Unadjusted sampling algorithms
What is an unadjusted algorithm?
- MALA is so-called because a numerical approximation to a \(\pi\)-invariant process is ‘adjusted’ via
the Metropolis–Hastings step
- This is done to ensure that the algorithm still produces a \(\pi\)-invariant Markov chain
- Without the MH step \(\pi\) is
typically no longer equilibrium distribution, HOWEVER
- numerous authors argue that equilibrium is never reached anyway
- often the equilibrium distribution \(\pi_h\) is not far from \(\pi\) (e.g. in TV)
- Ignoring the MH step saves on computation
The unadjusted Langevin algorithm
- The unadjusted Langevin algorithm is defined through the
simple recursion \[
X_{n+1} = X_n + \frac{h}{2}\nabla\log\pi(X_n) + \sqrt{h}Z_{n+1}
\] where \(Z_{n+1} \sim
N(0,I)\)
- This is simply the Euler–Maruyama discretisation of the (overdamped)
Langevin diffusion
- Many results exist showing when \((X_n)_{n
\geq 0}\) has an equilibrium distribution \(\pi_h\) and how close this is to \(\pi\)
- Also mixing time bounds exists for \(\pi\) itself
- Often we can get within distance \(\epsilon\) of \(\pi\) in finite time, even if we never
actually reach it (e.g. Dalalyan (2017), Durmus, Majewski &
Miasojedow (2019))
- Generally preferred by numerical analysts, also called Langevin
Monte Carlo
Stochastic gradient Langevin dynamics (SGLD)
- Without the costly Metropolis–Hastings step, we can also save
further by approximating \(\nabla\log\pi\)
- Imagine \(\pi\) is of the form
\[
\pi(\theta|y) \propto \prod_i f(y_i|\theta) \pi_0(\theta)
\]
- Question. If \(y_i\) are data points, what kind of
assumption are we making about them?
- If this is the case then \[
\nabla\log\pi(\theta|y) = \nabla\log\pi_0(\theta) + \sum_{i=1}^n
\nabla\log f(y_i|\theta)
\]
- This computation is \(O(n)\). We
can approximate through subsampling
- Sample \(N \ll n\) integers between
\(1\) and \(n\), store as the set \(I_N\) (with \(|I_N| = N\))
- Approximate gradient with \[
\nabla\log\pi_0(\theta) + \frac{n}{N} \sum_{j \in I_N} \nabla\log
f(y_j|\theta)
\]
- Question. What is the new computational cost?
Part 3: Parallel Tempering + adaptive MCMC
- There are many more elaborate and exciting MCMC algorithms
- I will explain one, parallel tempering, and then one technique to
automate tuning
Multi-modal target distributions
- Gradient-based methods do not typically work well if the target
distribution has a multi-modal density
- Question. Why do you think this is?
- In this instance many alternative strategies are based on
tempering the target \[
\pi(x) \to \pi(x)^\beta
\] for some \(\beta <
1\)
- The name comes from tempering metal (heating it up so it’s easier to
work with)
- \(1/\beta\) is called the
temperature, so \(\beta < 1
\implies\) heating up the target
- This typically reduces the level of multi-modality
Tempering example

Tempering algorithms
- There are many ways to take this idea and use it to create
sampling/optimisation algorithms
- Simulated annealing (optimisation)
- Simulated tempering (sampling)
- Parallel tempering (sampling)
- We will focus on the last of these
- Parallel tempering algorithm:
- Create \(K\) parallel Markov chains
each targeting \(\pi^{\beta_k}\) for
some \(\beta_k\), \(1 \leq k \leq K\)
- At pre-specified intervals propose swaps between
adjacent chains
- In the hotter chains travel between modes is easier, and then states
swap with the cooler chains
Example: mixture of 2 Gaussians, means 8 and -8

Algorithm design for tempering - natural questions
- How many parallel chains?
- At what temperatures should they be set?
- How often should swaps be proposed?
- What proportion of swaps should be accepted?
Adaptive MCMC - Motivation
- One reason that the Gibbs sampler remains popular is that there are
no tuning parameters
- E.g. in the Random Walk Metropolis \[
Y \sim N(x, h^2 I)
\] we must choose a step-size \(h\)
- Often there are many more parameters to choose (e.g. preconditioning
matrix \(\Sigma \in \mathbb{R}^{d \times
d}\))
- It can be hard to find a good choice (through e.g. pilot
runs/mathematical calculations)
- But good choices are crucial to algorithm performance
- Adaptive MCMC is an attempt to solve this
issue
Adaptive MCMC - basic idea
- In adaptive MCMC we try to learn good choices for tuning parameters
‘on-the-fly’
- Start with an arbitrary choice
- Run the algorithm for \(k\)
iterations
- Check some statistics and update our choices accordingly
- Repeat until sufficiently good choices are found
A very simple adaptive MCMC algorithm
- Take the simple 1D random walk Metropolis, with proposal \[
Y \sim N(x, h^2)
\]
- We might want to choose \(h\) s.t.
\(\mathbb{E}[\alpha] \approx
0.44\)
- So we can do the following:
- Start with \(h = h_0\)
- Run algorithm for \(k=100\) (say)
iterations
- Calculate the average acceptance rate
- Adapt \(h\) based on how far this
is from 0.44
- Repeat until some criterion is met (e.g. close to 0.44, max
iterations etc.)
- Question. If the average acceptance rate is
larger than 0.44, should we increase or decrease \(h\)?
Code for the simple adaptive MCMC algorithm
RWM_output <- RWM(logpi = logpi_gauss,
nits = adapt_iter,
h = h[k],
x_curr = start_x[k])
av_accept[k] <- RWM_output$a_rate
# update h based on average_accept_rate
if (av_accept[k] > 0.5) {
h[k+1] <- h[k] * 1.2
} else if (av_accept[k] < 0.4) {
h[k+1] <- h[k] * 0.8
} else {
h[k+1] <- h[k]
}
Results for the simple adaptive MCMC algorithm
## Epoch: 1 h = 100 Average acceptace rate: 0.03
## Epoch: 2 h = 80 Average acceptace rate: 0.01
## Epoch: 3 h = 64 Average acceptace rate: 0.02
## Epoch: 4 h = 51.2 Average acceptace rate: 0.02
## Epoch: 5 h = 40.96 Average acceptace rate: 0.03
## Epoch: 6 h = 32.768 Average acceptace rate: 0.01
## Epoch: 7 h = 26.2144 Average acceptace rate: 0.04
## Epoch: 8 h = 20.97152 Average acceptace rate: 0.05
## Epoch: 9 h = 16.77722 Average acceptace rate: 0.06
## Epoch: 10 h = 13.42177 Average acceptace rate: 0.1
## Epoch: 11 h = 10.73742 Average acceptace rate: 0.15
## Epoch: 12 h = 8.589935 Average acceptace rate: 0.12
## Epoch: 13 h = 6.871948 Average acceptace rate: 0.18
## Epoch: 14 h = 5.497558 Average acceptace rate: 0.23
## Epoch: 15 h = 4.398047 Average acceptace rate: 0.25
## Epoch: 16 h = 3.518437 Average acceptace rate: 0.35
## Epoch: 17 h = 2.81475 Average acceptace rate: 0.32
## Epoch: 18 h = 2.2518 Average acceptace rate: 0.48
## Epoch: 19 h = 2.2518 Average acceptace rate: 0.44
## Epoch: 20 h = 2.2518 Average acceptace rate: 0.37

A word of caution about adaptation
- Naively adapting the tuning parameters of the Markov kernel is
dangerous
- We are no longer dealing with a time-homogeneous chain (the kernel
changes each iteration)
- The way the kernel changes depends on the past (may no longer be
Markov)
- To prove results about adaptive MCMC is therefore challenging
- In practice the simplest way of avoiding issues is to (safely) adapt
during Burn in, and then stop
- I just gave a very simple way of adapting, there are far more
sophisticated approaches (that I will not cover)
- Also lots of open theoretical questions in the area (e.g. which
adaptive schemes are better?)
Some further MCMC algorithms that I didn’t mention
- Reversible jump MCMC
- sampling between different spaces of models
- aim to choose the best one or average over them
- Pseudo-marginal MCMC
- Running an exact MCMC algorithm when the likelihood is
intractable
- Typically relies on importance sampling to unbiasedly estimate
likelihoods
- Interacting particle systems (e.g. Stein variational gradient
descent)
- Evolving several particles along a ‘gradient flow’ to reach a
particle approximation to \(\pi\)
- Non-reversible algorithms
- E.g. piece-wise deterministic Markov processes, which can be exactly
simulated in continuous-time
End of module - Do zobaczenia!
I hope you enjoyed the course and am happy to answer any further
questions over e-mail!
Appendix: Case studies
- I will look at two case examples of MCMC in practice
- Bayesian Variable selection for genomic data analysis
- Generalised linear spatial models for radioactivity on a pacific
island
Case Study 1: Bayesian Variable Selection
The Variable selection problem
A very famous problem in probabilistic modelling
Given data \((y_i,x_{i1},...,x_{ip})_{i=1}^n\), which of
the covariates (\(x\)’s) should I use
to predict \(y\)?
We will consider the setting of linear regression here
Solution 1: Choose the best model
We can use many heuristic model selection procedures
- forward selection
- backwards elimination
We can also use more modern penalized approaches
These approaches will choose the single best model for
the data
If there are two correlated covariates, it will typically choose
one
- Question. What do we call two correlated covariates
in linear regression?
This is suboptimal if our primary interest is in learning which
covariates might be associated with the response
Example: genomic data
A typical genomic dataset contains the covariates relating to
many different genes in the body
The response might be whether someone has a disease, or some
continuous value that is higher if people have a disease than not
(e.g. blood pressure, testosterone etc.)
Scientists want to know all the genes that might be associated
with the disease, so that they can investigate those genes
further
Typically the number of possible genes (number of predictors
\(p\)) is much larger than the sample
size \(n\)
This setting is known as high-dimensional
statistics
Bayesian variable selection via MCMC
Introduce \(\gamma =
\{0,1\}^p\), where \[
\gamma_j = 1 \implies \beta_j \neq 0
\]
This controls whether the \(j\)th covariate is included in the
model
The full linear regression model is now: \[
y_i = \alpha + \beta_1 \gamma_1 x_{i1} + ... + \beta_p\gamma_p x_{ip} +
\epsilon_i
\]
Each particular value for \(\gamma\) is associated with a particular
model \(M_\gamma\)
- Question. If only \(\gamma_1\) and \(\gamma_p\) are not zero, what is model
\(M_\gamma\)?
Choice of prior distribution
We will make the problem easier and choose conjugate
priors where we can
Conditional on a certain value for \(\gamma\), we can choose priors for \(\beta\), \(\alpha\) and \(\sigma^2\) such that \(\pi(\beta, \sigma^2, \alpha|y, X, \gamma)\)
is tractable
These can also be chosen in such a way that the marginal
posterior \(\pi(\gamma|y,X)\) can be
computed
Note: this cannot be done in general, only for linear
regression
Example prior specification
One example is called the g-prior. Gives marginal
posterior (for positive real \(g\))
\[
\pi(\gamma|y,X) \propto \pi_0(\gamma)
\frac{(1+g)^{\frac{n-p_\gamma-1}{2}}}{(1+g(1-R_\gamma^2))^{\frac{n-1}{2}}}
\]
Question. What do you think \(R_\gamma^2\) is?
Question. What do you think \(p_\gamma\) is?
MCMC over \(\gamma\) space
Our task is now to perform MCMC over the space \(\{0,1\}^p\)
Question. Mathematically, what is this
space?
We will introduce a form of ‘add-delete-swap sampler’ for
this
Add-Delete-Swap sampler
At each iteration, given current state \(\gamma \in \{0,1\}^p\), do the
following:
Decide whether to ADD/DELETE w.p. \(\lambda\) or SWAP w.p. \(1-\lambda\)
First set proposal \(\gamma^* =
\gamma\). Then do the following
- If ADD/DELETE, draw \(j \sim
\mathcal{U}[{1,...,p}]\) and set \(\gamma^*_j = 1-\gamma_j\)
- If SWAP, choose a \(j\) for which
\(\gamma^*_j = 1\) and a \(k\) for which \(\gamma^*_k = 0\) and swap them (both
choices uniform)
Set the next state to be \(\gamma^*\) with probability \(\alpha(\gamma, \gamma^*)\), otherwise set
it to be \(\gamma\), where \[
\alpha(\gamma,\gamma^*) = \min \left( 1,
\frac{\pi(\gamma^*|y,X)}{\pi(\gamma|y,X)} \right).
\]
- Question. Why do I only need a Metropolis step
here, not full Metropolis–Hastings with the \(q(\gamma^*,\gamma)/q(\gamma,\gamma^*)\)
part?
One useful posterior summary
The posterior inclusion probability (or PIP) for predictor \(j\) is \[
\mathbb{P}_\pi(\gamma_j = 1|y, X)
\]
This tells us the posterior probability that covariate \(j\) should be included in the
model
Any PIP significantly bigger than 0 might be of interest
Case study 2: GLSMs for radioactivity estimation
Generalised Linear Spatial Models (GLSMs)
- Consider observations \((y_i,z_i,x_i)\)
- \(y_i\) treated as realisations of
\(Y_i\) following some exponential
family distribution
- \(z_i\) spatial location (e.g. in
\(\mathbb{R}^2\)) associated with
sample \(i\)
- \(x_i\) other covariates associated
with sample \(i\)
- Generalised linear spatial model (GLSM) takes the form \[
h(\mathbb{E}[Y_i|S(z_i)]) = S(z_i) + x_i^T\beta,
\] for some link function \(h\)
- \((S(z))_{z \in \mathbb{R}^2}\) is
some unknown function defined over \(z\)-space. Represents some latent intensity
field
- Interest is usually in some expectations from the posterior \[
\pi(S(z)|y,x)
\]
Radioactivity at Rongelap island
- Rongelap Island forms part of the Republic of the Marshall
Islands
- Located in the Pacific Ocean ~2500 miles from Hawaii
- Experienced contamination following Bikini Atoll nuclear weapons
testing programme in 1950s
- Former inhabitants lived in self-imposed exile on neighbouring
island since 1985.
- Marshall Islands National Radiological Survey has examined the
current levels of contamination by counting photon emissions at \(n=157\) locations over the island.
- Good reasons to model data as Poisson counts with spatial
correlation
The Marshall Islands (from Encyclopedia Brittanica)

The dataset (Diggle, Tawn & Moyeed, 1997)

The Posterior
- The model here is \[
Y_i |S(z_i) \sim \text{Poisson} (e^{S(z_i)})
\]
- The prior on the spatial intensity field \(S(z)\) is a Gaussian process, which means
\[
S(z_i) \sim N(0,\sigma^2), \qquad \text{Cov}(S(z_i),S(z_j)) =
\sigma^2\exp\left\{\phi^{-1}\|z_i - z_j\|^2 \right\}
\]
- The parameters \(\sigma^2\) and
\(\phi\) will also have a hyperprior
\(\pi_0(\sigma^2,\phi)\)
- Posterior for \((S(z_1),S(z_2),...,S(z_n),\sigma,\phi)|y\)
is therefore 159-dimensional and quite complicated!
MCMC strategy
- Common to use a blocking strategy, alternating \[
(\sigma^2,\phi)|S,y, \qquad S|(\sigma^2,\phi,y)
\]
- Use something like MALA to update \(S|(\sigma^2,\phi,y)\) (high-dimensional,
smooth)
- Same can be used for \((\sigma^2,\phi)|S,y\), or something more
bespoke (low-dimensional, more irregular)
The mean posterior intensity field (Diggle, Tawn & Moyeed,
1997)

Areas of with high probability of high contamination (Diggle, Tawn
& Moyeed, 1997)
