Preface

This short course provides an introduction to Monte Carlo methods and Markov chain Monte Carlo methods in Statistics, consisting of five two-hour lectures and three computer workshop classes. In the first lecture we introduce the Monte Carlo principle, provide some general motivation and discuss some different methods for simulating random variables. In the second lecture we will introduce Monte Carlo significance/hypothesis testing. The focus of the third, fourth and fifth lectures will be Markov chain Monte Carlo (MCMC) methods for Bayesian inference. We will review some Markov chain theory, introduce the Metropolis–Hastings algorithm, give a small amount of theory and also discuss practical issues. Along the way we will introduce some interesting statistical models in a variety of settings.

Some basic notational conventions that will be used throughout:

  • \(\mathcal{X}\) will denote the sample space of a random variable or state space of a Markov chain.
  • \(\mathcal{B}\) will denote ‘event space’, the set of possible subsets of \(\mathcal{X}\) that we assign a probability to.1
  • The term ‘distribution’ will refer to a function which maps from \(\mathcal{B} \to [0,1]\), which gives the probability of any event \(A \in \mathcal{B}\) according to that distribution. We may also refer to this as the ‘model’ or ‘probability measure’.
  • We will sometimes use the same symbol to denote a distribution and its density, when there is no ambiguity. So we will write \(X \sim \pi(\cdot)\) to mean \(X\) has distribution \(\pi(\cdot)\), and then both \(\mathbb{P}(X\in A) = \pi(A)\) and \(\pi(A) = \int_A \pi(x)dx\).
  • \(X \stackrel{d}= Y\) mean \(X\) and \(Y\) are equal in distribution, \(X = Y\) means \(\mathbb{P}(X=Y) = 1\).
  • Many of the results are stated in terms of continuous random variables, densities and integrals. In most cases the analogue for discrete random variables can be found by simply replacing the integral with a sum and \(\pi(x)dx\) with the relevant probability mass function.2

These notes will inevitably contain typographical errors. If you spot any please let me know ().

1 The Monte Carlo principle

1.1 Introduction

The aim of this course is to introduce the basic principle of the Monte Carlo method, which has had a profound effect on the field of Statistics over the past century.3 We will first motivate the principle with some examples from statistics in which it might be useful, and then give a formal definition.

Example (Basic model checking). Suppose we have some data \(x_{1},...,x_{n}\) and believe that it is reasonable to assume that these are samples from a model \(X_{i}\stackrel{\text{iid}}{\sim}f_{\theta}\), with the parameter \(\theta\) known. In order to check that this probabilistic model is appropriate for our data, we may wish to calculate \[ \mathbb{E}[g(X)]=\int g(x)f_{\theta}(x)dx, \] where \(g\) is some known function that summarises a key characteristic of the phenomenon in question, and compare this to \(n^{-1}\sum_{i=1}^{n}g(x_{i})\), the empirical average from the data. One example is to assume that some data follow an exponential distribution, and setting \(\lambda\) to be the inverse of the sample mean, then comparing the sample variance to \(1/\lambda^2\) to check the appropriateness of the model. But in general this integral may be analytically intractable, or extremely difficult to compute, which presents a problem. We will expand on the notion of intractability below.

Example (Classical significance testing). Suppose that we have a probabilistic model \(f_{\theta}\), now with \(\theta\) unknown. We wish to test the hypothesis \[ H_{0}:\theta=\theta_{0}. \] We collect some data \(x=(x_{1},...,x_{n})\), with each \(x_{i}\in\mathcal{X}\), and define a test statistic \(T:\mathcal{X}^{n}\to[0,\infty)\), such that for any \(y,z\in\mathcal{X}^{n},\) \(T(z)>T(y)\) implies that the hypothesis \(H_{0}\) is more consistent with data \(y\) than data \(z\) (mathematically, \(T\) defines a partial order on \(\mathcal{X}^{n}\)). To construct a \(p\)-value and decide whether or not to reject \(H_{0}\), we then compute \[ \mathbb{P}[T(X)\geq T(x)|H_{0}\text{ true}]=\int_{\{z:T(z) \geq T(x)\}} f_{\theta_{0}}(z_{1})...f_{\theta_{0}}(z_{n})dz_{1}...dz_{n}. \] Often to perform this calculation we derive an analytical expression for the distribution of \(T(X)\) (either exactly or asymptotically as \(n\to\infty\)) under the assumption that \(H_{0}\) is true. But if this is either not possible or not appropriate (if e.g. \(n\) is not large) then this presents an issue.

Example (Bayesian inference). Suppose that we have some data \(x=(x_{1},...,x_{n})\) and are prepared to assume that the data are samples from a model \(f_{\theta}\) with \(\theta\) unknown. We decide to infer \(\theta\) using the Bayesian approach. We construct a prior distribution with density \(\pi_{0}(\theta)\) expressing our beliefs about how likely different values for \(\theta\) are before seeing the data, and then use Bayes’ theorem to update our beliefs, generating a posterior distribution with density \[ \pi(\theta|x)=\frac{f_{\theta}(x)\pi_{0}(\theta)}{\int f_{\theta}(x)\pi_{0}(\theta)d\theta}\propto f_{\theta}(x)\pi_{0}(\theta). \] Computing \(\pi(\theta|x)\) up to a constant of proportionality is therefore straightforward, but computing it exactly requires us to evaluate the integral \(\int f_{\theta}(x)\pi_{0}(\theta)d\theta\), known as the marginal likelihood (because the parameter \(\theta\) is integrated out of the expression to give the marginal distribution of \(x\)). To learn interesting things from our model we typically want to know interesting summaries like posterior means and variances, all of which require taking expectations with respect to \(\pi(\theta|x)\). For a summary function of interest \(g\), such an integral will take the form \[ \mathbb{E}_{\pi}[g(\theta)]=\int g(\theta)\pi(\theta|x)d\theta=\int g(\theta)\frac{f_{\theta}(x)\pi_{0}(\theta)}{\int f_{\vartheta}(x)\pi_{0}(\vartheta)d\vartheta}d\theta. \] There are two integrals in this expression and either or both may be analytically intractable or extremely difficult to calculate, meaning we will struggle to use the approach.

All of the above scenarios share a common theme: we have a statistical question that requires some integration to answer. The difficulty lies in the fact that the integral in question may be intractable.

Definition. A one-dimensional integral \(\int_{a}^{b}f(x)dx\) is said to be intractable if there is no elementary function \(F\) such that \(\int_{a}^{b}f(x)dx=F(b)-F(a)\) for all \(a,b\in\mathbb{R}\).

The above can be generalised to higher dimensions without too much difficulty. An elementary function is a function that can be written as the composition of a finite number of operations such as addition, subtraction, multiplication, division, taking exponential and logarithm, applying trigonometric functions and taking powers or roots. So \(\sin(x)/x\) is an elementary function, but \(\sum_{k=1}^{\infty}\sin^{k}(x)/x^{k}\) is not, since it can only be expressed as an infinite sum. The function \(F\) in the above definition is often called the anti-derivative or indefinite integral of \(f\).

A very simple example of an intractable integral is \[ \int_{a}^{b}e^{-\frac{1}{2}x^{2}}dx. \] This should be very familiar as an unnormalised version of the Gaussian (or Normal) distribution. It is well-known that \(\int_{-\infty}^{\infty}e^{-x^{2}/2}dx=\sqrt{2\pi}\), but in fact there is no analytical expression for \(\int_{a}^{b}e^{-x^{2}/2}dx\) for an arbitrary \(a,b\in\mathbb{R}\), which is why in the previous century statistical tables were produced with numerical approximations. You might think that intractable integrals must be very strange, but in fact the above illustrates that they can look perfectly sensible when plotted.

curve(dnorm, from = -5, to = 5, xlab = "x", ylab = "Gaussian density evaluated at x")

In some cases, a problem may not involve an intractable integral, but still may involve a probability calculation that is extremely difficult. An example is the expectation of a discrete random variable that has a very large sample space. The calculation requires a sum over an extremely large number of elements, and it may be that the amount of computing power required to perform such a task in a reasonable amount of time is not available to us. So although an exact calculation is possible in this case (at least to the level of precision allowed by a computer), we may still consider it to be intractable in practice.

1.2 Intractable integrals, numerical schemes and the curse of dimensionality

As alluded to in the above paragraph, in many cases (particularly very low-dimensional models) intractable integrals are not too hard to deal with using deterministic numerical integration strategies. Recall the definition of a one-dimensional Riemann integral.

Definition. The Riemann integral of a real-valued function \(f\) on the interval \([a,b]\) for \(a,b\in\mathbb{R}\) is the result of setting \(\delta x:=(b-a)/n\) and \(x_{i} := a+i\delta x\) and computing \[ \lim_{n\to\infty}\sum_{i=1}^{n}f(x_{i})\delta x. %\label{eq:riemann_sum} \]

The limit can also be taken using \(x_{i-1}\) in place of \(x_{i}\). For a finite \(n\) the two resulting quantities are known as right and left Riemann sums. A function is Riemann-integrable if and only if all possible Riemann sums converge to the same quantity as \(n\to\infty\). This essentially means that any choice \(x'_i\) with which to evaluate \(f(x_i')\) in the above sum will give the same result in the limit. The intuition here is that within very small intervals the function \(f\) will not vary much, so can be well approximated by the assumption that it is constant over the interval.

The very definition of a Riemann integral gives a straightforward strategy for approximation: compute one of the Riemann sums for a finite \(n\). This is approximating the integral by a sequence of rectangles. A slightly more sophisticated method is to approximate the integral with a collection of trapezia. The trapezoidal rule approximation of an integral \(\int_{a}^{b}f(x)dx\) is \[ \int_{a}^{b}f(x)dx\approx\sum_{i=1}^{n}\frac{f(x_{i-1})+f(x_i)}{2}\delta x. \] The collection of these methods and more sophisticated alternatives is called quadrature rules. In one dimension this strategy can be very effective. However, there is a serious problem in higher dimensions with such a grid-based approach. If \(n\) grid squares are required in one dimension for a given level of accuracy, then \(n^{d}=e^{dlog(n)}\) grid squares are needed to evaluate a \(d\)-dimensional integral of the same difficulty to the same level of precision. So the cost increases exponentially with dimension! The general phenomenon of calculations becoming exponentially more challenging in higher dimensions is known as the curse of dimensionality in mathematics, and numerical integration is a good example. Fortunately, in our setting there is often a way around the problem.

1.3 Monte Carlo

The Monte Carlo method is based on the following principle:

If an integral can be written as an expectation with respect to some probability distribution, then we can approximate it by drawing samples from the distribution and computing the sample mean.

A simple example is the Gaussian distribution shown above. Suppose that we wish to calculate \((1/\sqrt{2\pi})\int\sin(x)e^{-x^{2}/2}dx\). The Monte Carlo approach is to do the following:

  1. Draw \(n\) samples \(X_{1},...,X_{n}\sim N(0,1)\)
  2. Compute \(\frac{1}{n}\sum_{i=1}^{n}\sin(X_{i})\).

Note that this approach would be identical if the distribution to be sampled from were \(d\)-dimensional, for some \(d>1\). Hence, if we are lucky, the curse of dimensionality will be broken. The general approach for approximation of the integral \(\mathbb{E}_{\pi}[f]:=\int f(x)\pi(x)dx\) is the following:

  1. Draw \(n\) samples \(X_{1},...,X_{n}\sim\pi(\cdot)\)
  2. Compute the average \(\hat{f}_{n}:=n^{-1}\sum_{i=1}^{n}f(X_{i})\).

1.4 Why does Monte Carlo work?

The two fundamental results of probability theory that ensure the Monte Carlo principle is well-founded are the Law of Large numbers and the Central Limit theorem.

Theorem (Strong) Law of Large Numbers.4

Suppose that \(X_{1},...,X_{n}\) are independent samples from a distribution \(\pi(\cdot)\), and set \(\hat{f}_{n}:=n^{-1}\sum_{i=1}^{n}f(X_{i})\) for some function \(f:\mathcal{X}\to\mathbb{R}\). If \(\mathbb{E}_{\pi}|f(X)|<\infty\), then as \(n\to\infty\) \[ \mathbb{P}\left(\lim_{n\to\infty}\hat{f}_{n}=\mathbb{E}_{\pi}[f(X)]\right)=1. \] The Law of Large Numbers ensures that under very broad conditions the empirical average \(\hat{f}_{n}\) will get arbitrarily close to the expectation \(\mathbb{E}_{\pi}[f]\) as \(n\to\infty\). But it gives no information about the rate at which this happens, the next result, however, does this.

Theorem (Central Limit Theorem). Suppose that \(X_{1},...,X_{n}\) are independent samples from a distribution \(\pi(\cdot)\), and set \(\hat{f}_{n}:=n^{-1}\sum_{i=1}^{n}f(X_{i})\) for some function \(f:\mathcal{X}\to\mathbb{R}\). If \(\mathbb{E}_{\pi}[f(X)^{2}]<\infty\) then as \(n\to\infty\) \[ \sqrt{n}\left(\hat{f}_{n}-\mathbb{E}_{\pi}[f(X)]\right)\xrightarrow{d}N\left(0,\text{Var}_{\pi}[f(X)]\right) \]

Informally, the Central Limit Theorem implies that under slightly stronger conditions, for large enough \(n\) the error \(|\hat{f}_{n}-\mathbb{E}_{\pi}[f]|\approx Cn^{-1/2}\), for some \(C<\infty\). This is sometimes referred to as “root \(n\)” or “square root” convergence. This is still an asymptotic result but gives useful practical information. With even stronger conditions often direct non-asymptotic bounds can be established (we, however, will not discuss these).

Some other interesting questions:

What if my integral isn’t an expectation?

In fact, in the applications we consider in Statistics, they almost always are. But for a general integral \(\int g(x)dx\) one can write \(\int g(x)dx=\int\frac{g(x)}{\pi(x)}\pi(x)dx=\mathbb{E}_{\pi}[g/\pi]\). How intelligent this is depends on \(g\) and \(\pi\). We will return to this idea when considering importance sampling later in the course.

How can I compute probabilities/integrals between finite limits?

The trick here is to use the indicator function \[ \begin{aligned} \mathbb{I}_{[a,b]}(x) & :=\begin{cases} 1 & a\leq x\leq b\\ 0 & \text{otherwise.} \end{cases} \end{aligned} \] Using this gives \[ \mathbb{P}[a\leq X\leq b]=\int_{a}^{b}\pi(x)dx=\int\mathbb{I}_{[a,b]}(x)\pi(x)dx=\mathbb{E}_{\pi}[\mathbb{I}_{[a,b]}(X)], \] which can then be evaluated in the usual way. We will also sometimes write \(\mathbb{I}(x>a)\) to mean \(1\) if \(x>a\) and zero otherwise, and other authors use other small variations on this notation, which should be intuitive.

How do I draw samples from a probability distribution? What if this is not possible?

This is what we will discuss throughout the course.

2 Sampling from a probability distribution

2.1 The big picture

Generating independent samples from a given probability distribution takes a few steps. Typically, one first generates a sample from the uniform distribution between zero and one. It turns out that doing this is relatively straightforward. After this, in many cases one can directly transform these uniform samples into samples from another distribution of interest. This cannot always be done directly, and when it is not possible then one has to resort to other tricks, such as generating samples from one distribution and either re-weighting them or throwing some away, to end up with samples from another. In this chapter we will cover the basics of all of these approaches.

2.2 Generating uniform (pseudo-)random numbers

The strangest thing about generating samples from a probability distribution using a computer is that they are most likely not random at all. True random numbers are relatively easy to generate in real life: flip a coin, roll a dice or take a Statistics exam and see what mark I give you (!) (This is a joke, my marking is not random.) But producing a large number of these quickly is not as easy as an alternative strategy: pseudo-random numbers.

Definition. A uniform pseudo-random number generator is a function \[ G: \mathcal{D} \times \{1,2,3,...\} \to \mathcal{D} \] such that for any \(x \in \mathcal{D}\), \(G(x,1)=x\) and the sequence of numbers \(G(x,1),G(x,2),G(x,3),...\) is statistically indistinguishable from a sequence of independent samples from the uniform distribution over \(\mathcal{D}\) when evaluated through a pre-specified set of tests.

Exactly which uniform distribution we are comparing to depends on the form of \(G\). In this course we will stick to the \(U[0,1]\) distribution. A popular pre-specified set of tests that is used in practice is known as “Die Hard”, or the later version “Die Harder”, which are both available online.

The most crucial thing to note about a pseudo-random number generator is that the sequence of numbers produced is actually completely deterministic! The initial value to input into the function \(G\) is known as the seed. Changing the seed changes the sequence that follows, but using the same seed produces the same sequence. But for our purposes, the point is that successive numbers in the same sequence will not exhibit any noticeable patterns that would enable us to distinguish them from a sequence of truly random independent samples from the uniform distribution on \([0,1]\).

Example (Lehmer random number generator). Set \(m\) to be a prime number (or power of a prime), and \(a\in\mathbb{Z}\) (good choices recommended below). Then \(G\) is defined through the recursion \[ G(x,i+1) = a \cdot G(x,i) \mod m. \] It is recommend to choose \(a\) to be of high multiplicative order modulo \(m\), meaning the smallest \(k\) such that \(a^{k}\mod m=a\) is large. This is to prevent the sequence from repeating itself too early (think about what would happen if \(x=1\)). Park, Miller and Stockmeyer (1993) recommend setting \(m=2^{31}-1\) (a Mersenne prime) and \(a=48271\). Dividing the sequence by \(m-1\) then gives \(U[0,1]\) samples.

A popular modern pseudo-random number generator is known as the Mersenne Twister, and this is the default used in the R statistical software. We will not discuss the details here but more information can be found in the recommended reading for the course or in the relevant R help files.

Remark. Note that any sample from the Lehmer generator will be a rational number, in fact some multiple of \(1/(m-1)\). So it is not really a sample from the continuous uniform distribution. In fact, using a finite amount of computing power it is not possible to generate perfect samples from any continuous probability distribution. Due to the impossibility of storing a number with an infinite number of decimal places, all simulated random numbers are in fact discrete, just (hopefully) extremely close to the samples of continuous random variables that they represent, to the point that calculations based on the two different objects will yield arbitrarily close results. This idea might seem bizarre but in fact it is riddled throughout statistics. In fact, all data are discrete, since we can only ever record things up to a finite number of decimal places. Continuous probability models are simply a very convenient way to represent them.

2.3 Direct sampling via transformations

The goal is usually not to sample from the uniform distribution. But we can often transform uniform random variables into others to generate samples from lots of different distributions. The key piece of intuition for why this is particularly easy is that the cumulative distribution function for the \(U[0,1]\) distribution is \[ \mathbb{P}[U\leq u]=F_{U}(u)=u \] for any \(u\in[0,1]\). This of course also means that \(\mathbb{P}[a<U<b]=b-a\) for \(a,b\in[0,1]\) with \(a\leq b\). This allows easy simulation of a wide variety of discrete random variables. The below proposition gives an example.

Proposition (Bernoulli sampling) If \(U\sim U[0,1]\) and we set \[ \begin{aligned} X=\begin{cases} 1 & U\leq p\\ 0 & U>p \end{cases} \end{aligned} \] for some known \(p\in[0,1]\), then \(X\sim\text{Bernoulli}(p)\).

Proof. Note simply that \(\mathbb{P}[X=1]=\mathbb{P}[U\leq p]=F_{U}(p)=p,\) and a similar calculation can be made for \(\mathbb{P}[X=0]\) (try it if unsure!).

Consider a more complex example in which a random variable is defined on the positive integers such that \(\mathbb{P}[X=i]=p_{i}\) for \(i\geq1\). Write \(c_{i}:=\sum_{j=1}^{i}p_{i}\). Then to generate a sample from the distribution we can take a \(U\sim U[0,1]\) and set \(X=i\) if \(c_{i-1}<U\leq c_{i}\).

Exercise. Can you prove that the above procedure will generate a sample from the correct distribution?

Sampling from a continuous distribution would seem more difficult. But in fact the same general principle can be applied. Consider a graphical representation of the above example, using the cumulative probability distribution.

What we are really doing is inverting this function, by starting from \(F(x)\) and seeing what \(x\) would have produced it. Technically in the discrete case \(F\) is not invertible, so a precise description of this takes some care. But in the continuous case this leads to a general procedure.

Proposition (Inverse CDF method). If \(U\sim U[0,1]\) and \(F_{X}\) is a cumulative distribution function associated for a random variable \(X\) such that \(F_{X}^{-1}\) is known, then \[ X\stackrel{d}{=}F_{X}^{-1}(U). \]

Proof. Again we show that the cumulative distribution functions are equal. Direct calculation gives \[ \mathbb{P}[F_{X}^{-1}(U)\leq x]=\mathbb{\mathbb{P}}[U\leq F_{X}(x)]=F_{X}(x)=\mathbb{P}[X\leq x]. \]

The inverse CDF method, also known as the probability integral transform, gives a general recipe provided that an easy to evaluate analytical expression for \(F_{X}^{-1}\) can be found.

Example. If \(X\sim\text{Exp}(\lambda)\), then \(F_{X}(x)=1-e^{-\lambda x}\) for \(x>0\), Now \[ u=1-e^{-\lambda x}\implies x=-\frac{1}{\lambda}\log(1-u), \] meaning \(F_{X}^{-1}(u)=-\log(1-u)/\lambda\) for \(0<u<1\). Therefore if \(U\sim U[0,1]\), then \(F_{X}^{-1}(U)\sim\text{Exp}(\lambda)\).

The inverse CDF method is very powerful, but in some cases either this is not possible or not very easy to do. So another way to generate samples from a given distribution is to consider whether it can be written as some combination of other random variables that do have invertible CDFs. This can in fact often be done.

Example (Gamma distribution). If \(Y\sim\text{Gamma}(n,\lambda)\), then in fact one can write \(Y\stackrel{d}{=}\sum_{i=1}^{n}X_{i}\), where each \(X_{i}\sim\text{Exp}(\lambda)\) and they are independent of eachother.

Example (Chi-squared distribution). If \(U_{n}\sim\chi_{n}^{2}\) then \(U_{n}\stackrel{d}{=}\sum_{i=1}^{n}Z_{i}^{2}\), where each \(Z_{i}\sim N(0,1)\) and they are independent of eachother.

Example (Cauchy distribution). If \(T\sim t_{1}\), then \(T\stackrel{d}{=}Z_{1}/Z_{2}\), where \(Z_{1}\) and \(Z_{2}\) are independent \(N(0,1)\) random variables.

Example (Bi-variate Normal distribution, Box–Muller method). If \((Z_{1},Z_{2}) \sim N(0, I_{2\times 2})\), then one can write \[ \begin{aligned} Z_{1} &\stackrel{d}{=}\sqrt{-2\log(U_{1})}\sin(2\pi U_{2}), \\ Z_{2} &\stackrel{d}{=}\sqrt{-2\log(U_{1})}\cos(2\pi U_{2}), \end{aligned} \] where \(U_{1}\) and \(U_{2}\) are independent \(U[0,1]\) random variables.

Example (Dirichlet process). The dirichlet process is a distribution over distributions. We write \(H(\cdot) \sim DP(\mu(\cdot), \alpha)\), which means that \(H(\cdot)\) is a ‘random probability measure’. This really means that each sample from a Dirichlet process is a probability distribution, which we can then draw samples from once again. The parameters are therefore \(\mu(\cdot)\), a probability distribution, and \(\alpha \in [\mathbb{R}]0,\infty)\). Any \(H(\cdot)\) will be a distribution over a countably-infinite space. The process is used commonly in cluster analysis, in which some random measure is used to determine the locations of the cluster means. It’s very difficult to directly sample a distirbution \(H(\cdot)\), but we can directly sample from \(H(\cdot)\) without having to explicitly construct it using the following approach.

Set \(n=1\), \(k=1\), draw \(Y_k \sim \mu(\cdot)\).

  1. Set \(X_n = Y_j\) for \(1\leq j \leq k\) with probability \(n_j/(\alpha + n - 1)\)
  2. Otherwise:
    1. Set \(k = k+1\)
    2. Draw \(Y_k \sim \mu(\cdot)\)
    3. Set \(X_n = Y_k\)
  3. If \(n < N\) set \(n = n+1\) and return to 1, otherwise stop.

Step 2 therefore occurs with probability \(\alpha/(\alpha + n - 1)\). At iteration \(n\) the notation \(n_j = \sum_{i=1}^{n-1} \mathbb{I}(X_i = Y_j) \}\), meaning the number of samples (so far) that have taken the value \(Y_j\).

2.4 Indirect sampling

In many cases, sampling directly from a distribution using some kind of transformation is simply not possible. In this case, we can resort to indirect sampling strategies. The first of these is known as rejection sampling. We begin with a simple example.

Example. Consider rolling a regular 6-sided die. If the result is larger than 3, then do not record it. If the result is between 1 and 3, then record it as a sample. What distribution is being sampled from here? And roughly how many die rolls are needed to produce \(n\) samples from this distribution?

The general idea of rejection is to choose some candidate distribution, from which direct ‘candidate’ samples can be drawn, and then reject some of the candidate samples so that the collection of numbers that remain are effectively samples from a different distribution (often called the target). In the above example the candidate distribution is the discrete uniform over \(\{1,2,3,4,5,6\}\), and the target distribution is the uniform over \(\{1,2,3\}\). The general algorithm for a continuous target with density \(\pi(x)\) and candidate with density \(q(x)\) is given below. We will require an upper bound on the ratio of these, meaning an \(M<\infty\) such that \(M\geq\pi(x)/q(x)\) for all \(x\in\mathcal{X}\) (can you see why from the algorithm below?)

Rejection Sampling Algorithm

Require \(q(x),\pi(x),M\), and a desired number of samples \(n\) - Set \(i \leftarrow 0\) - While \(i < n\) - Draw \(X \sim q(\cdot)\) - Draw \(U \sim \mathcal{U}[0,1]\) - If \(U \leq \frac{\pi(X)}{Mq(X)}\), accept \(X\) as a sample and set \(i \leftarrow i + 1\) - EndWhile

Proposition. The rejection sampling algorithm produces samples from \(\pi(\cdot)\).

Proof. If \(X\sim q(\cdot)\) then for any event \(A\) \[ \begin{aligned} \mathbb{P}[X\in A|X\text{ accepted}] & =\frac{\mathbb{P}[X\in A,X\text{ accepted}]}{\mathbb{P}[X\text{ accepted}]}\\ & =\frac{\int_{A}q(x)\frac{\pi(x)}{Mq(x)}dx}{\int_{\mathcal{X}}q(x)\frac{\pi(x)}{Mq(x)}dx}\\ & =\frac{\frac{1}{M}\int_{A}\pi(x)dx}{\frac{1}{M}\int_{\mathcal{X}}\pi(x)dx}\\ & =\int_{A}\pi(x)dx. \end{aligned} \]

The proof of the above result also helps us understand how efficient the rejection sampling algorithm is, meaning how many samples will be rejected before we manage to produce a sample from \(\pi(\cdot)\). From above, the probability that a sample from \(q\) will be accepted is \[ \mathbb{P}[\text{accepted}]=\int q(x)\frac{\pi(x)}{Mq(x)}dx=\frac{1}{M}\int\pi(x)dx=\frac{1}{M}. \] So viewing the number of candidates drawn before acceptance as a geometric random variable with parameter \(1/M\), we see that on average \(M\) candidates will need to be drawn before a sample is accepted. Thus, it is desirable to find as small an \(M\) as possible, which in turn means finding a \(q(x)\) that looks as similar to \(\pi(x)\) as possible, in the sense that the ratio \(\pi(x)/q(x)\approx1\). Rejection sampling can work well provided a reasonable bound \(M\) can be found. This typically means that \(q(x)\) should have heavier tails than \(\pi(x)\), so that the ratio \(\pi(x)/q(x)\) does not grow arbitrarily large as \(\|x\|\to\infty\). Some knowledge of \(\pi(\cdot)\) is therefore needed to design a good algorithm.

A second popular approach for indirect sampling is called importance sampling. In fact it is not really a sampling algorithm, but rather a way to perform Monte Carlo for one distribution by first sampling from another. But it can easily be adapted to become an indirect sampling algorithm.

The basic identity on which the importance sampling method is based is \[ \mathbb{E}_\pi[f(X)]=\int f(x)\pi(x)dx = \int f(x)\frac{\pi(x)}{q(x)}q(x)dx=\mathbb{E}_q\left[f(X)\frac{\pi(X)}{q(X)}\right]. \] Using this identity expectations with respect to \(\pi\) can instead be written as expectations with respect to \(q\). We typically write \[ w(X) := \frac{\pi(X)}{q(X)}, \] and call \(w\) the importance weights. The resulting procedure is outlined below.

Importance Sampling Algorithm

Require \(q(x)\), \(\pi(x)\), function of interest \(f(x)\), desired number of samples \(n\)

  • Draw \(X_i \sim q(\cdot)\) for \(1\leq i\leq n\)
  • For each \(i\) compute the importance weights \(w(X_i):=\pi(X_i)/q(X_i)\)
  • Compute \(\hat{f}^{IS}_n := n^{-1}\sum_{i=1}^n w(X_i)f(X_i)\)

Exercise: Prove that the importance sampling estimator \(\hat{f}^{IS}_n\) is unbiased and consistent.

Comparing importance and rejection sampling as a way to perform Monte Carlo, we can see that in importance sampling an upper bound on \(\pi(x)/q(x)\) is not needed, whereas in rejection sampling it is. In addition in rejection sampling it may take some time to draw \(n\) samples, whereas in importance sampling we only need to sample from \(q(\cdot)\), there is no ‘throwing away’ of samples, which seems wasteful.

To understand the quality of the importance sampling estimator \(\hat{f}^{IS}_n\), we must consider its variance. Recall that in ordinary Monte Carlo the estimator \(\hat{f}_n\) has variance \[ \begin{aligned} \text{Var}[\hat{f}_n] &= n^{-1}\text{Var}_\pi[f(X)] \\ &= n^{-1}\left( \mathbb{E}_\pi[f(X)^2] - \mathbb{E}_\pi[f(X)]^2 \right). \end{aligned} \] In the importance sampling case this variance is instead \[ \begin{aligned} \text{Var}[\hat{f}_n^{IS}] &= n^{-1}\text{Var}_q [f(X)w(X)] \\ &= n^{-1}\left( \mathbb{E}_q[w(X)^2f(X)^2] - \mathbb{E}_q[w(X)f(X)]^2 \right). \end{aligned} \] This quantity gives some hint that the weights \(w\) play a role in determining the quality of the importance sampling estimator. To be more precise about this issue, we might ask how many independent samples from \(\pi(\cdot)\) we would need to produce a Monte Carlo estimator with variance \(\text{Var}[\hat{f}^{IS}_n]\). Since \(\text{Var}[\hat{f}_n] = \text{Var}_\pi[f]/n\), then we can write \(\text{Var}[\hat{f}_n^{IS}]\approx\text{Var}_\pi[f]/n_{eff}\), for some \(n_{eff}\in\mathbb{R}\), which may be smaller or larger than \(n\) (and may no longer be an integer). We refer to \(n_{eff}\) as the effective sample size. Simple algebra gives \[ n_{eff} := n\frac{\text{Var}[\hat{f}_n]}{\text{Var}[\hat{f}_n^{IS}]}. \] A simple approximation to the above quantity (derivation omitted), which is crucially independent of \(f\), is given by \[ n_{eff} \approx \frac{1}{\sum_{i=1}^n w_i^2}. \] Intuitively \(n_{eff}\) will be smaller when \(\sum_i w_i^2\) is larger, which occurs when the weights have a high variance. If we study the importance sampling estimate given a set of weighted samples \(\{(x_i,w_i)\}_{i=1}^n\), we have \[ n\hat{f}_n^{IS} = w_1 f(x_1) + ... + w_n f(x_n). \] Consider the extreme case in which \(w_2=...=w_n \approx 0\) and \(w_1 \gg 1\). Then the estimate is only really relying on a single sample \(x_1\). The importance weights are informing us that, relative to all other samples, \(x_1\) is the only choice that truly reflects what samples from \(\pi(\cdot)\) might look like. In this scenario the resulting estimator will typically be of poor quality. Can you think of what the underlying cause of this high variance in the weights is?

The importance sampling algorithm provides a collection of weighted samples \(\{(x_i,w_i)\}_{i=1}^n\). These can be used to draw samples from \(\pi(\cdot)\).

Sampling Importance Re-sampling algorithm

  • Require a collection of importance-weighted samples \(\{(x_i,w_i)\}_{i=1}^n\).
    • Compute the normalised importance weights \(\tilde{w}_i :=w_i/(\sum_i w_i)\) for \(1\leq i \leq n\)
    • Re-sample from the finite set of options \(\{x_1,...,x_n\}\), such that \(\mathbb{P}(\text{Sample is } x_i) = \tilde{w}_i\).

The quality of this method is strongly linked to the quality of the importance sampling estimator. In particular, again taking an extreme case, if \(\tilde{w_1} \approx 1\), then every re-sampled point is highly likely to be \(x_1\). So drawing 100 samples using this procedure one might expect to get close to 100 \(x_1\)’s. Outside of very special circumstances this is unlikely to be representative of true samples from \(\pi(\cdot)\)! There is also a version of importance sampling in which the weights are normalised in the above manner before constructing an estimator. This has the advantage of being applicable when the density \(\pi(x)\) is only known up to a constant of proportionality (more on this later), and is also typically a lower variance estimator (but slightly biased). There are many other extensions and modifications of the basic importance sampling procedure, which constitute an active area of current research (most notably sequential Monte Carlo/particle filters). We will not discuss this further here.

3 Monte Carlo significance testing

The first clear cut application of Monte Carlo in Statistics that we will explore is testing. Let’s consider a basic test of a simple null hypothesis. We have some data \(\{x_1,...,x_n\}\), and we assume a probabilistic model \(X_i \sim \mu_\theta(\cdot)\), where \(\mu_\theta(\cdot)\) is some distribution parameterised by \(\theta\). Our null hypothesis might be \[ H_0: \theta = \theta_0. \]

A classical way to test the plausibility of this hypothesis using the data is to construct a test statistic that imposes some kind of ordering on the sample space. If we denote the test statistic \(T:\mathcal{X}^n \to [0,\infty)\), where \(n\) is the number of observations in the dataset and \(\mathcal{X}\) is the sample space, then we desire that for \[ T(z) > T(y) \implies \text{$y:=\{y_1,...,y_n\}$ is more consistent with $H_0$ than $z:=\{z_1,...,z_n\}$.} \] To make some judgment about the plausibility of \(H_0\), we then evaluate the \(p\)-value \[ \mathbb{P}(T(X) > T(x)|H_0 \text{ true}). \] Small \(p\)-values indicate that \(H_0\) is not very plausible. To evaluate the \(p\)-value we need to know the distribution for \(T(X)\), where \(X:=\{X_1,...,X_n\}\), as we are required to evaluate the integral \[ \mathbb{P}(T(X) > T(x)|H_0 \text{ true}) = \int \mathbb{I}(T(z)>T(x))\prod_i\mu_{\theta_0}(z_i)dz_i. \] Unless we can evaluate this interval, which requires either knowing the distribution and quantiles of \(T(X)\) or being able to compute it by hand, then we cannot compute the \(p\)-value. In the early 20th century a collection of testing scenarios were constructed in which a known distribution for \(T(X)\) can be found, either exactly or asymptotically (e.g. \(z\)-test, \(t\)-test), but to move beyond this framework has proven to be very difficult.

To remedy the problem, we can instead use Monte Carlo. To evaluate how extreme the sample \(T(x)\) would be under \(H_0\), we do the following:

  1. Draw \(N\) pseudo-datasets \(z^{(i)}\) for \(i=1,...,N\), each of size \(n\), by drawing \(N\) sets of \(n\) samples from a \(\mu_{\theta_0}(\cdot)\).

  2. For each of these compute \(T(z^{(i)})\)

  3. Compute the Monte Carlo \(p\)-value \[ \mathbb{P}(T(X) > T(x)|H_0 \text{ true}) \approx \frac{1}{N} \sum_{i=1}^N \mathbb{I}(T(z^{(i)}) > T(x)). \]

Crucially, the above test is applicable without needing to know the distribution for \(T(X)\), either exactly or asymptotically, and so is extremely widely applicable, and opens up testing beyond the classical framework.

We consider two examples here of the above procedure. The first is quite simple and mainly for illustrative purposes.

Example 1. You have \(30\) data points \(x :=\{x_1,...,x_{30}\}\) that you assume to be samples from an \(\text{Exp}(\lambda)\) distribution, as they represent delivery times in hours for packages from a certain courier. The courier claims that the average delivery time is 100 hours, so this means

\[ H_0: \lambda=0.01 \]

In this scenario a classical test statistic choice is \[ T(X) := \left| \frac{\sqrt{30}(\bar{X}-100)}{100} \right|. \] If we use the central limit theorem, then as \(n\to\infty\) the asymptotic distribution for \(T(X)\) will be a truncated standard Normal distribution (truncated to be positive). To compute the \(p\)-value one would therefore resort to looking up quantiles of the standard Normal distribution, either using software or tables. In our scenario the asymptotic distribution for \(T(x)\) is likely to be a good approximation, so we wil use this to compare with our Monte Carlo procedure.

The empirical distribution for \(T(X)\) in this scenario is given below, using \(N=10,000\) simulated datasets, with the truncated Normal distribution overlaid.

N <- 10000
n <- 30
data <- matrix(nrow = N, ncol = n)
means <- rep(NA, N)

for (i in 1:N) {
  data[i,] <- rexp(n, rate = 0.01)
  means[i] <- mean(data[i,])
}

scaled_means <- sqrt(30)*(means-100)/100
abs_scaled <- abs(scaled_means)
hist(abs_scaled, freq = F)

tnorm <- function(x) { 2*dnorm(x) }

curve(tnorm, from = 0, to = 5, add = T, col = "red")

As can be seen from the figure, the empirical distribution for \(T(X)\) agrees very closely with the asymptotic approximation. This is reassuring as it suggests that the Monte Carlo procedure is a sensible one, and also adds support to our intuition that the asymptotic approximation is a good one here.

Example 2. (Testing for spatial randomness). You are given the spatial coordinates of \(n\) objects \(x_i \in S \subset \mathbb{R}^2\), and asked to test whether or not there is evidence of clustering. If the distribution is randomly scattered (i.e. uniform on a two-dimensional rectangle of some sort) then some trees would be grouped together and some spread out. If there is some evidence of clustering then you might expect trees to be grouped together. A convenient test statistic to choose in this setting is the inverse of the average nearest neighbour distance \[ T(x) := \left( \frac{1}{n} \sum_i \min_{j\neq i} d(x_i,x_j) \right)^{-1} \] Larger values of \(T\) suggest more evidence of clustering. If we believe that either iid uniform locations or those for which there is some clustering are the two plausible explanations for the data, then a valid null hypothesis is \[ H_0: \text{Locations are iid uniformly distributed on } S. \]

To test the hypothesis we simulate \(N\) sets of \(n\) object locations from a two-dimensional uniform distribution \(z^{(1)},...,z^{(N)}\), and compute the test statistics \(T(Z^{(i)})\) for each \(i\in\{1,...,N\}\). The empirical distribution of test statistics is shown below in the case \(n = 20\) and \(S = [0,1]^2\), with \(N = 10,000\).

N <- 10000
n <- 20

xlocs <- matrix(nrow = N, ncol = n)
ylocs <- matrix(nrow = N, ncol = n)
distances <- matrix(nrow = N, ncol = n)
means <- rep(NA, N)

set.seed(123)
for (i in 1:N) {
  xlocs[i,] <- runif(n)
  ylocs[i,] <- runif(n)
}

test_stats <- rep(NA, N)
for (k in 1:N) {
  dmat <- matrix(nrow = n, ncol = n)
  d_min <- rep(Inf, n)
  for (i in 1:n) {
    for (j in 1:n) {
      dmat[i,j] <- sqrt( sum( (xlocs[k,i] - xlocs[k,j])^2 + (ylocs[k,i] - ylocs[k,j])^2 ))
      if(j==i) { dmat[i,j] = Inf }
    }
    d_min[i] <- min(dmat[i,])
  }
  test_stats[k] <- 1/mean(d_min)
}

hist(test_stats, freq = F, main = "", xlab = "Inverse nearest neighbour distance")
abline(v = quantile(test_stats, 0.95), col = "red", lwd = 3, lty = 2)
text(x = quantile(test_stats, 0.95) + 1, y = 0.2, labels = "95% quantile", col = "red")

This gives us a good indication of which values would be extreme under \(H_0\). We can take a look at some clustered data and compute the test statistic in this case.

n <- 20
sigma <- 0.05

set.seed(432)
means_x <- c(rep(0.2, n/2), rep(0.8, n/2))
means_y <- c(rep(0.2, n/2), rep(0.8, n/2))
xlocs <- means_x + rnorm(n, sd = sigma)
ylocs <- means_y + rnorm(n, sd = sigma)

dmat <- matrix(nrow = n, ncol = n)
d_min <- rep(Inf, n)
for (i in 1:n) {
  for (j in 1:n) {
    dmat[i,j] <- sqrt( sum( (xlocs[i] - xlocs[j])^2 + (ylocs[i] - ylocs[j])^2 ))
    if(j==i) { dmat[i,j] = Inf }
  }
  d_min[i] <- min(dmat[i,])
}
test_stat <- 1/mean(d_min)
# value = 34.43

plot(xlocs, ylocs, xlab = "", ylab = "", xlim = c(0,1), ylim = c(0,1),
     main = "Test Statistic = 34.43")

The test statistic value of \(34.43\) is much larger than all of the empirical values of \(T\) we simulated under \(H_0\), suggesting that this would be quite an extreme value to see in independent uniformly sample data.

4 MCMC Preliminaries

4.1 Bayesian inference

The goal of Bayesian inference is the same as the goal of any other type of inference: infer some unknown parameters \(\theta_{0}\in\Theta\) in a probabilistic model \(f_{\theta_{0}}(\cdot)\), given some data that are assumed to be samples from \(f_{\theta_{0}}(\cdot)\). So in terms of the big idea of building a probabilistic model for a phenomenon of interest and using some observations to learn the parameters of the model, whichever type of inference procedure you use the objective is the same. The differences are simply in the details.

In the Bayesian approach we begin by constructing a prior distribution. This is a distribution defined over \(\Theta\), the space of possible parameter values, and should summarise our beliefs about what \(\theta_{0}\) might be. In full Bayesian inference the prior distribution must be constructed before looking at the data - letting the data inform the prior choice results in what is known as empirical Bayes inference. Given the prior distribution, with associated density/mass function \(\pi_{0}(\theta)\) for any \(\theta\in\Theta\), we write the probabilistic model, or likelihood for some observed data \(y\) as \(f(y|\theta)\).5

With these two ingredients, it is straightforward to update our beliefs about \(\theta_{0}\) using Bayes’ theorem \[ \pi(\theta|y)=\frac{f(y|\theta)\pi_{0}(\theta)}{\int f(y|\theta)\pi_{0}(\theta)d\theta}. \] The density \(\pi(\theta|y)\) is known as the posterior. If a sensible prior distribution is chosen, then the laws of probability ensure that the posterior distribution is a proper characterisation of how the data should change our beliefs about what \(\theta_{0}\) might be.

There are many advantages and disadvantages to Bayesian inference, as with any other inference paradigm. Some advantages are:

  • It is straightforward to perform prediction whilst incorporating uncertainty about parameter values. Given some data \(y\), our best prediction for some new data \(y'\) is \(p(y'|y)=\int f(y'|\theta)\pi(\theta|y)d\theta\)
  • Similarly, marginal uncertainty about certain components of \(\theta_{0}\) is easy: simply compute the marginal posterior
  • In modern settings, the prior distribution is often used as a regulariser. This introduces some weak identifiability into the parameter space, playing a very similar role to penalisation in the frequentist setting. In complex modelling scenarios this ability to regularise through the prior, combined with the overall stability of the approach, has made Bayesian inference the default choice in many application areas. Regularisation is often also called shrinkage. Most data scientists today, whether Bayesian or not, agree that shrinkage is a good thing!

Some disadvantages are:

  • A prior distribution representing beliefs must be constructed. Often one would prefer not to do this, and some seemingly similar choices can give quite different posteriors if care is not taken
  • Computing the posterior up to a constant (something that does not depend on \(\theta\)) is easy, as \(\pi(\theta|y)\propto f(y|\theta)\pi_{0}(\theta)\), but computing it exactly, or computing expectations with respect to it, requires integration! Outside of special situations (known as conjugacy) the integrals in question are intractable.

The latter of these problems has motivated an entire field of research within statistics, knowns as Bayesian computation. The most popular approach generated by this field is Markov chain Monte Carlo, and this will be the focus of the rest of the course.

4.2 Unnormalised Monte Carlo in high dimensions

We have of course already covered methods for approximating an expectation \(\mathbb{E}_{\pi}[f(X)]\) based on drawing samples from \(X_{i}\sim\pi(\cdot)\) for \(i=1,...,n\) and then computing \(n^{-1}\sum_{i=1}^{n}f(X_{i})\). There are two additional challenges we must now face:

  • We will typically only know \(\pi(\cdot)\) up to some proportionality constant
  • Often the parameter space we are interested in will be high-dimensional

The first of these renders many of the exact sampling approaches discussed earlier in the course impossible. The second problem is a result of Bayesian inference being most popular in more complex modelling settings, in which often inference is desired for many unknowns. Rejection sampling can still be performed provided that a ratio upper bound \(M\) can be found for the unnormalised version of \(\pi(\cdot)\), but this is often challenging. In addition, typically such an \(M\) will grow exponentially fast with the dimension of the space.

The solution to these issues is often to use a different Monte Carlo method. The challenge of direct independent sampling from \(\pi\) leads to an interesting relaxation: instead of drawing direct independent samples from \(\pi(\cdot)\), we draw dependent samples that are not exactly from \(\pi(\cdot)\), but in such a way that the samples will be close to samples from \(\pi\) after a suitable period of time. This sounds like a very strange way to approach the Bayesian computation problem, but in fact with these relaxations we can design methods for which the computational cost only grows very slowly with dimension. To introduce the dependence structures that we need, we will review Markov chains in the next subsection. Following this we will introduce the core family of algorithms used in Markov chain Monte Carlo, Metropolis–Hastings. Following this we will discuss some practical issues before reviewing several example applications and briefly touching on further topics of study for those interested.

5 Markov chains (for MCMC)

A Markov chain is an example of a discrete-time stochastic process, so we first recall some general definitions before specialising to the processes we are interested in here.

Definition. An \(\mathcal{X}\)-valued discrete-time stochastic process is a collection of random variables \(\{X_{t}\}_{t\geq0}\), such that \(X_{t}\) is an \(\mathcal{X}\)-valued random variable for any integer \(t\geq0\). The space \(\mathcal{X}\) is referred to as the state space.

Example. The collection of random variables \(\{X_{t}\}_{t\geq0}\) described using the definitions \(\mathbb{P}(X_{0}=0)=\mathbb{P}(X_{0}=1)=1/2\), \(X_{t}=-2X_{t-1}\) is a \(\mathbb{Z}\)-valued discrete-time stochastic process. In this case the process is completely deterministic conditional on knowing the initial state \(X_{0}\).

Example. A non-homogeneous Bernoulli process \(\{X_{t}\}_{t\geq0}\) is a \(\{0,1\}\)-valued stochastic process such that \(X_{t}\sim\text{Bernoulli}(p_{t})\) for some sequence of probabilities \(\{p_{t}\}_{t\geq0}\). The sequence of probabilities themselves can be considered as a stochastic process (what would the state space be?).

A discrete-time stochastic process is a very general object, and as is often the case in mathematics, if we would like to say something interesting about a mathematical object we need to impose some structure. The goal is typically to find a very small amount of structure that we can impose, which will then allow us to say lots of interesting things about how the process might behave. Markovian structure serves just this kind of purpose.

Definition An \(\mathcal{X}\)-valued discrete-time stochastic process \(\{X_{t}\}_{t\geq0}\) is called a Markov chain if for any fixed \(t\) and any \(x_{t}\in\mathcal{X}\) the random variable \(X_{t+1}|(X_{t}=x_{t})\) is independent of \(X_{t-k}\) for all \(1\leq k \leq t\). Symbolically this can be written \[ \left(X_{t+1}\perp\!\!\!\perp X_{t-k}\right)|X_{t}. \] In words we say that the future is independent of the past conditional on the present. The Markov property can also be written in terms of probabilities as \[ \mathbb{P}\left(X_{t+1}\in A|X_{t}=x_{t},X_{t-1}=x_{t-1},...,X_{0}=x_{0}\right)=\mathbb{P}\left(X_{t+1}\in A|X_{t}=x_{t}\right), \] provided that such conditional probabilities can be defined. In discrete settings this is usually straightforward, but to condition on a set of zero probability, as in the event \(\mathbb{P}(X=x)\) when \(X\) is continuous actually takes some care.6 The notation \(X\perp\!\!\!\perp Y\) means that \(X\) and \(Y\) are independent, i.e. that for any events \(A\) and \(B\) we can write \(\mathbb{P}(X\in A,Y\in B)=\mathbb{P}(X\in A)\mathbb{P}(Y\in B)\).

The Markovian structure permits us to write a recipe for producing a stochastic process. Informally we can write \[ \mathbb{P}\left(X_{0}=x_{0},X_{1}=x_1,...,X_{n}=x_{n}\right)=\mathbb{P}(X_{0}=x_{0})\prod_{i=1}^{n}\mathbb{P}(X_{i}=x_{i}|X_{i-1}=x_{i-1}). \] So to generate a Markov chain, we need only two things:

  • An initial distribution \(\mu_{0}\), such that \(\mathbb{P}(X_{0}\in A):=\mu_{0}(A)\) for all \(A\in\mathcal{B}\)
  • A transition kernel \(P(x,\cdot)\), such that \(\mathbb{P}(X_{i}=A|X_{i-1}=x):=P(x,A)\) for all \(A\in\mathcal{B}\) and all \(x\in\mathcal{X}\)

The initial distribution tells us how to generate the first point in the chain. The transition kernel tells us how to generate the next point, given knowledge of the current point. So a realisation of the Markov chain \(\{X_{0},...,X_{n}\}\) can be produced by doing the following:

  1. Set \(i=1\)
  2. Sample \(X_{0}\sim\mu_{0}(\cdot)\)
  3. Sample \(X_{i}|X_{i-1}\sim P(X_{i-1},\cdot)\)
  4. If \(i<n\), set \(i\to i+1\) and return to step 3.

This may be a slightly more general construction to that given in previous courses, as it is designed to cover both discrete and continuous state space Markov chains. All Markov chains we will consider in this course will be time-homogeneous.

Definition. A Markov chain is called time-homogeneous if the transition kernel \(P\) does not depend on \(t\).

Another sometimes convenient way to define a Markov chain is by replacing the transition kernel with a recursion, i.e. an equation showing how to turn \(X_{i}\) into \(X_{i+1}\). Often it makes sense to write

\[ X_{i+1}=f(X_{i},\xi_{i+1}), \] where \(f\) is some suitable function and \(\xi_{i+1}\) is some random noise that is independent of everything else. This representation is sometimes called the non-linear state space model form of a Markov chain. It can be particularly useful when thinking about how to generate a Markov chain sequentially using an algorithm.

Example. Let \(X_{0}\sim\text{Bernoulli}(1/2)\) and introduce the matrix

\[ \bar{P} = \left( \begin{array}{cc} 1/3 & 2/3 \\ 2/3 & 1/3 \end{array}\right). \]

Let \(\mathbb{P}(X_{i}=1|X_{i-1}=0)=\bar{P}(1,2)=2/3\), \(\mathbb{P}(X_{i}=0|X_{i-1}=0)=\bar{P}(1,1)=1/3\), \(\mathbb{P}(X_{i}=0|X_{i-1}=1)=\bar{P}(2,1)=2/3\), and \(\mathbb{P}(X_{i}=2|X_{i-1}=2)=\bar{P}(2,2)=1/3\). The matrix \(\bar{P}\) is called the transition matrix. The transition kernel for any \(A \in \mathcal{B}\) is \[ P(x,A):=\sum_{y\in A}\bar{P}(x,y). \] The resulting stochastic process \(\{X_{0},X_{1},X_{2},...\}\) is a Markov chain, with state space \(\mathcal{X}=\{0,1\}\), initial distribution \(\mu_{0}(A)=\sum_{x\in A}(1/2)^{x}(1/2)^{1-x}\) and transition kernel \(P\).

Example. Let \(X_{0}\sim N(0,1)\) and introduce the recursion \[ X_{i}=\gamma X_{i-1}+\sqrt{(1-\gamma^{2})}Z_{i}, \] where \(Z_{i}\sim N(0,1)\) for all \(i\geq1\) and \(\gamma\in\mathbb{R}\) with \(|\gamma|<1\). Then \(\{X_{0},X_{1},X_{2},...\}\) is a Markov chain with state space \(\mathcal{X}=\mathbb{R}\) and since \(X_{i}|(X_{i-1}=x)\sim N(\gamma x,1-\gamma^{2})\) it has transition kernel

\[ P(x,A)=\frac{1}{\sqrt{2\pi(1-\gamma^{2})}}\int_{y\in A}\exp\left(-\frac{1}{2(1-\gamma^{2})}(y-\gamma x)^{2}\right)dy. \] This chain is conveniently represented using the non-linear state space model form as \(X_{i+1}=f(X_{i},Z_{i+1})\) with \(f(x,z):=\gamma x+\sqrt{1-\gamma^{2}}z\).

Markov chains are fascinating objects of study and have many interesting features. But the most relevant for our purposes is about the long-term behaviour. Consider the following example.

Example (re-visited). If we take the matrix \(\bar{P}\) from above, then the \(i\)th row gives the probability distribution for the next state, given that the current state is \(i\). In fact, the \(i\)th row of the matrix \(\bar{P}^{n}\) gives the distribution for the \((t+n)\)th state, given that the \(t\)th state is \(i\) (check this). Examining this matrix for different values of \(n\) gives \[ \begin{aligned} \bar{P} & =\left(\begin{array}{cc} 1/3 & 2/3\\ 2/3 & 1/3 \end{array}\right)\\ \bar{P}^{2} & =\left(\begin{array}{cc} 0.55 & 0.44\\ 0.44 & 0.55 \end{array}\right)\\ \bar{P}^{5} & =\left(\begin{array}{cc} 0.498 & 0.502\\ 0.502 & 0.498 \end{array}\right) \end{aligned} \] What is happening to each row of \(\bar{P}^{n}\) as \(n\to\infty\)? What does this mean in terms of \(\mathbb{P}\left(X_{n}\in A|X_{0}=x\right)\)?

The above example is colloquially referred to as the Markov chain forgetting its initial state. Many Markov chains have the property that regardless of the starting point, when \(n\) is large the distribution of the random variable \(X_{n}\) begins to resemble that of a fixed distribution. The phenomenon is called convergence to equilibrium. To be precise about what exactly we mean by this, we first need some more definitions.

Definition. The \(n\)-step transition kernel for a Markov chain \(P^{n}\) is such that for any \(A\in\mathcal{B}\) and any \(x\in\mathcal{X}\), \(P^{n}(x,A)\) denotes the probability that \(X_{n}\in A\) given that \(X_{0}=x\). We write \[ X_{n}|(X_{0}=x)\sim P^{n}(x,\cdot), \] provided that this quantity is well-defined. In words, \(P^{n}(x,\cdot)\) denotes the probability distribution for the \((t+n)\)th state, given that the \(t\)th state is \(x\).

To discuss the long run behaviour of the distribution \(P^{n}(x,\cdot)\) we need to introduce some notion of distance between two probability distributions. There are several interesting choices but we will keep things classical here.

Definition. An integral distance between two probability distributions \(\mu(\cdot)\) and \(\nu(\cdot)\) defined on \((\mathcal{X},\mathcal{B})\) is given by \[ \mathcal{D}_\mathcal{F}(\mu(\cdot),\nu(\cdot)):=\sup_{f \in \mathcal{F}}|\mathbb{E}_\mu f - \mathbb{E}_\nu f|, \] where \(\mathcal{F}\) denotes some representative collection of real-valued functions. A collection \(\mathcal{F}\) is called representative if \(\mathcal{D}_\mathcal{F}(\mu(\cdot),\nu(\cdot)) = 0 \iff \mu(\cdot) = \nu(\cdot)\).

In words, this denotes the largest difference between the expectations of a function in the class \(\mathcal{F}\) calculated using the distributions \(\mu(\cdot)\) and \(\nu(\cdot)\). In order to be a proper distance metric, the collection \(\mathcal{F}\) of functions used to define \(\mathcal{D}_\mathcal{F}\) must be sufficiently rich that \(\mathcal{D}_\mathcal{F}(\mu(\cdot),\nu(\cdot))=0\) if and only if \(\mu(\cdot)=\nu(\cdot)\). It might seem as if \(\mathcal{F}\) must contain all possible functions, but this is not the case. Two simple examples of collections of functions for which this property would hold are:

  • \(\mathcal{F}^* := \{ \mathbb{I}_A:\mathcal{X} \to \mathbb{R} : A \in \mathcal{B} \}\), the collection of indicator functions for elements of event space. Since \(\mathbb{E}_\mu[\mathbb{I}_A(X)]=\mathbb{P}(X\in A)\) then measures differences in probabilities
  • \(\mathcal{F}':=\{f_s(x):= e^{s^T x} : s \in \mathbb{R}^d \}\), which would result in \(\mathbb{E}[f_s(X)]\) being the moment generating function for the distribution, evaluated at \(s\) (this only makes sense if the moment generating functions are well-defined).

We now come to a more rigorous mathematical definition of convergence in the sense discussed above.

Definition. A Markov chain with transition kernel \(P\) has a unique limiting distribution \(\pi(\cdot)\) if for (\(\pi-\)almost)7 all starting points \(x\in\mathcal{X}\), it holds that \[ \lim_{n\to\infty}\mathcal{D}_{\mathcal{F}^*}\left( P^{n}(x,\cdot), \pi(\cdot) \right)=0. \] The limiting distribution in the above example is simply \((1/2,1/2)\). The formal definition is, however, not very practical. We would like to also have some notion of how to determine whether or not a certain \(\pi(\cdot)\) is the limiting distribution for a chain with kernel \(P\). Again we need a couple of extra definitions first.

Definition. A Markov chain with transition kernel \(P\) is called

  1. \(\pi\)-irreducible if for every \(A\in\mathcal{B}\) and every \(x\in\mathcal{X}\) there is an \(n<\infty\) such that

\[ \pi(A)>0\implies P^{n}(x,A)>0. \]

  1. Aperiodic if there is no collection of disjoint events \(A_{0},...,A_{k-1}\in\mathcal{B}\) such that

\[ x\in A_{i}\implies P(x,A_{(i+1)\mod k})=1. \]

(If this is not the case the chain is called periodic).

  1. \(\pi\)-invariant if for all \(n\geq0\)

\[ X_{t}\sim\pi(\cdot)\implies X_{t+n}\sim\pi(\cdot). \]

In words, a chain is \(\pi\)-irreducible if any event that can happen under \(\pi(\cdot)\) can eventually also happen during the simulation of the chain. Informally we generally drop the \(\pi\) and just talk about irreducibility. The general concept refers to a chain being able to explore different regions of the state space. If we desire the limiting distribution of the chain to be a certain \(\pi(\cdot)\), then clearly it must be that any events that can occur under \(\pi(\cdot)\) must be able to happen under the chain dynamics. A periodic chain cycles through distinct regions of the state space in a deterministic fashion. Periodicity will not be that relevant in our case, as it is not a common occurrence for the Markov chains we will study. If \(\pi(\cdot)\) is the limiting distribution for the chain then it must be an invariant distribution, but the reverse is not always true.

Crucially, however, \(\pi\)-invariance, irreducibility and periodicity are often easy to check in practice, meaning the following result is very useful for our purposes.

Example (\(\pi\)-irreducibility). The chain with transition \[ X_{i+1} = \gamma X_i + \sqrt{1-\gamma^2} Z_i \] is \(\pi\)-irreducible whenever \(\pi(\cdot)\) is a continuous distribution on \(\mathbb{R}\), since \(\mathbb{P}(x, A) > 0\) for \(A := [a,b]\) and any \(b > a\).

Example (\(\pi\)-invariance). Take the continuous example \[ X_{i+1} = \gamma X_i + \sqrt{1-\gamma^2}Z_{i+1}, \] for \(|\gamma|<1\) and \(Z_{i+1} \sim N(0,1)\). If \(X_i \sim N(0,1)\), then \(X_{i+1}\) is a linear combination of two independent Gaussians, so also Gaussian (how would yo prove this?). It is easy to see that \(\mathbb{E}[X_{i+1}] = 0\). To compute the variance we have \[ \begin{aligned} \text{Var}[X_{i+1}] &= \gamma^2 \text{Var}[X_i] + (1-\gamma^2)\text{Var}[Z_i] \\ &= \gamma^2 + (1 - \gamma^2) \\ &= 1 \end{aligned} \] So \(N(0,1)\) is an invariant distribution (but is it the only one?)

Example (\(\pi\)-invariance but lack of \(\pi\)-irreducibility). The chain with transition matrix \[ \bar{K} = \left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right) \] is \(\pi\)-invariant for any \(\pi(\cdot)\) defined on \(\{1,2\}\), but will never be \(\pi\)-irreducible, as the states that the chain is able to reach are crucially dependent on the starting point.

Theorem. If a Markov chain with transition kernel \(P\) is \(\pi\)-invariant, \(\pi\)-irreducible and aperiodic, then \(\pi(\cdot)\) is the unique invariant distribution, and is also the limiting distribution for the chain.

In fact there is a very easy way to check \(\pi\)-invariance of a Markov chain if a little bit more structure is imposed. The Markov chains on which the vast majority of algorithms within Markov chain Monte Carlo (although this is changing) are reversible. This means that they satisfy what is known as the detailed balance equations. For chains defined on finite state spaces a chain is \(\pi\)-reversible if the transition matrix satisfies \[ \pi(x)\bar{P}(x,y) = \pi(y)\bar{P}(y,x) \] for all \(x,y \in \mathcal{X}\). If the chain is defined on a continuous space but has a transition density \(p\), then the same equations can be expressed in terms on densities as \[ \pi(x)p(x,y) = \pi(y)p(y,x). \] There is a more general definition in terms of measures that we omit. The reason the reversibility is a very useful property, however, is that it gives an easy way to check if a Markov chain is \(\pi\)-invariant.

Theorem. If a Markov chain is \(\pi\)-reversible, then it is \(\pi\)-invariant.

Proof. Consider the continuous case, we have for any \(A \in \mathcal{B}\) that if \(X_n \sim \pi(\cdot)\) \[ \begin{aligned} \mathbb{P}[X_{n+1} \in A] &= \int_{x \in\mathcal{X}} \int_{y \in A} \pi(x)p(x,y)dxdy \\ &= \int_{x \in\mathcal{X}} \int_{y \in A} \pi(y)p(y,x)dxdy \\ &= \int_{y \in A} \left[ \int_{x \in\mathcal{X}} p(y,x)dx \right] \pi(y)dy \\ &= \int_{y \in A} \pi(y)dy \\ &= \pi(A). \end{aligned} \] In the finite case the same argument can be constructed

Example. Consider yet again the chain with transition matrix \[ \bar{P} = \left( \begin{array}{cc} 1/3 & 2/3 \\ 2/3 & 1/3 \end{array}\right). \] Here \[ \bar{P}(1,2) = \bar{P}, \] so the uniform distribution of \(\{1,2\}\) is invariant, and the chain is reversible with respect to it.

The last piece of Markov chain theory we will introduce in the course before looking at examples is the Markov chain equivalent of the strong law of large numbers. It is very important for Markov chain Monte Carlo.

Theorem (Markov chain ergodic theorem). If a Markov chain \(\{X_{t}\}_{t\geq0}\) is \(\pi\)-irreducible and \(\pi\)-invariant, and \(f:\mathcal{X}\to\mathbb{R}\) satisfies \(\mathbb{E}_{\pi}[f(X)]<\infty\), then the ergodic average \(\hat{f}_{n}:=n^{-1}\sum_{i=0}^{n-1}f(X_{i})\) satisfies \[ \mathbb{P}\left(\lim_{n\to\infty}\hat{f}_{n}=\mathbb{E}_{\pi}[f(X)]\right)=1. \] Note that the above result does not require aperiodicity, and so the conditions are slightly weaker than those needed for equilibrium. This fact has not really been exploited in Markov chain Monte Carlo up until the time of writing, but is nonetheless true.

6 Markov chain Monte Carlo

The basic idea of Markov chain Monte Carlo is to construct a Markov chain in such a way that the limiting distribution is something from which we desire expectations. In Bayesian inference this is the posterior distribution.8

In fact, the vast majority of approaches to doing this over the past 70 years have all stemmed from the same dominating algorithm, of which many specific flavours exist. The focus of this chapter is to understand how it works.

6.1 The Metropolis–Hastings algorithm

Almost all MCMC methods in use today are either a particular case of the Metropolis–Hastings algorithm or can be derived from it in some way. A specific version was invented by Metropolis et al. (1953) at the Los Alamos National Laboratory. The general approach was introduced by Hastings (1970). Interestingly, however, MCMC only really became popular in the 1990s, when computer processing power reached a level that meant algorithms of this kind could be routinely executed on a personal computer. Widespread popularisation of the technique led to a revolution of sorts in Statistics, as Bayesian inference suddenly became a workable and accessible part of the data scientist’s toolkit.

The Metropolis-Hastings algorithm is based loosely on the familiar accept-reject principle. The idea is to introduce some candidate transition kernel \(Q:\mathcal{X} \times \mathcal{B} \to [0,1]\), and at each iteration of the algorithm generate a candidate next position in the Markov chain using \(Q\). This candidate (also called the proposal) is then either accepted or rejected, in which case the next point in the Markov chain is set to be the same as the current point. Remarkaby, if the acceptance rate is chosen appropriately, it is not difficult to show that such a procedure will produce a \(\pi\)-invariant Markov chain, and often also not hard to show that \(\pi(\cdot)\) will be the limiting distribution.

We will begin by looking at a simple variant, known as the Random Walk Metropolis.

Random Walk Metropolis algorithm

Require: Starting point \(X_0=x\), target \(\pi(\cdot)\), number of iterations \(n\), step-size \(h>0\).

For \(i \in \{0,...,n-1\}\)

  • Draw \(\xi_i \sim N(0,1)\) and set \(Y = X_i + h\xi_i\)
  • Draw \(U \sim \mathcal{U}[0,1]\)
  • If \(U \leq \alpha(X_i,Y)\), set \(X_{i+1} = Y\), otherwise set \(X_{i+1} = X_i\)

Output the Markov chain \(\{X_1,...,X_n\}\).

The acceptance rate in the above procedure is \(\alpha:\mathcal{X} \times \mathcal{X} \to[0,1]\). There are in fact many choices of \(\alpha\) that are valid, in the sense that the resulting Markov chain would be \(\pi\)-invariant, but the preferred choice is \[ \alpha(x,y) = \min\left(1, \frac{\pi(y)}{\pi(x)}\right). \] The third step of the algorithm serves to accept the proposal \(Y\) with probability \(\alpha(X_i,Y\)), and reject it (and hence stay put) with probability \(1-\alpha(X_i,Y)\). The name ‘random walk’ comes from the fact that without the acceptance rate the algorithm would produce a sample from a simple random walk model, in which the next state of the Markov chain is simply a symmetric mean zero perturbation of the current state.

Notice that the only way in which the density \(\pi\) is used in the above algorithm is through the ratio \(\pi(Y)/\pi(X_{i})\) in the acceptance rate. This is precisely what makes the Metropolis–Hastings algorithm applicable when only \(c\pi\) is known (for example in the Bayesian inference setting), since for any \(y,x \in \mathcal{X}\) for which the ratio is well-defined it is trivial to see that for any \(c \neq 0\) \[ \frac{c\pi(y)}{c\pi(x)} = \frac{\pi(y)}{\pi(x)}. \]

Let’s run this algorithm to produce a realisation of a Markov chain when \(\pi(\cdot)\) is a standard Gaussian distribution and \(X_0 = 0\).

set.seed(759)
logpi_gauss <- function(x) { dnorm(x, log = T) }
nits <- 2000

# create a 1d random walk metropolis sampling function
RWM_1d <- function(logpi, nits, h, x_curr) {
  logpi_curr <- logpi(x_curr)
  accepted <- 0
  x_store <- rep(NA, nits)
  
  for (i in 1:nits) {
    # propose a candidate move
    x_prop <- x_curr + h*rnorm(1)
    logpi_prop <- logpi(x_prop)
  
    # accept-reject
    loga <- logpi_prop - logpi_curr
    u <- runif(1)
    if (log(u) < loga) {
      x_curr <- x_prop
      logpi_curr <- logpi_prop
      accepted <- accepted + 1
    }
    x_store[i] <- x_curr
  }
  
  return(list(x_store = x_store, a_rate = accepted/nits))
}

# run the function to generate a Markov chain
mc <- RWM_1d(logpi = logpi_gauss, nits = nits, h = 2.4, x_curr = 0)

# plot the chain and output the acceptance rate
plot(mc$x_store, type = 'l', xlab = "t", ylab = "X(t)")

cat("The acceptance rate is: ",mc$a_rate)
## The acceptance rate is:  0.4705

The above figure is called a trace plot for the chain. On the x-axis is time \(t\), and on the y-axis is the state of the chain at time \(t\). Most of the samples are concentrated between -2 and 2, as would be expected of samples from the standard Normal distribution, which is reassuring. In fact, we can use the samples to estimate the probability that a standard normal random variable will be between -2 and 2.

sum((-2 < mc$x_store)*(mc$x_store < 2)*1)/nits
## [1] 0.9475

This is close to the true value of 0.96. Note that we have just used the algorithm to construct an MCMC estimate for the integral \[ \int \mathbb{I}_{(-2,2)}(x) \frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx. \] To estimate the mean of the target distribution we simply use the samples.

mean(mc$x_store)
## [1] 0.009321799

Again this is very close to the true value. Of course, the quality of this approximation varies depending on how many iterations of the algorithm have been performed, or equivalently how long the chain is. We can plot the running mean of the samples to see when the estimate becomes stable.

plot(cumsum(mc$x_store)/(1:nits), type = 'l', xlab = "t", ylab = "Mean at time t")
abline(h = 0, lty = 2, lwd = 2, col = "blue")

Another important indicator of algorithm performance is the average acceptance rate for proposals. In the above scenario, the main quantity governing this is the step-size \(h\). If the step-size is too large, then proposed moves will be wildly different to the current state. Presuming that the current state is a reasonable sample from \(\pi(\cdot)\), then moving very far away from this region in a blind manner does not seem like the best idea, and so it is likely that most proposals will be rejected. In this instance the chain tends to get ‘stuck’ in particular places for a long time.

bigstep <- RWM_1d(logpi = logpi_gauss, nits = nits, h = 24, x_curr = 0)
plot(bigstep$x_store, type = 'l', xlab = "t", ylab = "X(t)")

cat("The acceptance rate is: ", bigstep$a_rate)
## The acceptance rate is:  0.043

Conversely, when the step-size is too small then most proposals will be very close to the current point, such that the ratio \(\pi(Y)/\pi(X) \approx 1\). This will result in a very high acceptance rate. But, there will be strong correlation between samples, and so it will take the chain a very long time to explore all of the parts of \(\mathcal{X}\) that have an appreciable probability under \(\pi(\cdot)\). So this strategy tends to produce poor estimators too.

smallstep <- RWM_1d(logpi = logpi_gauss, nits = nits, h = 0.1, x_curr = 0)
plot(smallstep$x_store, type = 'l', xlab = "t", ylab = "X(t)")

cat("The acceptance rate is: ", smallstep$a_rate)
## The acceptance rate is:  0.9835

There is in fact a sweet spot, in which the step-size is chosen such that some proposals are accepted and some are rejected. The intuitive idea is to make the step-size large enough that the chain can explore that state space quickly, but not so big that too many proposals are rejected. In the case of the random walk metropolis in one dimension, the optimal acceptacnce rate is typically around 44%. In the high dimensional setting the optimal acceptance rate is around 23%. These guidelines are based on some very interesting probabilistic limiting arguments, based on relating Markov chains produced by the algorithm to a diffusion process and then optimising the speed of this diffusion. It is often the case in mathematics that limiting objects of this nature possess more structure and are therefore easier to study, and so this approach is a common one. We will not explore it further here, but you will learn about diffusion processes in the second half of the module.

Users of Markov chain Monte Carlo methods become very skilled in the art of assessing how well the algorithm has performed by looking at traceplots and various other diagnostics of this kind, which we will study in more depth.

Exercise: Can you write a Random Walk Metropolis algorithm to generate samples from a \(N(5,10)\) distribution?

The more general version of the Metropolis-Hastings algorithm is given below.

Metropolis-Hastings algorithm

Require: Starting point \(X_0=x\), target \(\pi(\cdot)\), proposal kernel \(Q\), number of iterations \(n\).

  • For \(i \in \{0,...,n-1\}\)
    • Draw \(Y \sim Q(X_i,\cdot)\)
    • Draw \(U \sim \mathcal{U}[0,1]\)
    • If \(U \leq \alpha(X_i,Y)\), set \(X_{i+1} = Y\), otherwise set \(X_{i+1} = X_i\)
  • Output the Markov chain \(\{X_1,...,X_n\}\).

To preserve \(\pi\)-invariance in the above procedure, the preferred acceptance rate is now \[ \alpha(x,y) = \min\left(1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}\right). \] Here \(q(x,y)\) is the transition density corresponding to \(Q\), meaning for any \(A \in \mathcal{B}\) \[ Q(x,A) = \int_A q(x,y)dy. \] Note that in the random walk Metropolis case above \[ q(x,y) = \frac{1}{\sqrt{2\pi h^2}}\exp\left\{ -\frac{1}{2h^2}(y-x)^2 \right\} = q(y,x), \] meaning that the ratio \(q(x,y)/q(y,x)=1\).

In the random walk case the acceptance rate has an intuitive interpretation that if the proposal has higher density than the current point under \(\pi\) the move will be accepted, and otherwise it might be rejected. In the general case this intuition is lost, but nonetheless choosing a different form of proposal can be advantageous. Some other popular choices for proposals in Metropolis-Hastings are:

  • \(Y \sim Q(\cdot)\), meaning the proposal distribution \(Q(\cdot)\) does not depend on the current point \(X_i\). The resulting method is called The Independence Sampler. This algorithm is simple to implement and analyse, but often performs poorly in high-dimensional settings

  • \(Y = X_i + h^2\nabla\log\pi(X_i)/2 + h\xi_i\), meaning the candidate kernel is a Gaussian that is not centred at the current state, but instead moved in the direction of the gradient of \(\pi\). The intuition is that if \(h\) is well-chosen then \(\pi(Y)\) will hopefully be larger than \(\pi(X_i)\), or at least not much smaller than it. This approach is called the Metropolis-adjusted Langevin algorithm, or MALA. It tends to work better than the Random Walk Metropolis in high-dimensional settings, but can be more sensitive to the step-size choice.9

  • The state-of-the-art in Markov chain Monte Carlo to fit many probabilistic models of interest is considered by many to be Hamiltonian Monte Carlo, in which Metropolis-Hastings proposals are constructed by numerically solving a Hamiltonian dynamical system. This sounds very complicated but is actually in some sense a generalisation of MALA, and follows the same principle in which gradients are used to guide proposals. We will not discuss the details here but note that the software Stan provides an open source implementation that is growing in popularity among practitioners. As with MALA the method performs extremely well in high-dimensional settings provided that the tuning parameters are well-chosen.

The key mathematical result of the chapter is a proof that the Metropolis-Hastings algorithm does indeed produce a \(\pi\)-invariant Markov chain, and some stronger (example) conditions under which \(\pi(\cdot)\) is the limiting distribution for the chain.

Theorem. The following hold for a Markov chain produced using the Metropolis-Hastings algorithm:

  1. The chain is \(\pi\)-invariant.
  2. If for all \(x,y \in \mathcal{X}\), \(M \geq \pi(y) > 0\), and \(q(x,y) > 0\) for some \(M<\infty\), then \(\pi(\cdot)\) is the unique limiting distribution for the chain.

Proof. For part 1, we sketch the argument. Consider first the case in which \(y \neq x\) is proposed and then accepted. The density associated with this event is \[ q(x,y)\alpha(x,y). \] We can show detailed balance holds for \(\pi(\cdot)\) here. Using the above to describe the density associated with moving from \(x\) to \(y\)

\[ \begin{aligned} \pi(x)q(x,y)\alpha(x,y) &= \pi(x)q(x,y) \min \left(1, \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}\right) \\ &= \min\left( \pi(x)q(x,y), \pi(y)q(y,x) \right) \\ &= \pi(y)q(y,x) \min \left(1, \frac{\pi(x)q(x,y)}{\pi(y)q(y,x)}\right) \\ &= \pi(y)q(y,x)\alpha(y,x). \end{aligned} \] The alternative scenario is that in which the proposal is rejected. In that instance it is very easy to see that the detailed balance equations will hold, as \(y=x\). For part 2, we simply note that the conditions imply that \[ q(x,y)\alpha(x,y) > 0 \] for all \(x,y \in \mathcal{X}\) such that \(\pi(y) > 0\), meaning that for any set \(A \in \mathcal{B}\) \[ \int_A \pi(y)dy > 0 \implies \int_A q(x,y)\alpha(x,y)dy > 0, \] and hence the chain is \(\pi\)-irreducible. Because there exist proposals for which \(\alpha < 1\), then there is a non-zero probability of rejection, meaning that the chain is also aperiodic.

Remark. Note that condition (2) is sufficient but not necessary for \(\pi(\cdot)\) to be the limiting distribution, much weaker conditions are also possible.

6.2 A case study

Here we illustrate a basic example of using MCMC for a practical inference problem, a Bayesian Poisson regression. The model and priors are as follows

\[ \begin{aligned} Y_i &\sim \text{Poisson}(e^{\sum_{j=1}^d \beta_jx_{ij}}), \\ \beta_j &\stackrel{iid}{\sim} N(0, 1^2). \end{aligned} \] The data will be \(\{(y_i, x_{i1},...,x_{id})\}_{i=1}^n\). Each \(y_i\) represents the observed response, typically a count of some kind, and each \(x_{ij}\) will be the value of some covariate, or feature of the \(i\) sample in the dataset. The response \(y_{i}\) could represent the number of times a student visited the university library in a year, with the covariates giving information about the degree programme and educational background of the student. The response could also be the infant mortality rate in a geographical region, with the covariates representing socio-economic characteristics related to the region. Equally \(y_i\) could be the number of times a stock has reduced in price by more than 10% in a certain time period, with the \(x_{ij}\)’s giving information about the industrial sector and history of the company. There are many types of data and phenomena for which this kind of probbilistic model will be appropriate, so feel free to impose that of most interest to you onto the problem!

In this example we have imposed \(N(0,1^2)\) independent priors for each \(\beta_j\). This is of course something that the user is free to choose. The choice here reflects that apriori we have no reason to assume that different \(\beta\)’s will be dependent (they may still be aposteriori thanks to the data), and we are regularising the \(\beta_j\)’s to be not too far from zero. The practical interpretation is that if \(x_{ij}\) increases by 1, the mean of the Poisson random variable \(Y_i\) will increase by a multiplicative factor \(e^{\beta_j}\), meaning \(\mu_{i,\text{new}} = e^{\beta_j} \mu_{i,\text{old}}\). By saying that there is a 96% probability apriori that \(\beta_j \in (-2,2)\) we are effectively saying that we believe that plausible effects of a 1 unit increase in any \(X_{ij}\) are that the mean \(\mu_i\) would change by a factor of between \(e^{-2}\) and \(e^2\). Whether or not this is sensible of course depends on your particular setting. One common approach to remove this level of prescription from the model is to take the each prior to instead be \(\beta_j \sim N(0,\sigma^2)\) and impose a hyperprior on \(\sigma^2\). Here we avoid doing this for simplicity.

We will simulate some data directly from this model and then fit the model using MCMC. This is a handy tool since we will exactly know the true values of each \(\beta_j\) that we are trying to fit, so can see how well the MCMC is recovering these parameters.

# simulate 100 data points from the Poisson model.
# We will include three \beta_j parameters, an intercept and two covariates x2 and x3

set.seed(101)

# fix number of data points, number of parameters and parameter values
n <- 200
beta <- c(0.5,-0.25,1)
d <- length(beta)

# generate covariates (arbitrary)
s <- 1
x2 <- rnorm(n, sd = s)
x3 <- rnorm(n, sd = s)

# generate means of poisson and then sample poisson data
means <- exp(beta[1] + beta[2]*x2 + beta[3]*x3)
y <- rpois(n, lambda = means)

# take a look at the data
barplot(y)

Having now simulated the data we will construct the posterior and use our random walk sampler to compute posterior expectations. We will need to modify our random walk metropolis function to work with more than one dimension.

The (unnormalised) posterior distribution in this case can be written

\[ \begin{aligned} \pi(\beta|y) &\propto \prod_i \frac{(e^{\sum_j \beta_j x_{ij}})^{y_i} e^{-e^{\sum_j \beta_j x_{ij}}}}{y_i!} \prod_j \frac{1}{\sqrt{2\pi}}e^{ -\beta_j^2/2 } \\ & \propto \exp \left( \sum_{i,j} \beta_j x_{ij}y_i - e^{\sum_{i,j} \beta_j x_{ij} } - \frac{1}{2} \sum_j \beta_j^2 \right). \end{aligned} \] A function to evaluate the log-posterior is given below.

# posterior
logpi_poisson <- function(beta) { 
  sum(dnorm(beta, log = T)) + sum(dpois(y, lambda = exp(beta[1] + beta[2]*x2 + beta[3]*x3), log = T))
}

To fit the model we use a multi-dimensional version of the same random walk Metropolis algorithm that has already been introduced above. We will look more carefully at the details of this modification in the computer workshops. We run the algorithm for 10,000 iterations, and choose a step-size \(h\) so that around 25% of samples are accepted, in line with the guidelines given above.

## The acceptance rate is:  0.2546

As can be seen, the three components of the chain seem to converge relatively quickly to what look like sensible regions of the parameter space, given the true values for the \(\beta\) parameter that were used to generate the data. We can inspect the output to see this in more detail.

hist(mc2$x_store[,1], main = "", col = "black", xlab = "", 
     xlim = c(-0.5, 1.5), ylim = c(0, 5000))
hist(mc2$x_store[,2], main = "Beta 2", col = "red", xlab = "", add = T)
hist(mc2$x_store[,3], main = "Beta 3", col = "blue", xlab = "", add = T)

colMeans(mc2$x_store)
## [1]  0.3944325 -0.2350571  1.1027465

The histograms show the shape of each marginal posterior distribution. We can use the samples to estimate many posterior expectations. The posterior mean estimates are broadly in line with the true values. It is not that surprising that 2 out of 3 are shrunken slightly towards zero due to the prior regularisation.

Of course, this is not a challenging example, the dimension of the state space is only 3 and the posterior distribution is fairly regular. But it illustrates how to use MCMC on a practical example. We will explore more ways in which the output can be used to understand the quality of estimators (and also look at some more challenging examples) in the next chapter.

7 MCMC in practice

7.1 Improving the algorithm

There are numerous ways in which the basic random walk Metropolis cab be improved upon, in terms of generating a Markov chain that produces better quality approximations to integrals of interest in some sense (we will make this precise in the second part of this chapter). We will discuss three of these here:

  • Pre-conditioning
  • Component-wise updates
  • Non-normal proposals and re-parametrisation

The first of these is designed to deal with target distributions that exhibit either heterogeneity, correlation of both. The second is another way to deal with heterogeneity, and can sometimes help simplify the algorithm. The third can help deal with both of the above and also problematic tail behaviour.

7.1.1 Pre-conditioning

Consider the example from the previous workshop in which the target distribution was a \(d\)-Gaussian with non-identity covariance, \[ \pi(x) = (2\pi)^{-d/2}|\Sigma|^{-1/2} \exp\left( -\frac{1}{2}x^T\Sigma^{-1} x \right), \] where \(\Sigma \in \mathbb{R}^{d \times d}\) is the covariance. If \(\Sigma\) is very far from the identity, then using a random walk proposal with \(h^2 I_{d \times d}\) covariance can be inefficient. An example in the case \(d = 2\) and \[ \Sigma = \left( \begin{array}{cc} 20 & 0 \\ 0 & 1 \end{array} \right), \] is shown below.

We can see that while the second coordinate is well-explored, the 1st larger coordinate is not so well explored. The issue is that the proposal scales don’t match those of the target. We can rectify this issue using what is called pre-conditioning. In our context this means choosing the proposal covariance to be \(h^2\hat{\Sigma}\), rather than \(h^2 I_{d \times d}\), as in the standard random walk algorithm, where \(\hat{\Sigma}\) is some suitable estimate of the covariance of \(\pi(\cdot)\). Note that any estimate is ok here, as it does not affect the validity of the method, only how quickly the chain converges and the quality of integral approximations. In other words, you don’t ever have to pre-condition your algorithm, but it might work a lot faster if you do, and how much faster depends on how good an estimate \(\hat{\Sigma}\) is of \(\Sigma\).

Pre-conditioning can also help when \(\pi(\cdot)\) exhibits correlation, as mimicking the correlation structure within the proposal has the same outcome: typically the chain will converge more rapidly and integral estimates are of better quality. This intuition can be made rigorous in various ways, but we will not consider it here.

7.1.2 Component-wise updates

Another popular and simple way to manage a scenario in which different proposals would be appropriate in some directions than others is through component-wise updates. The idea is essentially to update a single coordinate (or groups of coordinates) of the state vector \(x\) at each iteration, rather than the entire \(d\)-dimensional object. Doing this means that different proposals and different step-sizes can be chosen for each (group of) coordinate(s). This can be beneficial.

We explain the component-wise update for a 3-dimensional example, with target density \(\pi(x(1),x(2),x(3))\). Let \(X_i\) be the current state of the Markov chain. The component-wise random walk Metropolis is the following.

  • Set \(Y(1) = X_i(1) + h_1\xi_i\) with \(\xi_i \sim N(0,1)\)
  • Set \(Y := (Y(1),X_i(2),X_i(3))^T\) and \[ X_{i+1} = \begin{cases} Y \qquad \text{w.p. } \alpha(X_i,Y) \\ X_i \qquad \text{otherwise.} \end{cases} \]
  • Set \(Y(2) = X_{i+1}(2) + h_2\xi_{i+1}\) with \(\xi_{i+1} \sim N(0,1)\)
  • Set \(Y := (X_{i+1}(1),Y(2),X_{i+1}(3))^T\) and \[ X_{i+2} = \begin{cases} Y \qquad \text{w.p. } \alpha(X_{i+1},Y) \\ X_{i+1} \qquad \text{otherwise.} \end{cases} \]
  • Set \(Y(3) = X_{i+2}(3) + h_3\xi_{i+2}\) with \(\xi_{i+2} \sim N(0,1)\)
  • Set \(Y := (X_{i+2}(1),X_{i+1}(2),Y(3))^T\) and \[ X_{i+3} = \begin{cases} Y \qquad \text{w.p. } \alpha(X_{i+2},Y) \\ X_{i+2} \qquad \text{otherwise.} \end{cases} \]

In the \(d\)-dimensional case of course one would cycle through \(d\) components, rather than always 3. The version of component-wise sampling described above is called a deterministic scan sampler. There is also a random scan version, in which a component is chosen randomly at each iteration, rather than each component from \(\{1,...,d\}\) being cycled through in a deterministic fashion.

Note two key points from the above:

  • A separate step size is used for the 1st, 2nd and 3rd component updates
  • The iteration counter moves forward after every component-wise update

The second bullet is important for maintaining \(\pi\)-invariance of the chain. The first bullet is what can allow us to handle heterogeneity, since larger step-sizes can be chosen for components with larger marginal variances. A simple example of using a component-wise Metropolis sampler on the same heterogeneous two-dimensional example is given below.

Notice the zig-zag pattern made by the sampler, since only one coordinate is updated at each iteration while the other remains fixed. The component-wise scheme is also sometimes called Metropolis-within-Gibbs, because of its similarity with another MCMC scheme known as the Gibbs sampler. We will not cover this here, but it is covered in STAT0031 for example.

7.1.3 Re-parametrisations and heavy-tailed proposals

Consider a target distribution of the form \[ \pi(x) \propto \frac{1}{x^2}\mathbb{I}(x > 1). \] Of course the setting is simple and this kind of distribution would not be a problem to sample from in practice, but it serves as a sensible model problem for our purposes here. The distribution is called heavy-tailed. The are many idfferent ways to define a distribution in this manner. We will consider a continuous distribution to be heavy-tailed if there is no \(\delta >0\) for which its density \(\pi(x)\) satisfies \[ \int e^{\delta\|x\|}\pi(x)dx < \infty, \] where \(\|x\| := \sqrt{\sum_i x_i^2}\) is the usual Euclidean norm. Intuitively for this integral to be finite the density must decay to zero as \(\|x\| \to \infty\) at a faster rate than \(e^{-\delta\|x\|}\). If the density does not decay so quickly (e.g. it decays \(\sim \|x\|^{-p}\) for some \(p>1\)) then it will be heavy-tailed. The more colloquial ay to think about the problem is that heavy-tails are those for which the density decays at a slower than exponential rate.

Exercise. Check that the above form of \(\pi(x)\) does indeed constitute a heavy-tailed distribution.

As we have already seen in the workshops, heavy-tailed distributions are a challenge for the random walk Metropolis. We can see why this is by examining the acceptance rate as a function of the current position \(x\). Writing \(\xi \sim N(0,1)\) then this is \[ \min \left( 1, \frac{\pi(x+h\xi)}{\pi(x)} \right). \] In the model scenario, the ratio within this minimum is essentially \[ \frac{x^2}{(h\xi+x)^2}. \] Since \(\xi \sim N(0,1)\) then the size of \(h\xi\) will typically be such that for roughly 96% of proposals \(h|\xi| \leq 2h\). So for typical proposals as \(|x| \to\infty\) this ratio will be approximately 1, whether or not \(\xi\) is positive or negative. We would like it to be the case that the acceptance rate is steering the Markov chain towards high-probability rgions of the state space, meaning when \(x >>0\) we accept when \(\xi < 0\) and typically reject when \(\xi > 0\), however in the heavy-tailed setting this doesn’t seem to be the case. This is what causes the meandering behvaiour of the random walk Metropolis in this setting that was seen in the previous workshop.

To rectify the problem one simple solution is to re-parametrise the model. Instead of defining \(\pi(\cdot)\) in terms of \(x\), we can instead work with \(y:= \log(x)\). Then using the change of variables formula we have \[ \pi'(y) = \pi(x)\left|\frac{dx}{dy}\right|. \] Since \(y = \log(x)\) then \(x = e^y\) and \(dx/dy = e^y\) too, so \[ \pi'(y) \propto \frac{1}{(e^y)^2}\mathbb{I}(y > 0) \left| e^y \right| = e^{-y} \mathbb{I}(y > 0). \] And now, as if by magic, the new variable \(y\) no longer has a heavy-tailed density!

Re-parametrisations such as these are not always easy to find but often can be helpful once a little is known about the problem at hand. As well as controlling for heavy tails they can also be useful in uncorrelating different components of the target distribution, or more generally removing dependencies.

In the heavy-tailed setting another solution is to choose heavier tails for the \(\xi\) variable, meaning that rather than setting \(\xi \sim N(0,1)\) we set \(\xi \sim t_5\) or similar. This will have the affect that typical proposals are larger in size and that surprisingly large proposals occur more frequently, which can be advantageous. The intuition is the same as in the case of choosing the proposal covariance to match that of the target.

7.2 Assessing the output

Now that we have developed a basic understanding of how to implement Markov chain Monte Carlo, the fundamental question of interest is:

How do I know if my MCMC algorithm works?

The question has both a theoretical and practical answer, but we will focus mainly on the practical side here. The Metropolis-Hastings algorithm allows us to approximate the integral \(\mathbb{E}_\pi[f]\) by constructing estimators of the form \[ \tilde{f}_n := \frac{1}{n}\sum_{i=1}^n f(X_i), \] where \(\{X_1,...,X_n\}\) is a Markov chain with limiting distribution \(\pi(\cdot)\). Judging whether or not the algorithm ‘works’ can be interpreted as assessing the quality of the estimator \(\tilde{f}_n\). There are many loss functions with with to assess the quality of an estimator, but we will restrict ourselves to the simple mean squared error criterion here. Writing \(f_0 := \mathbb{E}_\pi[f]\) so as not be be inundated with expectations, and dropping the \(n\) subscript, we have \[ L(\tilde{f},f_0) := \mathbb{E} \left[ (\tilde{f} - f_0)^2 \right]. \] This can be broken down using the more familiar quantities of bias and variance of an estimator, since \[ \begin{aligned} \mathbb{E} \left[ (\tilde{f} - f_0)^2 \right] &= \mathbb{E} \left[ (\tilde{f} - \mathbb{E}[\tilde{f}] + \mathbb{E}[\tilde{f}] - f_0)^2 \right] \\ &= \mathbb{E}[ (\tilde{f}-\mathbb{E}[\tilde{f}])^2 ] + 2\mathbb{E}[(\tilde{f}-\mathbb{E}[\tilde{f}])(\mathbb{E}[\tilde{f}] - f_0)] + (\mathbb{E}[\tilde{f}]-f_0)^2. \end{aligned} \] The middle expectation is \[ \mathbb{E}[(\tilde{f}-\mathbb{E}[\tilde{f}])(\mathbb{E}[\tilde{f}] - f_0)] = \mathbb{E}[\tilde{f}]^2 - \mathbb{E}[\tilde{f}]f_0-\mathbb{E}[\tilde{f}]^2+f_0\mathbb{E}[\tilde{f}] = 0. \] Recalling that \(\text{Bias}(\tilde{f}) := \mathbb{E}[\tilde{f}] - f_0\), the mean squared error can therefore be written \[ L(\tilde{f},f_0) = \text{Bias}(\tilde{f})^2 + \text{Var}[\tilde{f}]. \] Understanding the quality of an MCMC algorithm can therefore be done by assessing the bias and variance of estimators for each quantity of interest.

7.2.1 Assessing variance

The variance of \(\tilde{f}_n\) is \[ \begin{aligned} \text{Var}(\tilde{f}_n) &= \frac{1}{n^2}\text{Var}\left( \sum_{i=1}^n f(X_i) \right), \\ &= \frac{1}{n^2} \text{Cov}\left(\sum_{i=1}^n f(X_i), \sum_{i=1}^n f(X_i) \right), \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n\text{Var}(f(X_i)) + 2 \sum_{i<j} \text{Cov}\left( f(X_i), f(X_j) \right) \right). \end{aligned} \] To properly understand the above derivation it may help to recall that the covariance function is both symmetric and distributive in its arguments, meaning \(\text{Cov}(X,Y) = \text{Cov}(Y,X)\) and \(\text{Cov}(X,Y+Z) = \text{Cov}(X,Y) + \text{Cov}(X,Z)\). If we consider the setting in which equilibrium has been reached, meaning \(X_1 \sim \pi(\cdot)\), then by \(\pi\)-invariance \(X_i \sim \pi(\cdot)\) for all \(i \geq 1\). This means that \[ \text{Var}\left( \tilde{f}_n \right) = \frac{1}{n} \left( \text{Var}(f(X_1)) + 2 \sum_{k=2}^n \left( \frac{n-k}{n} \right) \text{Cov}\left( f(X_1), f(X_k) \right) \right) \] The main point here is to note that because of the correlatin between different states in the chain, this variance could be much larger than under independent sampling.

We can now return to some of the earlier intuitive eample of MCMC not working well and relate more precisely to the variance of an MCMC estimator for some integral \(\int f(x)\pi(x)dx\) of interest.

  • When the step-size in a random walk Metropolis is too small, nearby points in the chain will be very similar, meaning \(\text{Cov}(f(X_1),f(X_k))\) will be large for many values of \(k\). This will have the effect of inflating the variance of the MCMC estimator

  • When the step-size is too large and the first \(k\) proposals are all rejected then \(\text{Cov}(f(X_1),f(X_k)) = \text{Var}(f(X_1))\), which is the maximum possible value (using the Cauchy-Schwarz inequality). So again this will have the effect of producing estimators with a very large variance.

This is where the Goldilocks principle intuition comes from. If we choose a ste-size that is not too big and not too small then we can minimise the sum on the right-hand side of the variance equation.

Remark. You may note at this point that covariance can be negative as well as positive, and therefore wonder whether or not it is possible to achieve a variance for an MCMC estimator that is smaller than that produced by iid sampling. The answer is that yes such estimators are possible, and the Markov chains are known as antithetic. It is however, typically the case that a user is interested in computing estimates for more than just a single integral \(\int f(x)\pi(x)dx\), but rather several different integrals using the same chain. In the Bayesian setting, for example, there coul be interest in various posterior means, variances and quantiles. It is very difficult in general to design Markov chains that are antithetic for all integrals of interest, and so we tend to still refer to independent samples as a gold standard of sorts.

In practice we would of course like to estimate the variance of an estimator using the output of the chain. There are a few ways to do this. One simple approach is called the method of batch means, in which the basic intuition is to divide the Markov chain into batches and the treat the batches as approximately independent. If care is taken when choosing the batch size then this is a sensible approach. Another popular approach is based on spectral methods from time series analysis. We will not discuss the details of these here, but simply note that variances of MCMC estimators can be computed straghtforwad using R packages, and we will explore these further in the next workshop.

Another way to discuss the quality of an estimator is to construct an effective sample size, as in the case of importance sampling. Note that in the case of the ordinary Monte Carlo estimator \(\hat{f}_n\), we have \(\text{Var}(\hat{f}_n) = \text{Var}_\pi(f(X))/n\), meaning that \[ \frac{\text{Var}(\tilde{f}_n)}{\text{Var}(\hat{f}_n)} = 1 + 2 \sum_{k=2}^n \left( \frac{n-k}{n} \right) \text{Corr}\left( f(X_1), f(X_k) \right). \] We can simplify this expression even further by taking the limit as \(n \to \infty\), meaning that for any finite \(k\) the ratio \((n-k)/n \to 1\). The resulting expression is \[ \lim_{n \to \infty} \frac{\text{Var}(\tilde{f}_n)}{\text{Var}(\hat{f}_n)} = 1 + 2 \sum_{k=2}^\infty \text{Corr}\left( f(X_1), f(X_k) \right). \] If we examine this expression we can see an intuitive appeal. It suggests that the more correlated samples are with each other, which can be related to the scenario in which the chain moves very slowly through the state space, the larger the variance of estimators will be. This adds some justification to the discussion in the previous chapter about desiring a chain that moves quickly around \(\mathcal{X}\).

To develop a notion of effective sample size, we solve the equation \(\text{Var}(\tilde{f}_n) = \text{Var}_\pi(f(X))/n_{eff}\), which leads to the following expression.

Definition. The effective sample size of a Markov chain Monte Carlo estimator \(\tilde{f}_n\) is defined to be \[ n_{eff} := \frac{n}{1+2\sum_{k=2}^\infty \text{Corr}(f(X_1),f(X_k))}. \]

Of course, the above may well seem unsatisfactory, for two reasons:

  • It involves an infinite sum
  • We don’t know what each correlation actually is

The standard approach is to estimate the correlations from the chain. Of course this means there is some error, but hopefully this won’t be too much of a problem. For the latter, there are different ways of dealing with the problem. One approach is to simply truncate the sum as soon as two successive estimated terms are negative. Another is to fit an auto-regressive process model to the Markov chain and compute the infinite sum using spectral methods. We will not explore the details of these here, but they are implemented in many different software packages.

Generally when using Markov chain Monte Carlo it is standard to report effective sample sizes alongside estimates for each quantity of interest. But it is also common to plot the estimated correlations from the chain, as this gives a good visual indication of how quickly they relax to zero, which can be related to the quality of the estimator using the above expressions.

We end this section with a note of caution: in the Bayesian inference setting it is important not to get confused between ‘the variance of the MCMC estimator’, the subject of this section, and ‘the posterior variance of the parameter’. The former describes how well an MCMC algorithm approximates a given posterior expectation. The latter describes our uncertainty about some parameter of interest given a particular model and dataset. Just because my MCMC algorithm performs well, does not mean that the resulting probabilistic model is a good one!

7.2.2 Assessing bias

The ‘bias’ of an MCMC estimator originates from the fact that the Markov chain has not reached equilibrium. This means that \(P^n(x,\cdot)\), the distribution of \(X_n\), is therefore not exactly \(\pi(\cdot)\) in general. The theoretical approach to understanding this problem is to assess \[ \mathcal{D}_\mathcal{F} \left( P^n(x,\cdot), \pi(\cdot) \right) \] for a finite \(n\), or at least give some indication as to how this quantity will decay to zero as a function of \(n\). One common approach is to find a bound on the distance that decays geometrically with \(n\). If such a bound can be found we say the Markov chain is geometrically ergodic in the relevant distance metric. Establishing geometric ergodicity, or more generally the study of what is called mixing for Markov chains (meaning quantifying how far a Markov chain is from its limiting distribution), is a popular area of current research, relying on many interesting mathematical tools from both probability theory and mathematical analysis. The following proposition gives some intuition of why a geometric bound is beneficial for the bias problem.

Proposition. If \(f\in \mathcal{F}\) and \(\mathcal{D}_\mathcal{F}(P^n(x,\cdot),\pi(\cdot)) \leq C\rho^n\) for some \(C<\infty\) and \(\rho < 1\), then \[ |\text{Bias}(\tilde{f}_n)| \leq \frac{C}{n(1-\rho)}. \] Proof. Direct calculation gives \[ \begin{aligned} |\mathbb{E}[\tilde{f}_n] - \mathbb{E}_\pi[f(X)]| &\leq n^{-1}\sum_{i=1}^n | \mathbb{E}[f(X_n)] - \mathbb{E}_\pi[f(X)] | \\ &= Cn^{-1} \sum_{i=1}^n \rho^i \\ &\leq \frac{C}{n(1-\rho)}. \end{aligned} \]

The above suggests that \(\text{Bias}(\tilde{f}_n)^2\) is some constant multiplied by \(1/n^2\). Typically the variance, as seen above, will decay at a rate of \(1/n\) (a slower rate). This means that provided such a bound can be found, it is likely that the squared bias term is small in comparison to the variance term in the mean squared error expression, for sufficiently large \(n\).10

There are also much more practical approaches to the same problem. The simplest of these is simply to look at the trace plot of a chain. Consider the below example.

The limiting distribution for the chain is \(N(0,1)\), but the starting state is \(X_0 = 100\). This is an extremely unlikely sample from a standard Normal distribution. So it is not surprising that the chain quickly moves into a more probable region. From studying the above, it seems like the chain begins to stabilise, and hence could be considered to have mixed, after around 120 iterations. Among practitioners, the section of the chain up until this point is called burn in (or less frequently warm up), and is typically discarded. The purpose of discarding this initial collection of samples is to reduce the bias of the resulting estimator. One can equivalently adapt our MCMC estimator to be \[ \tilde{f}_n = \frac{1}{n} \sum_{t = m+1}^{n+m} f(X_t), \] where the first \(m\) samples are discarded as burn in.

Of course, in the general setting we don’t know exactly what the probable region of the target distribution is, and particularly in a high dimensional setting we cannot assess the trace plot for every single parameter of interest. But we can do several pragmatic things that help us understand things in general, such as:

  • Have some kind of blanket rule, such as ‘discard the first 50% of samples and use the remaining samples’
  • Examine trace plots of some suitable summaries, such as the logarithm of the target distribution \(\log\pi(X_t)\)
  • Plot the running averages for each expectation of interest, and check that each of these has stabilised

There are many other such ‘rules of thumb’ that are common among practitioners, and the reality is that although it is impossible to be sure of things without supporting theory, in many cases a combination of these is a good practical way to assess convergence to equilibrium.

The final method we will approach in this section is a slightly more sophisticated approach to controlling bias, known as The Gelman-Rubin diagnostic, pioneered by Andrew Gelman and Donald Rubin in 1992. The basic idea is to initialise several different Markov chains targeting the same distribution from different starting points, and monitor each to see whether or not they converge onto the same region of parameter space. Once this happens we can conclude that the chains are close to equilibrium, in the sense that you cannot tell them apart, and hence it would seem that the chains have forgotten their starting points.

The basic approach to this method is to run \(k\) Markov chains, each for \(n\) iterations, from different (and preferably very well-dispersed) starting points. We compute a summary of within-chain variation and a summary of between chain variation, and compare these quantities. If the chains are not yet in the same region, then we would expect the between-chain variation to be larger that the variation within each chain. Once equilibrium is nearly reached, we would expect the two quantities to be comparable.

Mathematically we write \(f_{ij}:= f(X_{ij})\), where \(f\) is the function for which expectations are desired, and \(X_{ij}\) denotes the \(i\)th state of the \(j\)th Markov chain. Write \[ \bar{f}_{.j}:= \frac{1}{n} \sum_i f_{ij}, \qquad \bar{f} = \frac{1}{nm} \sum_{i,j} f_{ij}. \] Then the ‘between chains estimator’ is \[ B := \frac{n}{m-1} \sum_{j=1}^m (\bar{f}_{.j} - \bar{f})^2, \] and the ‘within chains’ estimator is \[ W:= \frac{1}{m} \sum_j W_j, \qquad W_j := \frac{1}{n-1} \sum_{i=1}^n (f_{ij} - \bar{f}_{.j})^2. \] The Gelman-Rubin diagnostic is known is \(\hat{R}\) and defined as \[ \hat{R} := \sqrt{ \frac{\frac{n-1}{n}W + \frac{1}{n}B}{W} }. \] Intuitively, if the chains are in different regions and hence have not mixed sufficiently, then \(B\) will typically be much larger than \(W\), meaning that \(\hat{R}\) will be much larger than \(1\). Once the different chains look sufficiently similar then \(\hat{R} \approx 1\). This is by no means a fool proof diagnostic, but nonetheless it is widely used in practice. Some general ‘rule-of-thumb’ guidelines are to ensure that \(\hat{R}<1.1\), although this is purely based on empirical evidence, there is little theory to support the practice.

There are many variations on \(\hat{R}\) and alternative diagnostics that are in use, but we will not explore these further here. Needless to say that developing good MCMC diagnostic tools and supporting theory is very much an active area of research.

It is a good idea in a statistical analysis involving MCMC to report some kind of summary table, as in other models like linear regression. In the MCMC setting such a table may look like the below.

Quantity of interest Estimate MCMC S.E. ESS Rhat n
\(\mathbb{E}_{\pi}[\theta]\) 20.25 0.02 4533 1.02 10000
\(\text{Var}_{\pi}[\theta]\) 1.52 0.04 3270 1.03 10000
\(\mathbb{P}_{\pi}[\theta < 0]\) 0.00 0.01 8638 1.01 10000

Here S.E. stands for standard error, ESS stands for effective sample size, and \(n\) is the number of iterations in the Markov chain. Of course there may be many other quantities of interest than those shown in the table in a particular application, this is just an example.


  1. Those who know some measure theory will recognise this as the Borel \(\sigma\)-algebra, but the course is designed to be approachable to students without such background.↩︎

  2. Again those familiar with measure theory will note that the integrals can be kept in the discrete case by simply interpreting the reference measure \(dx\) as the relevant counting measure rather than the Lebesgue measure on \(\mathbb{R}^d\).↩︎

  3. We focus in this course on applications of Monte Carlo in Statistics, but the principle is applied in many other areas, and was first introduced by statistical physicists working on the Manhatten project.↩︎

  4. There is also a weak law of large numbers, which states that for every \(\epsilon > 0\), \(\lim_{n\to\infty}\mathbb{P}(|\hat{f}_{n}-\mathbb{E}_{\pi}[f(X)]| > \epsilon)=0\) under the same conditions. Intuitively the results mean the same thing, but there are differences between the precise statements.↩︎

  5. Note that we have switched from \(f_{\theta}(\cdot)\) to \(f(\cdot|\theta)\). In the Bayesian paradigm we treat \(\theta\) as a random variable and so conditioning on this is natural. Bayesians still, however, tend to believe that there is an unknown fixed true parameter value \(\theta_{0}\).↩︎

  6. The details of the construction will not be discussed here, but interested readers can look up the Borel-Kolmogorov paradox and a regular conditional probability measure for further reading. Be warned that some measure theory is needed to understand these.↩︎

  7. This means that the set \(A\) of all starting states for which the limit isn’t \(\pi(\cdot)\) satisfies \(\pi(A)=0\). We will not dwell on these technical points during the course, but the interested reader can check the references for further information.↩︎

  8. Note, however, that there are many other application areas for MCMC, inside and outside of Statistics and Data Science. In molecular dynamics simulation (a branch of statistical physics), MCMC is also widely used. In this context the distribution of interest is the Boltzmann–Gibbs distribution for some system of interacting particles.↩︎

  9. There is theoretical justification for both of these statements, which is currently being honed as an active area of research.↩︎

  10. Technically, the above is actually called uniformly ergodic, which is a strong form of geometric ergodicity. In general \(C\) can be a function of the starting point \(x\) to have a geometrically ergodic chain. The above is the case in which \(C(x) \leq C\) for all \(x\).↩︎