Module Outline

Lecture 1: Mathematical preliminaries, sampling & Monte Carlo, rejection and importance sampling
Lecture 2: Markov chains on general state spaces, introduction to MCMC
Lecture 3: Metropolis–Hastings algorithm, Random Walk proposals, if time Gibbs sampler
Lecture 4: Some MCMC theory
Lecture 5: Advanced algorithms (Langevin, Hamiltonian, Tempering)

Part 1: Preliminaries & motivation

Notation

We will work on the measurable space \((E,\mathcal{E})\) (\(E\) polish, \(\mathcal{E}\) Borel)
- \(E\) is outcome space, \(\mathcal{E}\) is event space
\(\mu,\nu,\xi,\lambda\) etc. can denote probability measures/distributions/models
If \(X \sim \mu\) then \[ \mathbb{P}(X \in A) = \mu(A) \] for all \(A \in \mathcal{E}\)
We will write \(\mu(A) = \int_A \mu(dx)\) and \[ \mathbb{E}_\mu[f] = \int f(x) \mu(dx) \]
If \(\mu(dx)\) has a density then this will be written \(\mu(x)\)
Discrete distributions can be expressed using the Dirac measure \[ \delta_x(A) := \begin{cases} 1 & x \in A \\ 0 & x \notin A \end{cases} \]
This allows a unified treatment of discrete & continuous random variables, and everything in between

Notation: example 1

\(E = \mathbb{R}\)
The distribution \(\nu\) is defined as \[ \nu(A) := \frac{1}{\sqrt{2\pi}}\int_A e^{-\frac{x^2}{2}}dx \]
Question. What is the name of this well-known distribution?
Write \[ \mathbb{E}_{\nu}[X^2] = \int x^2 \nu(dx) = \frac{1}{\sqrt{2\pi}}\int_A x^2 e^{-\frac{x^2}{2}}dx \]
Formally we can say \(\nu(dx) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx\)
Question. What is \(\mathbb{E}_{\nu}[X^2]\) here?
Question. What measure is \(dx\)?

Notation: example 2

\(E = \{1,2,...,10\}\)
For any \(A \in \mathcal{E}\) the distribution \(\xi\) is defined as \[ \xi(A) = \frac{1}{10} \sum_{i=1}^{10} \delta_i(A) \]
Question. What is the name given to this probability distribution?
Now \[ \mathbb{E}_{\xi}[X^2] = \int x^2 \xi(dx) = \frac{1}{10} \sum_{i=1}^{10}\int x^2 \delta_i(dx) = \frac{1}{10}\sum_{i=1}^{10} i^2 \] (recall that by definition \(\int f(x) \delta_a(dx) = f(a)\))
Question. If \(Y\) is a random variable s.t. \(\mathbb{P}(Y = 1) = 1/4\) and \(\mathbb{P}(Y = 2) = 3/4\), how can we express the distribution of \(Y\)?
Question. If \(Z \sim \delta_0\) then what is \(\mathbb{P}(Z = 0)\)?

Notation: final example

\(E = \mathbb{R}\), consider the following probability measure (with \(\nu\) as in example 1) \[ \lambda := \frac{1}{2}\nu + \frac{1}{2}\delta_0 \]
Question. If \(X \sim \lambda\), what are \(\mathbb{P}(X \geq 0)\) and \(\mathbb{P}(X > 0)\)?
Question. Is \(X\) discrete/continuous?
We will need this level of notational flexibility later

Motivation

This course is motivated by two seemingly abstract problems
The Sampling Problem. Given a finite measure \(\pi_u\) on \((E,\mathcal{E})\), generate a sample \[ X \sim \pi \] where \(\pi(\cdot) := \pi_u(\cdot)/\pi_u(E)\) is the normalised version of \(\pi_u\).
The Integration Problem. Given the same \(\pi_u\) and a function \(f \in L^1(\pi_u)\), compute \[ \mathbb{E}_\pi[f] = \int f(x)\pi(dx), \] with \(\pi\) as above, to a desired degree of accuracy.
These questions are ubiquitous in computational science (statistical physics, theoretical computer science, optimization, applied mathematics…)
- We will focus on applications to statistics and statistical machine learning

Example: Inference in latent variable models

Suppose we have a model with parameters \(\theta\) and observations \(x\) but also some latent structure \(z\) (unobserved)
- e.g. in a mixture model, \(z\) might be the cluster allocations
General structure is a log-likelihood of the form \[ \ell(\theta;x) = \log\int p_\theta(x,z)dz \]
How we can maximise the log-likelihood if this integral is intractable?

Example (Classical significance testing)

Suppose a statistical model with density \(f_{\theta}\), \(\theta\) unknown
We wish to test \[ H_{0}:\theta=\theta_{0}. \]
We collect data \(x=(x_{1},...,x_{n})\) with \(x_i \in \mathcal{D}\), assumed to be iid samples from \(f_\theta\)
Define a test statistic \(T:\mathcal{D}^{n}\to[0,\infty)\), s.t. for any \(y,z\in \mathcal{D}^n,\)
\[ T(z)>T(y) \] means \(H_0\) is more consistent with \(y\) than \(z\).
The corresponding \(p\)-value is \[ \mathbb{P}[T(X)\geq T(x)|H_{0}]=\int_{\{z:T(z)\geq T(x)\}} f_{\theta_{0}}(z_{1})...f_{\theta_{0}}(z_{n})dz_{1}...dz_{n}. \]
This is usually intractable if we don’t know the distribution for \(T(X)\).

Example (Bayesian inference)

Data \(x=(x_{1},...,x_{n})\), assume the likelihood to be \(f_{\theta}: \mathcal{D}^n \to [0,\infty)\), \(\theta \in \mathbb{R}^d\) unknown
Construct a prior density \(\pi_{0}(\theta)\), and use Bayes’ theorem to update beliefs, giving a posterior \[ \pi(\theta|x)=\frac{f_{\theta}(x)\pi_{0}(\theta)}{\int f_{\theta}(x)\pi_{0}(\theta)d\theta}. \]
To learn from our model we typically want posterior summaries, e.g. \[ \mathbb{E}_{\pi}[g(\theta)]=\int g(\theta)\pi(\theta|x)d\theta=\int g(\theta)\frac{f_{\theta}(x)\pi_{0}(\theta)}{\int f_{\vartheta}(x)\pi_{0}(\vartheta)d\vartheta}d\theta. \]
Both of these integrals may be intractable.

What is an intractable integral?

A one-dimensional integral \[ \int_{a}^{b}f(x)dx \] is said to be intractable if there is no elementary function \(F\) such that \[ \int_{a}^{b}f(x)dx=F(b)-F(a) \] for all \(a,b\in\mathbb{R}\)
A function is called elementary if it can be written as the composition of a finite number of addition, subtraction, multiplication, division, exponential, logarithm, trigonometric, power or root operations
An intractable integral is well-defined, just not easy to compute. Since Riemann and Lebesgue integrals agree when both are defined, when one is intractable then so is the other!
(The extension to higher dimensions is straightforward)

A simple example

A very simple example of an intractable integral is \[ \int_{a}^{b}e^{-\frac{1}{2}x^{2}}dx. \]
Question: What is \(\int_{-\infty}^\infty e^{-x^2/2}dx\)?
In other cases we may also have a tractable but expensive integral or sum that we would rather avoid: we call this computationally intractable
E.g. imagine for large \(n\) computing \[ \sum_{i=1}^{n!} a_i \] for some sequence of real numbers \(a_1,a_2,...\).

Intractable integrals can look very innocent!

Analytical intractability \(\implies\) numerical methods

When the analytical approach fails we resort to numerical methods for calculating the integral
This is why you had to use \(z\)-tables and \(t\)-tables to compute p-values in your first course in Statistics!
A popular low-dimensional strategy is quadrature rules, e.g. trapezium rule \[ \int_{a}^{b}f(x)dx\approx\sum_{i=1}^{n}\frac{f(x_{i-1})+f(x_i)}{2}\delta x \]
In 1D this is an effective strategy, but:
- Assume \(N\) grid squares are needed to well-approximate an integral in one dimension
- This means \(N^{d}=e^{dlog(N)}\) grid squares are needed in \(d\)-dimensions!
- So the cost grows exponentially with dimension! (curse of dimensionality)

Trapezium rule for 1D Gaussian density

The Monte Carlo Principle

If an integral can be written as an expectation with respect to some probability distribution, we can approximate it by sampling from the distribution and computing the empirical average.

The Monte Carlo method

Suppose we wish to calculate \[ \mathbb{E}_{\pi}[f]:=\int f(x)\pi(dx). \]
The Monte Carlo method:
1. Draw \(N\) samples \(X_{1},...,X_{N} \sim \pi\)
2. Compute \(\hat{f}_{N}:= \frac{1}{N} \sum_{i=1}^{N}f(X_{i})\).
Note that this approach would be identical in \(d\)-dimensions. So if we are lucky the curse of dimensionality will be broken.

Why does Monte Carlo work?

Monte Carlo works because of two fundamental results from probability theory

Theorem (Strong) Law of Large Numbers. Suppose that \(X_{1},...,X_{N}\) are iid samples from \(\pi(\cdot)\), and set \(\hat{f}_{N}:=N^{-1}\sum_{i=1}^{N}f(X_{i})\). If \(\mathbb{E}_{\pi}|f(X)|<\infty\), then as \(N\to\infty\) \[ \mathbb{P}\left(\lim_{N\to\infty}\hat{f}_{N}=\mathbb{E}_{\pi}[f]\right)=1. \]
Theorem (Central Limit Theorem). Suppose \(X_{1},...,X_{N}\) are iid from \(\pi(\cdot)\), and set \(\hat{f}_{N}:=N^{-1}\sum_{i=1}^{N}f(X_{i})\). If \(\mathbb{E}_{\pi}[f(X)^{2}]<\infty\) then as \(N\to\infty\) \[ \sqrt{N}\left(\hat{f}_{N}-\mathbb{E}_{\pi}[f]\right)\xrightarrow{d} \mathcal{N} \left(0,\text{Var}_{\pi}[f] \right) \]
SLLN tells us that eventually Monte Carlo will work in many problems
The CLT gives information about rates of convergence, and the shape of fluctuations in \(\hat{f}_N\) as \(N\to\infty\).

A more statistical assessment of Monte Carlo

We can consider \(\hat{f}_N\) as an estimator for the unknown quantity \(\mathbb{E}_\pi[f]\)
The bias of this estimator is \[ \mathbb{E}[\hat{f}_N] - \mathbb{E}_\pi[f] \]
The Monte Carlo estimator is unbiased, since \[ \mathbb{E}[\hat{f}_N] = \mathbb{E} \left[ \frac{1}{N} \sum_{i=1}^N f(X_i) \right] = \frac{1}{N} \sum_{i=1}^N \mathbb{E}[f(X_i)] = \frac{1}{N} \cdot N \mathbb{E}_\pi[f] \]
The variance of \(\hat{f}_N\) is \[ \text{Var}[\hat{f}_N] = \text{Var}\left( \frac{1}{N} \sum_i f(X_i) \right) = \frac{1}{N^2} \sum_i \text{Var}_\pi[f(X_i)] = \frac{1}{N}\text{Var}_\pi[f(X)] \]
(The second equality holds because the samples are independent, a restriction that we will later relax)

Some natural questions

What if my integral isn’t an expectation?
- For a general integral \(\int g(x)dx\) one can write \[ \int g(x)dx=\int\frac{g(x)}{\pi(x)}\pi(x)dx=\mathbb{E}_{\pi}[g/\pi] \]
- How intelligent this is depends on \(g\) and \(\pi\) (we may return to this idea).
How can I compute integrals between finite limits, e.g. probabilities?
- We can use the indicator function \[ \begin{aligned} \mathbb{I}_{[a,b]}(x) & :=\begin{cases} 1 & a\leq x\leq b\\ 0 & \text{otherwise.} \end{cases} \end{aligned} \] This gives \[ \mathbb{P}[a\leq X\leq b]=\int_{a}^{b}\pi(dx)=\int\mathbb{I}_{[a, b]}(x)\pi(dx)=\mathbb{E}_{\pi}[\mathbb{I}_{[a, b]}(X)], \] which can then be evaluated in the usual way.
How do I draw samples from a probability distribution?
- This is what we will discuss throughout the course!

Part 3: Sampling from a probability distribution

The big picture

Generating a sample from a probability distribution \(\pi\) takes a few steps:
1. Generate a sample from \(U[0,1]\) distribution
  - Actually pseudo-random samples
2. If possible, directly transform this into a sample from \(\pi\)
3. If not then resort to indirect approaches
In the interests of time I will not cover step 1 (ask if interested)

Direct sampling via transformations

Assume that we have a \(U[0,1]\) sample \(u\)
We can transform this to generate samples from lots of different distributions.
The key piece of intuition is \[ \mathbb{P}[U\leq u]=F_{U}(u)=u \] for any \(u\in[0,1]\).
This also means that \[ \mathbb{P}[a<U<b]=b-a \] for \(a,b\in[0,1]\) with \(a\leq b\).

Bernoulli sampling

If \(U\sim U[0,1]\) and we set \[ \begin{aligned} X=\begin{cases} 1 & U\leq p\\ 0 & U>p \end{cases} \end{aligned} \] for some known \(p\in[0,1]\), then \(X\sim\text{Bernoulli}(p)\).
Proof. \(\mathbb{P}[X=1]=\mathbb{P}[U\leq p]=F_{U}(p)=p,\) and a similar calculation can be made for \(\mathbb{P}[X=0]\) (try it if unsure!).

A more complex example

Consider a more complex example \[ \mathbb{P}[X=i]=p_{i} \] for \(i\geq1\)
Write \[ c_{i}:=\sum_{j=1}^{i}p_j. \]
To generate a sample:
1. Draw \(U\sim U[0,1]\)
2. Set \(X=i\) if \(c_{i-1}<U\leq c_{i}\)
Exercise. Can you prove that the above procedure will generate a sample from the correct distribution?

Graphical representation of the discrete case

Continuous case: Inverse CDF method

If \(U\sim U[0,1]\) and \(F_{X}\) is the CDF for \(X\) with \(F_{X}^{-1}\) known, then \[ X\stackrel{d}{=}F_{X}^{-1}(U). \]
Proof. Show CDFs are equal: \[ \mathbb{P}[F_{X}^{-1}(U)\leq x]=\mathbb{\mathbb{P}}[U\leq F_{X}(x)]=F_{X}(x)=\mathbb{P}[X\leq x]. \]

Example (Exponential distribution)

If \(X\sim\text{Exp}(\lambda)\), then for \(x > 0\) \[ F(x) := 1-e^{-\lambda x} \implies F^{-1}(u) := -\frac{1}{\lambda}\log(1-u) \]
Using the Inverse CDF method then gives \[ X\stackrel{d}{=}-\frac{1}{\lambda}\log(1-U), \] where \(U \sim U[0,1]\).
Question: In this case \(-\frac{1}{\lambda}\log(U)\) would also work. Can you see why?

Example (Exponential distribution)

uniform_samples <- runif(1000)
transformed_samples <- -log(1-uniform_samples)
hist(transformed_samples, xlab = "", ylab = "")

Limitations

The inverse CDF method is very powerful, but requires an expression for \(F_X^{-1}\)
- Question. Can you think of a distribution for which this cannot be easily derived?
Another way to generate samples is to build a r.v. by combining others
- If \(Y\sim\text{Gamma}(n,\lambda)\), then one can write \[ Y\stackrel{d}{=}\sum_{i=1}^{n}X_{i}, \] where each \(X_{i}\stackrel{iid}\sim\text{Exp}(\lambda)\)
This is also not always possible: we need more flexibility

Indirect Sampling

Indirect sampling refers to the idea of post-processing samples in some other way than a direct transformation or combination
This usually results in some loss of computational efficiency compared to direct sampling
- But often it is all we can do!
We will look at two approaches
- rejection sampling
- importance sampling

Rejection sampling: simple example

Roll a regular six-sided dice, label result \(X\)
If \(X \leq 3\), keep it
Otherwise throw it away
Questions.
1. From which distribution are we sampling?
2. How many iterations of this procedure are needed to generate \(N\) samples?

How could we modify the procedure to draw samples from this distribution?

We can write the target and candidate distributions mathematically…

\[ \begin{aligned} \pi &= (1/2,1/4,1/4,0,0,0) \\ q &= (1/6,1/6,1/6,1/6,1/6,1/6) \end{aligned} \]

Question: How would the problem change if \(q = (1/3,1/3,1/3)\) instead?

Rejection sampling: main idea

Goal: sampling from target distribution \(\pi\)
Choose some candidate distribution \(q\)
Sample \(X \sim q\), reject some of the samples

Rejection Sampling Algorithm

Set \(i \leftarrow 0\)
While \(i < N\)
- Draw \(X \sim q\)
- Draw \(U \sim \mathcal{U}[0,1]\)
- If \[ U \leq \frac{\pi(X)}{Mq(X)}, \] accept \(X\) as a sample and set \(i \leftarrow i + 1\)
EndWhile
Questions:
1. How many samples would this algorithm produce?
2. Which element of the algorithm is currently undefined?

The constant \(M\)

In the above \[ M := \sup_x \frac{\pi(x)}{q(x)}. \]
In the first example \(M = 2\), in the second \(M = 3\)
Question: Why is \(M\) chosen in this way? = Question. What happens if \(M = \infty\)

Modified example worked calculation

\(\pi = (1/2,1/4,1/4,0,0,0)\)
\(q = (1/6,1/6,1/6,1/6,1/6,1/6)\)
\(M = 3\).
This means the acceptance probabilities are \[ \frac{\pi(1)}{Mq(1)} = 1, ~ \frac{\pi(2)}{Mq(2)} = 1/2, ~ \frac{\pi(3)}{Mq(3)} = 1/2, \] with all others being \(0\).

A continuous example

Take \(\pi\) to be \(N(0,1)\)
Set \(q\) to be \(N(0,\sigma^2)\), for \(\sigma > 1\). \[ M = \sup_x \frac{\pi(x)}{q(x)} = \frac{\pi(0)}{q(0)} = \frac{1\sqrt{2\pi}}{1/\sqrt{2\pi\sigma^2}} = \sigma. \]

Example in the case \(\sigma = 3\)

Proof that the rejection sampling algorithm produces samples from \(\pi\) (in case \(\pi\) & q have densities)

For any event \(A\) if \(X \sim q\) then \[ \mathbb{P}[X\in A|X\text{ accepted}] =\frac{\mathbb{P}[X\in A,X\text{ accepted}]}{\mathbb{P}[X\text{ accepted}]} \]
Note that for any \(A \in \mathcal{E}\) (using the disintegration theorem) \[ \mathbb{P}[X \in A, X\text{ accepted}] = \int_A \mathbb{P}[X\text{ accepted}|X = x]q(x)dx = \int_A \frac{\pi(x)}{Mq(x)}q(x)dx \]
(This is just an integral form of \(\mathbb{P}(B) = \sum_i \mathbb{P}(B | A_i)\mathbb{P}(A_i)\))
We can therefore write \[ \mathbb{P}[X\in A|X\text{ accepted}] = \frac{\mathbb{P}[X\in A, X\text{ accepted}]}{\mathbb{P}[X\text{ accepted}]} = \frac{\int_{A}q(x)\frac{\pi(x)}{Mq(x)}dx}{\int_{E}q(x)\frac{\pi(x)}{Mq(x)}dx} = \frac{\frac{1}{M}\int_{A}\pi(x)dx}{\frac{1}{M}\int_{E}\pi(x)dx} = \pi(A) \]

What proportion of samples will be accepted?

From above, the probability that a sample from \(q\) will be accepted is \[ \mathbb{P}[\text{accepted}]=\int q(x)\frac{\pi(x)}{Mq(x)}dx=\frac{1}{M}\int\pi(x)dx=\frac{1}{M}. \]
So it is desirable to choose \(q\) s.t. \(M\) is as small as possible (i.e. \(q \approx \pi\))
For example, when the target is \(N(0,1)\) and candidate \(N(0,\sigma)\) for \(\sigma>1\), then \[ M = \sigma \]

The tails of \(q\) should be at least as heavy as \(\pi\) for finite \(M\) (here \(\pi\) qs \(N(0,1)\), \(q\) is \(N(0,1/2)\))

Towards a more general algorithm

The above holds when \(\pi\) and \(q\) have densities wrt Lebesgue measure
In general we need the measure \(\pi\) to be absolutely continuous with respect to \(q\)
The acceptance rate can then be constructed as \[ \frac{1}{M} \cdot\frac{d\pi}{dq}(x) \] (i.e. \(1/M\) multiplied by the Radon–Nikodym derivative of \(\pi\) wrt \(q\))

Importance sampling

The basic identity on which the importance sampling method is based is \[ \mathbb{E}_\pi[f(X)]=\int f(x)\pi(dx) = \int f(x)\frac{\pi(dx)}{q(dx)}q(dx)=\mathbb{E}_q\left[f(X)\frac{d\pi}{dq}(X)\right]. \]
We typically write \[ w(X) := \frac{d\pi}{dq}(X), \] and call \(w\) the importance weight
Question. When \(\pi\) and \(q\) both have densities wrt Lebesgue measure what do the weights become?

(Naive) Importance Sampling Algorithm

Draw \(X_i \sim q\) for \(1\leq i\leq N\)
For each \(i\) compute the importance weights \[ w(X_i):= \frac{d\pi}{dq}(X_i) \]
Compute \[ \hat{f}^{IS-n}_N := \frac{1}{N}\sum_{i=1}^N w(X_i)f(X_i) \]

This is not really a sampling algorithm, it’s a way to perform Monte Carlo for one distribution using samples from another.

The bias & variance of the naive IS estimator

The naive estimator is unbiased (proof straightforward, hence omitted)
Recall that in ordinary Monte Carlo the estimator \(\hat{f}_N\) has variance \[ \begin{aligned} \text{Var}[\hat{f}_N] &= \frac{1}{N}\text{Var}_\pi[f(X)] \\ &= \frac{1}{N}\left( \mathbb{E}_\pi[f(X)^2] - \mathbb{E}_\pi[f(X)]^2 \right). \end{aligned} \]
In the naive IS case this variance is instead \[ \begin{aligned} \text{Var}[\hat{f}_N^{IS-n}] &= \frac{1}{N}\text{Var}_q [f(X)w(X)] \\ &= \frac{1}{N}\left( \mathbb{E}_q[w(X)^2f(X)^2] - \mathbb{E}_q[w(X)f(X)]^2 \right). \end{aligned} \]
When \(w\) is an unbounded function of \(x\), this can be infinite for many choices of \(f\)
Question. How should the tails of \(\pi\) and \(q\) behave for \(w(x)\) to be bounded/unbounded?

Self-normalised importance sampling

We have seen that the naive IS estimator can have infinite variance
It is also not computable when only \(c\pi\) is known (e.g. Bayesian inference)
We can instead replace the weights with their normalised version \[ \tilde{w}(X_i) := \frac{w(X_i)}{\sum_{i=1}^N w(X_i)} \]
We can then use the estimator \[ \hat{f}_N^{IS} = \sum_{i=1}^N \tilde{w}(X_i)f(X_i) \]
This often stabilises the variance, but introduces a small bias for finite \(N\). It’s generally preferred

The ‘effective’ sample size

How many independent samples from \(\pi\) do we need to produce a Monte Carlo estimator with variance \(\text{Var}[\hat{f}^{IS}_N]\)?
Recall \(\text{Var}[\hat{f}_N] = \frac{1}{N}\text{Var}_\pi[f]\). We can set \[ \text{Var}[\hat{f}_N^{IS}] := \frac{1}{N_\text{eff}}\text{Var}_\pi[f] \] for some \(N_{\text{eff}} > 0\), which we call the effective sample size.
A simple approximation (derivation omitted) is given by \[ N_{eff} \approx \frac{1}{\sum_{i=1}^N \tilde{w}_i^2}. \]
This is independent of \(f\), and gives a useful heuristic for measuring the quality of an IS estimator.

Effective sample size: intuition

\(N_{eff}\) will be smaller when \(\sum_i \tilde{w}_i^2\) is larger
This occurs when the weights have a high variance \[ N\hat{f}_N^{IS} = \tilde{w}_1 f(x_1) + ... + \tilde{w}_n f(x_n). \]
E.g. if \(\tilde{w}_2=...=\tilde{w}_N \approx 0\) and \(\tilde{w}_1 \gg 1\) the estimate is only really relying on a single sample \(x_1\)
- The resulting estimator will probably be poor
- Note that \(\sum_i \tilde{w}_i^2 \approx \tilde{w}_i^2 \gg \tilde{w}_i \gg 1\) here

Markov chain Monte Carlo sampling algorithms

Lecture 1

Welcome & Logistics

Module Outline

Part 1: Preliminaries & motivation

Notation

Notation: example 1

Notation: example 2

Notation: final example

Motivation

Part 2: Monte Carlo

Example: Inference in latent variable models

Example (Classical significance testing)

Example (Bayesian inference)

What is an intractable integral?

A simple example

Intractable integrals can look very innocent!

Analytical intractability \(\implies\) numerical methods

Trapezium rule for 1D Gaussian density

The Monte Carlo Principle

The Monte Carlo method

Why does Monte Carlo work?

A more statistical assessment of Monte Carlo

Some natural questions

Part 3: Sampling from a probability distribution

The big picture

Direct sampling via transformations

Bernoulli sampling

A more complex example

Graphical representation of the discrete case

Continuous case: Inverse CDF method

Example (Exponential distribution)

Example (Exponential distribution)

Limitations

Indirect Sampling

Rejection sampling: simple example

How could we modify the procedure to draw samples from this distribution?

We can write the target and candidate distributions mathematically…

Rejection sampling: main idea

Rejection Sampling Algorithm

The constant \(M\)

Modified example worked calculation

A continuous example

Example in the case \(\sigma = 3\)

Proof that the rejection sampling algorithm produces samples from \(\pi\) (in case \(\pi\) & q have densities)

What proportion of samples will be accepted?

The tails of \(q\) should be at least as heavy as \(\pi\) for finite \(M\) (here \(\pi\) qs \(N(0,1)\), \(q\) is \(N(0,1/2)\))

Towards a more general algorithm

Importance sampling

(Naive) Importance Sampling Algorithm

The bias & variance of the naive IS estimator

Self-normalised importance sampling

The ‘effective’ sample size

Effective sample size: intuition

End of lecture