Dr Samuel Livingstone
This course is motivated by two seemingly abstract problems
The Sampling Problem. Given a finite measure \(\pi_u\) on \((E,\mathcal{E})\), generate a sample \[ X \sim \pi \] where \(\pi(\cdot) := \pi_u(\cdot)/\pi_u(E)\) is the normalised version of \(\pi_u\).
The Integration Problem. Given the same \(\pi_u\) and a function \(f \in L^1(\pi_u)\), compute \[ \mathbb{E}_\pi[f] = \int f(x)\pi(dx), \] with \(\pi\) as above, to a desired degree of accuracy.
These questions are ubiquitous in computational science (statistical physics, theoretical computer science, optimization, applied mathematics…)
A one-dimensional integral \[ \int_{a}^{b}f(x)dx \] is said to be intractable if there is no elementary function \(F\) such that \[ \int_{a}^{b}f(x)dx=F(b)-F(a) \] for all \(a,b\in\mathbb{R}\)
A function is called elementary if it can be written as the composition of a finite number of addition, subtraction, multiplication, division, exponential, logarithm, trigonometric, power or root operations
An intractable integral is well-defined, just not easy to compute. Since Riemann and Lebesgue integrals agree when both are defined, when one is intractable then so is the other!
(The extension to higher dimensions is straightforward)
A very simple example of an intractable integral is \[ \int_{a}^{b}e^{-\frac{1}{2}x^{2}}dx. \]
Question: What is \(\int_{-\infty}^\infty e^{-x^2/2}dx\)?
In other cases we may also have a tractable but expensive integral or sum that we would rather avoid: we call this computationally intractable
E.g. imagine for large \(n\) computing \[ \sum_{i=1}^{n!} a_i \] for some sequence of real numbers \(a_1,a_2,...\).
If an integral can be written as an expectation with respect to some probability distribution, we can approximate it by sampling from the distribution and computing the empirical average.
Suppose we wish to calculate \[ \mathbb{E}_{\pi}[f]:=\int f(x)\pi(dx). \]
The Monte Carlo method:
Note that this approach would be identical in \(d\)-dimensions. So if we are lucky the curse of dimensionality will be broken.
Monte Carlo works because of two fundamental results from probability theory
Theorem (Strong) Law of Large Numbers. Suppose that \(X_{1},...,X_{N}\) are iid samples from \(\pi(\cdot)\), and set \(\hat{f}_{N}:=N^{-1}\sum_{i=1}^{N}f(X_{i})\). If \(\mathbb{E}_{\pi}|f(X)|<\infty\), then as \(N\to\infty\) \[ \mathbb{P}\left(\lim_{N\to\infty}\hat{f}_{N}=\mathbb{E}_{\pi}[f]\right)=1. \]
Theorem (Central Limit Theorem). Suppose \(X_{1},...,X_{N}\) are iid from \(\pi(\cdot)\), and set \(\hat{f}_{N}:=N^{-1}\sum_{i=1}^{N}f(X_{i})\). If \(\mathbb{E}_{\pi}[f(X)^{2}]<\infty\) then as \(N\to\infty\) \[ \sqrt{N}\left(\hat{f}_{N}-\mathbb{E}_{\pi}[f]\right)\xrightarrow{d} \mathcal{N} \left(0,\text{Var}_{\pi}[f] \right) \]
SLLN tells us that eventually Monte Carlo will work in many problems
The CLT gives information about rates of convergence, and the shape of fluctuations in \(\hat{f}_N\) as \(N\to\infty\).
uniform_samples <- runif(1000)
transformed_samples <- -log(1-uniform_samples)
hist(transformed_samples, xlab = "", ylab = "")
Roll a regular six-sided dice, label result \(X\)
If \(X \leq 3\), keep it
Otherwise throw it away
Questions.
\[ \begin{aligned} \pi &= (1/2,1/4,1/4,0,0,0) \\ q &= (1/6,1/6,1/6,1/6,1/6,1/6) \end{aligned} \]
Question: How would the problem change if \(q = (1/3,1/3,1/3)\) instead?
Set \(i \leftarrow 0\)
While \(i < N\)
EndWhile
Questions:
\(\pi = (1/2,1/4,1/4,0,0,0)\)
\(q = (1/6,1/6,1/6,1/6,1/6,1/6)\)
\(M = 3\).
This means the acceptance probabilities are \[ \frac{\pi(1)}{Mq(1)} = 1, ~ \frac{\pi(2)}{Mq(2)} = 1/2, ~ \frac{\pi(3)}{Mq(3)} = 1/2, \] with all others being \(0\).
The basic identity on which the importance sampling method is based is \[ \mathbb{E}_\pi[f(X)]=\int f(x)\pi(dx) = \int f(x)\frac{\pi(dx)}{q(dx)}q(dx)=\mathbb{E}_q\left[f(X)\frac{d\pi}{dq}(X)\right]. \]
We typically write \[ w(X) := \frac{d\pi}{dq}(X), \] and call \(w\) the importance weight
Question. When \(\pi\) and \(q\) both have densities wrt Lebesgue measure what do the weights become?
The naive estimator is unbiased (proof straightforward, hence omitted)
Recall that in ordinary Monte Carlo the estimator \(\hat{f}_N\) has variance \[ \begin{aligned} \text{Var}[\hat{f}_N] &= \frac{1}{N}\text{Var}_\pi[f(X)] \\ &= \frac{1}{N}\left( \mathbb{E}_\pi[f(X)^2] - \mathbb{E}_\pi[f(X)]^2 \right). \end{aligned} \]
In the naive IS case this variance is instead \[ \begin{aligned} \text{Var}[\hat{f}_N^{IS-n}] &= \frac{1}{N}\text{Var}_q [f(X)w(X)] \\ &= \frac{1}{N}\left( \mathbb{E}_q[w(X)^2f(X)^2] - \mathbb{E}_q[w(X)f(X)]^2 \right). \end{aligned} \]
When \(w\) is an unbounded function of \(x\), this can be infinite for many choices of \(f\)
Question. How should the tails of \(\pi\) and \(q\) behave for \(w(x)\) to be bounded/unbounded?
How many independent samples from \(\pi\) do we need to produce a Monte Carlo estimator with variance \(\text{Var}[\hat{f}^{IS}_N]\)?
Recall \(\text{Var}[\hat{f}_N] = \frac{1}{N}\text{Var}_\pi[f]\). We can set \[ \text{Var}[\hat{f}_N^{IS}] := \frac{1}{N_\text{eff}}\text{Var}_\pi[f] \] for some \(N_{\text{eff}} > 0\), which we call the effective sample size.
A simple approximation (derivation omitted) is given by \[ N_{eff} \approx \frac{1}{\sum_{i=1}^N \tilde{w}_i^2}. \]
This is independent of \(f\), and gives a useful heuristic for measuring the quality of an IS estimator.