Dr Samuel Livingstone
Markov chain Monte Carlo \[ \{X_t\}_{t\geq 0}, \qquad \mathbb{E}_\pi[f] \approx \frac{1}{n}\sum_{i=1}^n f(X_i) \]
General Metropolis–Hastings algorithm \[ Y \sim Q(X_i,\cdot), \qquad \alpha(X_i,Y) = \min\left(1, \frac{\pi(Y)q(Y,X_i)}{\pi(X_i)q(X_i,Y)} \right). \]
Random walk proposals \[ Y \sim N(X_i, h^2 I_{d \times d} ), \qquad \alpha(X_i,Y) = \min\left( 1, \frac{\pi(Y)}{\pi(X_i)} \right) \]
Recall: Interested in computing \(\mathbb{E}_\pi[f]\)
Approach: construct a Markov chain and compute ergodic averages
Statistically, we are constructing a point estimate for \(\mathbb{E}_\pi[f]\) of the form \[ \frac{1}{N}\sum_{i=1}^N f(x_i), \quad \{x_1,...,x_N\} \text{ is algorithm output} \]
We can therefore study the properties of the estimator \[ \tilde{f}_N := \frac{1}{N}\sum_{i=1}^N f(X_i), \quad X_{i+1} \sim P(X_i,\cdot) \]
Writing \(f_0 := \mathbb{E}_\pi[f]\), we can look at the mean squared error \[ L(\tilde{f}_N,f_0) := \mathbb{E} \left[ (\tilde{f}_N - f_0)^2 \right]. \]
Question. Why have I used the letter \(L\) above? What is the name for this kind of function?
This mean squared error can be broken down using the more familiar quantities of bias and variance, since \[ L(\tilde{f}_N,f_0) = \text{Bias}(\tilde{f}_N)^2 + \text{Var}[\tilde{f}_N]. \]
First note that \[ |\text{Bias}(\tilde{f}_N)| = \left|\frac{1}{N}\sum_{i=1}^N \mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f] \right| \]
We can bring the absolute value inside the sum because \[ \left| \frac{1}{N}\sum_{i=1}^N \mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f] \right| \leq \frac{1}{N}\sum_{i=1}^N |\mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f]| \]
Then note that since \(0 \leq f(x) \leq 1\) \[ |\mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f]| \leq \sup_{0 \leq f(x) \leq 1}|\mathbb{E}[f(X_i)] - \mathbb{E}_\pi[f]| = \|P^i(x,\cdot) - \pi\|_{TV} \]
Using \(\|P^i(x,\cdot) - \pi\|_{TV} \leq M(x)\rho^i\) gives \[ \text{Bias}(\tilde{f}_N) \leq \frac{1}{N} \sum_{i=1}^N M(x)\rho^i \leq \frac{M(x)}{N(1-\rho)}. \]
The above suggests that \[ \text{Bias}(\tilde{f}_N)^2 \propto \frac{1}{N^2}. \]
Typically the variance will decay at a slower rate of \(\frac{1}{N}\)
So if such a bound can be established, then we can focus on the variance term in the mean squared error expression, for sufficiently large \(N\)
Burn in (or less frequently warm up) just means discarding the initial portion of the chain
This will reduce the bias of the resulting estimator
One can equivalently adapt our MCMC estimator to be \[ \tilde{f}_N = \frac{1}{N} \sum_{t = m+1}^{N+m} f(X_t), \] where the first \(m\) samples are discarded as burn in
We can often judge how much to discard through trace plots
Question. What is the problem with using trace plots to decide on burn in in high dimensions?
The variance of \(\tilde{f}_N\) is \[ \text{Var}(\tilde{f}_N) = \text{Var} \left( \frac{1}{N} \sum_{i=1}^N f(X_i) \right) = \frac{1}{N^2}\text{Var}\left( \sum_{i=1}^N f(X_i) \right) \]
Because \(\text{Var}(f) = \text{Cov}(f,f)\), we can write \[ \text{Var}(\tilde{f}_N) = \frac{1}{N^2} \text{Cov}\left(\sum_{i=1}^N f(X_i), \sum_{i=1}^N f(X_i) \right) \]
Recalling that \(\text{Cov}(f_1 + f_2,f_3) = \text{Cov}(f_1,f_3) + \text{Cov}(f_2,f_3)\) (and similarly for 2nd argument) we can write this as \[ \frac{1}{N^2} \left( \sum_{i=1}^N\text{Var}(f(X_i)) + 2 \sum_{i<j} \text{Cov}\left( f(X_i), f(X_j) \right) \right) \]
If \(X_1 \sim \pi\) then we can go even further since marginally \(X_i \sim \pi\) for all \(i\), meaning \[ \text{Var}\left( \tilde{f}_N \right) = \frac{1}{N} \left( \text{Var}(f(X_1)) + 2 \sum_{k=2}^N \left( \frac{N-k+1}{N} \right) \text{Cov}\left( f(X_1), f(X_k) \right) \right) \]
Question. How does this relate to the variance under iid sampling from \(\pi\)?
When \(h\) is too small, nearby points will be similar, meaning \[ \text{Cov}(f(X_1),f(X_k)) \gg 0 \] for many values of \(k\).
When \(h\) is too large and the first \(k\) proposals are all rejected then \[ \text{Cov}(f(X_1),f(X_k)) = \text{Var}(f(X_1)) \] the maximum possible value
Question. What will happen to \(\text{Var}(\tilde{f}_N)\) in either of these two cases?
Note that for the ordinary Monte Carlo estimator \(\hat{f}_N\), we have \[ \text{Var}(\hat{f}_N) = \frac{\text{Var}_\pi(f(X))}{N} \]
This means \[ \frac{\text{Var}(\tilde{f}_N)}{\text{Var}(\hat{f}_N)} = 1 + 2 \sum_{k=2}^N \left( \frac{N-k}{N} \right) \text{Corr}\left( f(X_1), f(X_k) \right). \]
We can simplify further by taking the limit as \(N \to \infty\), giving \[ \lim_{N \to \infty} \frac{\text{Var}(\tilde{f}_N)}{\text{Var}(\hat{f}_N)} = 1 + 2 \sum_{k=2}^\infty \text{Corr}\left( f(X_1), f(X_k) \right). \]
\[ N_{\text{eff}} := \frac{N}{1+2\sum_{k=2}^\infty \text{Corr}(f(X_1),f(X_k))}. \]
(This is how many independent samples from \(\pi\) you would need to get an estimator with the same variance as \(\tilde{f}_N\)).
In practice this is estimated using some intelligent heuristics
It is common to report \(N_{\text{eff}}\), or even \(N_{\text{eff}}/\text{second}\) (much fairer)
See the effectiveSize()
command in the
R
package CODA
Tomorrow we discuss more advanced algorithms!