Dr Samuel Livingstone
Choose a distribution over \(\Theta\) space, expressing your beliefs about what \(\theta_0 \in \Theta\) might be
Update the prior into a posterior distribution using Bayes’ theorem, e.g. \[ \pi(\theta|y) = \frac{\pi_0(\theta)f(y|\theta)}{\int \pi_0(\vartheta)f(y|\vartheta)d\vartheta} \]
Question. What is \(f(y|\theta)\)?
We asked \(10\) students that took STAT0044 (my module at UCL) if they enjoyed it \[ X_i = \begin{cases} 1 ~~~ \text{ if student liked STAT0044} \\ 0 ~~~ \text{ otherwise.} \end{cases} \]
The model for the data is \[ \begin{aligned} X_i | p &\sim \text{Bernoulli}(p), \\ p &\sim U[0,1] \end{aligned} \]
Of course all the students liked the module…
This means \[ \begin{aligned} \pi(p|x) &= \frac{f(x|p)\pi_0(p)}{\int f(x|p)\pi_0(p)dp} \\ &= \frac{(p^n \cdot 1) \mathbb{I}_{[0,1]}(p)}{\int_0^1 {p}^n \cdot 1 dp} \\ &= (n+1)p^n\mathbb{I}_{[0,1]}(p). \end{aligned} \]
Recall that the Beta\((\alpha,\beta)\) distribution has pdf \[ f(p) \propto p^{\alpha - 1}(1-p)^{\beta - 1}\mathbb{I}_{[0,1]}(p) \]
Question: So what is the posterior distribution here?
par(mfrow = c(1,2))
n <- 10
f1 <- function(x) { dbeta(x, shape1 = 1, shape2 = 1) }
f2 <- function(x) { dbeta(x, shape1 = 1+n, shape2 = 1) }
curve(f1, from = 0, to = 1, main = "Prior", xlab = "p", ylab = "Density", ylim = c(0,10))
curve(f2, from = 0, to = 1, main = "Posterior", xlab = "p", ylab = "", ylim = c(0,10))
Definition An \(E\)-valued discrete-time stochastic process \(\{X_{t}\}_{t\geq0}\) is called a Markov chain if for any fixed \(t\) and any \(x_{t}\in E\) the random variable \[ X_{t+1}|(X_{t}=x_{t}) \] is independent of \(X_{t-k}\) for all \(t \geq k \geq 1\).
Symbolically this can be written \[ \left(X_{t+1}\perp\!\!\!\perp X_{t-k}\right)|X_{t}. \]
In words the future is independent of the past if we know the present.
The Markov property in terms of probabilities (ignoring measure theoretic subtleties) \[ \mathbb{P}\left(X_{t+1}\in A|X_{t}=x_{t},...,X_{0}=x_{0}\right)=\mathbb{P}\left(X_{t+1}\in A|X_{t}=x_{t}\right) \]
An initial distribution \(\mu_{0}\), such that \(\mathbb{P}(X_{0}\in A):=\mu_0(A)\) for all \(A\in\mathcal{E}\)
A transition kernel \[ P(x,A) := \mathbb{P}(X_{i} \in A|X_{i-1}=x) \] for all \(A\in\mathcal{E}\) and all \(x\in E\)
Then set \(i=1\), draw \(X_0 \sim \mu_0\) and repeat:
We will focus on time-homogeneous chains, meaning \(P\) does not depend on \(t\).
Let \(X_0 = 1\) or \(2\) w.p. \(1/2\), and introduce the matrix
\[ \bar{P} = \left( \begin{array}{cc} 1/3 & 2/3 \\ 2/3 & 1/3 \end{array}\right). \]
Let \(\mathbb{P}(X_{i}=y|X_{i-1}=x)=\bar{P}(x,y)\).
\(\bar{P}\) is called the
transition matrix
The transition kernel is \[ P(x,A):=\sum_{y\in A}\bar{P}(x,y). \]
Normally in finite state spaces we just work with the matrix
Let \(X_0 \sim N(0,1)\) and introduce the recursion \[ X_{i}=\gamma X_{i-1}+\sqrt{(1-\gamma^{2})}Z_{i}, \] where each \(Z_{i}\stackrel{iid}{\sim} N(0,1)\) and \(\gamma\in\mathbb{R}\) with \(|\gamma|<1\).
Here the kernel is \[ P(x,A)=\frac{1}{\sqrt{2\pi(1-\gamma^{2})}}\int_{y\in A}\exp\left(-\frac{1}{2(1-\gamma^{2})}(y-\gamma x)^{2}\right)dy. \]
We can write \(P(x,A) = \int_A p(x,y)dy\) and call \(p(x,y)\) the transition density.
More generally we can write \(P(x,A) = \int_A P(x,dy)\)
If \(X_0 \sim \mu\) and \(X_{t+1}|(X_t = x) \sim P(x,\cdot)\), then the marginally \[ \mathbb{P}(X_1 \in A) = \int P(x,A)\mu(dx). \]
We sometimes define the probability measure \(\mu P\) s.t. for any \(A \in \mathcal{E}\) \[ \mu P(A) := \int \mu(dx)P(x,A) \] (and write \(X_1 \sim \mu P\))
Note that in the finite setting this integral can be re-written \[ \begin{aligned} \mathbb{P}(X_1 \in A) &= \sum_{x \in E} \mu(\{x\})P(x,A) \\ &= \sum_{x \in E} \mathbb{P}(X_0 =x) \mathbb{P}(X_1 \in A |X_0 = x) \end{aligned} \] (i.e. just the law of total probability)
With \(\bar{P}\) as above, recall the Chapman–Kolmogorov equations \[ \mathbb{P}(X_{t+n} = y|X_t = x) = \bar{P}^n(x,y) \]
Examining this matrix for different values of \(n\) gives \[ \begin{aligned} \bar{P} & =\left(\begin{array}{cc} 1/3 & 2/3\\ 2/3 & 1/3 \end{array}\right)\\ \bar{P}^{2} & =\left(\begin{array}{cc} 0.55 & 0.44\\ 0.44 & 0.55 \end{array}\right)\\ \bar{P}^{5} & =\left(\begin{array}{cc} 0.498 & 0.502\\ 0.502 & 0.498 \end{array}\right) \end{aligned} \]
Question. What is happening to each row of \(\bar{P}^{n}\) as \(n\to\infty\)?
Many Markov chains have the property that when \(n\) is large the distribution of \(X_{n}\) begins to stabilise
The phenomenon is called convergence to equilibrium or mixing
To define it mathematically we will need a precise notion of distance between distributions
For now we will work with the total variation distance (TV). For distributions \(\mu\) and \(\nu\) on \((E,\mathcal{E})\) \[ \|\mu - \nu\|_{TV} := \sup_{A \in \mathcal{E}}|\mu(A) - \nu(A)| \]
TV is within the family of integral probability metrics, so can also be written \[ \|\mu - \nu\|_{TV} = \sup_{f :E \to [0,1]}|\mathbb{E}_\mu[f] - \mathbb{E}_\nu[f]| \]
If \(\mu\) and \(\nu\) have densities then it can also be written (see exercises) \[ \|\mu - \nu\|_{TV} = \frac{1}{2} \int |\mu(x) - \nu(x)|dx \]
If a Markov chain with transition kernel \(P\) is \(\pi\)-invariant, \(\pi\)-irreducible and aperiodic, then \(\pi\) is the unique limiting distribution for the chain.
Definition. A finite state space chain with transition matrix \(\bar{P}\) is \(\pi\)-reversible if \[ \pi(x)\bar{P}(x,y) = \pi(y)\bar{P}(y,x) \] for all \(x,y \in E\).
For a general state space chain with a transition kernel \(P\) the condition can be written \[ \pi(dx)P(x,dy) = \pi(dy)P(y,dx) \]
If \(P(x,\cdot)\) has a transition density \(p(x,y)\) for all \(x \in E\) then \(\pi\) has a density and we can write \[ \pi(x)p(x,y) = \pi(y)p(y,x) \]
(‘\(P\) has a transition density \(p\)’ just means \(P(x,A) = \int_A p(x,y)dy\))
If a Markov chain is \(\pi\)-reversible, then it is \(\pi\)-invariant.
Proof. In the general case, we need to show for any \(A \in \mathcal{E}\) that \(X_n \sim \pi \implies X_{n+1} \sim \pi\)
Recall that if \(X_n \sim \pi\) then \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_n \in E}\pi(dx_n)P(x_n, A) = \int_{x_{n+1} \in A} \int_{x_n \in E} \pi(dx_n)P(x_n, dx_{n+1}) \]
Now by \(\pi\)-reversibility \(\pi(dx_n) P(x_n,dx_{n+1}) = \pi(dx_{n+1}) P(x_{n+1},dx_n)\), meaning we can instead write \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_n \in E} \int_{x_{n+1} \in A} \pi(dx_{n+1}) P(x_{n+1},dx_n) \]
Using Fubini–Tonelli theorem the integrals can be switched around, meaning \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_{n+1} \in A} \left[ \int_{x_n \in E} P(x_{n+1},dx_n) \right] \pi(dx_{n+1}) \]
But since \(P(x_{n+1}, \cdot)\) is a probability measure then \(P(x_{n+1},E) = 1\), meaning \[ \mathbb{P}(X_{n+1} \in A) = \int_{x_{n+1} \in A} \pi(dx_{n+1}) = \pi(A). \]
If a Markov chain \(\{X_{t}\}_{t\geq0}\) is \(\pi\)-irreducible and \(\pi\)-invariant, and \(f: E \to\mathbb{R}\) satisfies \(\mathbb{E}_{\pi}[f(X)]<\infty\), then the ergodic average \[ \hat{f}_{n}:= \frac{1}{n}\sum_{i=0}^{n-1}f(X_{i}) \] satisfies \[ \mathbb{P}\left(\lim_{n\to\infty}\hat{f}_{n}=\mathbb{E}_{\pi}[f(X)]\right)=1. \]
We will find out next time!