Stuff I Always Look Up ...

1 Purpose

This is a living, annotated bibliography of stuff I need to use on a semi-regular basis, always have to look up but in my chaotic file system, can never find. It also documents some embarrassing truths – stuff like “Is variance the square root of standard deviation, or the other way round?” … Of course, it’s the other way round.

So, point number one: If \(\sigma_X\) is the standard deviation of \(X\) then: \[ \mathrm{Var}(X) = \sigma^2_X \]

2 Posterior Predictive Distributions

I spent a week trying to find a derivation of equation 1.4 on pp.7 of (Gelman et al. 2014), the posterior predictive distribution (Rubin 1984) of new data \(\tilde{y}\) given previously observed data \(y\) and a model with parameters \(\theta\) \[\begin{equation} p( \tilde{y} | y ) = \int p\left( \tilde{y} | \theta \right) p\left( \theta | y \right) d\theta \tag{2.1} \label{eqn:finalPPD} \end{equation}\]

There are explanations floating around on the internet, but none I could follow because they skipped steps and left me confused.

We need a few basic laws and definitions from probability theory as follows (Blitzstein and Hwang 2019):

2.1 Conditional Probability

For two variables \(a\) and \(b\): \[\begin{equation} p(a | b) = \frac{p(a,b)}{p(b)} \tag{2.2} \end{equation}\]

Or re-arranged: \[\begin{equation} p(a,b) = p(a | b) p(b) \tag{2.3} \end{equation}\]

And for two variables \(a\) and \(b\), conditioned on \(c\): \[\begin{equation} p(a,b | c) = \frac{p(a,b,c)}{p(c)} \tag{2.4} \end{equation}\]

and the re-arrangement: \[\begin{equation} p(a,b,c) = p(a,b | c) p(c) \end{equation}\]

Note that \(p(a,b,c)\) can also be factorised as any of the following (depending on what we want to achieve): \[\begin{align} p(a,b,c) &= p(b,c|a) p(a) \tag{2.5} \\ &= p(a,c|b) p(b) \tag{2.6} \\ &= p(a,b|c) p(c) \tag{2.7} \end{align}\]

2.2 Law of Total Probability

From the joint probability \(p(a,b)\), the marginal \(p(a)\) is: \[\begin{equation} p(a) = \int p \left( a, b \right) db \end{equation}\]

And the continuous law of total probability is (Blitzstein and Hwang 2019) pp. 289: \[\begin{align} p(a) &= \int p \left( a,b \right) db \\ &= \int p \left( a|b \right) p( b )db \tag{2.8} \end{align}\]

where we’ve used equation (2.3) to re-write \(p \left( a,b \right)\) as \(p \left( a|b \right) p( b )\)

Adding conditioning on \(c\) we obtain (Blitzstein and Hwang 2019) pp. 54: \[\begin{align} \label{eqn:lotp_cont} p \left( a|c \right) &= \int p \left( a,b | c \right) db \\ &= \int p \left( a|b,c \right) p( b | c ) db \tag{2.9} \end{align}\]

2.3 Conditional Independence

Two variables \(a\) and \(c\) are conditionally independent given a third variable \(b\) (Blitzstein and Hwang 2019) pp. 58: \[\begin{equation} ( a \perp\!\!\!\perp c ) | b \iff p(a,c | b) = p( a | b ) p( c | b) \tag{2.10} \end{equation}\]

2.4 Bayes Theorem

For two variables \(a\) and \(b\): \[\begin{equation} p(a|b) = \frac{p(b|a)p(a)}{p(b)} \tag{2.11} \end{equation}\]

2.5 Bayes with Extra Conditioning

I frequently have to remind myself how to rewrite this form: \(p(a | b, c)\) – this is covered in (Blitzstein and Hwang 2019) Theorem 2.4.2 on pp. 54-56.

There are a few useful ways to re-write.

Using Conditional Probability

Assume it makes sense (in the problem we’re trying to solve) to view \(b\) and \(c\) as “one thing together” then using the formula for conditional probability we get: \[\begin{equation} p( a | b, c ) = \frac{p(a,b,c)}{p(b,c)} \tag{2.12} \end{equation}\]

Applying conditional probability again – equation (2.4) and the different factorisations in (2.5) through (2.7) – to rewrite \(p(a,b,c)\): \[\begin{align} p(a,b,c) &= p(b,c|a) p(a) \\ &= p(a,c|b) p(b) \\ &= p(a,b|c) p(c) \end{align}\]

Choosing the RHS as suited to the problem - here, we take \(p(b,c|a) p(a)\) as we are treating \(b\) and \(c\) as “one event” (we want to keep them together) and substitute in (2.12): \[\begin{equation} p( a | b, c ) = \frac{ p(b,c|a) p(a) }{ p( b,c ) } \tag{2.13} \end{equation}\]

Now, we have to find an expression for the denominator \(p( b,c )\) and we have options including another application of conditional probability so \(p( b, c ) = p(b|c) p(c)\) resulting in: \[\begin{equation} p( a | b, c ) = \frac{ p(b,c|a) p(a) }{ p(b|c) p(c) } \tag{2.13} \end{equation}\]

Using the Chain Rule of Probability

Start, as before, with: \[\begin{equation} p( a | b, c ) = \frac{p(a,b,c)}{p(b,c)} \tag{2.14} \end{equation}\]

This time, decompose \(p(a,b,c)\) differently, using the chain rule: \[\begin{align} p(a,b,c) &= p(a|b,c)p(b,c) \\ &= p(b|a,c)p(a,c) \\ &= p(c|a,b)p(a,b) \end{align}\] Obviously, the first re-write gets us nowhere – it merely restates \(p(a|b,c)\). Lets say the second factorisation is helpful for our problem \(p(b|a,c)p(a,c)\), then we substitute in (2.14):

\[\begin{equation} p( a | b, c ) = \frac{p(b|a,c)p(a,c)}{p(b,c)} \tag{2.15} \end{equation}\]

We now have to deal with \(p(a,c)\) in the numerator and \(p(b,c)\) in the denominator. Starting with the numerator, apply conditional probability again: \[\begin{equation} p( a, c ) = p(a|c)p(c) \end{equation}\]

For the denominator: \[\begin{equation} p( b, c ) = p(b|c)p(c) \end{equation}\]

What we end up with is Bayes theorem with extra conditioning on \(c\)

Bayes Theorem with Extra Conditioning on \(c\)

Substitute both into (2.15): \[\begin{align} p( a | b, c ) &= \frac{p(b|a,c) p(a|c)p(c)} {p(b|c)p(c)} \\ &= \frac{p(b|a,c) p(a|c)} {p(b|c)} \end{align}\]

Bayes Theorem with Extra Conditioning on \(b\)

Had we chosen, instead, to use \(p(c|a,b)p(a,b)\), we end up with: \[\begin{align} p( a | b, c ) &= \frac{p(a,b,c)}{p(b,c)} \\ &= \frac{p(c|a,b)p(a,b)}{p(b,c)} \end{align}\]

with \(p(a,b) = p(a|b)p(b)\): \[\begin{equation} p( a | b, c ) = \frac{p(c|a,b)p(a|b)p(b)}{p(b,c)} \end{equation}\]

Of course, \(p(b,c) = p(c,b) = p(c|b)p(b)\) in the denominator: \[\begin{align} p( a | b, c ) &= \frac{p(c|a,b)p(a|b)p(b)}{p(c|b)p(b)} \\ &= \frac{p(c|a,b)p(a|b)}{p(c|b)} \end{align}\]

We chose between conditioning on \(c\) or \(b\) depending on the problem we are trying to solve (i.e. does it make sense to consider everything being conditioned on \(c\) or \(b\) ?)

2.6 Deriving the PPD

Starting with the fundamental Bayesian modelling framework:

  1. before observing the data, \(y\), the prior distribution of the parameters is \(p(\theta)\)
  2. we have a sampling distribution of the data given parameters \(p(y|\theta)\)
  3. the joint distribution of \(\theta\) and \(y\) is \(p(\theta, y) = p(\theta)p(y|\theta)\)
  4. we then obtain the posterior distribution of the parameters of the model given the observed data:

\[\begin{equation} p(\theta|y) = \frac{p(\theta,y)}{p(y)} = \frac{p(\theta)p(y|\theta)}{p(y)} \end{equation}\]

So, we can think of the posterior distribution \(p(\theta|y)\) as the `output’ of Bayesian model estimation.

Step 1

We want to obtain a distribution for future values \(\tilde{y}\) given the observed (and modelled) data \(y\) which is \(p(\tilde{y}|y)\) using what we know about the posterior distribution arising from parameter estimation \(p(\theta | y)\).

The first step is to write \(p(\tilde{y}|y)\) using the law of total probability (with conditioning): equation (2.9): \[\begin{equation} p( \tilde{y} | y ) = \int p\left(\tilde{y},\theta | y \right) d\theta \tag{2.16} \end{equation}\]

Step 2

Re-write the integrand \(p\left(\tilde{y},\theta | y \right)\) using equation (2.4):

\[\begin{equation} p( \tilde{y} | y ) = \int \frac{p(\tilde{y},\theta,y)}{p(y)} d\theta \tag{2.17} \end{equation}\]

Step 3

We assert that \(\tilde{y}\) is conditionally independent of \(y\) given the model parameters \(\theta\): \[\begin{equation} ( \tilde{y} \perp\!\!\!\perp y ) | \theta \iff p( \tilde{y}, y | \theta) = p( \tilde{y} | \theta ) p( y | \theta) \tag{2.18} \end{equation}\]

To make use of the conditional independence \(p( \tilde{y}, y | \theta)\), we have to factorise \(p(\tilde{y},\theta,y)\) in equation (2.17).

Let \(a = \tilde{y}\), \(b = \theta\) and \(c = y\); we are seeking a factorisation of \(p(a,b,c)\) as \(p(a,c|b)\) and inspecting the factorisations (2.5) through (2.7) we find that (2.6) matches:

\[\begin{align} p(a,b,c) &= p(a,c|b) p(b) \\ p( \tilde{y}, \theta, y) &= p( \tilde{y}, y | \theta) p(\theta) \tag{2.19} \end{align}\]

Substitution (2.19) into (2.17) we have:

\[\begin{equation} p(\tilde{y} | y ) = \int \frac{p(\tilde{y},y | \theta) p(\theta)}{p(y)} d \theta \tag{2.20} \end{equation}\]

We make use of the equality from equation (2.18) i.e. that \(p( \tilde{y}, y | \theta) = p( \tilde{y} | \theta ) p( y | \theta)\) and substitute into (2.20):

\[\begin{equation} p(\tilde{y} | y ) = \int \frac{ p( \tilde{y} | \theta ) p( y | \theta) p(\theta)}{p(y)} d \theta \tag{2.21} \end{equation}\]

Step 4

Recall that we want to make use of the ‘output’ of parameter estimation, the posterior distribution of the model parameters given the observed data \(p(\theta|y)\), and in equation (2.21) we see the term \(p(y|\theta)\). All we need to do is re-write \(p(y|\theta)\) using Bayes rule, equation (2.11):

\[\begin{equation} p(y|\theta) = \frac{p(\theta|y) p(y)}{p(\theta)} \end{equation}\]

And substitute into equation (2.21): \[\begin{align} p(\tilde{y} | y ) &= \int \frac{ p( \tilde{y} | \theta ) p(\theta|y) p(y) p(\theta)}{p(y)p(\theta)} d \theta \\ &= \int p( \tilde{y} | \theta ) p(\theta|y) d \theta \tag{2.22} \\ \end{align}\]

… and there we have it, the expression for the PPD, equation (2.1)

3 Sampling and Integration

This one came from Chad Scherrer who posted a tweet about the relationship between integration and sampling and I thought this was a really helpful heuristic.

When I work through examples of Bayesian problems, my first thought is “how will I code this?” and Chad’s tweet comes to mind, and helpfully, it follows nicely from the posterior predictive distribution example.

To paraphrase Chad’s tweet:

  • To sample from \(p(y|x)p(x)\)
    1. Sample \(x\)
    2. Use the sample of \(x\) to sample \(y\)
    3. Return \((x,y)\)
  • To sample from \(\int p(y|x)p(x) dx\)
    1. Sample \(x\)
    2. Use the sample of \(x\) to sample \(y\)
    3. Discard \(x\)
    4. Return \(y\)

In practice then: we want to obtain samples from the posterior predictive distribution \(p(\tilde{y}|y)\) i.e. for some new or unseen \(\tilde{x}\).

\[\begin{equation} p(\tilde{y} | y ) = \int p( \tilde{y} | \theta ) p(\theta|y) d \theta \end{equation}\]

Here’s how to proceed. We’ve used some Bayesian parameter estimation method (e.g. MCMC or similar) to obtain samples from \(p(\theta|y)\) and stored them as \(\Theta_S\)

We have a function that returns a value \(\tilde{y}\) given an input \(\tilde{x}\) and parameters \(\theta\) such that \(\tilde{y} = f(\tilde{x}; \theta)\)

For each sample \(\theta^{s} \in \Theta_S\) – one sample from \(p(\theta|y)\)

  1. Compute \(\tilde{y}^s = f(\tilde{x}; \theta^{s})\) – a sample from \(p(\tilde{y}|\theta)\)

  2. Store \(\tilde{y}^s\) and throw away \(\theta^{s}\)

Our resulting collection of \(\tilde{y}^s\) are samples from the PPD from which we can then take expected values, quantiles etc.

4 Logits, Probabilities and Odds

Can never remember these relations, so wrote them down explicitly for future reference.

4.1 Odds

If \(p_A\) is the probability of event \(A\) then:

  1. \(\textrm{odds}_A = \frac{1}{(1-p_A)}\)
  2. given the odds, we recover the probability as \(p_A = \frac{\textrm{odds}_A}{1+\textrm{odds}_A}\)
  3. the log odds are given by \(\ln(\textrm{odds}_A) = \ln \left( \frac{1}{1-p_A} \right) = \textrm{logit}(p_A) = \ln(p_A) - \ln(1-p_A)\)

So, odds of 1.0 equals a probability \(p_A = 0.5\) – the probability of the event occurring is at chance level.

4.2 Odds Ratios

For two events, \(A\) and \(B\) with probabilities \(p_A\) and \(p_B\):

  1. \(\textrm{odds}_A = \frac{1}{(1-p_A)}\) and \(\textrm{odds}_B = \frac{1}{(1-p_B)}\)
  2. the odds ratio \(\textrm{OR}_{AB} = \frac{\textrm{odds}_A}{\textrm{odds}_B} = \frac{1-p_B}{1-p_A}\)

4.3 Logistic Regression

In logistic regression applied to clinical situations, we are usually interested in a single “thing” associated with two discrete and mutually-exclusive events (e.g. \(A\) = “dying” or \(B\) = “not dying”) – the thing either occurs or it does not occur. In these circumstances \(p_B = 1 - p_A\).

To convert to the language of regression, denote an outcome \(y\) (for example, death, experiencing a side effect, obtaining a positive response to a treatment) then:

  1. the probability that \(y\) occurred is \(p_y\) (equivalent to \(p_A\) in the example above)
  2. the corresponding probability \(y\) does not occur (\(¬y\)) is \(1-p_y\) (equivalent to \(p_B\) in the example above)
  3. the odds of \(y\) occurring are \(\textrm{odds}_{y} = \frac{1}{(1-p_y)}\) and the odds of \(¬y\) are \(\textrm{odds}_{¬y} = \frac{1}{1-(1-p_y)} = \frac{1}{p_y}\)
  4. the odds ratio of \(y\) and \(¬y\) is then \(\textrm{OR}_{(y,¬y)} = \frac{p_y}{1-p_y}\)

Further reading: Chapter 13 of (Gelman, Hill, and Vehtari 2020) walks through all this with examples and code in R. On the web, Jay Rotella has a nice PDF walk through of the same material with elaboration and graphical examples in R.

References

Blitzstein, Joseph K, and Jessica Hwang. 2019. Introduction to Probability. 2nd ed. Chapman; Hall/CRC.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.
Gelman, Andrew, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2014. Bayesian Data Analysis. 3rd ed. Chapman; Hall/CRC.
Rubin, Donald B. 1984. “Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician.” The Annals of Statistics 12 (4): 1151–72.