forked from bob-carpenter/prob-stats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpredictive-inference.Rmd
448 lines (381 loc) · 16.7 KB
/
predictive-inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
# Posterior Predictive Inference
One of the primary reasons we fit models is to make predictions about
the future. More specifically, we want to observe some data $y$ and
use it to predict future data $\tilde{y}$. Even more specifically,
we'd like to understand the probability distribution of the future
data $\tilde{y}$ given the observed data $y$.
We are going to assume that we are still working relative to a model
whose sampling distribution has a density $p(y \mid \theta)$.^[From
now on, we'll be dropping random variable subscripts on probabilty
functions. The convention in applied statistics is to choose names
for bound variables that allow the random variables to be determined
by context. For example, we will write $p(\theta \mid y)$ for a
posterior, taking it to mean $p_{\Theta \mid Y}(\theta \mid y)$.]
Thus if we knew the value of $\theta$,^[We are also overloading
lowercase variables like $\theta$ to mean their random variable
counterpart $\Theta$ when necessary, as it is here, where we should
properly be saying that the random variable $\Theta$ takes on some
known value $\theta$.] the distribution of $\tilde{y}$ would be given
by $p(\tilde{y} \mid \theta)$. Here, we are assuming that the
sampling distribution will be the same density for the original data
$p(y \mid \theta)$ and the predictive data $p(\tilde{y} \mid \theta)$.
Unfortunately, we don't know the true value of $\theta$. All we have
to go on are the inferences we can make about $\theta$ given our model
and observed data $y$. As we saw in the last chapter, this knowledge
is encapsulated in a posterior distribution with density $p(\theta
\mid y)$. So rather than making predictions $p(\tilde{y} \mid
\theta)$ based on a single estimated value of $\theta$, we are going
to create a weighted average of predictions for every possible
$\theta$ with weights determined by the posterior $p(\theta \mid y)$.
Because $\theta$ is continuous, the averaging must proceed by
integration. The result is the *posterior predictive distribution*,
the density for which is defined by
$$
p(\tilde{y} \mid y)
\ = \
\displaystyle
\int_{\Theta}
p(\tilde{y} \mid \theta)
\times
p(\theta \mid y)
\,
\mathrm{d}\theta.
$$
The variable $\Theta$ is now doing extra duty as the set of possible
values for $\theta$, that is the range of variables over which
$\theta$ is averaged.
When estimating the probability of new data, there are two forms of
uncertainty that need to be taken into account. The first is
estimation uncertainty arising from not knowing the exact value of
$\theta$. This comes into play by averaging over the posterior
$p(\theta \mid y)$. The second form of uncertainty is introduced by
the sampling distribution $p(\tilde{y} \mid \theta)$. Even if we knew
the precise value of $\theta$, we would still not know the value of
$\tilde{y}$ because it is not a deterministic function of $\theta$.
Let's write our formula again highlighting the two sources of
uncertainty.
$$
p(\tilde{y} \mid y)
\ = \
\int_{\Theta}
\underbrace{p(\tilde{y} \mid \theta)}_{\mbox{sampling uncertainty}}
\times
\underbrace{p(\theta \mid y)}_{\mbox{estimation uncertainty}}
\,
\mathrm{d}\theta.
$$
## Calculation via simulation
The probability mass function for the posterior predictive is defined
by an integral that averages over the posterior, thus it can be
estimated using simulations $\theta^{(1)}, \theta^{(M)}$ of the
posterior $p(\theta \mid y)$ as
$$
p(\tilde{y} \mid y)
\ \approx \
\frac{1}{M} \sum_{m=1}^M p(\tilde{y} \mid \theta^{(m)}).
$$
That is, we just take the average prediction over our sample of
simulations.
Continuing our example from the previous chapter, we will put
everything together and show how to compute posterior predictive
densities. To review, we have $y$ boys born out of $N$ births, under
the sampling distribution $y \sim \mbox{binomial}(N, \theta)$ and
prior $\theta \sim \mbox{uniform}(0, 1)$. We have shown in the
previous chapter how to take simulated draws $\theta^{(1)}, \ldots,
\theta^{(M)}$ from the posterior distribution with density $p(\theta
\mid y, N)$.
Now suppose we have new data $\tilde{y}$ from a trial of size
$\tilde{N}$. We can estimate $p(\tilde{y} \mid y, \tilde{N})$ by
instantiating the general formula above,
$$
p(\tilde{y} \mid y, \tilde{N})
\ \approx \
\frac{1}{M} \sum_{m = 1}^M
\mbox{binomial}(\tilde{y} \mid \tilde{N}, \theta^{(m)}).
$$
If we treat $\mbox{binomial}$ as evaluating elementwise,^[A function
$f : \mathbb{R} \rightarrow \mathbb{R}$ can be extended *elementwise*
to a function on sequences $f : \mathbb{R}^N \rightarrow \mathbb{R}^N$
by defining $$f(x)[n] = f(x[n]).$$ For example, elementwise
exponentiation satisfies $$\exp((u, v, w)) = (\exp(a), \exp(v),
\exp(w)).$$] so that a collection of $\theta$ as input produces a
collection as output, then we can write this using the
elementwise definition of the binomial and a function to calculation
the mean of a collection as
$$
\begin{array}{rcl}
p\!\left(\tilde{y} \mid y\right)
& \approx &
\mbox{mean}\!\left(\mbox{binomial}\!\left(\tilde{y} \mid \tilde{N}, \theta\right)\right).
\\[6pt]
& \approx &
\mbox{mean}\!\left(\mbox{binomial}\!\left(\tilde{y} \mid \tilde{N},
\left(\theta^{(1)}, \ldots \theta^{(M)}\right)\right)\right).
\\[6pt]
& = &
\mbox{mean}\!\left(
( \mbox{binomial}\!\left(\tilde{y} \mid \tilde{N}, \theta^{(1)}\right),
\ldots,
\mbox{binomial}\!\left(\tilde{y} \mid \tilde{N}, \theta^{(M)}\right)
\right)
\\[6pt]
& = &
\frac{1}{M} \sum_{m=1}^M
\mbox{binomial}\!\left(\tilde{y} \mid \tilde{N}, \theta^{(m)}\right).
\end{array}
$$
The second two lines illustrate how an elementwise definition is
expanded, and the third shows how the compound function for the mean
is applied.^[The definition of the mean function is $$\mbox{mean}\!\left( (u_1, \ldots, u_N)
\right) \ = \ \frac{1}{N} \sum_{n=1}^N u_n.$$]
The pseudocode translates the mathematical definition directly.
```
for (m in 1:M)
draw theta(m) from posterior p(theta | y, N)
p[m] = binomial(y_pred | N_pred, theta(m))
print 'esimated p(y_pred | y) = ' mean(p)
```
With an elementwise version of the binomial probability mass function,
this simplifies to the following.^[Simple can mean several
things---we're measuring code simplicity, not necessarily how simple
it is to undersetand. Code simplicity involves several things, among
them shallower nesting of statments and fewer indexed expressions, as
in this example. As another example, the calculation of the mean could
be done manually, but it is simpler to use a library function. Even if
there's no library function, it simplifies code to break complex
operations down into well-named functions. Building up a personal
library of such functions is the first step toward becoming a
developer.]
```
for (m in 1:M)
draw theta(m) from posterior p(theta | y, N)
p = binomial(y_pred | N_pred, theta)
print 'esimated p(y_pred | y) = ' mean(p)
```
Let's continue with the small data example, where $y = 3$ and $N =
10$. We'll take $\tilde{N} = 20$ in order to make a prediction over
the next twenty observations and calculate the probability for each
possible $\tilde{y} \in 0:\tilde{N}$. Then we'll plot them in a bar
chart to inspect the distribution.
```{r fig.cap = "Probablity mass function for the posterior predictive distribution of the number of boys $\\tilde{y}$ in a subsequent group of $\\tilde{N} = 20$ births based on observing 3 boys in 10 births with no prior information. Although it is peaked near the 30% level observed in the data, it would not be that surprising to see more boys than girls in the next 20 births."}
set.seed(1234)
log_sum_exp <- function(u) max(u) + log(sum(exp(u - max(u))))
M <- 1000
y <- 3
N <- 10
N_pred <- 20
lp <- rep(NA, M)
log_pred <- rep(NA, N_pred + 1)
theta <- rbeta(M, y + 1, N - y + 1)
for (y_pred in 0:N_pred) {
for (m in 1:M) {
lp[m] <- dbinom(y_pred, N_pred, theta[m], log = TRUE)
}
log_pred[y_pred + 1] <- log_sum_exp(lp) - log(M)
}
pp3_10_df <- data.frame(y_pred = 0:20, log_p = exp(log_pred))
pp1_plot <-
ggplot(pp3_10_df, aes(y_pred, log_p)) +
geom_col(color='black', fill = '#ffffe8', size = 0.25) +
xlab(expression(widetilde(y))) +
ylab(expression(paste("p(", widetilde(y), " | ", y, ")"))) +
ggtheme_tufte()
pp1_plot
```
Because we have only observed $N$ outcomes, the posterior is not very
concentrated---it is consistent with a wide range of possible
outcomes. Contrast this situation to using the binomial directly to do
prediction by setting $\theta = 0.3$, the observed proportion of
boys.^[Inference with a single point estimate of one or more
parameters is common in live, real-time applications, where the cost
of averaging over the posterior may be prohibitive.]
```{r fig.cap = "Probability mass function for predictions made by plugging the proportion of boy births observed in the data, $\\theta^* = \\frac{y}{N} = 0.3$, into the sampling distribution, to yield $\\mbox{binomial}(\\tilde{y} \\mid \\tilde{N}, \\theta^*)$. Compared to the full posterior taking into account estimation uncertinaty in $\\theta$, this plug-in estimate makes it seem very unlikely there will be more boys than girls born in the next 20 births."}
binom_df <- data.frame(y_pred = 0:20, log_p = dbinom(0:20, 20, 0.3))
binom_plot <-
ggplot(binom_df, aes(y_pred, log_p)) +
geom_col(color='black', fill = '#ffffe8', size = 0.25) +
xlab(expression(widetilde(y))) +
ylab(expression(paste("binom(", widetilde(y), " |", widetilde(N), ", 0.3)"))) +
ggtheme_tufte()
binom_plot
```
The probability mass in the predictions is much more concentrated
around the observed proportion of boys. Ignoring the estimation
uncertainty in $\theta$ captured by the full posterior leads to
inferences that are overly concentrated compared to the full
probabilistic conditioning on observed data. As we will see in
subsequent chapters, such predictions are not well calibrated for
future data assuming the model is correct---they place too much
certainty in the observed proportion of boys in a small sample.
The full posterior automatically adjusts for the size of the sample.
Consider the following plot, in which the same proportion of boys is
provided (30%), but the sample size continues to grow (quadrupling
each time).
```{r out.width = "100%", fig.width = 9, fig.asp=0.4, fig.cap = "Illustration of convergence of posterior to binomial prediction based on proportion of boys as number of observations $y$ and $N$ grows, with proportion $\\frac{y}{N}$ fixed. The central limit theorem tells us that each quadrupling of the data cuts the uncertainty in parameter estimation in half until all that is left is the sampling uncertainty in the binomial, as represented in the final plot, where $\\theta = 0.3$ is fixed."}
M <- 1000
y <- 12
N <- 40
N_pred <- 20
lp <- rep(NA, M)
log_pred <- rep(NA, N_pred + 1)
theta <- rbeta(M, y + 1, N - y + 1)
for (y_pred in 0:N_pred) {
for (m in 1:M) {
lp[m] <- dbinom(y_pred, N_pred, theta[m], log = TRUE)
}
log_pred[y_pred + 1] <- log_sum_exp(lp) - log(M)
}
pp12_40_df <- data.frame(y_pred = 0:20, log_p = exp(log_pred))
y <- 48
N <- 160
N_pred <- 20
lp <- rep(NA, M)
log_pred <- rep(NA, N_pred + 1)
theta <- rbeta(M, y + 1, N - y + 1)
for (y_pred in 0:N_pred) {
for (m in 1:M) {
lp[m] <- dbinom(y_pred, N_pred, theta[m], log = TRUE)
}
log_pred[y_pred + 1] <- log_sum_exp(lp) - log(M)
}
pp48_160_df <- data.frame(y_pred = 0:20, log_p = exp(log_pred))
pp_full_df <- data.frame(y_pred = c(), log_p = c(), model = c())
pp_full_df <- rbind(pp_full_df,
data.frame(y_pred = pp3_10_df$y_pred, log_p = pp3_10_df$log_p,
model = rep(paste('y = 3, N = 10'), 21)))
pp_full_df <- rbind(pp_full_df,
data.frame(y_pred = pp12_40_df$y_pred, log_p = pp12_40_df$log_p,
model = rep(paste('y = 12, N = 40'), 21)))
pp_full_df <- rbind(pp_full_df,
data.frame(y_pred = pp48_160_df$y_pred, log_p = pp48_160_df$log_p,
model = rep(paste('y = 48, N = 160'), 21)))
pp_full_df <- rbind(pp_full_df,
data.frame(y_pred = binom_df$y_pred, log_p = binom_df$log_p,
model = rep('fix theta = 0.3', 21)))
pp_full_plot <-
ggplot(pp_full_df, aes(y_pred, log_p)) +
facet_grid(cols = vars(model)) +
# facet_wrap( ~ model) +
geom_col(color='black', fill = '#ffffe8', size = 0.1) +
xlab(expression(widetilde(y))) +
ylab('probability') +
ggtheme_tufte() +
theme(panel.spacing.x = unit(1, "lines"))
pp_full_plot
```
As the data size grows, the posterior predictive distribution
approaches the predictions derived from just plugging the proportion
of boys observed (30%) in the sampling distribution. After 160
observations, the posterior predictive distribution is only a percent
or two away from the predictions that do not take into account
estimation uncertainty. Whether that matters or not will depend on the
application.^[If leveraged bets or lives are at stake, a 1%
discrepancey in predictions can have enormous consequences.]
## Numerically stable expectations on the log scale
We often run into the problem of underflow when computing densities or
probabilities.^[Underflow is the result of operations which produce
real numbers $\epsilon$ smaller than the smallest number that can
be represented.] To get around that problem we compute on the log
scale, using $\log p(y \mid \theta)$ rather than doing calculations on
the original scale $p(y \mid \theta)$.
But we have to be careful with averaging non-linear operations. The
averaging that we do with simulation-based estimates does not
distribute. For most $u$ and $v$,^[An exception is $u = v = 2.$]
$$
\log u + \log v \neq \log (u + v).
$$
As a result, the log of an average is not equal to the average of a
log,
$$
\frac{1}{M} \sum_{m=1}^M \log u_m
\neq
\log \left( \frac{1}{M} \sum_{m = 1}^M u_m \right).
$$
All is not lost, however. We will rewrite our desired result as
$$
\begin{array}{rcl}
\log \frac{1}{M} \sum_{m = 1}^M p(\tilde{y} \mid \theta^{(m)})
& = &
\log \frac{1}{M} + \log \sum_{m = 1}^M p(\tilde{y} \mid \theta^{(m)})
\\[4pt]
& = &
- \log M
+ \log \sum_{m=1}^M \exp
\left( \log p(\tilde{y} \mid \theta^{(m)}) \right)
\\[4pt]
& = &
\mbox{log_sum_exp}(\log p(\tilde{y} \mid \theta)) - \log M.
\end{array}
$$
Extending our example, we can work on the log scale to stablize
calculations with larger $N$ by calculating
$$
\begin{array}{rcl}
\log p(\tilde{y} \mid y)
& \approx &
\log \sum_{m=1}^M
\exp\left(
\log \mbox{binomial}(\tilde{y} \mid \tilde{N}, \theta^{(m)})
\right).
\\[8pt]
& = &
\mbox{log_sum_exp}(\log \mbox{binomial}(\tilde{y} \mid \tilde{N}, \theta)).
\end{array}
$$
As with the algorithm on the original scale, the pseudocode is
straightforward given a means to draw $\theta^{(m)}$ from the
posterior $p(\theta \mid y, N)$.^[Typically, software will have the
binomial implemented on the log scale directly, so it won't be
necessary to take the log of the standard scale version.]
```
for (m in 1:M)
draw theta(m) from posterior p(theta | y, N)
lp[m] = log(binomial(y_pred | N_pred, theta(m))
print 'log p(y_pred | y) = ' log_sum_exp(lp)
```
Thus we can calculate posterior predictive densities on the log scale
using log scale density calculations throughout to prevent overflow
and underflow in intermediate calculations or in the final result.
## Posterior predictive densities as expectations
Working with expectations and conditional expectations is natural for
posterior inference, but initially requires some mental gymnastics to
interpret all the implicit bindings. Using expectation notation, the
posterior predictive distribution can be defined as
$$
p(\tilde{y} \mid y)
\ = \
\displaystyle
\mathbb{E}\!\left[ p(\tilde{y} \mid \theta) \mid y \right].
$$
We are now overloading lower case variables to do double duty as their
upper-case counterparts. Rendered with full random variable
indexing, the expectation and its definition are
$$
\begin{array}{rcl}
p_{\tilde{Y} \mid Y}(\tilde{y} \mid y)
& = &
\displaystyle
\mathbb{E}\!\left[
p_{\tilde{Y}\mid\Theta}(\tilde{y} \mid \Theta) \mid Y = y
\right]
\\[8pt]
& = &
\displaystyle
\int_T
p_{\tilde{Y}\mid\Theta}(\tilde{y} \mid \theta)
\times p_{\Theta \mid Y}(\theta \mid y)
\, \mathrm{d} \theta,
\end{array}
$$
where $T$ is domain of integration for $\theta$, i.e., possible values
for $\Theta$. The twist is that the function whose expectation is
being taken is now a density $p(\tilde{y} \mid \theta)$ in which
$\theta$ shows up as a conditioning variable for the data $\tilde{y}$
being predicted.
The trick to understanding expectations is in understanding which
variables are bound---all other random variables are averaged out to
define the expectation. The value of both the observed data $y$ and
the data $\tilde{y}$ for which we are computing predictions are fixed
by the function definition and are not free variables in the
expectation. The random variable $\theta$ is not bound anywhere, and
hence it is averaged in the calculation.