forked from bob-carpenter/prob-stats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathconjugacy.Rmd
536 lines (504 loc) · 15.1 KB
/
conjugacy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
# Conjugate Priors
## Uniform prior and binomial likelihood
Suppose we have a uniform prior for parameter $\theta \in (0, 1)$,
$$
\theta \sim \textrm{uniform}(0, 1),
$$
and combine it with a a likelihood for data $y \in 0:N$,
$$
y \sim \textrm{binomial}(N, \theta).
$$
We know from Bayes's rule that the posterior is proportional to the
likelihood times the prior. Writing this out in symbols,
$$
\begin{array}{rcl}
p(\theta \mid y, N)
& \propto &
\displaystyle
\frac{p(y \mid \theta, N) \cdot p(\theta)}
{p(y \mid N)}
\\[6pt]
& \propto &
p(y \mid \theta) \cdot p(\theta)
\\[4pt]
& = & \textrm{binomial}(y \mid N, \theta) \cdot \textrm{uniform}(\theta \mid 0,
1)
\\[4pt]
& = &
\displaystyle
\binom{N}{y} \cdot \theta^y \cdot (1 - \theta)^{N - y} \cdot 1
\\[4pt]
& \propto &
\displaystyle
\theta^y \cdot (1 - \theta)^{N - y}.
\end{array}
$$
Now that we know $p(\theta \mid y) \propto \theta^y \cdot (1 -
\theta)^{N - y},$ we can follow Laplace in applying Euler's beta
function to normalize,^[Euler's beta function is defined by
$$
\begin{array}{rcl}
\textrm{B}(\alpha, \beta)
& = &
\displaystyle
\int_0^1 u^{\alpha - 1} \cdot (1 - u)^{\beta - 1} \mathrm{d}u
\\[4pt]
& = &
\displaystyle
\frac{\Gamma(\alpha) \cdot \Gamma(\beta)}
{\Gamma(\alpha + \beta)},
\end{array}
$$
where the gamma function is defined by
$$
\textstyle \Gamma(\gamma) = \int_0^{\infty} u^{\gamma - 1} \cdot \exp(-u) \, \textrm{d}u.
$$
The gamma function generalizes the integer factorial function, satisfying
$$
\Gamma(u + 1) = u!
$$
for all integers $u \in \mathbb{N}.$
]
$$
\begin{array}{rcl}
p(\theta \mid y, N)
& = &
\displaystyle
\frac{\theta^y \cdot (1 - \theta)^{N - y}}
{\int_0^1 \theta^y \cdot (1 - \theta)^{N - y} \, \textrm{d}\theta}
\\[6pt]
& = &
\displaystyle
\frac{1}
{\mathrm{B}(y, N - y)}
\cdot \theta^y \cdot (1 - \theta)^{N - y}
\\[6pt]
& = &
\displaystyle
\frac{\Gamma(N)}
{\Gamma(y) \cdot \Gamma(N - y)}
\cdot \theta^y \cdot (1 - \theta)^{N - y.}
\end{array}
$$
Another way of arriving at the same result is to just work through
Bayes's rule by brute force,
$$
p(\theta \mid y)
= \frac{p(y \mid \theta) \cdot p(\theta)}
{\int_0^1 p(y \mid \theta) \cdot p(\theta) \, \textrm{d}\theta}.
$$
The integral in the denominator is just the beta function given above.
## Beta prior and binomial likelihood produces a beta posterior
A $\textrm{beta}(1, 1)$ distribution is identical to a $\textrm{uniform}(0,
1)$ distribution, because
$$
\begin{array}{rcl}
\textrm{beta}(\theta \mid 1, 1)
& \propto &
\theta^{1 - 1} \cdot (1 - \theta)^{1 - 1}
\\[4pt]
& = &
1
\\[4pt]
& = &
\textrm{uniform}(\theta \mid 0, 1).
\end{array}
$$
Now suppose we assume a beta prior with parameters $\alpha, \beta >
0,$
$$
\theta \sim \textrm{beta}(\alpha, \beta).
$$
When combined with a binomial likelihood,
$$
y \sim \textrm{binomial}(N, \theta),
$$
the result is a posterior of the following form
$$
\begin{array}{rcl}
p(\theta \mid y, N)
& \propto &
p(y \mid N, \theta) \cdot p(\theta)
\\[4pt]
& = &
\textrm{binomial}(y \mid N, \theta) \cdot \textrm{beta}(\theta \mid
\alpha, \beta)
\\[4pt]
& \propto &
\left( \theta^y \cdot (1 - \theta)^{N - y} \right)
\cdot
\left( \theta^{\alpha - 1} \cdot (1 - \theta)^{\beta - 1} \right)
\\[8pt]
& = &
\theta^{y + \alpha - 1} \cdot (1 - \theta)^{N - y + \beta - 1}
\\[6pt]
& \propto &
\textrm{beta}(y + \alpha, N - y + \beta).
\end{array}
$$
The rearrangement of terms in the penultimate step^[This uses the rule
of exponents from algebra,
$$\theta^u \cdot \theta^v = \theta^{u = v}.$$] lets us collect a
result that matches the kernel of a beta distribution.
The takeaway message is that if the prior is a beta distribution and
the likelihood is binomial, then the posterior is also a beta
distribution.
## Conjugate priors
In general, a family $\mathcal{F}$ of priors is said to be *conjugate*
to a family $\mathcal{G}$ of likelihood functions, if $p(\theta) \in
\mathcal{F}$ and $p(y \mid \theta) \in \mathcal{G}$ imply that
$p(\theta \mid y) \in \mathcal{F}$. We have already seen one example
of conjugacy, with the family of beta priors and binomial likelihoods,
$$
\begin{array}{rcl}
\mathcal{F} & = &
\{ \, \textrm{beta}(\alpha, \beta) \mid \alpha, \beta > 0 \, \}
\\[6pt]
\mathcal{G} & = &
\{ \, \textrm{binomial}(N, \theta) \mid N \in \mathbb{N}, \
\theta \in (0, 1) \, \}.
\end{array}
$$
It is no coincidence that the likelihood and prior in our conjugate
example have matching forms,
$$
\begin{array}{r|rllll|l}
\textrm{prior}
& \textrm{beta}(\theta \mid \alpha, \beta)
& \propto &
\theta^{\alpha - 1}
& \cdot &
(1 - \theta)^{\beta - 1}
& p(\theta)
\\[6pt]
\textrm{likelihood}
&
\textrm{binomial}(y \mid N, \theta)
& \propto &
\theta^y
& \cdot &
(1 - \theta)^{N - y}
& p(y \mid \theta)
\\[4pt] \hline
\textrm{posterior}
&
\textrm{beta}(\theta \mid y + \alpha, \ N - y + \beta)
& \propto &
\theta^{y + \alpha - 1}
& \cdot &
(1 - \theta)^{N - y + \beta - 1}
& p(\theta \mid y)
\end{array}
$$
Thinking of the exponents $y$ and $N - y$ as success and failure
counts respectively, we can think of the exponents $\alpha - 1$ and
$\beta - 1$ as the prior number of succcesses and failures.^[The prior
counts are $\alpha - 1$ and $\beta - 1$ for a total prior count of
$\alpha + beta - 2$. The reason for the subtraction is that
$\textrm{beta}(1, 1)$ is the uniform distribution, so $\alpha = 1, \beta
= 1$ corresponds to a prior count of zero.] We then just add the
prior successes and the likelihood successes to get $y + \alpha - 1$
posterior succesess; the prior failures work similarly, with $\beta -
1$ prior failures and $N - y$ observed failures producing a $N - y +
\beta - 1$ posterior failure count.
## Beta-Bernoulli conjugacy
The Bernoulli distribution is just a single trial binomial
distribution. For a single binary observation $y \in \{ 0, 1 \}$ and
chance of success parameter $\theta,$
$$
\textrm{bernoulli}(y \mid \theta)
\ = \
\textrm{binomial}(y \mid 1, \theta).
$$
Therefore, the beta distribution must also be conjugate to the Bernoulli
distribution---if the prior is a beta distribution and the likelihood
is a Bernoulli distribution, then the posterior is also a beta
distribution. This is evident if we recast the Bernoulli definition
in the same form as the beta and binomial distributions,
$$
\textrm{bernoulli}(y \mid \theta)
\ = \ \theta^y \cdot (1 - \theta)^{1 - y}.
$$
Working through the algebra, if $p(\theta) = \textrm{beta}(\alpha, \beta)$
and the likelihood is $p(y \mid \theta) = \textrm{bernoulli}(y \mid \theta)$, then the
posterior is $p(\theta \mid y) = \textrm{beta}(\alpha + y, \beta + 1 -
y).$
Said more simply, if the prior is $\textrm{beta}(\alpha, \beta)$ and we
condition on a single binary observation $y$; if $y = 1$, the updated
distribution is $\textrm{beta}(\alpha + 1, \beta);$ if $y = 0$, the
updated distribution is $\textrm{beta}(\alpha, \beta + 1).$
## Chaining repeated observations
Suppose we do not have a single binary observation, but a whole
sequence $y = y_1, \ldots, y_N$, where $y_n \in \{ 0, 1 \}.$ Now
suppose we start with a prior $p(\theta) = \textrm{beta}(\theta \mid
\alpha, \beta)$. Given no observations, our knowledge of the
parameter $\theta$ is determined by the prior,
$$
p(\theta) = \textrm{beta}(\theta \mid \alpha, \beta).
$$
After the first observation, $y_1$, we can update our knowledge of
$\theta$ conditioned on that observation to
$$
p(\theta \mid y_1)
\ = \
\textrm{beta}(\theta \mid \alpha + y_1, \beta + 1 - y_1).
$$
What is our knowledge of $\theta$ after two observations, $y_1, y_2$?
We can apply Bayes's rule, to see that
$$
p(\theta \mid y_1, y_2)
\propto
p(\theta \mid y_1) \cdot p(y_2 \mid \theta).
$$
Displayed this way, the distribution $p(\theta \mid y_1)$ is acting
like a prior with respect to the likelihood for the single observation
$y_2$. Substituting in the actual distributions, we have
$$
\begin{array}{rcl}
p(\theta \mid y_1, y_2)
& = &
p(\theta \mid y_1) \cdot p(y_2 \mid \theta)
\\[4pt]
& = &
\textrm{beta}(\theta \mid \alpha + y_1, \beta + 1 - y_1)
\cdot
\textrm{bernoulli}(y_2 \mid \theta)
\\[4pt]
& \propto &
\textrm{beta}(\theta \mid \alpha + y_1 + y_2, \beta + 1 - y_1 + 1 - y_2)
\\[4pt]
& = &
\textrm{beta}(\theta \mid \alpha + (y_1 + y_2), \beta + 2 - (y_1 + - y_2))
\end{array}
$$
We can visualize these chained updates as follows.
```{r, engine='tikz', out.width = "100%", fig.ext="pdf", fig.cap="Progress of streaming updates with conjugate priors. There is an initial prior $\\textrm{beta}(\\alpha_0, \\beta_0)$ and a stream of data $y_1, y_2, \\ldots, y_N.$ After each data point $y_n$ is observed, the prior parameters $\\alpha_{n-1}, \\beta_{n-1}$ are updated to the posterior parameters $\\alpha_{n}, \\beta_n,$ which then acts as a prior for subsequent data."}
\begin{tikzpicture}[->, auto, node distance=3cm, font=\normalsize]
\node[rectangle,draw,semithick, label = below:$p(\theta)$] (A) {$\textrm{beta}(\alpha_0, \beta_0)$};
\node[rectangle,draw,semithick, label = below:$p(\theta \mid y_1)$] (B) [right of = A] {$\textrm{beta}(\alpha_1, \beta_1)$};
\node[rectangle,draw,semithick, , label = below:$p(\theta \mid y_{1:2})$] (C) [right of = B] {$\textrm{beta}(\alpha_2, \beta_2)$};
\node[draw=none, fill=none] (D) [right of = C] {$\cdots$};
\node[rectangle,draw,semithick,, label = below:$p(\theta \mid y_{1:N})$] (E) [right of = D] {$\textrm{beta}(\alpha_N, \beta_N)$};
\path(A) edge [ ] node {$y_1$} (B);
\path(B) edge [ ] node {$y_2$} (C);
\path(C) edge [ ] node {$y_3$} (D);
\path(D) edge [ ] node {$y_N$} (E);
\end{tikzpicture}
```
Given a prior $\textrm{beta}(\alpha_0, \beta_0)$, we will let
$\textrm{beta}(\alpha_n, \beta_n)$ to be the distribution for $\theta$
conditioned on observations
$$
y_{1:n} = y_1, \ldots, y_n.
$$
The values $(\alpha_n, \beta_n)$ are defined inductively. As a base
case $(\alpha_0, \beta_0)$ are our initial prior parameters. Then
inductively, after observing $y_{n + 1},$ we have
$$
(\alpha_{n + 1}, \beta_{n + 1})
\ = \
(\alpha_n + y_{n + 1}, \beta_n + 1 - y_{n + 1}).
$$
Because $y_n$ is binary, we can expand this definition by cases and
also write
$$
(\alpha_{n + 1}, \beta_{n + 1})
\ = \
\begin{cases}
(\alpha_n + 1, \beta_n)
& \textrm{if} \ y_{n + 1} = 1, \ \textrm{and}
\\[4pt]
(\alpha_n, \beta_n + 1)
& \textrm{if} \ y_{n + 1} = 0.
\end{cases}
$$
All that is happening is counting---we just add one to the success
count if $y_n = 1$ and one to the failure count if $y_n = 0.$ By
the time we get to the end and have observed $y_{1:N},$ we have
$$
(\alpha_n, \beta_n)
\ = \
(\alpha_0 + \textrm{sum}(y), \
\beta_0 + N - \textrm{sum}(y)).
$$
This is, not coincidentally, the same result we would have achieved
had we started with the prior $\textrm{beta}(\alpha_0, \beta_0)$ and
observed all $y_1, \ldots, y_N$ simultaneously. In that case, we
could use the equivalence with the binomial,
$$
\textrm{binomial}(\textrm{sum}(y) \mid N, \theta)
\ \propto \
\theta^{\textrm{sum}(y)} \cdot (1 - \theta)^{N - \textrm{sum}(y)}.
$$
Another way of deriving this result is by direct expansion
$$
\begin{array}{rcl}
p(\theta \mid y_1, \ldots, y_N)
& \propto &
p(\theta) \cdot p(y_1, \ldots, y_N \mid \theta)
\\[4pt]
& = &
p(\theta) \cdot p(y_1 \mid \theta) \cdots p(y_N \mid \theta).
\\[4pt]
& = &
\textrm{beta}(\theta \mid \alpha_0, \beta_0)
\cdot
\prod_{n=1}^N \textrm{bernoulli}(y_n \mid \theta).
\\[4pt]
& = &
\theta^{\alpha_0 - 1} \cdot (1 - \theta)^{\beta_0 - 1}
\cdot
\prod_{n=1}^N \theta^{y_n} \cdot \theta^{1 - y_n}.
\\[4pt]
& = &
\theta^{\alpha_0 - 1} \cdot (1 - \theta)^{\beta_0 - 1}
\cdot
\theta^{\sum_{n=1}^N y_n}
\cdot
(1 - \theta)^{\sum_{n=1}^N 1 - y_n}
\\[4pt]
& = &
\left( \theta^{\alpha_0 - 1} \cdot (1 - \theta)^{\beta_0 - 1} \right)
\cdot
\left( \theta^{\textrm{sum}(y)} \cdot (1 - \theta)^{N - \textrm{sum}(y)}
\right)
\\[4pt]
& = &
\theta^{\alpha_0 + \textrm{sum}(y) - 1} \cdot (1 - \theta)^{\beta_0 + N
- \textrm{sum}(y) - 1}
\\[4pt]
& \propto &
\textrm{beta}(\theta
\mid \alpha_0 + \textrm{sum}(y), \beta_0 + N - \textrm{sum}(y)).
\end{array}
$$
## Gamma-Poisson conjugacy
The same line of reasoning that led from a binomial distribution to a
prior that behaved as prior data, we can reason from the Poisson
distribution backward to a conjugate prior. Recall that for a count
$y \in \mathbb{N}$ and rate $\lambda \in (0, \infty)$,
$$
\textrm{poisson}(y \mid \lambda)
\ = \
\frac{1}{y!}
\cdot \lambda^y
\cdot \exp(-\lambda).
$$
Given this form, the gamma distribution provides a family of
conjugate priors, parameters by a shape $\alpha$ and rate (inverse
scale) $\beta$, as
$$
\textrm{gamma}(\lambda \mid \alpha, \beta)
\ = \
\frac{\beta^{\alpha}}{\Gamma(\alpha)}
\cdot
\lambda^{\alpha - 1}
\cdot
\exp(-\beta \cdot \lambda).
$$
Now let's work out the conjugacy. Suppose we have a Poisson
likelihood with rate parameter $\lambda$,
$$
\begin{array}{rcl}
p(y \mid \lambda)
& = & \textrm{poisson}(y \mid \lambda),
\\[4pt]
& = &
\frac{1}{y!}
\cdot \lambda^y
\cdot \exp(-\lambda)
\end{array}
$$
with a gamma prior on the rate parameter,
$$
\begin{array}{rcl}
p(\lambda \mid \alpha, \beta)
& = & \textrm{gamma}(\lambda \mid \alpha, \beta)
\\[4pt]
& = &
\frac{\beta^{\alpha}}{\Gamma(\alpha)}
\cdot
\lambda^{\alpha - 1}
\cdot
\exp(-\beta \cdot \lambda).
\end{array}
$$
The posterior is equal to the product of the likelihood and prior up
to a proportion,
$$
\begin{array}{rcl}
p(\lambda \mid y, \alpha, \beta)
& = & p(y \mid \lambda) \cdot p(\lambda \mid \alpha, \beta)
\\[4pt]
& = &
\textrm{poisson}(y \mid \lambda)
\cdot \textrm{gamma}(\lambda \mid \alpha, \beta)
\\[4pt]
& = &
\left(
\frac{1}{y!}
\cdot \lambda^y
\cdot \exp(-\lambda)
\right)
\cdot
\left(
\frac{\beta^{\alpha}}{\Gamma(\alpha)}
\cdot
\lambda^{\alpha - 1}
\cdot
\exp(-\beta \cdot \lambda)
\right)
\\[4pt]
& \propto &
\left( \lambda^y \cdot \exp(-\lambda) \right)
\cdot
\left( \lambda^{\alpha - 1} \cdot \exp(-\beta \cdot \lambda) \right)
\\[4pt]
& = &
\lambda^{y + \alpha - 1} \cdot \exp(-(\beta + 1) \cdot \lambda)
\\[4pt]
& \propto &
\textrm{gamma}(\alpha + y, \beta + 1).
\end{array}
$$
Now suppose we have a sequence of count observations $y_1, \ldots, y_N
\in \mathbb{N}.$ Assuming a $\textrm{poisson}(\lambda)$ likelihood
with prior $\textrm{gamma}(\lambda \mid \alpha_0, \beta_0)$, we can
update with $y_1$ to produce a posterior
$$
p(\lambda \mid y_1, \alpha_0, \beta_0)
\ = \
\textrm{gamma}(\alpha_0 + y_1, \beta_0 + 1).
$$
Using this as a prior for $y_2$, the posterior becomes
$$
p(\lambda \mid y_1, y_2, \alpha_0, \beta_0)
\ = \
\textrm{gamma}(\alpha_0 + y_1 + y_2, \beta_0 + 2).
$$
Extending this logic, we can take a whole batch of observations $y_1,
\ldots, y_N$ at once to produce a posterior,
$$
p(\lambda \mid y_1, \ldots, y_N, \alpha_0, \beta_0)
\ = \
\textrm{gamma}(\alpha_0 + \textrm{sum}(y), \beta_0 + N).
$$
Worked out one step at a time, the posterior after observing $y_1,
\ldots, y_n$ is
$$
p(\lambda \mid y_1, \ldots, y_n, \alpha_0, \beta_0)
\ = \
p(\lambda \mid \alpha_n, \beta_n)
$$
where $(\alpha_0, \beta_0)$ are given as initial priors, and
$$
(\alpha_{n + 1}, \beta_{n + 1})
\ = \
(\alpha_n + y_{n+1}, \beta_n + 1).
$$
Thinking of the gamma prior in terms of Poisson data,
$\textrm{gamma}(\alpha, \beta)$ represents a total of $\beta$ prior
observations, with a sum total of $\alpha$. For example, if we have a
single prior observation $y$, that corresponds to a $\textrm{gamma}(y,
1)$ prior; if we have three prior observations $12, 7, 9$, that'd be
represented by a $\textrm{gamma}(26, 3)$ prior.