forked from bob-carpenter/prob-stats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinference.Rmd
817 lines (673 loc) · 30.5 KB
/
inference.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
# Statistical Inference and Inverse Problems
*Deductive inference* works from facts toward conclusions
determinstically. For example, if I tell you that all men are mortal
and that Socrates is a man, you can deductively conclude that Socrates
is mortal. *Inductive inference*, on the other hand, is a bit more
slippery to define, as it works from observations back to facts. That
is, if we think of the facts as governing or generating the
observations, then induction is a kind of inverse inference.
*Statistical inference* is a kind of inductive inference that is
specifically formulated as an inverse problem.
## Laplace's birth ratio model
The roots of statistical inference lie not in games of chance, but in
the realm of public health. Pierre-Simon Laplace was investigating
the rate of child births by sex in France in an attempt to predict
future population sizes.^[Pierre-Simon Laplace. 1812. *Essai
philosophique sur les probabilités*. H. Remy. p. lvi of the
Introduction. Annotated English translation of the 1825 Fifth
Edition: Andrew I. Dale, 1995. *Philosophical Essay on
Probabilities*. Springer-Verlag.] Laplace reports the following
number of live births, gathered from thirty departments of France
between 1800 and 1802 was as follows.
$$
\begin{array}[r|r]
{ } \mbox{sex} & \mbox{live births}
\\ \hline
\mbox{male} & 110\,312
\\
\mbox{female} & 105\,287
\end{array}
$$
Laplace assumed each birth is independent and each has probability
$\Theta \in [0, 1]$ of being a boy. Letting $Y$ be the number of male
births and $N$ be the total number of births, Laplace assumed the
model
$$
Y \sim \mbox{binomial}(N, \Theta).
$$
In other words, his data-generating distribution had the probability
mass function^[The constant $N$ that appears in the full binomial
notation is suppressed in the density notation $p_{Y \mid
\Theta}$---it is common to suppress constants in the notation to make
the relationship between the modeled data $Y$ and parameters $\Theta$
3easier to scan.]
$$
p_{Y \mid \Theta}(y \mid \theta)
\ = \
\mbox{binomial}(y \mid N, \theta).
$$
Because it employs a binomial distribution, this model assumes that
the sex of each baby is independent, with probability $\theta$ of
being a boy. This may or may not be a good approximation to reality.
Part of our job is going to be to check the assumptions like this
built into our models.
We know how to generate $Y$ given values for the parameter $\Theta$,
but we are now faced with the *inverse problem* of drawing inferences
about $\Theta$ based on observations about $Y$.
## What is a model?
We say that this simple formula is a *model* in the sense that it is
not the actual birth process, but rather a mathematical construct
meant to reflect properties of the birth process. In this sense, it's
like Isaac Newton's model of the planetary motions using differential
equations.^[Isaac Newton. 1687. *Philosophiae Naturalis Principia
Mathematica*. Translated as I. Bernard Cohen and Anne
Whitman. 1999. *The Principia: Mathematical Principles of Natural
Philosophy.* University of California Press.] The equations are not
the planets, just descriptions of how they move in response to
gravitational and other forces.
Models like Newton's allow us to predict certain things, such as the
motion of the planets, the tides, and balls dropped from towers. But
they typically only approximate the full process being modeled in some
way. Even Newton's model, which is fabulously accurate at predictions
at observable scales, is only an approximation to the finer-grained
models of motion and gravity introduced by Albert Einstein.^[Einstein,
Albert. 1907. On the relativity principle and the conclusions drawn
from it. *Jahrbuch der Radioaktivitfit and Elektronik* 4:411--462.]
which itself was only a special case of the more general theory of
relativity.^[Einstein, Albert. 1916. The foundation of the general
theory of relativity. *Annalen Phys.* 14:769--822.] Each successive
model is better than the last in that it's better at prediction, more
general, or more elegant---science does not progress based on a single
criterion for improving models.
The reproductive process is complex, and many factors may impinge on
the sex of a baby being born. Part of our job as scientists is to
check the assumptions of our models and refine them as necessary.
This needs to be done relative to the goal of the model. If the goal
of this simple reproductive model is only to predict the prevalence of
male births at a national scale, then a simple, direct prevalence
model with a single parameter like the one we have introduced may be
sufficient.
To conclude, when we say "model", all we have in mind is some
mathematical construct taken to represent some aspect of reality.
Whether a model is useful is a pragmatic question which must be judged
relative to its intended application.
## What is a random variable?
As in all statistical modeling, Laplace treated the observed number of
male births $Y$ as a random variable. This assumes a form of
counterfactual reasoning whereby we assume the world might have been
some other way than it actually turned out to be.
As in most statistical models, Laplace treated $N$ as a constant. In
many cases, the denominator of binary events is not itself a constant,
but is itself a random variable determined by factors of the
environment. For instance, the number of attempts an athlete on a
sports team get depends on the ability of that athlete and the number
of reviews a movie receives depends on its popularity.
As originally formulated by Thomas Bayes,^[Bayes, T., 1763. LII. An
essay towards solving a problem in the doctrine of chances. By the
late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to
John Canton, AMFRS. *Philosophical Transactions of the Royal Society*,
pp. 370--418.] Laplace also treated $\Theta$ as a random variable.
That is, Laplace wanted to infer, based on observation and
measurement, that the probability that $\Theta$'s value was in a
certain range. Specifically, Laplace was curious about the question
of whether the male birth rate is higher, which can be expressed in
probabilistic terms by the event probability $\mbox{Pr}[\Theta >
0.5]$.
## Laplace's inverse problem
Given a total of $N$ births, we have introduced random variables for
* the observed data of $Y$ male births, and
* the probability $\Theta$ that a live birth will result in a boy.
We also have the actual observed number of male births, $y$. That is,
we know the value of the random variable $Y$. Given our observed
data, we can ask two obvious questions, namely
* What is the probability of a boy being born?
* Is it more likely that a boy is born than a girl?
Given that $\Theta$ is the male birth rate, the first question is
asking about the value of $\Theta$. To provide a probabilistic
answer, we want to look at the distribution of $\Theta$ given that we
observe the actual data $Y = y$, which has the density $p_{\Theta \mid
Y}(\theta \mid y)$. We can summarize this distribution
probabilistically using intervals, for instance by reporting the
central 95% interval probability,
$$
\mbox{Pr}\left[ 0.025 \leq \Theta \leq 0.975
\ \Big| \
Y = y
\right].
$$
The second question, namely whether boys are more likely to be born,
is true if $\Theta > \frac{1}{2}$. The probability of this event is
$$
\mbox{Pr}\left[ \Theta > \frac{1}{2}
\ \Bigg| \
Y = y
\right].
$$
If we can estimate this event probability, we can answer Laplace's
second question.^[The quality of the answer will be determined by the
quality of the data and the quality of the model.]
## Bayes's rule to solve the inverse problem
The model we have is a *generative model*^[Also known as a *forward
model* or a *mechanistic model* by scientists.]---it works from a
parameter value $\theta$ to the observed data $y$ through a *sampling
distribution* with probability function $p_{Y \mid \Theta}(y \mid
\theta).$ What we need to solve our inference problems is the
*posterior density* $p_{\Theta \mid Y}(\theta \mid y)$. Bayes
realized that the posterior could be defined in terms of the sampling
distribution as
$$
\begin{array}{rcl}
p_{\Theta \mid Y}(\theta \mid y)
& = &
\frac{\displaystyle
p_{Y \mid \Theta}(y \mid \theta)
\times
p_{\Theta}(\theta)}
{\displaystyle
p_Y(y)}
\\[6pt]
& \propto &
p_{Y \mid \Theta}(y \mid \theta)
\times
p_{\Theta}(\theta).
\end{array}
$$
All of our sampling algorithms will work with densities known only up
to a proportion.
## The prior distribution
This still leaves the not inconsequential matter of how to determine
$p_{\Theta}(\theta)$, the density of the so-called *prior
distribution* of $\Theta$. The prior distribution encapsulates what
we know about the parameters $\Theta$ before observing the actual data
$y$. This prior knowledge may be derived in many different ways.
* We may have prior knowledge from physical constraints. For example,
if we are modeling the concentration of a molecule in a solution,
the concentration must be positive.
* We may have prior knowledge of the basic scale of the answer from
prior scientific knowledge. For example, if we are modeling human
growth, we know that heights above two meters are rare and heights
above three meters are unheard of.
* We may have prior knowledge from directly related prior experiments.
For example, if we are doing a Phase II drug trial, we will have
data from the Phase I trial, or we may have data from Europe if we
are doing a trial in the United States.
* We may have experiments from indirectly related trials. For
example, if we are modeling football player abilities, we have years
and years of data from prior seasons. We know nobody is going to average 10 points
a game---it's just not done.
Because we are working probabilistically, our prior knowledge will
itself be modeled with a probability distribution, say with density
$p_{\Theta}(\theta)$. The prior distribution may depend on
parameters, which may be constants or may themselves be unknown. This
may seem like an awfully strong imposition to have to express prior
knowledge as a density. If we can express our knowledge well and
sharply in a distribution, we will have an *informative prior.*
Luckily, because we are only building approximate models of reality,
the prior knowledge model does not need to be perfect. We usually err
on the side of underpowering the prior a bit compared to what we
really know, imposing only *weakly informative priors*, such as those
that determine scales, but not exact boundaries of parameters.^[The
notion of a truly *uninformative prior* is much trickier, because to be
truly uninformative, a prior must be scale free.]
We will have a lot to say about prior knowledge later in the book, but
for now we can follow Laplace in adopting a uniform prior for the rate
of male births,
$$
\Theta \sim \mbox{uniform}(0, 1).
$$
In other words, we assume the prior density is given by
$$
p_{\Theta}(\theta) = \mbox{uniform}(\Theta \mid 0, 1).
$$
Here, the bounds zero and one, expressed as constant parameters of the
uniform distribution, are logical constraints imposed by the fact that
the random variable $\Theta$ denotes a probability.
Other than the logical bounds, this uniform prior distribution is
saying a value in the range 0.01 to 0.05 is as likely as one in 0.48
to 0.52. This is a very weak prior indeed compared to what we know
about births. Nevertheless, it will suffice for this first analysis.
## The proportional posterior
With a prior and likelihood,^[Remember, the likelihood is just the
sampling distribution $p_{Y \mid \Theta}(y \mid \theta)$ viewed as a
function of $\theta$ for fixed $y$.] we have our full joint model in
hand,
$$
p_{Y \mid \Theta}(y \mid N, \theta) \times p_{\Theta}(\theta)
\ = \
\mbox{binomial}(y \mid N, \theta)
\times \mbox{uniform}(\theta \mid 0, 1).
$$
We have carried along the constant $N$ so we don't forget it, but it
simply appears on the right of the conditioning bar on both sides of
the equation.
The sampling distribution $p(y \mid \theta)$ is considered as a
density for $y$ given a value of $\theta$. If we instead fix $y$ and
view $p(y \mid \theta)$ as a function of $\theta$, it is called the
*likelihood function*. As a function, the likelihood function is not
itself a density. Nevertheless, it is crucial in posterior inference.
With Bayes's rule, we know the posterior is proportional to the prior
times the likelihood,
$$
\underbrace{p_{\Theta \mid Y}(\theta \mid y)}_{\text{posterior}}
\ \propto \
\underbrace{p_{Y \mid \Theta}(y \mid \theta)}_{\text{likelihood}}
\ \times \
\underbrace{p_{\Theta}(\theta)}_{\text{prior}}.
$$
Given the definitions of the relevant
probability functions,^[
For reference, these are the likelihood
$$
\mbox{binomial}(y \mid N, \theta)
\ \propto \
\theta^y \times (1 - \theta)^{N - y}
$$
and the prior
$$
\mbox{uniform}(\theta \mid 0, 1)
\ = \
1.
$$
]
we have
$$
\begin{array}{rcl}
p_{\Theta \mid Y}(\theta \mid y, N)
& \propto &
\mbox{binomial}(y \mid N, \theta)
\times \mbox{uniform}(\theta \mid 0, 1)
\\[4pt]
& \propto &
\theta^y \times (1 - \theta)^{N - y}
\end{array}
$$
To summarize, we know the posterior $p_{\Theta \mid Y}$ up to a
proportion, but are still missing the normalizing constant so that it
integrates to one.^[We return to the normalizer later when we discuss
the beta distribution.]
## Sampling from the posterior
Now that we have a formula for the posterior up to a proportion, we
are in business for sampling from the posterior. All of the sampling
algorithms in common use require the density only up to a proportion.
For now, we will simply assume a method exists to draw a sample
$\theta^{(1)}, \cdots, \theta^{(M)}$ where each $\theta^{(m)}$ is
drawn from the posterior $p_{\Theta \mid y}(\theta \mid y)$ given the
observed data $y$.
When we do begin to employ general samplers, they are going to require
specifications of our models that are exact enough to be programmed.
Rather than relying on narrative explanation, we'll use a pseudocode
for models that can be easily translated for an assortment of
posterior samplers.^[This specification is sufficient for coding a
sampler in BUGS, Edward, emcee, Greta, JAGS, NIMBLE, PyMC, Pyro, or
Stan.]
$$
\begin{array}{r|lr}
\mbox{Name} & \mbox{simple binomial}
\\ \hline
\mbox{Data}
& N \in \mathbb{N}
\\
& y_n \in \{ 0, 1 \}
& \mbox{ } \hfill \mbox{for} \ n \in 1:N
\\ \hline
\mbox{Parameters}
& \theta \in (0, 1)
\\ \hline
\mbox{Prior}
& \theta \sim \mbox{uniform}(0, 1)
\\ \hline
\mbox{Likelihood}
& y_n \sim \mbox{binomial}(N, \theta)
& \mbox{ } \hfill \mbox{for} \ n \in 1:N
\end{array}
$$
## Simulating data
Rather than starting with Laplace's data, which will present
computational problems, we will start with some simulated data. We
simulate data for a model by simulating the parameters from the prior,
then simulating the data from the parameters. That is, we run the
model in the forward direction from prior to parameters to data. This
is usually how we construct the models in the first place, so this
should be a natural step. In pseudocode, this is a two-liner.
```
theta = uniform_rng(0, 1)
y = binomial_rng(N, theta)
print 'theta = ' theta '; y = ' y
```
Before we can actually simulate, we need to set the constants, because
they don't have priors. Here, we'll just take $N = 10$ for
pedagogical convenience. Let's run it a few times and see what we
get.
```{r}
set.seed(123)
N = 10
for (m in 1:5) {
theta = runif(1, 0, 1)
y = rbinom(1, N, theta)
printf('theta = %3.2f; y = %d\n', theta, y)
}
```
The values simulated for $\theta$ are not round numbers, so we know
that we won't satisfy $y = N \times \theta$, the expected value of a
random variable $Y$ such that $Y \sim \mbox{binomial}(N, \theta)$.
From an estimation perspective, we won't have $\theta = y / N$,
either. So the question becomes what are reasonable values for
$\theta$ based on our observation of $y$? That's precisely the
posterior, so let's proceed to sampling from that. We'll just assume
we have a function that samples from the posterior of a model with a
given name when passed the data for the model. Here, the data
consists of the values of $y$ and $N$, and we will run $M = 1\,000$
iterations.
```
N = 10
y = 3
theta[1:M] = posterior_sample('simple binomial', y, N)
print 'theta = ' theta[1:10] '...'
```
Let's run that and see what a few posterior draws look like.
```{r}
M <- 1000
N <- 10
y <- 3
theta <- rbeta(M, y + 1, N - y + 1)
printf('theta = ')
for (n in 1:10) printf('%3.2f ', theta[n])
```
It's hard to glean much from the draws. What it does tell us is that
the posterior in the range we expect it to be in---near 0.3, because
the data was $y = 3$ boys in $N = 10$ births. The first thing we want
to do with any posterior is check that it's reasonable.
For visualizing draws of a single variable, such as the proportion of
boy births $\theta$, histograms are handy.
```{r fig.cap = 'Histogram of one thousand draws from the posterior $p(\\theta \\mid y)$. With thirty bins, the histogram appears ragged, but conveys the rough shape and location of the posterior.'}
binom_post_df <- data.frame(theta = theta)
binomial_post_plot <-
ggplot(binom_post_df, aes(x = theta)) +
geom_histogram(color = 'black', fill = '#ffffe6', size = 0.25) +
scale_x_continuous(lim = c(0, 1),
breaks = c(0, 0.25, 0.5, 0.75, 1)) +
xlab(expression(theta)) +
ylab("posterior draw proportion") +
ggtheme_tufte() +
theme(axis.text.y = element_blank())
binomial_post_plot
```
Let's up $M$ to $1\,000\,000$ and double the number of bins to get a
better look at the posterior density. ^[A sample size $M > 100$ is
rarely necessary for calculating estimates, event probabilities, or
other expectations conditioned on data. For histograms, many draws are
required to ensure low relative error in every bin so that the
resulting histogram is smooth.]
```{r fig.cap = 'Histogram of one million draws from the posterior $p(\\theta \\mid y)$. A *much* larger $M$ is required to get a fine-grained view of the whole posterior distribution than is required for an accurate summary statistic.'}
set.seed(1234)
M <- 1e6
theta <- rbeta(M, y + 1, N - y + 1)
binom_post_df2 <- data.frame(theta = theta)
binomial_post_plot2 <-
ggplot(binom_post_df2, aes(x = theta)) +
geom_histogram(color = 'black', fill = '#ffffe6',
size = 0.25, bins = 60) +
scale_x_continuous(lim = c(0, 1),
breaks = c(0, 0.25, 0.5, 0.75, 1)) +
xlab(expression(theta)) +
ylab("posterior draw proportion") +
ggtheme_tufte() +
theme(axis.text.y = element_blank())
binomial_post_plot2
```
Histograms have their limitations. The distribution is slightly
asymmetric, with a longer tail to the right than to the left, but
asymmetry can be difficult to detect visually until it is more extreme
than here. Asymmetric distributions are said to be *skewed*, either to
the right or left, depending on which tail is longer.^[The formal
measurement of the *skew* of a random variable $Y$ is just another
expectation that may be estimated via simulation, $$\mbox{skew}[Y] =
\mathbb{E}\left[\left(\frac{Y -
\mathbb{E}[Y]}{\mbox{sd}[Y]}\right)^3\right].$$] It's also hard to
tell the exact location of the posterior mean and median visually.
## Posterior summary statistics
We often want to look at summaries of the posterior, the posterior
mean, standard deviation, and quantiles being the most commonly used
in practice. These are all easily calculated based on the sample
draws.
Calculating the posterior mean and standard deviation are as
simple as calling built-in mean and standard deviation functions,
```
print 'estimated posterior mean = ' mean(theta) '
print 'estimated posterior sd = ' sd(theta) '
```
Let's see what we get.
```{r}
printf('estimated posterior mean = %3.2f', mean(theta))
printf('estimated posterior sd = %3.2f', sd(theta))
```
The posterior mean and standard deviation are excellent marginal
summary statistics for posterior quantities that have a roughly normal
distribution.^[Most posterior distributions we will consider approach
normality as more data is observed.] If the posterior distribution
has very broad or narrow tails or is highly skewed, standard deviation
and mean are less useful.
We can estimate quantiles just as easily, assuming we have built-in
functions to compute quantiles.
```
print 'estimated posterior median = ' quantile(theta, 0.5)
print 'estimated posterior central 80 pct interval = '
quantiles(theta, { 0.1, 0.9 })
```
Running this produces the following.^[The median is slightly lower
than the mean, as they will be in right skewed distributions.]
```{r}
printf('estimated posterior median = %3.2f\n',
quantile(theta, 0.5))
printf('estimated posterior central 90 pct interval = (%3.2f, %3.2f)\n',
quantile(theta, 0.1), quantile(theta, 0.9))
```
The posterior simulations and summaries answer Laplace's question
about the value of $\theta$, i.e., the proportion of boys born, at
least relative to this tiny data set.
We have reported a central 90% interval here. It is a 90% interval in
the sense that it is 90% probable to contain the value (relative to
the model, as always). We have located that interval centrally in the
sense that it runs from the 5% quantile to the 95% quantile.
There is nothing privileged about the width or location of a posterior
interval. A value is as likely to be in a posterior interval from the
1% quantile to the 91% quantile, or from the 10% quantile to the 100%
quantile. The width is chosen to be convenient to reason about. With
a 90% interval, we know roughly nine out of ten values will fall
within it, and choosing a central interval gives us an idea of the
central part of the distribution.
## Estimating event probabilities
To answer the question about whether boys are more prevalent than
girls, we need to estimate $\mbox{Pr}[\theta > 0.5]$, which is
straightforward with simulation. As usual, we just count the number of
times that the simulated value $\theta^{(m)} > 0.5$ and divide by the
number of simulations $M$,
```
print 'estimated Pr[theta > 0.5] = ' sum(theta > 0.5) / M
```
Running this, we see that with 3 boys in 10 births, the probability
boys represent more than 50% of the live births is estimated, relative
to the model, to be
```{r}
printf('estimated Pr[theta > 0.5] = %3.2f\n', sum(theta > 0.5) / M)
```
Now let's overlay the median and central 90% interval.
```{r fig.cap = "Histogram of $1\\,000\\,000$ draws from the posterior $$p(\\theta \\mid y, N) \\propto \\mbox{binomial}(y \\mid N, \\theta),$$ given $N = 10, y = 3$. The median (50% quantile) is indicated with a dashed line and the boundaries of the central 90% interval (5% and 95% quantiles) are picked out with dotted lines. The proportion of the total area shaded to the right of 0.5 represents the posterior probability that $\\theta > 0.5,$ which is about 11%."}
q05 <- quantile(binom_post_df2$theta, 0.05)
q50 <- quantile(binom_post_df2$theta, 0.50)
q95 <- quantile(binom_post_df2$theta, 0.95)
binomial_post_plot3 <-
ggplot(binom_post_df2, aes(x = theta)) +
geom_histogram(color = 'black', fill = '#ffffe6',
size = 0.25, binwidth = 0.015, boundary = 0.5) +
geom_histogram(data = subset(binom_post_df2, theta > 0.5),
color = 'black', size = 0.25,
fill = '#ccccb6', binwidth = 0.015, boundary = 0.5) +
geom_vline(xintercept = q50,
linetype = "dashed", size = 0.5) +
geom_vline(xintercept = q05,
linetype = "dotted", size = 0.5) +
geom_vline(xintercept = q95,
linetype = "dotted", size = 0.5) +
annotate("text", x = q05, y = 50000, label = "5%",
family = "Palatino") +
annotate("text", x = q50, y = 50000, label = "50%",
family = "Palatino") +
annotate("text", x = q95, y = 50000, label = "95%",
family = "Palatino") +
annotate("text", x = q95, y = 50000, label = "95%",
family = "Palatino") +
annotate("text", x = 0.75, y = 50000, label = "quantiles",
family = "Palatino") +
scale_x_continuous(lim = c(0, 1),
breaks = c(0, 0.25, 0.5, 0.75, 1)) +
xlab(expression(theta)) +
ylab("posterior draw proportion") +
ggtheme_tufte() +
theme(axis.text.y = element_blank())
binomial_post_plot3
```
## Laplace's data
What happens if we use Laplace's data, rather than our small data set,
which had roughly 110 thousand male births and 105 thousand female?
Let's take some draws from the posterior $p(\theta \mid y, N)$ where
$y = 110\,312$ boys out of $N = 110\,312 + 105\,287$ total births.
We'll take $M = 1\,000\,000$ simulations $\theta^{(1)}, \ldots,
\theta^{(M)}$ here because they are cheap and we would like low
sampling error.
```{r}
set.seed(1234)
M <- 1000000
boys <- 110312
girls <- 105287
theta <- rbeta(M, boys + 1, girls + 1)
laplace_df <- data.frame(theta = theta)
laplace_plot <-
ggplot(laplace_df, aes(x = theta)) +
geom_histogram(bins = 50,
color = 'black', size = 0.25, fill = '#ccccb6') +
scale_x_continuous(lim = c(0.5075, 0.5175),
breaks = c(0.508, 0.510, 0.512, 0.514, 0.516),
labels = c(".508", ".510", ".512", ".514", ".516")) +
xlab(expression(theta)) +
ylab("posterior draws") +
ggtheme_tufte()
laplace_plot
```
The mean of the posterior sample is approximately 0.511, or a slightly
higher than 51% chance of a male birth. The central 90% posterior
interval calculated from quantiles of the sample is $(0.510, 0.513)$.
What about the event probability that boy births are more likely than
girl births, i.e., $\mbox{Pr}[\theta > 0.5]$? If we make our usual
calculation, taking draws $\theta^{(1)}, \ldots, \theta^{(M)}$ from
the posterior and look at the proportion for which $\theta^{(m)} >
0.5$, the result is 1. No decimal places, just 1. If we look at the
draws, the minimum value of $\theta^{(m)}$ in $1\,000,000$ draws was
approximately 0.506. The proportion of draws for which $\theta^{(m)}
> 0.5$ is thus 100%, which forms our estimate for $\mbox{Pr}[\theta >
0.5]$.
As we have seen before, simulation-based estimates provide
probabilistic guarantees about absolute tolerances. With $100\,000$
draws, we are sure that the answer is 1.0000 to within plus or minus
0.0001 or less.^[Tolerances can be calculated using the central limit
theorem, which we will define properly when we introduce the normal
distribution later.] We know the answer must be strictly less
than one. Using some analytic techniques,^[The cumulataive
distribution function of the posterior, which is known to be the beta
distribution $$p(\theta \mid y, N) = \mbox{beta}(\theta \mid y + 1, N
- y + 1).$$] the true estimate to within 27 decimal places is
$$
\mbox{Pr}[\theta > 0.5] = 1 - 10^{-27}.
$$
Thus Laplace was certain that the probability of a boy being born was
higher than that of a girl being born.
## Inference for and comparison of multiple variables
The first example of Laplace's is simple in that it has only a single
parameter of interest, $\theta$, the probability of a male birth. Now
we will consider a very similar model with two variables, so that we
can do some posterior comparisons. We will consider some simple
review data for two New York City-based Mexican restaurants. The
first contender is Downtown Bakery II, an East Village Mexican
restaurnat has $Y_1 = 114$ out of $N_1 = 235$ 5-star reviews on Yelp,
La Delicias Mexicanas, in Spanish Harlem, has $Y_1 = 24$ out of $N_2 =
51$ 5-start reviews. Our question is, which is more likely to deliver
a 5-star experience? In terms of proportion of 5-star votes, they are
close, with Downtown Bakery garnering 49% 5-star reviews and La
Delicias only 47%. Knowing how noisy binomial data is, this is too
close to call.
We'll model each restaurant independently for $n \in 1:2$ as
$$
Y_n \sim \mbox{binomial}(N, \Theta_n)
$$
with independent uniform priors for $n \in 1:2$ as
$$
\Theta_n \sim \mbox{uniform}(0, 1).
$$
We can now draw $\theta^{(1)}, \ldots, \theta^{(M)}$ simulations from
the posterior $p_{\Theta \mid Y, N}(\theta \mid y, N)$ as usual.
The main event is whether $\theta_1 > \theta_2$---we want to know if
the probability of getting a five-star review is higher at Downtown
Bakery than La Delicias. All we need to do is look at the posterior
mean of the indicator function $\mathrm{I}[\theta_1 > \theta_2]$. The
calculus gets more complicated---a double integral is now required
because there are two variables. The simulation-based estimate, on
the other hand, proceeds as before, counting proportion of draws in
which the event is simulated to occur.
$$
\begin{array}{rcl}
\mbox{Pr}[\theta_1 > \theta_2 \mid y, N]
& = &
\int_0^1 \int_0^1
\, \mathrm{I}[\theta_1 > \theta_2] \times p(\theta_1, \theta_2 \mid y, N)
\, \mathrm{d} \theta_1 \, \mathrm{d} \theta_2
\\[8pt]
& \approx &
\frac{1}{M} \sum_{m = 1}^M \mathrm{I}[\theta_1^{(m)} >
\theta_2^{(m)}].
\end{array}
$$
In pseudocode, this is just
```
success = 0
for (m in 1:M)
draw theta(m) from posterior p(theta | y, N)
if (theta(m)[1] > theta(m)[2])
success += 1
print 'Pr[theta[1] > theta[2] | y, M] = ' success / M
```
Let's run that with $M = 10\,000$ simulations and see what we get:
```{r}
M <- 10000
y <- c(114, 24)
N <- c(235, 51)
theta1 <- rbeta(M, y[1] + 1, N[1] + 1)
theta2 <- rbeta(M, y[2] + 1, N[2] + 1)
printf('Pr[theta[1] > theta[2] | y, M] = %3.2f\n',
sum(theta1 > theta2) / M)
```
Only about a `r as.integer(100 * sum(theta1 > theta2) / M)`% chance
that Downtown Bakery is the better bet.^[As much as this diner loves
Downtown Bakery, the nod for food, ambience, and the existence of beer
goes to La Delicias Mexicanas.]
To get a sense of the posterior, we can construct a histogram of
posterior draws of $\Delta = \Theta_1 - \Theta_2$.
```{r fig.cap = 'Histogram of posterior differences between probabilty of Downtown Bakery getting a 5-star review ($\\theta_1$) and that of La Delicias Mexicanas getting one ($\\theta_2$). The draws for which $\\delta > 0$ (equivalently, $\\theta_1 > \\theta_2$) are shaded darker. The area of the darker region divided by the total area is the estimate of the probability that Downtown Bakery is more likely to get a 5-star review than La Delicias Mexicanas.'}
delta <- theta1 - theta2
delicias_df <- data.frame(delta = delta)
delicias_plot <-
ggplot(delicias_df, aes(x = delta)) +
geom_histogram(bins = 30, boundary = 0, binwidth = 0.01,
color = 'black', size = 0.25, fill = '#ffffe6') +
geom_histogram(data = subset(delicias_df, delta > 0),
binwidth = 0.01, boundary = 0,
color = 'black', size = 0.25, fill = '#ccccb6') +
xlab(expression(paste(delta, " = ", theta[1] - theta[2]))) +
ylab("posterior draws") +
ggtheme_tufte()
delicias_plot
```
There is substantial uncertainty, and only 52% of the draws lie to the
right of zero. That is,
$$\mbox{Pr}[\theta_1 > \theta_2]
\ = \ \mbox{Pr}[\delta > 0]
\ \approx \ 0.52.$$