forked from bob-carpenter/prob-stats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnormal.Rmd
502 lines (409 loc) · 13.5 KB
/
normal.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
# Normal Distribution
## Limit of binomials
In 1733, Abraham de Moivre noticed that as the number of trials in a
binomial distribution grew, the resulting distribution became nearly
symmetric and bell-shaped.^[De Moivre, A., 1733. *Approximatio ad
summam terminorum binomii*.] With 10 trials and a probability of
success of 0.9 in each trial, the distribution is clearly asymmetric
(i.e., skewed).
```{r fig.cap = 'Probability of number of successes in $N = 10$ independent trials, each with a 90 percent chance of success. The result is $\\mbox{binomial}(y \\mid 10, 0.9)$ by construction. With only ten trials, the distribution is highly asymmetric, with skew (longer tails) to the left.'}
N <- 10
x <- 0:10
y <- dbinom(x, N, 0.9)
binomial_limit_plot <-
ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) +
geom_bar(stat = "identity", color = 'black', fill = '#ffffe6', size = 0.2) +
xlab('y') +
scale_x_continuous(breaks = c(0, 2, 4, 6, 8, 10)) +
ylab('binomial(y | 10, 0.9)') +
ggtheme_tufte()
binomial_limit_plot
```
But when we increase the number of trials by a factor of 100 to
$N = 1\,000$ without changing the 0.9 probability of success, the result is a nearly symmetric bell shape.
```{r fig.cap = 'Probability of number of successes in $N = 1000$ independent trials, each with a 90 percent chance of success. The familiar bell-shaped curve arises as $N$ grows.'}
x <- 860:940
y <- dbinom(x, 1000, 0.9)
binomial_limit_plot <-
ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) +
geom_bar(stat = "identity", color = 'black', fill = '#ffffe6', size = 0.1) +
xlab('y') +
ylab('binomial(y | 1000, 0.9)') +
ggtheme_tufte()
binomial_limit_plot
```
de Moivre found that the normal density function (to be developed
below), despite coming from a continuous distribution, provided a
tight approximation to the binomial probablity mass function as $N$
becomes large.^[This is because the interval between integers is 1, so
that
$$\begin{array}{rcl}
1 & = & \int_{-\infty}^{\infty} p(x) \mathrm{d}x
\\[2pt]
& \approx & \sum_{x = -\infty}^{\infty} p(x).
\end{array}
$$]
## The normal motivated by least square errors
In 1809, Carl Friedrich Gauss published his treatise on the motion of
planets, in which he derives the normal distribution from the method
of least squares.^[Gauss, Carolo Friderico, 1809. *Theoria Motus
Corporum Coelestium in sectionibus conicis solem ambientium.* Sumtibus
Frid. Perthes et IH Besser, Hamburgi. English translation: Theory of
Motion of the Celestial Bodies Moving in Conic Sections Around the
Sun.] Gauss was faced with a sequence of noisy measurements $y_1,
\ldots, y_N$ from some distribution and wanted to combine them by
taking their average,
$$
\bar{y} = \frac{1}{N} \sum_{n=1}^N y_n.
$$
Gauss realized that the average $\bar{y}$ minimizes the sum of
square differences from the observed values,^[The expression
$\mbox{arg min}_y \, f(y)$ returns the value of $y$ that minimizes
$f(y)$, e.g., $$\mbox{arg min}_y \ (y - 3)^2 = 3$$.]
$$
\bar{y} = \mbox{arg min}_y \, \sum_{n=1}^N \left( y_n - y \right)^2.
$$
Gauss reasoned backward to the normal distribution, reasoning that for
the average to be a good estimator, the distribution of the errors
$y_n - y$ must have this same quadratic property. Working on the log
scale, Gauss was looking for a distribution whose density functions
would have roughly the property that
$$
\log p(y) = -y^2 + \mbox{const.}
$$
We have to work on the log scale here in order for the resulting
density function to be normalizable.^[Normalizability requires
$$
\begin{array}{rcl}
\int_{-\infty}^{\infty} p(y) \, \mathrm{d}y
& = &
\int_{-\infty}^{\infty}
\exp(\mbox{const}) \times \exp(-y^2) \, \mathrm{d}y
\\[4pt]
& = &
\exp(\mbox{const})
\times
\int_{-\infty}^{\infty} \exp(-y^2) \, \mathrm{d}y
\end{array}
$$
to be finite.]
In order for the standard deviation and variance to work out to unity,
it is convenient to work with half th squared error. Therefore, the
standard normal distribution is defined on the log scale up to an
additive constant that doesn't depend on $y$ by
$$
\log \mbox{normal}(y) = -\frac{1}{2} y^2 + \mbox{const.}
$$
This ensures that if $Z \sim \mbox{normal}()$, then $\mathbb{E}[Z] =
0$ and $\mbox{var}[Z] = \mbox{sd}[Z] = 1$.
Converting back the linear scale gives us the kernel of the normal
distribution,
$$
\mbox{normal}(y) \propto \exp \left( -\frac{1}{2} y^2 \right).
$$
The *kernel* of a probabilty function defines it as a function of
parameters and variate outcome up to a proportion. Most of the algebra
in statistics only requires probability functions up to a proportion,
so it's usually simpler to drop the normalizing constants.^[It's also
faster computationally to drop normalizing constant, which often
involve complicated expensive special function evaluation.]
Including the normalizing constants, the normal distribution is^[The
constant factor is defined by $$ \int_{-\infty}^{\infty} \exp\left(
-\frac{1}{2} y^2 \right) \ = \ \sqrt{2 \pi}. $$]
$$
\mbox{normal}(y)
\ = \
\frac{1}{\sqrt{2 \pi}} \exp \left( -\frac{1}{2} y^2 \right).
$$
Now we have a distribution where the value $y$ that maximizes the
density of the independent error terms $y - y_n$ is the one that
minimizes square error,
$$
\begin{array}{rcl}
\mbox{arg max}_y \prod_{n=1}^N \mbox{normal}(y - y_n)
& = &
\mbox{arg max}_y \log \prod_{n=1}^N \mbox{normal}(y - y_n)
\\[6pt]
& = & \mbox{arg max}_y \sum_{n=1}^N \log \mbox{normal}(y - y_n)
\\[6pt]
& = & \mbox{arg max}_y \sum_{n=1}^N -\frac{1}{2} \left( y - y_n \right)^2
\\[6pt]
& = & \mbox{arg max}_y -\frac{1}{2} \sum_{n=1}^N \left( y - y_n \right)^2
\\[6pt]
& = & \mbox{arg min}_y \frac{1}{2} \sum_{n=1}^N \left( y - y_n \right)^2
\\[6pt]
& = & \mbox{arg min}_y \sum_{n=1}^N \left( y - y_n \right)^2
\\[6pt]
& = & \bar{y}.
\end{array}
$$
## Adding location and scale parameters
The standard normal distribution has an expectaiton of zero and standard
deviation one. We can add a scale parameter $\sigma > 0$ to multiply
the standard deviation and a location parameter $\mu$ so that if $Y
\sim \mbox{normal}(\mu, \sigma)$, then $\mathbb{E}[Y] = \mu$ and
$\mbox{sd}[Y] = \sigma$.
Starting with a standard normal variate,
$$
Z \sim \mbox{normal}(),
$$
we can scale it by $\sigma$ and shift by $\mu$ to get a new variable
$$
Y = \mu + \sigma \times Z,
$$
for which
$$
Y \sim \mbox{normal}(\mu, \sigma).
$$
Dealing with the required change of variables for the inverse
transform^[This inverse transform has its own name, the *z-transform*,
and in general may be written as $$Z = \frac{Y -
\mathbb{E}[Y]}{\mbox{sd}[Y]}.$$ and used to standardize any random
variable $Y$ with a finite expectation and variance. The resulting
variable $Z$ has $\mathbb{E}[Z] = 0$ and $\mbox{sd}[Z] = 1$ by
construction.]
$$
Z
\ = \
\frac{Y - \mu}{\sigma},
$$
lets us derive the general normal density function for location
parameter $\mu$ and scale parameter $\sigma > 0$ as^[It's a standard
Jacobian calculation for transform
$$
Y = f(Z) = \mu + \sigma \times Z,
$$
$$
Z = f^{-1}(Y) = \frac{Y - \mu}{\sigma},
$$
and
$$
p_Z(z) = \mbox{normal}(),
$$
from which the Jacobian calculation gives us
$$
\begin{array}{rcl}
p_Y(y)
& = &
\mbox{normal}(f^{-1}(y))
\times
\left|
\frac{\mathrm{d}}{\mathrm{d}y'} f^{-1}(y') \Big|_{y' = y}
\right|
\\[8pt]
& = &
\mbox{normal}\left( \frac{y - \mu}{\sigma} \right)
\times \frac{1}{\sigma}
\\[8pt]
& = & \frac{1}{\sqrt{2\pi}} \, \frac{1}{\sigma} \,
\exp \left( \left( \frac{y - \mu}{\sigma} \right)^2 \right).
\end{array}
$$
]
$$
\mbox{normal}(y \mid \mu, \sigma)
\ = \
\frac{1}{\sqrt{2 \pi}}
\,
\frac{1}{\sigma}
\,
\exp \! \left(
-\frac{1}{2}
\left( \frac{y - \mu}{\sigma} \right)^2
\right).
$$
Presented this way, the formula makes clear that the normal density
function is a product of three factors,
* a factor deriving from the standard normal distribution,
$\mbox{normal}(z) = \exp(-\frac{1}{2} z^2)$, applied to the inverse of
the location-scale transform, $\frac{y - \mu}{\sigma}$.
* a normalizing factor of $\frac{1}{\sigma}$ that depends on the
scale,
* a constant normalizing factor of $\frac{1}{\sqrt{2 \pi}}$, and
## Central limit theorem
Laplace proved the central limit theorem in 1812, which established
the sense in which the normal distribution is normal. The theorem
tells us that if we take a sequence of independent, finite
expectation, finite variance random variables and add them, the sum
approaches a normal distribution. It goes even further and tells us
which normal distribution, based on the expectations and variances.
Let's state it mathematically as a proper theorem.
*Central Limit Theorem* (Laplace 1812).
Suppose
$$
Y = Y_1, Y_2, \ldots, Y_N
$$
is a sequence of independent, identically distributed random variables
with
$$
\mathbb{E}[Y_n] = \mu
$$
and
$$
\mbox{sd}[Y_n] = \sigma.
$$
Define the average of $Y$ as a new random variable
$$
Z = \frac{1}{N} \sum_{n=1}^N Y_n.
$$
As $N \rightarrow \infty$, the average has a normal distribution with
the same location as the $Y_n$ and a scale reduced by a factor of
$\sqrt{\frac{1}{\sqrt{N}}}$.
$$
\textstyle
p_Z(z)
\rightarrow
\mbox{normal}\left(
z
\ \Bigg| \
\mu, \,
\frac{1}{\sqrt{N}} \times \sigma
\right).
$$
The theorem can be generalized to variables that do not have the same
distribution as long as they are independent and have finite
expectations and variances.
Many natural phenomena, such as adult human height in either sex, are
the result of a number of relatively small, additive effects, the
combination of which leads to normality no matter how the additive
effects are distributed. Practically speaking, this is why the normal
distribution makes sense as a representation of many natural
phenomena.
## Lognormal distribution
The normal distribution arises naturally as the the distribution of a
random variable resulting from a sum of effects. What if effects are
multiplicative rather than additive, so that it is the product,
$$
Y = V_1 \times \cdots \times V_N
$$
of positive effects $V_n > 0$. Because the effects $V_n$ are positive,
we can work on the log scale, where
$$
\log Y = \log V_1 + \cdots \log V_N.
$$
In this case, $\log Y$ should have a roughly normal distribution,
being the sum of $N$ additive terms. If $\log Y \sim
\mbox{normal}(\mu, \sigma)$, then $Y$ has what is called a *lognormal
distribution*. The lognormal density function can be calculated by
accounting for the change of variables,^[To calculate the Jacobian for
the change of variables, note that if $Z \sim \mbox{normal}(\mu,
\sigma)$ and $Y = \exp(Z)$, then
$$
\begin{array}{rcl}
p_Y(y \mid \mu, \sigma)
& = &
p_Z(\exp^{-1}(y) \mid \mu, \sigma)
\times
\left|
\frac{\mathrm{d}}{\mathrm{d}y'} \exp^{-1}(y')
\Bigg|_{y' = y}
\right|
\\[6pt]
& = &
\mbox{normal}(\log y \mid \mu, \sigma)
\times
\left|
\frac{\mathrm{d}}{\mathrm{d}y'} \log y'
\Bigg|_{y' = y}
\right|
\\[6pt]
& = &
\mbox{normal}(\log y \mid \mu, \sigma) \times \frac{1}{y}.
\end{array}
$$
]
so that
$$
\begin{array}{rcl}
\displaystyle
p_Y(y \mid \mu, \sigma)
& = &
\displaystyle
\mbox{lognormal}(y \mid \mu, \sigma)
\\[6pt]
& = &
\displaystyle
\frac{1}{y} \times \mbox{normal}(\log y \mid \mu, \sigma)
\\[4pt]
& = &
\displaystyle
\frac{1}{y}
\, \frac{1}{\sqrt{2\pi}}
\, \frac{1}{\sigma}
\, \exp \left(
- \frac{1}{2} \left( \frac{\log y - \mu}{\sigma} \right)^2 \right).
\end{array}
$$
To see the heart of what's going on without the location and scale
complicating matters, the kernel of the lognormal distribution
derived from the standard normal is just
$$
\mbox{lognormal}(y \mid 0, 1)
\ \propto \
\frac{1}{y} \, \exp \left( -\frac{1}{2} \left( \log y \right)^2 \right).
$$
## Simulating from a normal distribution
To simulate values of a normally distributed random variable
$$
Y \sim \mbox{normal}(\mu, \sigma),
$$
it suffices to simulate a standard normal
variate
$$
Z \sim \mbox{normal}(0, 1)
$$
and let
$$
Y = \mu + \sigma \times Z.
$$
The simplest way to approximately generate from the normal
distribution is to invoke the central limit theorem and generate
enough draws from a uniform distribution that the result is roughly
normal. This relies on simulation, so it is neither accurate nor
fast.^[It does correspond to Francis Galton's *quincunx*, a physical
machine into which marbles are dropped to fall onto a grid of pegs
which push them randomly to the right or the left. This random walk
produces distances after a number of steps that are roughly normally
distributed. The approximate improves as the number of random moves
is increased.]
A clever and efficient way to generate standard normal variates relies
on solving a seemingly harder problem, generating two independent
standard normal variates. By working with two independent normal
variates $(X, Y)$, we can work in polar coordinates and generate an
angle and radius $(\Theta, R)$, where
$$
X = R \times \cos \Theta
$$
and
$$
Y = R \times \sin \Theta.
$$
The other way around,
$$
R = \sqrt{X^2 + Y^2}
$$
and
$$
\Theta = \arctan \left( \frac{Y}{X} \right).
$$
Our strategy is to generate the polar coordinates $(\Theta, R)$ and
transform them to $(X, Y)$ with independent standard normal
distributions. We can generate a random angle uniformly in radians
with
$$
\Theta \sim \mbox{uniform}(0, 2\pi).
$$
The tricky part is generating the radius, which we will do by
simulating a uniform variate
$$
U \sim \mbox{uniform}(0, 1)
$$
and transforming it into
$$
R = \sqrt{-2 \log U}.
$$
This relies on a distribution and properties of sums of squared
uniform variables we have not yet introduced.^[The technique hinges on
the fact that $$X^2 + Y^2 \sim \mbox{chi\_squared(2)}$$ and if $$V \sim
\mbox{uniform}(0, 1),$$ then $$-2 \log V \sim \mbox{chi\_squared}(2),$$
so that for our normal variates, $$R = \sqrt{-2 \log V}$$.]