forked from DataScienceSpecialization/courses
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
20 changed files
with
4,192 additions
and
3,816 deletions.
There are no files selected for viewing
Binary file modified
BIN
-28.4 KB
(15%)
06_StatisticalInference/03_01_TwoGroupIntervals/fig/unnamed-chunk-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
372 changes: 186 additions & 186 deletions
372
06_StatisticalInference/03_01_TwoGroupIntervals/index.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,186 +1,186 @@ | ||
--- | ||
title : Two group intervals | ||
subtitle : Statistical Inference | ||
author : Brian Caffo, Jeff Leek, Roger Peng | ||
job : Johns Hopkins Bloomberg School of Public Health | ||
logo : bloomberg_shield.png | ||
framework : io2012 # {io2012, html5slides, shower, dzslides, ...} | ||
highlighter : highlight.js # {highlight.js, prettify, highlight} | ||
hitheme : tomorrow # | ||
url: | ||
lib: ../../libraries | ||
assets: ../../assets | ||
widgets : [mathjax] # {mathjax, quiz, bootstrap} | ||
mode : selfcontained # {standalone, draft} | ||
--- | ||
```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} | ||
# make this an external chunk that can be included in any file | ||
options(width = 100) | ||
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') | ||
options(xtable.type = 'html') | ||
knit_hooks$set(inline = function(x) { | ||
if(is.numeric(x)) { | ||
round(x, getOption('digits')) | ||
} else { | ||
paste(as.character(x), collapse = ', ') | ||
} | ||
}) | ||
knit_hooks$set(plot = knitr:::hook_plot_html) | ||
runif(1) | ||
``` | ||
|
||
## Independent group $t$ confidence intervals | ||
|
||
- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo | ||
- We cannot use the paired t test because the groups are independent and may have different sample sizes | ||
- We now present methods for comparing independent groups | ||
|
||
--- | ||
|
||
## Notation | ||
|
||
- Let $X_1,\ldots,X_{n_x}$ be iid $N(\mu_x,\sigma^2)$ | ||
- Let $Y_1,\ldots,Y_{n_y}$ be iid $N(\mu_y, \sigma^2)$ | ||
- Let $\bar X$, $\bar Y$, $S_x$, $S_y$ be the means and standard deviations | ||
- Using the fact that linear combinations of normals are again normal, we know that $\bar Y - \bar X$ is also normal with mean $\mu_y - \mu_x$ and variance $\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})$ | ||
- The pooled variance estimator $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ is a good estimator of $\sigma^2$ | ||
|
||
--- | ||
|
||
## Note | ||
|
||
- The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size | ||
- If the sample sizes are the same the pooled variance estimate is the average of the group variances | ||
- The pooled estimator is unbiased | ||
$$ | ||
\begin{eqnarray*} | ||
E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ | ||
& = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} | ||
\end{eqnarray*} | ||
$$ | ||
- The pooled variance estimate is independent of $\bar Y - \bar X$ since $S_x$ is independent of $\bar X$ and $S_y$ is independent of $\bar Y$ and the groups are independent | ||
|
||
--- | ||
|
||
## Result | ||
|
||
- The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands | ||
- Therefore | ||
$$ | ||
\begin{eqnarray*} | ||
(n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ | ||
& = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ | ||
& = & \chi^2_{n_x + n_y - 2} | ||
\end{eqnarray*} | ||
$$ | ||
|
||
--- | ||
|
||
## Putting this all together | ||
|
||
- The statistic | ||
$$ | ||
\frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% | ||
{\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} | ||
= \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} | ||
$$ | ||
is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom | ||
- Therefore this statistic follows Gosset's $t$ distribution with $n_x + n_y - 2$ degrees of freedom | ||
- Notice the form is (estimator - true value) / SE | ||
|
||
--- | ||
|
||
## Confidence interval | ||
|
||
- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is | ||
$$ | ||
\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} | ||
$$ | ||
- Remember this interval is assuming a constant variance across the two groups | ||
- If there is some doubt, assume a different variance per group, which we will discuss later | ||
|
||
--- | ||
|
||
|
||
## Example | ||
### Based on Rosner, Fundamentals of Biostatistics | ||
|
||
- Comparing SBP for 8 oral contraceptive users versus 21 controls | ||
- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg | ||
- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg | ||
- Pooled variance estimate | ||
```{r} | ||
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2)) | ||
132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5 | ||
``` | ||
|
||
--- | ||
```{r} | ||
data(sleep) | ||
x1 <- sleep$extra[sleep$group == 1] | ||
x2 <- sleep$extra[sleep$group == 2] | ||
n1 <- length(x1) | ||
n2 <- length(x2) | ||
sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2)) | ||
md <- mean(x1) - mean(x2) | ||
semd <- sp * sqrt(1 / n1 + 1/n2) | ||
md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd | ||
t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf | ||
t.test(x1, x2, paired = TRUE)$conf | ||
``` | ||
|
||
--- | ||
## Ignoring pairing | ||
```{r, echo = FALSE, fig.width=5, fig.height=5} | ||
plot(c(0.5, 2.5), range(x1, x2), type = "n", frame = FALSE, xlab = "group", ylab = "Extra", axes = FALSE) | ||
axis(2) | ||
axis(1, at = 1 : 2, labels = c("Group 1", "Group 2")) | ||
for (i in 1 : n1) lines(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "red") | ||
for (i in 1 : n1) points(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "black", bg = "salmon", pch = 21, cex = 3) | ||
``` | ||
|
||
--- | ||
|
||
## Unequal variances | ||
|
||
- Under unequal variances | ||
$$ | ||
\bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{\sigma_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) | ||
$$ | ||
- The statistic | ||
$$ | ||
\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{\sigma_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} | ||
$$ | ||
approximately follows Gosset's $t$ distribution with degrees of freedom equal to | ||
$$ | ||
\frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} | ||
{\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + | ||
\left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} | ||
$$ | ||
|
||
--- | ||
|
||
## Example | ||
|
||
- Comparing SBP for 8 oral contraceptive users versus 21 controls | ||
- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg | ||
- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg | ||
- $df=15.04$, $t_{15.04, .975} = 2.13$ | ||
- Interval | ||
$$ | ||
132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} | ||
= [-8.91, 19.75] | ||
$$ | ||
- In R, `t.test(..., var.equal = FALSE)` | ||
|
||
--- | ||
## Comparing other kinds of data | ||
* For binomial data, there's lots of ways to compare two groups | ||
* Relative risk, risk difference, odds ratio. | ||
* Chi-squared tests, normal approximations, exact tests. | ||
* For count data, there's also Chi-squared tests and exact tests. | ||
* We'll leave the discussions for comparing groups of data for binary | ||
and count data until covering glms in the regression class. | ||
* In addition, Mathematical Biostatistics Boot Camp 2 covers many special | ||
cases relevant to biostatistics. | ||
--- | ||
title : Two group intervals | ||
subtitle : Statistical Inference | ||
author : Brian Caffo, Jeff Leek, Roger Peng | ||
job : Johns Hopkins Bloomberg School of Public Health | ||
logo : bloomberg_shield.png | ||
framework : io2012 # {io2012, html5slides, shower, dzslides, ...} | ||
highlighter : highlight.js # {highlight.js, prettify, highlight} | ||
hitheme : tomorrow # | ||
url: | ||
lib: ../../librariesNew | ||
assets: ../../assets | ||
widgets : [mathjax] # {mathjax, quiz, bootstrap} | ||
mode : selfcontained # {standalone, draft} | ||
--- | ||
```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'} | ||
# make this an external chunk that can be included in any file | ||
options(width = 100) | ||
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/') | ||
options(xtable.type = 'html') | ||
knit_hooks$set(inline = function(x) { | ||
if(is.numeric(x)) { | ||
round(x, getOption('digits')) | ||
} else { | ||
paste(as.character(x), collapse = ', ') | ||
} | ||
}) | ||
knit_hooks$set(plot = knitr:::hook_plot_html) | ||
runif(1) | ||
``` | ||
|
||
## Independent group $t$ confidence intervals | ||
|
||
- Suppose that we want to compare the mean blood pressure between two groups in a randomized trial; those who received the treatment to those who received a placebo | ||
- We cannot use the paired t test because the groups are independent and may have different sample sizes | ||
- We now present methods for comparing independent groups | ||
|
||
--- | ||
|
||
## Notation | ||
|
||
- Let $X_1,\ldots,X_{n_x}$ be iid $N(\mu_x,\sigma^2)$ | ||
- Let $Y_1,\ldots,Y_{n_y}$ be iid $N(\mu_y, \sigma^2)$ | ||
- Let $\bar X$, $\bar Y$, $S_x$, $S_y$ be the means and standard deviations | ||
- Using the fact that linear combinations of normals are again normal, we know that $\bar Y - \bar X$ is also normal with mean $\mu_y - \mu_x$ and variance $\sigma^2 (\frac{1}{n_x} + \frac{1}{n_y})$ | ||
- The pooled variance estimator $$S_p^2 = \{(n_x - 1) S_x^2 + (n_y - 1) S_y^2\}/(n_x + n_y - 2)$$ is a good estimator of $\sigma^2$ | ||
|
||
--- | ||
|
||
## Note | ||
|
||
- The pooled estimator is a mixture of the group variances, placing greater weight on whichever has a larger sample size | ||
- If the sample sizes are the same the pooled variance estimate is the average of the group variances | ||
- The pooled estimator is unbiased | ||
$$ | ||
\begin{eqnarray*} | ||
E[S_p^2] & = & \frac{(n_x - 1) E[S_x^2] + (n_y - 1) E[S_y^2]}{n_x + n_y - 2}\\ | ||
& = & \frac{(n_x - 1)\sigma^2 + (n_y - 1)\sigma^2}{n_x + n_y - 2} | ||
\end{eqnarray*} | ||
$$ | ||
- The pooled variance estimate is independent of $\bar Y - \bar X$ since $S_x$ is independent of $\bar X$ and $S_y$ is independent of $\bar Y$ and the groups are independent | ||
|
||
--- | ||
|
||
## Result | ||
|
||
- The sum of two independent Chi-squared random variables is Chi-squared with degrees of freedom equal to the sum of the degrees of freedom of the summands | ||
- Therefore | ||
$$ | ||
\begin{eqnarray*} | ||
(n_x + n_y - 2) S_p^2 / \sigma^2 & = & (n_x - 1)S_x^2 /\sigma^2 + (n_y - 1)S_y^2/\sigma^2 \\ \\ | ||
& = & \chi^2_{n_x - 1} + \chi^2_{n_y-1} \\ \\ | ||
& = & \chi^2_{n_x + n_y - 2} | ||
\end{eqnarray*} | ||
$$ | ||
|
||
--- | ||
|
||
## Putting this all together | ||
|
||
- The statistic | ||
$$ | ||
\frac{\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\sigma \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}}}% | ||
{\sqrt{\frac{(n_x + n_y - 2) S_p^2}{(n_x + n_y - 2)\sigma^2}}} | ||
= \frac{\bar Y - \bar X - (\mu_y - \mu_x)}{S_p \left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2}} | ||
$$ | ||
is a standard normal divided by the square root of an independent Chi-squared divided by its degrees of freedom | ||
- Therefore this statistic follows Gosset's $t$ distribution with $n_x + n_y - 2$ degrees of freedom | ||
- Notice the form is (estimator - true value) / SE | ||
|
||
--- | ||
|
||
## Confidence interval | ||
|
||
- Therefore a $(1 - \alpha)\times 100\%$ confidence interval for $\mu_y - \mu_x$ is | ||
$$ | ||
\bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} | ||
$$ | ||
- Remember this interval is assuming a constant variance across the two groups | ||
- If there is some doubt, assume a different variance per group, which we will discuss later | ||
|
||
--- | ||
|
||
|
||
## Example | ||
### Based on Rosner, Fundamentals of Biostatistics | ||
|
||
- Comparing SBP for 8 oral contraceptive users versus 21 controls | ||
- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg | ||
- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg | ||
- Pooled variance estimate | ||
```{r} | ||
sp <- sqrt((7 * 15.34^2 + 20 * 18.23^2) / (8 + 21 - 2)) | ||
132.86 - 127.44 + c(-1, 1) * qt(.975, 27) * sp * (1 / 8 + 1 / 21)^.5 | ||
``` | ||
|
||
--- | ||
```{r} | ||
data(sleep) | ||
x1 <- sleep$extra[sleep$group == 1] | ||
x2 <- sleep$extra[sleep$group == 2] | ||
n1 <- length(x1) | ||
n2 <- length(x2) | ||
sp <- sqrt( ((n1 - 1) * sd(x1)^2 + (n2-1) * sd(x2)^2) / (n1 + n2-2)) | ||
md <- mean(x1) - mean(x2) | ||
semd <- sp * sqrt(1 / n1 + 1/n2) | ||
md + c(-1, 1) * qt(.975, n1 + n2 - 2) * semd | ||
t.test(x1, x2, paired = FALSE, var.equal = TRUE)$conf | ||
t.test(x1, x2, paired = TRUE)$conf | ||
``` | ||
|
||
--- | ||
## Ignoring pairing | ||
```{r, echo = FALSE, fig.width=5, fig.height=5} | ||
plot(c(0.5, 2.5), range(x1, x2), type = "n", frame = FALSE, xlab = "group", ylab = "Extra", axes = FALSE) | ||
axis(2) | ||
axis(1, at = 1 : 2, labels = c("Group 1", "Group 2")) | ||
for (i in 1 : n1) lines(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "red") | ||
for (i in 1 : n1) points(c(1, 2), c(x1[i], x2[i]), lwd = 2, col = "black", bg = "salmon", pch = 21, cex = 3) | ||
``` | ||
|
||
--- | ||
|
||
## Unequal variances | ||
|
||
- Under unequal variances | ||
$$ | ||
\bar Y - \bar X \sim N\left(\mu_y - \mu_x, \frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right) | ||
$$ | ||
- The statistic | ||
$$ | ||
\frac{\bar Y - \bar X - (\mu_y - \mu_x)}{\left(\frac{s_x^2}{n_x} + \frac{\sigma_y^2}{n_y}\right)^{1/2}} | ||
$$ | ||
approximately follows Gosset's $t$ distribution with degrees of freedom equal to | ||
$$ | ||
\frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} | ||
{\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + | ||
\left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} | ||
$$ | ||
|
||
--- | ||
|
||
## Example | ||
|
||
- Comparing SBP for 8 oral contraceptive users versus 21 controls | ||
- $\bar X_{OC} = 132.86$ mmHg with $s_{OC} = 15.34$ mmHg | ||
- $\bar X_{C} = 127.44$ mmHg with $s_{C} = 18.23$ mmHg | ||
- $df=15.04$, $t_{15.04, .975} = 2.13$ | ||
- Interval | ||
$$ | ||
132.86 - 127.44 \pm 2.13 \left(\frac{15.34^2}{8} + \frac{18.23^2}{21} \right)^{1/2} | ||
= [-8.91, 19.75] | ||
$$ | ||
- In R, `t.test(..., var.equal = FALSE)` | ||
|
||
--- | ||
## Comparing other kinds of data | ||
* For binomial data, there's lots of ways to compare two groups | ||
* Relative risk, risk difference, odds ratio. | ||
* Chi-squared tests, normal approximations, exact tests. | ||
* For count data, there's also Chi-squared tests and exact tests. | ||
* We'll leave the discussions for comparing groups of data for binary | ||
and count data until covering glms in the regression class. | ||
* In addition, Mathematical Biostatistics Boot Camp 2 covers many special | ||
cases relevant to biostatistics. |
Oops, something went wrong.