Skip to content

Commit

Permalink
Final formatting and fixing any index abnormalities
Browse files Browse the repository at this point in the history
  • Loading branch information
OpenIntroOrg committed Jul 3, 2015
1 parent b4904ee commit ba0dc91
Show file tree
Hide file tree
Showing 15 changed files with 178 additions and 149 deletions.
17 changes: 11 additions & 6 deletions ch_distributions/TeX/ch_distributions.tex
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,14 @@ \subsection{Normal distribution model}

\begin{figure}[hht]
\centering
\includegraphics[width=0.9\textwidth]{ch_distributions/figures/twoSampleNormals/twoSampleNormals}
\includegraphics[width=0.85\textwidth]{ch_distributions/figures/twoSampleNormals/twoSampleNormals}
\caption{Both curves represent the normal distribution, however, they differ in their center and spread. The normal distribution with mean 0 and standard deviation 1 is called the \term{standard normal distribution}.}
\label{twoSampleNormals}
\end{figure}

\begin{figure}[hht]
\centering
\includegraphics[width=0.65\textwidth]{ch_distributions/figures/twoSampleNormalsStacked/twoSampleNormalsStacked}
\includegraphics[width=0.6\textwidth]{ch_distributions/figures/twoSampleNormalsStacked/twoSampleNormalsStacked}
\caption{The normal models shown in Figure~\ref{twoSampleNormals} but plotted together and on the same scale.}
\label{twoSampleNormalsStacked}
\end{figure}
Expand All @@ -50,7 +50,12 @@ \subsection{Normal distribution model}
Because the mean and standard deviation describe a normal distribution exactly, they are called the distribution's \termsub{parameters}{parameter}.

\begin{exercise}
Write down the short-hand for a normal distribution with (a)~mean~5 and standard deviation~3, (b)~mean~-100 and standard deviation~10, and (c)~mean~2 and standard deviation~9.\footnote{(a)~$N(\mu=5,\sigma=3)$. (b)~$N(\mu=-100, \sigma=10)$. (c)~$N(\mu=2, \sigma=9)$.}
Write down the short-hand for a normal distribution with\footnote{(a)~$N(\mu=5,\sigma=3)$. (b)~$N(\mu=-100, \sigma=10)$. (c)~$N(\mu=2, \sigma=9)$.}
\begin{parts}
\item mean~5 and standard deviation~3,
\item mean~-100 and standard deviation~10, and
\item mean~2 and standard deviation~9.
\end{parts}
\end{exercise}

\subsection{Standardizing with Z-scores}
Expand Down Expand Up @@ -340,7 +345,7 @@ \section{Evaluating the normal approximation}

Example~\ref{normalExam40Perc} suggests the distribution of heights of US males is well approximated by the normal model. We are interested in proceeding under the assumption that the data are normally distributed, but first we must check to see if this is reasonable.

There are two visual methods for checking the assumption of normality, which can be implemented and interpreted quickly. The first is a simple histogram with the best fitting normal curve overlaid on the plot, as shown in the left panel of Figure~\ref{fcidMHeights}. The sample mean $\bar{x}$ and standard deviation $s$ are used as the parameters of the best fitting normal curve. The closer this curve fits the histogram, the more reasonable the normal model assumption. Another more common method is examining a \term{normal probability plot}.\footnote{Also commonly called a \term{quantile-quantile plot}.}, shown in the right panel of Figure~\ref{fcidMHeights}. The closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.
There are two visual methods for checking the assumption of normality, which can be implemented and interpreted quickly. The first is a simple histogram with the best fitting normal curve overlaid on the plot, as shown in the left panel of Figure~\ref{fcidMHeights}. The sample mean $\bar{x}$ and standard deviation $s$ are used as the parameters of the best fitting normal curve. The closer this curve fits the histogram, the more reasonable the normal model assumption. Another more common method is examining a \term{normal probability plot},\footnote{Also commonly called a \term{quantile-quantile plot}.} shown in the right panel of Figure~\ref{fcidMHeights}. The closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.

\begin{figure}
\centering
Expand Down Expand Up @@ -577,7 +582,7 @@ \section{Binomial distribution (special topic)}
\end{example}

\begin{exercise}
Verify that the scenario where Brittany is the only one to refuse to give the most severe shock has probability $(0.35)^1(0.65)^3$.\footnote{$P(A=\text{\resp{shock}},\text{ }B=\text{\resp{refuse}},\text{ }C=\text{\resp{shock}},\text{ }D=\text{\resp{shock}}) = (0.65)(0.35)(0.65)(0.65) = (0.35)^1(0.65)^3$.}
Verify that the scenario where Brittany is the only one to refuse to give the most severe shock has probability $(0.35)^1(0.65)^3$.~\footnote{$P(A=\text{\resp{shock}},\text{ }B=\text{\resp{refuse}},\text{ }C=\text{\resp{shock}},\text{ }D=\text{\resp{shock}}) = (0.65)(0.35)(0.65)(0.65) = (0.35)^1(0.65)^3$.}
\end{exercise}

\textC{\newpage}
Expand Down Expand Up @@ -947,7 +952,7 @@ \subsection{Poisson distribution}
\index{distribution!Poisson|(}

\begin{example}{There are about 8 million individuals in New York City. How many individuals might we expect to be hospitalized for acute myocardial infarction (AMI), i.e. a heart attack, each day? According to historical records, the average number is about 4.4 individuals. However, we would also like to know the approximate distribution of counts. What would a histogram of the number of AMI occurrences each day look like if we recorded the daily counts over an entire year?} \label{amiIncidencesEachDayOver1YearInNYCExample}
A histogram of the number of occurrences of AMI on 365 days\footnote{These data are simulated. In practice, we should check for an association between successive days.} for NYC is shown in Figure~\ref{amiIncidencesOver100Days}. The sample mean (4.38) is similar to the historical average of 4.4. The sample standard deviation is about 2, and the histogram indicates that about 70\% of the data fall between 2.4 and 6.4. The distribution's shape is unimodal and skewed to the right.
A histogram of the number of occurrences of AMI on 365 days for NYC is shown in Figure~\ref{amiIncidencesOver100Days}.\footnote{These data are simulated. In practice, we should check for an association between successive days.} The sample mean (4.38) is similar to the historical average of 4.4. The sample standard deviation is about 2, and the histogram indicates that about 70\% of the data fall between 2.4 and 6.4. The distribution's shape is unimodal and skewed to the right.
\end{example}

\begin{figure}[h]
Expand Down
2 changes: 1 addition & 1 deletion ch_distributions/figures/satAbove1630/satAbove1630.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
library(openintro)
data(COL)

myPDF("satAbove1630.pdf", 2.5, 1.2,
myPDF("satAbove1630.pdf", 3, 1.4,
mar = c(1.2, 0, 0, 0),
mgp = c(3, 0.17, 0))
normTail(1500, 300,
Expand Down
Binary file modified ch_distributions/figures/satAbove1630/satAbove1630.pdf
Binary file not shown.
12 changes: 7 additions & 5 deletions ch_inference_for_means/TeX/ch_inference_for_means.tex
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,8 @@ \subsection{The normality condition}
\subsection{Introducing the $t$-distribution}
\label{introducingTheTDistribution}

\index{$t$-distribution|(}
\index{t-distribution|(}
\index{distribution!$t$|(}

In the cases where we will use a small sample to calculate the standard error, it will be useful to rely on a new distribution for inference calculations: the $t$-distribution. A $t$-distribution, shown as a solid line in Figure~\ref{tDistCompareToNormalDist}, has a bell shape. However, its tails are thicker than the normal model's. This means observations are more likely to fall beyond two standard deviations from the mean than under the normal distribution.\footnote{The standard deviation of the $t$-distribution is actually a little more than 1. However, it is useful to always think of the $t$-distribution as having a standard deviation of 1 in all of our applications.} While our estimate of the standard error will be a little less accurate when we are analyzing a small data set, these extra thick tails of the $t$-distribution are exactly the correction we need to resolve the problem of a poorly estimated standard error.

Expand Down Expand Up @@ -134,7 +135,8 @@ \subsection{Introducing the $t$-distribution}
\begin{exercise}
What proportion of the $t$-distribution with 19 degrees of freedom falls above -1.79 units?\footnote{We find the shaded area \emph{above} -1.79 (we leave the picture to you). The small left tail is between 0.025 and 0.05, so the larger upper region must have an area between 0.95 and 0.975.}

\index{$t$-distribution|)}
\index{distribution!$t$|)}
\index{t-distribution|)}

\end{exercise}

Expand Down Expand Up @@ -481,10 +483,10 @@ \subsection{Confidence interval for a difference of means}
%\end{termBox}

\index{point estimate!difference of means|)}
\index{standard error!difference in means}
\index{standard error (SE)!difference in means}

We can quantify the variability in the point estimate, $\bar{x}_{esc} - \bar{x}_{control}$, using the following formula for its standard error:
\index{standard error!difference in means}
\index{standard error (SE)!difference in means}
\begin{eqnarray*}
SE_{\bar{x}_{esc} - \bar{x}_{control}} = \sqrt{\frac{\sigma_{esc}^2}{n_{esc}} + \frac{\sigma_{control}^2}{n_{control}}}
\end{eqnarray*}
Expand Down Expand Up @@ -1099,7 +1101,7 @@ \subsection{Is batting performance related to player position in MLB?}

\subsection{Analysis of variance (ANOVA) and the F test}

The method of analysis of variance in this context focuses on answering one question: is the variability in the sample means so large that it seems unlikely to be from chance alone? This question is different from earlier testing procedures since we will \emph{simultaneously} consider many groups, and evaluate whether their sample means differ more than we would expect from natural variation. We call this variability the \term{mean square between groups ($MSG$)}, and it has an associated degrees of freedom, $df_{G}=k-1$ when there are $k$ groups. The $MSG$ can be thought of as a scaled variance formula for means. If the null hypothesis is true, any variation in the sample means is due to chance and shouldn't be too large. Details of $MSG$ calculations are provided in the footnote,\footnote{Let $\bar{x}$ represent the mean of outcomes across all groups. Then the mean square between groups is computed as
The method of analysis of variance in this context focuses on answering one question: is the variability in the sample means so large that it seems unlikely to be from chance alone? This question is different from earlier testing procedures since we will \emph{simultaneously} consider many groups, and evaluate whether their sample means differ more than we would expect from natural variation. We call this variability the \term{mean square between groups ($MSG$)}, and it has an associated degrees of freedom, $df_{G}=k-1$ when there are $k$ groups.\index{degrees of freedom (df)!ANOVA} The $MSG$ can be thought of as a scaled variance formula for means. If the null hypothesis is true, any variation in the sample means is due to chance and shouldn't be too large. Details of $MSG$ calculations are provided in the footnote,\footnote{Let $\bar{x}$ represent the mean of outcomes across all groups. Then the mean square between groups is computed as
\begin{align*}
MSG = \frac{1}{df_{G}}SSG = \frac{1}{k-1}\sum_{i=1}^{k} n_{i}\left(\bar{x}_{i} - \bar{x}\right)^2
\end{align*}
Expand Down
6 changes: 3 additions & 3 deletions ch_inference_for_props/TeX/ch_inference_for_props.tex
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ \subsection{Identifying when the sample proportion is nearly normal}
\item we expected to see at least 10 successes and 10 failures in our sample, i.e. $np\geq10$ and $n(1-p)\geq10$. This is called the \term{success-failure condition}.
\end{enumerate}
If these conditions are met, then the sampling distribution of $\hat{p}$ is nearly normal with mean $p$ and standard error
\index{standard error!single proportion}
\index{standard error (SE)!single proportion}
\begin{eqnarray}
SE_{\hat{p}} = \sqrt{\frac{\ p(1-p)\ }{n}}
\label{seOfPHat}
Expand Down Expand Up @@ -232,7 +232,7 @@ \subsection{Sample distribution of the difference of two proportions}
\item the two samples are independent of each other.
\end{itemize}
The standard error of the difference in sample proportions is
\index{standard error!difference in proportions}
\index{standard error (SE)!difference in proportions}
\begin{eqnarray}
SE_{\hat{p}_1 - \hat{p}_2}
= \sqrt{SE_{\hat{p}_1}^2 + SE_{\hat{p}_2}^2}
Expand All @@ -253,7 +253,7 @@ \subsection{Confidence intervals for $p_1 -p_2$}

In the setting of confidence intervals for a difference of two proportions, the two sample proportions are used to verify the success-failure condition and also compute the standard error, just as was the case with a single proportion.

\begin{example}{The way a question is phrased can influence a person's response. For example, Pew Research Center conducted a survey with the following question:\footnote{\oiRedirect{textbook-health_care_bill_2012}{www.people-press.org/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate}{www.people-press.org/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate/}. Sample sizes for each polling group are approximate.}
\begin{example}{The way a question is phrased can influence a person's response. For example, Pew Research Center conducted a survey with the following question:\footnote{\oiRedirect{textbook-health_care_bill_2012}{www.people-press.org/2012/03/26/public-remains-split-on-health-care-bill-opposed-to-mandate}. Sample sizes for each polling group are approximate.}
\begin{quote}
As you may know, by 2014 nearly all Americans will be required to have health insurance. [People who do not buy insurance will pay a penalty] while [People who cannot afford it will receive financial help from the government]. Do you approve or disapprove of this policy?
\end{quote}
Expand Down
4 changes: 2 additions & 2 deletions ch_inference_foundations/TeX/ch_inference_foundations.tex
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ \subsection{Standard error of the mean}
\label{seOfXBar}
\end{eqnarray}\vspace{-3mm}%

A reliable method to ensure sample observations are independent is to conduct a simple random sample consisting of less than 10\% of the population.\index{standard error!single mean}
A reliable method to ensure sample observations are independent is to conduct a simple random sample consisting of less than 10\% of the population.\index{standard error (SE)!single mean}
}
\end{termBox}

Expand Down Expand Up @@ -283,7 +283,7 @@ \subsection{An approximate 95\% confidence interval}
\end{figure}

\begin{exercise}
In Figure~\ref{95PercentConfidenceInterval}, one interval does not contain 3.90 minutes. Does this imply that the mean cannot be 3.90? \footnote{Just as some observations occur more than 2 standard deviations from the mean, some point estimates will be more than 2 standard errors from the parameter. A confidence interval only provides a plausible range of values for a parameter. While we might say other values are implausible based on the data, this does not mean they are impossible.}
In Figure~\ref{95PercentConfidenceInterval}, one interval does not contain 3.90 minutes. Does this imply that the mean cannot be 3.90?\footnote{Just as some observations occur more than 2 standard deviations from the mean, some point estimates will be more than 2 standard errors from the parameter. A confidence interval only provides a plausible range of values for a parameter. While we might say other values are implausible based on the data, this does not mean they are impossible.}
\end{exercise}

The rule where about 95\% of observations are within 2 standard deviations of the mean is only approximately true. However, it holds very well for the normal distribution. As we will soon see, the mean tends to be normally distributed when the sample size is sufficiently large.
Expand Down
4 changes: 2 additions & 2 deletions ch_inference_foundations/TeX/ch_inference_foundations_ex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -711,8 +711,8 @@ \subsection{Hypothesis testing}
\item If the alternative hypothesis is true, then the probability of making a
Type~2 Error and the power of a test add up to 1.
\item With large sample sizes, even small differences between the null value and
the true value of the parameter, a difference often called the effect size\index{
effect size}, will be identified as statistically significant.
the true value of the parameter, a difference often called the effect size
\index{effect size}, will be identified as statistically significant.
\end{parts}
}{}

Expand Down
8 changes: 5 additions & 3 deletions ch_intro_to_data/TeX/ch_intro_to_data.tex
Original file line number Diff line number Diff line change
Expand Up @@ -848,7 +848,7 @@ \subsection{Box plots, quartiles, and the median}
Examination of data for possible outliers serves many useful purposes, including\vspace{-2mm}
\begin{enumerate}
\setlength{\itemsep}{0mm}
\item Identifying \indexthis{strong skew}{skew!strong skew} in the distribution.
\item Identifying \indexthis{strong skew}{skew!example: strong} in the distribution.
\item Identifying data collection or entry errors. For instance, we re-examined the email purported to have 64,401 characters to ensure this value was accurate.
\item Providing insight into interesting properties of the data.\vspace{0.5mm}
\end{enumerate}}
Expand Down Expand Up @@ -1182,7 +1182,7 @@ \subsection{Segmented bar and mosaic plots}
\label{emailSpamNumberMosaicPlot}
\end{figure}

A \term{mosaic plot} is a graphical display of contingency table information that is similar to a bar plot for one variable or a segmented bar plot when using two variables. Figure~\ref{emailNumberMosaic} shows a mosaic plot for the \var{number} variable. Each column represents a level of \var{number}, and the column widths correspond to the proportion of emails of each number type. For instance, there are fewer emails with no numbers than emails with only small numbers, so the no number email column is slimmer. In general, mosaic plots use box \emph{areas} to represent the number of observations that box represents.
A \term{mosaic plot} is a graphical display of contingency table information that is similar to a bar plot for one variable or a segmented bar plot when using two variables. Figure~\ref{emailNumberMosaic} shows a mosaic plot for the \var{number} variable. Each column represents a level of \var{number}, and the column widths correspond to the proportion of emails for each number~type. For~instance, there are fewer emails with no numbers than emails with only small numbers, so the no number email column is slimmer. In general, mosaic plots use box \emph{areas} to represent the number of observations that box represents.

\begin{figure}
\centering
Expand Down Expand Up @@ -1367,7 +1367,7 @@ \subsection{Checking for independence}

\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{ch_intro_to_data/figures/discRandDotPlot/discRandDotPlot}
\includegraphics[width=0.85\textwidth]{ch_intro_to_data/figures/discRandDotPlot/discRandDotPlot}
\caption{A stacked dot plot of differences from 100 simulations produced under the independence model, $H_0$, where \var{gender\_\hspace{0.3mm}sim} and \var{decision} are independent. Two of the 100 simulations had a difference of at least 29.2\%, the difference observed in the study.}
\label{discRandDotPlot}
\end{figure}
Expand All @@ -1378,6 +1378,8 @@ \subsection{Checking for independence}
It appears that a difference of at least 29.2\% due to chance alone would only happen about 2\% of the time according to Figure~\ref{discRandDotPlot}. Such a low probability indicates a rare event.
\end{example}

\textC{\newpage}

The difference of 29.2\% being a rare event suggests two possible interpretations of the results of the study:
\begin{itemize}
\setlength{\itemsep}{0mm}
Expand Down
3 changes: 2 additions & 1 deletion ch_intro_to_data/TeX/ch_intro_to_data_ex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1384,6 +1384,7 @@ \subsection{Case study: gender discrimination}
\begin{parts}
\item Based on the mosaic plot, is survival independent of whether or not the
patient got a transplant? Explain your reasoning.
\textC{\\\textbf{(See the next page for additional parts to this question.)}}
\item What do the box plots below suggest about the efficacy (effectiveness) of the heart transplant treatment.
\item What proportion of patients in the treatment group and what proportion of
patients in the control group died?
Expand Down Expand Up @@ -1413,6 +1414,6 @@ \subsection{Case study: gender discrimination}
\end{subparts}
\end{parts}
\begin{center}
\includegraphics[width= 0.6\textwidth]{ch_intro_to_data/figures/eoce/randomization_heart_transplants/randomization_heart_transplants_rando.pdf}
\includegraphics[width= 0.65\textwidth]{ch_intro_to_data/figures/eoce/randomization_heart_transplants/randomization_heart_transplants_rando.pdf}
\end{center}
}{}
Loading

0 comments on commit ba0dc91

Please sign in to comment.