Skip to content

Commit

Permalink
Merge branch 'master' of github.com:hadley/r4ds
Browse files Browse the repository at this point in the history
# Conflicts:
#	factors.Rmd
  • Loading branch information
hadley committed Nov 10, 2016
2 parents bb7e185 + 8e45acf commit e772065
Show file tree
Hide file tree
Showing 22 changed files with 72 additions and 75 deletions.
8 changes: 4 additions & 4 deletions EDA.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,12 @@ The rest of this chapter will look at these two questions. I'll explain what var
"cell", each variable in its own column, and each observation in its own
row.

So far, all the data you've seen so far has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data].
So far, all of the data that you've seen has been tidy. In real-life, most data isn't tidy, so we'll come back to these ideas again in [tidy data].

## Variation

**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of variable's values.
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable's values.

### Visualising distributions

Expand Down Expand Up @@ -96,7 +96,7 @@ diamonds %>%
count(cut_width(carat, 0.5))
```

A histogram divides the x-axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.

You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.

Expand Down Expand Up @@ -153,7 +153,7 @@ Clusters of similar values suggest that subgroups exist in your data. To underst

* Why might the appearance of clusters be misleading?

The histogram shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.

```{r}
ggplot(data = faithful, mapping = aes(x = eruptions)) +
Expand Down
8 changes: 3 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
# R packages
# R for Data Science

This is code and text behind the [R for data science](http://r4ds.had.co.nz)
This is code and text behind the [R for Data Science](http://r4ds.had.co.nz)
book.

The site is built using [bookdown](https://github.com/rstudio/bookdown)

The R packages used in this book can be installed via

```{r}
devtools::install_github("hadley/r4ds")
```

The site is built using [bookdown package](https://github.com/rstudio/bookdown).
To create the site, you also need:

* [pandoc](http://johnmacfarlane.net/pandoc/)
2 changes: 1 addition & 1 deletion communicate-plots.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ ggplot(mpg, aes(displ, hwy)) +

Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you're mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you've mastered position and colour, you'll be able to quickly pick up other scale replacements.

It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices][diamond-prices] it's easier to see the precise relationship between `carat` and `price` if we log transform them:
It's very useful to plot transformations of your variable. For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them:

```{r, fig.align = "default", out.width = "50%"}
ggplot(diamonds, aes(carat, price)) +
Expand Down
6 changes: 3 additions & 3 deletions datetimes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ Now that you know how to get date-time data into R's date-time data structures,
### Getting components
You can pull out individual parts of the date with the acccessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
Expand Down Expand Up @@ -477,7 +477,7 @@ To find out how many periods fall into an interval, you need to use integer divi

How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.

Figure \@(ref:dt-algebra) summarises permitted arithmetic operations between the different data types.
Figure \@ref(fig:dt-algebra) summarises permitted arithmetic operations between the different data types.

```{r dt-algebra, echo = FALSE, fig.cap = "The allowed arithmetic operations between pairs of date/time classes."}
knitr::include_graphics("diagrams/datetimes-arithmetic.png")
Expand All @@ -503,7 +503,7 @@ knitr::include_graphics("diagrams/datetimes-arithmetic.png")

Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.

The first challange is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
The first challenge is that everyday names of time zones tend to be ambiguous. For example, if you're American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent). Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".

You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that name needs to reflect not only to the current behaviour, but also the complete history. For example, there are time zones for both "America/New_York" and "America/Detroit". These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!

Expand Down
2 changes: 1 addition & 1 deletion factors.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ y2 <- factor(x2, levels = month_levels)
y2
```

If you want an error, you can use `readr::parse_factor()`:
If you want a warning, you can use `readr::parse_factor()`:

```{r}
y2 <- parse_factor(x2, levels = month_levels)
Expand Down
2 changes: 1 addition & 1 deletion import.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ Time
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`). Beware of abbreviations:
if you're American, note that "EST" is a Canadian time zone that does not
have daylight savings time. It is \emph{not} Eastern Standard Time! We'll
have daylight savings time. It is _not_ Eastern Standard Time! We'll
come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).
Expand Down
4 changes: 2 additions & 2 deletions intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ There are four things you need to run the code in this book: R, RStudio, a colle

### R

To download R, go to CRAN, the **comprehensive** **R** **a**rchive **network**. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.
To download R, go to CRAN, the **c**omprehensive **R** **a**rchive **n**etwork. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don't try and pick a mirror that's close to you: instead use the cloud mirror, <https://cloud.r-project.org>, which automatically figures it out for you.

A new major version of R comes out once a year, and there are 2-3 minor releases each year. It's a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only make it worse.

Expand Down Expand Up @@ -141,7 +141,7 @@ Packages in the tidyverse change fairly frequently. You can see if updates are a

### Other packages

There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, are or designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles. This doesn't make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.

In this book we'll use three data packages from outside the tidyverse:

Expand Down
4 changes: 2 additions & 2 deletions iteration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -658,7 +658,7 @@ str(safe_log(10))
str(safe_log("a"))
```

When the function succeeds the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.
When the function succeeds, the `result` element contains the result and the `error` element is `NULL`. When the function fails, the `result` element is `NULL` and the `error` element contains an error object.

`safely()` is designed to work with map:

Expand Down Expand Up @@ -914,7 +914,7 @@ x %>%

### Reduce and accumulate

Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:
Sometimes you have a complex list that you want to reduce to a simple list by repeatedly applying a function that reduces a pair to a singleton. This is useful if you want to apply a two-table dplyr verb to multiple tables. For example, you might have a list of data frames, and you want to reduce to a single data frame by joining the elements together:

```{r}
dfs <- list(
Expand Down
6 changes: 3 additions & 3 deletions model-basics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
```

These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model by in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.
These are exactly the same values we got with `optim()`! Behind the scenes `lm()` doesn't use `optim()` but instead takes advantage of the mathematical structure of linear models. Using some connections between geometry, calculus, and linear algebra, `lm()` actually finds the closest model in a single step, using a sophisticated algorithm. This approach is both faster, and guarantees that there is a global minimum.

### Exercises

Expand Down Expand Up @@ -488,7 +488,7 @@ Note my use of `seq_range()` inside `data_grid()`. Instead of using every unique
```
* `trim = 0.1` will trim off 10% of the tail values. This is useful if the
variables has an long tailed distribution and you want to focus on generating
variables have a long tailed distribution and you want to focus on generating
values near the center:
```{r}
Expand Down Expand Up @@ -552,7 +552,7 @@ model_matrix(df, y ~ x^2 + x)
model_matrix(df, y ~ I(x^2) + x)
```

Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a linear to get arbitrary close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:
Transformations are useful because you can use them to approximate non-linear functions. If you've taken a calculus class, you may have heard of Taylor's theorem which says you can approximate any smooth function with an infinite sum of polynomials. That means you can use a polynomial function to get arbitrarily close to a smooth function by fitting an equation like `y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3`. Typing that sequence by hand is tedious, so R provides a helper function: `poly()`:

```{r}
model_matrix(df, y ~ poly(x, 2))
Expand Down
4 changes: 2 additions & 2 deletions model-building.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ diamonds2 %>%
arrange(price)
```

Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are a errors in the data. If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.
Nothing really jumps out at me here, but it's probably worth spending time considering if this indicates a problem with our model, or if there are errors in the data. If there are mistakes in the data, this could be an opportunity to buy diamonds that have been priced low incorrectly.

### Exercises

Expand Down Expand Up @@ -385,7 +385,7 @@ Either approach is reasonable. Making the transformed variable explicit is usefu

### Time of year: an alternative approach

In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using making our knowledge explicit in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:
In the previous section we used our domain knowledge (how the US school term affects travel) to improve the model. An alternative to using our knowledge explicitly in the model is to give the data more room to speak. We could use a more flexible model and allow that to capture the pattern we're interested in. A simple linear trend isn't adequate, so we could try using a natural spline to fit a smooth curve across the year:

```{r}
library(splines)
Expand Down
6 changes: 3 additions & 3 deletions model-many.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ In this chapter you're going to learn three powerful ideas that help you to work
1. Using the __broom__ package, by David Robinson, to turn models into tidy
data. This is a powerful technique for working with large numbers of models
because once you have tidy data, you can apply all of the techniques that
you've learned about in earlier in the book.
you've learned about earlier in the book.

We'll start by diving into a motivating example using data about life expectancy around the world. It's a small dataset but it illustrates how important modelling can be for improving your visualisations. We'll use a large number of simple models to partition out some of the strongest signal so we can see the subtler signals that remain. We'll also see how model summaries can help us pick out outliers and unusual trends.

Expand Down Expand Up @@ -133,7 +133,7 @@ And we want to apply it to every data frame. The data frames are in a list, so w
models <- map(by_country$data, country_model)
```

However, rather than leaving leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?
However, rather than leaving the list of models as a free-floating object, I think it's better to store it as a column in the `by_country` data frame. Storing related objects in columns is a key part of the value of data frames, and why I think list-columns are such a good idea. In the course of working with these countries, we are going to have lots of lists where we have one element per country. So why not store them all together in one data frame?

In other words, instead of creating a new object in the global environment, we're going to create a new variable in the `by_country` data frame. That's a job for `dplyr::mutate()`:

Expand Down Expand Up @@ -194,7 +194,7 @@ resids %>%
facet_wrap(~continent)
```

It looks like we've missed some mild pattern. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.
It looks like we've missed some mild patterns. There's also something interesting going on in Africa: we see some very large residuals which suggests our model isn't fitting so well there. We'll explore that more in the next section, attacking it from a slightly different angle.

### Model quality

Expand Down
Loading

0 comments on commit e772065

Please sign in to comment.