Skip to content

Commit

Permalink
Tweak figures through out book
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Jul 18, 2016
1 parent 061e233 commit 11294f5
Show file tree
Hide file tree
Showing 8 changed files with 41 additions and 48 deletions.
7 changes: 6 additions & 1 deletion _common.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,12 @@ options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
cache = TRUE,
out.width = "70%",
fig.align = 'center',
fig.width = 6,
fig.asp = 0.618, # 1 / phi
fig.show = "hold"
)

options(dplyr.print_min = 6, dplyr.print_max = 6)
2 changes: 1 addition & 1 deletion explore.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ circle %>%

While we may stumble over raw data, we can easily process visual information. Within your mind is a powerful visual processing system fine-tuned by millions of years of evolution. As a result, often the quickest way to understand your data is to visualize it. Once you plot your data, you can instantly see the relationships between values. Here, we see that the values fall on a circle.

```{r echo=FALSE, dependson=data}
```{r echo=FALSE, dependson = data, fig.asp = 1, out.width = "30%", fig.width = 3}
ggplot(circle, aes(x, y)) +
geom_point() +
coord_fixed()
Expand Down
8 changes: 4 additions & 4 deletions intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ There are some important topics that this book doesn't cover. We believe it's im

This book proudly focuses on small, in-memory datasets. This is the right place to start because you can't tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care you can typically use them to work with 1-2 Gb of data. If you're routinely working with larger data (10-100 Gb, say), you should learn more about [data.table](https://github.com/Rdatatable/data.table). This book doesn't teach data.table because it has a very concise interface that is harder to learn because it offers fewer linguistic cues. But if you're working with large data, the performance payoff is worth the extra effort required to learn it.

If your data is bigger than this, carefully consider if your big data problem might atually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [[Data transformation]].
If your data is bigger than this, carefully consider if your big data problem might atually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you're interested in. The challenge here is finding the right small data, which often requires a lot of iteration. We'll touch on this idea in [data transformation].

Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Hadoop) that allows you to send different datasets to different computers for processing. Once you've figured out how to answer the question for a single subset using the tools described in this book, you can use packages like sparklyr, rhipe, and ddr to solve it for the full dataset.

Expand Down Expand Up @@ -101,7 +101,7 @@ The complement of hypothesis generation is hypothesis confirmation. Hypothesis c
This means to do hypothesis confirmation you need to "preregister"
(write out in advance) your analysis plan, and not deviate from it
even when you have seen the data. We'll talk a little about some
strategies you can use to make this easier in [[model assessment]].
strategies you can use to make this easier in [model assessment].

It's common to think about modelling as a tool for hypothesis confirmation, and visualisation for a tool for hypothesis generation. But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation. The key difference is how often do you look at each observation: if you look only once, it's confirmation; if you look more than once, it's exploration.

Expand Down Expand Up @@ -131,7 +131,7 @@ To run the code in this book, you will need to install both R and the RStudio ID

RStudio is an integrated development environment, or IDE, for R programming. There are three key regions:

```{r echo = FALSE}
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/intro-rstudio.png")
```

Expand All @@ -151,7 +151,7 @@ If you want to see a list of all keyboard shortcuts, use the meta shortcut Alt +

We strongly recommend making two changes to the default RStudio options:

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("screenshots/rstudio-workspace.png")
```

Expand Down
16 changes: 2 additions & 14 deletions model-basics.Rmd
Original file line number Diff line number Diff line change
@@ -1,15 +1,3 @@
```{r include=FALSE, cache=FALSE}
set.seed(1014)
options(digits = 3)
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE,
cache = TRUE
)
options(dplyr.print_min = 6, dplyr.print_max = 6)
```
# Model

The goal of a fitted model is to provide a simple low-dimensional summary of a dataset. Ideally, the fitted model will capture true "signals" (i.e. patterns generated by the phenomenon of interest), and ignore "noise" (i.e. random variation that you're not interested in).
Expand Down Expand Up @@ -667,7 +655,7 @@ One way to do this is to use `condvis::visualweight()`.
### Transformations
```{r}
```{r, dev = "png"}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
Expand Down Expand Up @@ -700,7 +688,7 @@ Iteratively re-fit the model down-weighting outlying points (points with high re

### Additive models

```{r}
```{r, dev = "png"}
library(mgcv)
gam(income ~ s(education), data = heights)
Expand Down
2 changes: 1 addition & 1 deletion model-many.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ One way is to use the same approach as in the last chapter: there's a strong sig

You already know how to do that if we had a single country:

```{r, out.width = "33%", fig.asp = 1, fig.width = 3, fig.show = "hold"}
```{r, out.width = "33%", fig.asp = 1, fig.width = 3, fig.align='default'}
nz <- filter(gapminder, country == "New Zealand")
nz %>%
ggplot(aes(year, lifeExp)) +
Expand Down
18 changes: 9 additions & 9 deletions tidy.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ R follows a set of conventions that makes one layout of tabular data much easier

Data that satisfies these rules is known as *tidy data*. Notice that `table1` is tidy data.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-1.png")
```

Expand All @@ -75,7 +75,7 @@ Tidy data works well with R because it takes advantage of R's traits as a vector

Tidy data arranges values so that the relationships between variables in a dataset will parallel the relationship between vectors in R's storage objects. R stores tabular data as a data frame, a list of atomic vectors arranged to look like a table. Each column in the table is an atomic vector in the list. In tidy data, each variable in the dataset is assigned to its own column, i.e., its own vector in the data frame.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-2.png")
```

Expand Down Expand Up @@ -110,7 +110,7 @@ table1$population / table1$cases

To create the output, R applies the function in element-wise fashion: R first applies the function (or operation) to the first elements of each vector involved. Then R applies the function (or operation) to the second elements of each vector involved, and so on until R reaches the end of the vectors. If one vector is shorter than the others, R will recycle its values as needed (according to a set of recycling rules).

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-3.png")
```

Expand All @@ -130,7 +130,7 @@ If you use basic R syntax, your calculations will look like the code below. If y

#### Dataset one

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-4.png")
```

Expand All @@ -143,7 +143,7 @@ table1$cases / table1$population * 10000

#### Dataset two

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-5.png")
```

Expand All @@ -160,7 +160,7 @@ table2$value[case_rows] / table2$value[pop_rows] * 10000

#### Dataset three

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-6.png")
```

Expand All @@ -173,7 +173,7 @@ Dataset three combines the values of cases and population into the same cells. I

#### Dataset four

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-7.png")
```

Expand Down Expand Up @@ -257,7 +257,7 @@ spread(table2, key, value)

`spread()` returns a copy of your dataset that has had the key and value columns removed. In their place, `spread()` adds a new column for each unique key in the key column. These unique keys will form the column names of the new columns. `spread()` distributes the cells of the former value column across the cells of the new columns and truncates any non-key, non-value columns in a way that prevents duplication.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-8.png")
```

Expand Down Expand Up @@ -291,7 +291,7 @@ gather(table4, "year", "cases", 2:3)

We've placed "key" in quotation marks because you will usually use `gather()` to create tidy data. In this case, the "key" column will contain values, not keys. The values will only be keys in the sense that they were formally in the column names, a place where keys belong.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-9.png")
```

Expand Down
14 changes: 7 additions & 7 deletions variation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ If you've encountered unusual values in your dataset, and simply want to move on
ggplot2 subscribes to the philosophy that missing values should never silently go missing. It's not obvious where you should plot missing values, so ggplot2 doesn't include them in the plot, but does warn that they're been removed:
```{r}
```{r, dev = "png"}
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
```
Expand Down Expand Up @@ -336,7 +336,7 @@ Another alternative to display the distribution of a continuous variable broken
* A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/EDA-boxplot.png")
```

Expand Down Expand Up @@ -441,14 +441,14 @@ If the categorical variables are unordered, you might want to use the seriation

You've already seen one great way to visualise the covariation between two continuous variables: draw a scatterplot with `geom_point()`. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.

```{r}
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price))
```

Scatterplots become less useful as the size of your dataset grows, because points begin to pile up into areas of uniform black (as above). This problem is known as __overplotting__. This problem is similar to showing the distribution of price by color using a scatterplot:

```{r}
```{r, dev = "png"}
ggplot(data = diamonds, mapping = aes(x = price, y = cut)) +
geom_point()
```
Expand All @@ -457,7 +457,7 @@ And we can fix it in the same way: by using binning. Previously you used `geom_h

`geom_bin2d()` and `geom_hex()` divide the coordinate plane into two dimensional bins and then use a fill color to display how many points fall into each bin. `geom_bin2d()` creates rectangular bins. `geom_hex()` creates hexagonal bins. You will need to install the hexbin package to use `geom_hex()`.

```{r fig.show='hold', fig.asp = 1, out.width = "50%"}
```{r, fig.asp = 1, out.width = "50%", fig.align = "default"}
ggplot(data = smaller) +
geom_bin2d(aes(x = carat, y = price))
Expand Down Expand Up @@ -502,7 +502,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
unusual combination of $x$ and $y$ values, which makes the points outliers
even though their $x$ and $y$ values appear normal when examined separately.

```{r}
```{r, dev = "png"}
ggplot(data = diamonds) +
geom_point(aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
Expand Down Expand Up @@ -535,7 +535,7 @@ Patterns provide one of the most useful tools for data scientists because they r

Models are a rich tool for extracting patterns out of data. For example, consider the diamonds data. It's hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It's possible to use a model to remove the very strong relationship between price and carat so we we can explore the subtleties that remain.

```{r}
```{r, dev = "png"}
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
Expand Down
22 changes: 11 additions & 11 deletions visualize.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ If you get an odd result, double check that you are calling the aesthetic as its
How are these two plots similar?
```{r echo = FALSE, out.width = "50%"}
```{r echo = FALSE, out.width = "50%", fig.align="default"}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Expand Down Expand Up @@ -263,7 +263,7 @@ Next to each geom is a visual representation of the geom. Beneath the geom is a

To learn more about any single geom, open it's help page in R by running the command `?` followed by the name of the geom function, e.g. `?geom_smooth`.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-geoms-1.png")
knitr::include_graphics("images/visualization-geoms-2.png")
knitr::include_graphics("images/visualization-geoms-3.png")
Expand All @@ -274,7 +274,7 @@ Many geoms use a single object to describe all of the data. For example, `geom_s

In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the `linetype` example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.

```{r, fig.show='hold', fig.height = 2.5, fig.width = 2.5, out.width = "33%"}
```{r, fig.asp = 1, fig.width = 2.5, fig.align = 'default', out.width = "33%"}
ggplot(diamonds) +
geom_smooth(aes(x = carat, y = price))
Expand Down Expand Up @@ -518,13 +518,13 @@ Some graphs, like scatterplots, plot the raw values of your dataset. Other graph

ggplot2 calls the algorithm that a graph uses to calculate new values a _stat_, which is short for statistical transformation. Each geom in ggplot2 is associated with a default stat that it uses to calculate values to plot. The figure below describes how this process works with `geom_bar()`.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-stat-bar.png")
```

A few geoms, like `geom_point()`, plot your raw data as it is. These geoms also apply a transformation to your data, the identity transformation, which returns the data in its original state. Now we can say that _every_ geom uses a stat.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-stat-point.png")
```

Expand Down Expand Up @@ -575,15 +575,15 @@ For `geom_count()`, the `..prop..` variable does not do anything useful until yo

ggplot2 provides over 20 stats for you to use. Each stat is saved as a function, which provides a convenient way to access a stat's help page, e.g. `?stat_identity`. The table below describes each stat in ggplot2 and lists the parameters that the stat takes, as well as the variables that the stat makes.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-stats.png")
```

## Coordinate systems

Let's leave the Cartesian coordinate system and examine the polar coordinate system. We will begin with a riddle: how is a bar chart similar to a coxcomb plot, like the one below?

```{r echo = FALSE, fig.show='hold', fig.width=3, fig.height=4, out.width = "50%"}
```{r echo = FALSE, fig.width=3, fig.height=4, out.width = "50%", fig.align = "default"}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
ggplot(data = diamonds) +
Expand Down Expand Up @@ -620,7 +620,7 @@ ggplot2 comes with eight coordinate functions that you can use in the same way a

You can learn more about each coordinate system by opening its help page in R, e.g. `?coord_cartesian`, `?coord_fixed`, `?coord_flip`, `?coord_map`, `?coord_polar`, and `?coord_trans`.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-coordinate-systems.png")
```

Expand Down Expand Up @@ -693,19 +693,19 @@ The seven parameters in the template compose the grammar of graphics, a formal s

To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat).

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-grammar-1.png")
```

Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-grammar-2.png")
```

You'd then select a coordinate system to place the geoms into. You'd use the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (facetting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.

```{r, echo = FALSE}
```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/visualization-grammar-3.png")
```

Expand Down

0 comments on commit 11294f5

Please sign in to comment.