Skip to content

Commit

Permalink
Switch from I to we
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Aug 9, 2022
1 parent c6b1f50 commit 1d0902c
Show file tree
Hide file tree
Showing 22 changed files with 145 additions and 152 deletions.
24 changes: 10 additions & 14 deletions EDA.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ You can loosely word these questions as:
2. What type of covariation occurs between my variables?

The rest of this chapter will look at these two questions.
I'll explain what variation and covariation are, and I'll show you several ways to answer each question.
We'll explain what variation and covariation are, and we'll show you several ways to answer each question.
To make the discussion easier, let's define some terms:

- A **variable** is a quantity, quality, or property that you can measure.
Expand All @@ -75,7 +75,7 @@ To make the discussion easier, let's define some terms:

- An **observation** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object).
An observation will contain several values, each associated with a different variable.
I'll sometimes refer to an observation as a data point.
We'll sometimes refer to an observation as a data point.

- **Tabular data** is a set of values, each associated with a variable and an observation.
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
Expand Down Expand Up @@ -166,7 +166,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
```

If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` instead of `geom_histogram()`.
If you wish to overlay multiple histograms in the same plot, we recommend using `geom_freqpoly()` instead of `geom_histogram()`.
`geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead.
It's much easier to understand overlapping lines than bars.

Expand All @@ -190,7 +190,7 @@ There are a few challenges with this type of plot, which we will come back to in

Now that you can visualize variation, what should you look for in your plots?
And what type of follow-up questions should you ask?
I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
We've put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information.
The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).

### Typical values
Expand Down Expand Up @@ -354,10 +354,10 @@ If you've encountered unusual values in your dataset, and simply want to move on
filter(between(y, 3, 20))
```
I don't recommend this option because just because one measurement is invalid, doesn't mean all the measurements are.
We don't recommend this option because just because one measurement is invalid, doesn't mean all the measurements are.
Additionally, if you have low quality data, by time that you've applied this approach to every variable you might find that you don't have any data left!
2. Instead, I recommend replacing the unusual values with missing values.
2. Instead, we recommend replacing the unusual values with missing values.
The easiest way to do this is to use `mutate()` to replace the variable with a modified copy.
You can use the `if_else()` function to replace unusual values with `NA`:
Expand Down Expand Up @@ -936,7 +936,7 @@ ggplot(faithful, aes(eruptions)) +

Sometimes we'll turn the end of a pipeline of data transformation into a plot.
Watch for the transition from `|>` to `+`.
I wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
We wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.

```{r}
#| eval: false
Expand All @@ -955,11 +955,7 @@ diamonds |>

## Learning more

If you want to learn more about the mechanics of ggplot2, I'd highly recommend reading the [ggplot2 book](https://ggplot2-book.org).
It's been recently updated and has much more space to explore all the facets of visualization.
If you want to learn more about the mechanics of ggplot2, we highly recommend reading the [ggplot2 book](https://ggplot2-book.org).
Another useful resource is the [*R Graphics Cookbook*](https://r-graphics.org) by Winston Chang.

Another useful resource is the [*R Graphics Cookbook*](https://www.amazon.com/Graphics-Cookbook-Practical-Recipes-Visualizing/dp/1449316956) by Winston Chang.
Much of the contents are available online at <http://www.cookbook-r.com/Graphs/>.

I also recommend [*Graphical Data Analysis with R*](https://www.amazon.com/Graphical-Data-Analysis-Chapman-Hall/dp/1498715230), by Antony Unwin.
This is a book-length treatment similar to the material covered in this chapter, but has the space to go into much greater depth.
<!--# TODO: add Claus + Kieran books -->
34 changes: 18 additions & 16 deletions communicate-plots.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ To help others quickly build up a good mental model of the data, you will need t
In this chapter, you'll learn some of the tools that ggplot2 provides to do so.

This chapter focuses on the tools you need to create good graphics.
I assume that you know what you want, and just need to know how to do it.
For that reason, I highly recommend pairing this chapter with a good general visualisation book.
I particularly like [*The Truthful Art*](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo.
We assume that you know what you want, and just need to know how to do it.
For that reason, we highly recommend pairing this chapter with a good general visualisation book.
We particularly like [*The Truthful Art*](https://www.amazon.com/gp/product/0321934075/), by Albert Cairo.
It doesn't teach the mechanics of creating visualisations, but instead focuses on what you need to think about in order to create effective graphics.

### Prerequisites
Expand Down Expand Up @@ -165,7 +165,7 @@ ggplot(mpg, aes(displ, hwy)) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
```

Note another handy technique used here: I added a second layer of large, hollow points to highlight the points that I've labelled.
Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points.

You can sometimes use the same idea to replace the legend with labels placed directly on the plot.
It's not wonderful for this plot, but it isn't too bad.
Expand Down Expand Up @@ -221,7 +221,7 @@ ggplot(mpg, aes(displ, hwy)) +
geom_text(aes(label = label), data = label, vjust = "top", hjust = "right")
```

In these examples, I manually broke the label up into lines using `"\n"`.
In these examples, we manually broke the label up into lines using `"\n"`.
Another approach is to use `stringr::str_wrap()` to automatically add line breaks, given the number of characters you want per line:

```{r}
Expand Down Expand Up @@ -263,7 +263,7 @@ Remember, in addition to `geom_text()`, you have many other geoms in ggplot2 ava
A few ideas:

- Use `geom_hline()` and `geom_vline()` to add reference lines.
I often make them thick (`size = 2`) and white (`colour = white`), and draw them underneath the primary data layer.
We often make them thick (`size = 2`) and white (`colour = white`), and draw them underneath the primary data layer.
That makes them easy to see, without drawing attention away from the data.

- Use `geom_rect()` to draw a rectangle around points of interest.
Expand Down Expand Up @@ -699,7 +699,7 @@ file.remove("my-plot.pdf")
If you don't specify the `width` and `height` they will be taken from the dimensions of the current plotting device.
For reproducible code, you'll want to specify them.

Generally, however, I think you should be assembling your final reports using R Markdown, so I want to focus on the important code chunk options that you should know about for graphics.
Generally, however, we recommend that you assemble your final reports using R Markdown, so we focus on the important code chunk options that you should know about for graphics.
You can learn more about `ggsave()` in the documentation.

### Figure sizing
Expand All @@ -710,18 +710,20 @@ The biggest challenge of graphics in R Markdown is getting your figures the righ
There are five main options that control figure sizing: `fig.width`, `fig.height`, `fig.asp`, `out.width` and `out.height`.
Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).

I only ever use three of the five options:
<!-- TODO: https://www.tidyverse.org/blog/2020/08/taking-control-of-plot-scaling/ -->

- I find it most aesthetically pleasing for plots to have a consistent width.
To enforce this, I set `fig.width = 6` (6") and `fig.asp = 0.618` (the golden ratio) in the defaults.
Then in individual chunks, I only adjust `fig.asp`.
We recommend three of the five options:

- I control the output size with `out.width` and set it to a percentage of the line width.
I default to `out.width = "70%"` and `fig.align = "center"`.
- Plots tend to be more aesthetically pleasing if they have consistent width.
To enforce this, set `fig.width = 6` (6") and `fig.asp = 0.618` (the golden ratio) in the defaults.
Then in individual chunks, only adjust `fig.asp`.

- Control the output size with `out.width` and set it to a percentage of the line width.
We suggest to `out.width = "70%"` and `fig.align = "center"`.
That gives plots room to breathe, without taking up too much space.

- To put multiple plots in a single row I set the `out.width` to `50%` for two plots, `33%` for 3 plots, or `25%` to 4 plots, and set `fig.align = "default"`.
Depending on what I'm trying to illustrate (e.g. show data or show plot variations), I'll also tweak `fig.width`, as discussed below.
- To put multiple plots in a single row, set the `out.width` to `50%` for two plots, `33%` for 3 plots, or `25%` to 4 plots, and set `fig.align = "default"`.
Depending on what you're trying to illustrate (e.g. show data or show plot variations), you might also tweak `fig.width`, as discussed below.

If you find that you're having to squint to read the text in your plot, you need to tweak `fig.width`.
If `fig.width` is larger than the size the figure is rendered in the final doc, the text will be too small; if `fig.width` is smaller, the text will be too big.
Expand Down Expand Up @@ -760,7 +762,7 @@ For example, if your default `fig.width` is 6 and `out.width` is 0.7, when you s

### Other important options

When mingling code and text, like I do in this book, I recommend setting `fig.show = "hold"` so that plots are shown after the code.
When mingling code and text, like in this book, you can set `fig.show = "hold"` so that plots are shown after the code.
This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.

To add a caption to the plot, use `fig.cap`.
Expand Down
10 changes: 5 additions & 5 deletions data-visualize.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ R has several systems for making graphs, but ggplot2 is one of the most elegant
ggplot2 implements the **grammar of graphics**, a coherent system for describing and building graphs.
With ggplot2, you can do more faster by learning one system and applying it in many places.

If you'd like to learn more about the theoretical underpinnings of ggplot2, I recommend reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>.
If you'd like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading "The Layered Grammar of Graphics", <http://vita.had.co.nz/papers/layered-grammar.pdf>, the scientific paper that discusses the theoretical underpinnings..

### Prerequisites

Expand Down Expand Up @@ -91,7 +91,7 @@ Does this confirm or refute your hypothesis about fuel efficiency and engine siz
With ggplot2, you begin a plot with the function `ggplot()`.
`ggplot()` creates a coordinate system that you can add layers to.
The first argument of `ggplot()` is the dataset to use in the graph.
So `ggplot(data = mpg)` creates an empty graph, but it's not very interesting so I'm not going to show it here.
So `ggplot(data = mpg)` creates an empty graph, but it's not very interesting so we won't show it here.

You complete your graph by adding one or more layers to `ggplot()`.
The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot.
Expand Down Expand Up @@ -364,7 +364,7 @@ ggplot(shapes, aes(x, y)) +
As you start to run R code, you're likely to run into problems.
Don't worry --- it happens to everyone.
I have been writing R code for years, and every day I still write code that doesn't work!
We have all been writing R code for years, but every day we still write code that doesn't work!
Start by carefully comparing the code that you're running to the code in the book.
R is extremely picky, and a misplaced character can make all the difference.
Expand Down Expand Up @@ -728,7 +728,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
3. What does `show.legend = FALSE` do?
What happens if you remove it?\
Why do you think I used it earlier in the chapter?
Why do you think we used it earlier in the chapter?
4. What does the `se` argument to `geom_smooth()` do?
Expand Down Expand Up @@ -862,7 +862,7 @@ This means that you can typically use geoms without worrying about the underlyin
However, there are three reasons why you might need to use a stat explicitly:

1. You might want to override the default stat.
In the code below, I change the stat of `geom_bar()` from count (the default) to identity.
In the code below, we change the stat of `geom_bar()` from count (the default) to identity.
This lets me map the height of the bars to the raw values of a $y$ variable.
Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.

Expand Down
10 changes: 5 additions & 5 deletions databases.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)
```

If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
If you're using duckdb in a real project, we highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.

## DBI basics
Expand Down Expand Up @@ -159,7 +159,7 @@ con |>
as_tibble()
```

`dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.
`dbReadTable()` returns a `data.frame` so we use `as_tibble()` to convert it into a tibble so that it prints nicely.

In real life, it's rare that you'll use `dbReadTable()` because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.

Expand Down Expand Up @@ -255,7 +255,7 @@ Then, once you're ready to analyse the data with functions that are unique to R,
## SQL

The rest of the chapter will teach you a little SQL through the lens of dbplyr.
It's a rather non-traditional introduction to SQL but I hope it will get you quickly up to speed with the basics.
It's a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics.
Luckily, if you understand dplyr you're in a great place to quickly pick up SQL because so many of the concepts are the same.

We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: `flights` and `planes`.
Expand Down Expand Up @@ -446,7 +446,7 @@ flights |>
summarise(delay = mean(arr_delay))
```

If you want to learn more about how NULLs work, I recommend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
If you want to learn more about how NULLs work, you might enjoy "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.

In general, you can work with `NULL`s using the functions you'd use for `NA`s in R:

Expand Down Expand Up @@ -674,7 +674,7 @@ dbplyr's translations are certainly not perfect, and there are many R functions
### Learning more
If you've finished this chapter and would like to learn more about SQL.
I have two recommendations:
We have two recommendations:

- [*SQL for Data Scientists*](https://sqlfordatascientists.com) by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organisations.
- [*Practical SQL*](https://www.practicalsql.com) by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.
8 changes: 4 additions & 4 deletions datetimes.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ To warm up, try these three seemingly simple questions:
- Does every day have 24 hours?
- Does every minute have 60 seconds?

I'm sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?
We're sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?
(It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25.
You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.

Expand Down Expand Up @@ -53,7 +53,7 @@ There are three types of date/time data that refer to an instant in time:

- A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second).
Tibbles print this as `<dttm>`.
Elsewhere in R these are called POSIXct, but I don't think that's a very useful name.
Elsewhere in R these are called POSIXct, but that's not a very useful name.

In this chapter we are only going to focus on dates and date-times as R doesn't have a native class for storing times.
If you need one, you can use the **hms** package.
Expand Down Expand Up @@ -135,7 +135,7 @@ flights |>

Let's do the same thing for each of the four time columns in `flights`.
The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components.
Once I've created the date-time variables, I focus in on the variables we'll explore in the rest of the chapter.
Once we've created the date-time variables, we focus in on the variables we'll explore in the rest of the chapter.

```{r}
make_datetime_100 <- function(year, month, day, time) {
Expand All @@ -155,7 +155,7 @@ flights_dt <- flights |>
flights_dt
```

With this data, I can visualise the distribution of departure times across the year:
With this data, we can visualise the distribution of departure times across the year:

```{r}
flights_dt |>
Expand Down
Loading

0 comments on commit 1d0902c

Please sign in to comment.