Skip to content

Commit

Permalink
TR feedback 2 (hadley#1318)
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley authored Mar 1, 2023
1 parent bf07203 commit 7cd6215
Show file tree
Hide file tree
Showing 8 changed files with 81 additions and 58 deletions.
2 changes: 1 addition & 1 deletion arrow.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ Thanks to arrow, this code will work regardless of how large the underlying data
But it's currently rather slow: on Hadley's computer, it took \~10s to run.
That's not terrible given how much data we have, but we can make it much faster by switching to a better format.

## The parquet format
## The parquet format {#sec-parquet}

To make this data easier to work with, lets switch to the parquet file format and split it up into multiple files.
The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.
Expand Down
23 changes: 17 additions & 6 deletions datetimes.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,15 @@ You can also force the creation of a date-time from a date by supplying a timezo
ymd("2017-01-31", tz = "UTC")
```

Here I use the UTC[^datetimes-3] timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude[^datetimes-4]
. It doesn't use daylight savings time, making it a bit easier to compute with
.

[^datetimes-3]: You might wonder what UTC stands for.
It's a compromise between the English "Coordinated Universal Time" and French "Temps Universel Coordonné".

[^datetimes-4]: No prizes for guessing which country came up with the longitude system.

### From individual components

Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns.
Expand Down Expand Up @@ -300,6 +309,7 @@ The next section will look at how arithmetic works with date-times.
### Getting components
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
These are effectively the opposites of `make_datetime()`.
```{r}
datetime <- ymd_hms("2026-07-08 12:34:56")
Expand Down Expand Up @@ -629,8 +639,8 @@ We can fix this by adding `days(1)` to the arrival time of each overnight flight
flights_dt <- flights_dt |>
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(if_else(overnight, 0, 1)),
sched_arr_time = sched_arr_time + days(overnight * 1)
arr_time = arr_time + days(!overnight),
sched_arr_time = sched_arr_time + days(overnight)
)
```

Expand All @@ -643,9 +653,10 @@ flights_dt |>

### Intervals {#sec-intervals}

It's obvious what `dyears(1) / ddays(365)` should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.
What does `dyears(1) / ddays(365)` return?
It's not quite one, because `dyear()` is defined as the number of seconds per average year, which is 365.25 days.

What should `years(1) / days(1)` return?
What does `years(1) / days(1)` return?
Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366!
There's not quite enough information for lubridate to give a single clear answer.
What it does instead is give an estimate:
Expand Down Expand Up @@ -676,8 +687,8 @@ y2024 / days(1)

### Exercises

1. Explain `days(overnight * 1)` to someone who has just started learning R.
How does it work?
1. Explain `days(!overnight)` and `days(overnight)` to someone who has just started learning R.
What is the key fact you need to know?

2. Create a vector of dates giving the first day of every month in 2015.
Create a vector of dates giving the first day of every month in the *current* year.
Expand Down
43 changes: 25 additions & 18 deletions functions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Writing a function has three big advantages over using copy-and-paste:

3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

4. It makes it easier to reuse work from project-to-project, increasing your productivity over time.

A good rule of thumb is to consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
In this chapter, you'll learn about three useful types of functions:

Expand Down Expand Up @@ -327,21 +329,20 @@ Once you start writing functions, there are two RStudio shortcuts that are super
3. Given a vector of birthdates, write a function to compute the age in years.
4. Write your own functions to compute the variance and skewness of a numeric vector.
Variance is defined as $$
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
Skewness is defined as $$
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
$$
You can look up the definitions on Wikipedia or elsewhere.
5. Write `both_na()`, a summary function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
6. Read the documentation to figure out what the following functions do.
Why are they useful even though they are so short?
```{r}
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
is_directory <- function(x) {
file.info(x)$isdir
}
is_readable <- function(x) {
file.access(x, 4) == 0
}
```
## Data frame functions
Expand Down Expand Up @@ -484,7 +485,8 @@ count_prop <- function(df, var, sort = FALSE) {
diamonds |> count_prop(clarity)
```

This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in ``.
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables.
Note that we use a default value for `sort` so that if the user doesn't supply their own value it will default to `FALSE`.

Or maybe you want to find the sorted unique values of a variable for a subset of the data.
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply a condition:
Expand All @@ -499,8 +501,6 @@ unique_where <- function(df, condition, var) {
# Find all the destinations in December
flights |> unique_where(month == 12, dest)
# Which months did plane N14228 fly in?
flights |> unique_where(tailnum == "N14228", month)
```

Here we embrace `condition` because it's passed to `filter()` and `var` because it's passed to `distinct()` and `arrange()`.
Expand All @@ -509,7 +509,7 @@ We've made all these examples to take a data frame as the first argument, but if
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.

```{r}
flights_sub <- function(rows, cols) {
subset_flights <- function(rows, cols) {
flights |>
filter({{ rows }}) |>
select(time_hour, carrier, flight, {{ cols }})
Expand All @@ -527,7 +527,10 @@ You might try writing something like:
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarize(n_miss = sum(is.na({{ x_var }})))
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
Expand All @@ -541,7 +544,10 @@ We can work around that problem by using the handy `pick()` function, which allo
count_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarize(n_miss = sum(is.na({{ x_var }})))
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
flights |>
Expand Down Expand Up @@ -605,7 +611,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
```{r}
#| eval: false
weather |> standardise_time(sched_dep_time)
weather |> standardize_time(sched_dep_time)
```
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
Expand Down Expand Up @@ -697,9 +703,9 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
diamonds |> hex_plot(carat, price, depth)
```

### Combining with dplyr
### Combining with other tidyverse

Some of the most useful helpers combine a dash of dplyr with ggplot2.
Some of the most useful helpers combine a dash of data manipulation with ggplot2.
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:

Expand Down Expand Up @@ -839,7 +845,7 @@ This makes it very obvious that something unusual is happening.

```{r}
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
str_sub(string, 1, str_length(prefix)) == prefix
}
f3 <- function(x, y) {
Expand All @@ -851,6 +857,7 @@ This makes it very obvious that something unusual is happening.
3. Make a case for why `norm_r()`, `norm_d()` etc. would be better than `rnorm()`, `dnorm()`.
Make a case for the opposite.
How could you make the names even clearer?
## Summary
Expand Down
30 changes: 16 additions & 14 deletions iteration.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ Let's motivate this problem with a simple example: what happens if we have some

```{r}
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))
}
df_miss <- tibble(
Expand Down Expand Up @@ -397,22 +397,21 @@ If needed, you could `pivot_wider()` this back to the original form.

### Exercises

1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
1. Practice your `across()` skills by:

2. Compute the mean of every column in `mtcars`.
1. Computing the number of unique values in each column of `palmerpenguins::penguins`.

3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric column.
2. Computing the mean of every column in `mtcars`.

4. What happens if you use a list of functions, but don't name them?
How is the output named?
3. Grouping `diamonds` by `cut`, `clarity`, and `color` then counting the number of observations and computing the mean of each numeric column.

5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`.
Can you explain why?
2. What happens if you use a list of functions in `across()`, but don't name them?
How is the output named?

6. Adjust `expand_dates()` to automatically remove the date columns after they've been expanded.
3. Adjust `expand_dates()` to automatically remove the date columns after they've been expanded.
Do you need to embrace any arguments?

7. Explain what each step of the pipeline in this function does.
4. Explain what each step of the pipeline in this function does.
What special feature of `where()` are we taking advantage of?

```{r}
Expand Down Expand Up @@ -656,6 +655,7 @@ write_csv(gapminder, "gapminder.csv")
```

Now when you come back to this problem in the future, you can read in a single csv file.
For large and richer datasets, using parquet might be a better choice than `.csv`, as discussed in @sec-parquet.

```{r}
#| include: false
Expand Down Expand Up @@ -733,7 +733,9 @@ files <- paths |>
```

Then a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills.
One way to do so is with this handy `df_types` function that returns a tibble with one row for each column:
One way to do so is with this handy `df_types` function[^iteration-6] that returns a tibble with one row for each column:

[^iteration-6]: We're not going to explain how it works, but you if you look at the docs for the functions used, you should be able to puzzle it out.

```{r}
df_types <- function(df) {
Expand All @@ -744,7 +746,7 @@ df_types <- function(df) {
)
}
df_types(starwars)
df_types(gapminder)
```

You can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are.
Expand Down Expand Up @@ -952,9 +954,9 @@ carat_histogram <- function(df) {
carat_histogram(by_clarity$data[[1]])
```

Now we can use `map()` to create a list of many plots[^iteration-6] and their eventual file paths:
Now we can use `map()` to create a list of many plots[^iteration-7] and their eventual file paths:

[^iteration-6]: You can print `by_clarity$plot` to get a crude animation --- you'll get one plot for each element of `plots`.
[^iteration-7]: You can print `by_clarity$plot` to get a crude animation --- you'll get one plot for each element of `plots`.
NOTE: this didn't happen for me.

```{r}
Expand Down
32 changes: 17 additions & 15 deletions joins.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -200,8 +200,7 @@ Surrogate keys can be particular useful when communicating to other humans: it's
## Basic joins {#sec-mutating-joins}

Now that you understand how data frames are connected via keys, we can start using joins to better understand the `flights` dataset.
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `semi_join()`, `anti_join(), and full_join()`.
They all have the same interface: they take a pair of data frames (`x` and `y`) and return a data frame.
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `full_join()`, `semi_join()`, and `anti_join().` They all have the same interface: they take a pair of data frames (`x` and `y`) and return a data frame.
The order of the rows and columns in the output is primarily determined by `x`.

In this section, you'll learn how to use one mutating join, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
Expand Down Expand Up @@ -305,6 +304,10 @@ In older code you might see a different way of specifying the join keys, using a

Now that it exists, we prefer `join_by()` since it provides a clearer and more flexible specification.

`inner_join()`, `right_join()`, `full_join()` have the same interface as `left_join()`.
The difference is which rows they keep: left join keeps all the rows in `x`, the right join keeps all rows in `y`, the full join keeps all rows in either `x` or `y`, and the inner join only keeps rows that occur in both `x` and `y`.
We'll come back to these in more detail later.

### Filtering joins

As you might guess the primary action of a **filtering join** is to filter the rows.
Expand Down Expand Up @@ -464,9 +467,6 @@ knitr::include_graphics("diagrams/join/setup2.png", dpi = 270)

In an actual join, matches will be indicated with dots, as in @fig-join-inner.
The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values.
The join shown here is a so-called **equi** **inner join**, where rows match if the keys are equal, so that the output contains only the rows with keys that appear in both `x` and `y`.
Equi-joins are the most common type of join, so we'll typically omit the equi prefix, and just call it an inner join.
We'll come back to non-equi joins in @sec-non-equi-joins.

```{r}
#| label: fig-join-inner
Expand Down Expand Up @@ -572,6 +572,10 @@ However, this is not a great representation because while it might jog your memo
knitr::include_graphics("diagrams/join/venn.png", dpi = 270)
```

The joins shown here are the so-called **equi** **joins**, where rows match if the keys are equal.
Equi-joins are the most common type of join, so we'll typically omit the equi prefix, and just say "inner join" rather than "equi inner join".
We'll come back to non-equi joins in @sec-non-equi-joins.

### Row matching

So far we've explored what happens if a row in `x` matches zero or one rows in `y`.
Expand Down Expand Up @@ -620,8 +624,6 @@ df1 |>
inner_join(df2, join_by(key))
```

This is one reason we like `left_join()` --- if it runs without warning, you know that each row of the output matches the row in the same position in `x`.

You can gain further control over row matching with two arguments:

- `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
Expand Down Expand Up @@ -850,7 +852,7 @@ That leads to the following party days:
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
)
```

Expand All @@ -859,7 +861,7 @@ Now imagine that you have a table of employee birthdays:
```{r}
employees <- tibble(
name = sample(babynames::babynames$name, 100),
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
birthday = ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
employees
```
Expand Down Expand Up @@ -896,9 +898,9 @@ So it might be better to to be explicit about the date ranges that each party sp
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
)
parties
```
Expand All @@ -917,9 +919,9 @@ Ooops, there is an overlap, so let's fix that problem and continue:
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
)
```

Expand Down
2 changes: 1 addition & 1 deletion logicals.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -544,7 +544,7 @@ if_else(TRUE, "a", 1)
case_when(
x < -1 ~ TRUE,
x > 0 ~ lubridate::now()
x > 0 ~ now()
)
```

Expand Down
Loading

0 comments on commit 7cd6215

Please sign in to comment.