TR review feedback for logicals-factors (hadley#1310)

RinconArtist · Feb 27, 2023 · c0f0375 · c0f0375
1 parent b03248a
commit c0f0375
Show file tree

Hide file tree

Showing 5 changed files with 104 additions and 154 deletions.
diff --git a/factors.qmd b/factors.qmd
@@ -12,10 +12,11 @@ status("complete")
 Factors are used for categorical variables, variables that have a fixed and known set of possible values.
 They are also useful when you want to display character vectors in a non-alphabetical order.
 
-We'll start by motivating why factors are needed for data analysis and how you can create them with `factor()`.
-We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
+We'll start by motivating why factors are needed for data analysis[^factors-1] and how you can create them with `factor()`. We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
 You'll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.
 
+[^factors-1]: They're also really important for modelling.
+
 ### Prerequisites
 
 Base R provides some basic tools for creating and manipulating factors.
@@ -77,7 +78,7 @@ y2 <- factor(x2, levels = month_levels)
 y2
 ```
 
-This seems risky, so you might want to use `fct()` instead:
+This seems risky, so you might want to use `forcats::fct()` instead:
 
 ```{r}
 #| error: true
@@ -90,21 +91,17 @@ If you omit the levels, they'll be taken from the data in alphabetical order:
 factor(x1)
 ```
 
-Sometimes you'd prefer that the order of the levels matches the order of the first appearance in the data.
-You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:
+Sorting alphabetically is slightly risky because not every computer will sort strings in the same way.
+So `forcats::fct()` orders by first appearance:
 
 ```{r}
-f1 <- factor(x1, levels = unique(x1))
-f1
-
-f2 <- x1 |> factor() |> fct_inorder()
-f2
+fct(x1)
 ```
 
 If you ever need to access the set of valid levels directly, you can do so with `levels()`:
 
 ```{r}
-levels(f2)
+levels(y2)
 ```
 
 You can also create a factor when reading your data with readr with `col_factor()`:
@@ -169,7 +166,6 @@ For example, imagine you want to explore the average number of hours spent watch
 relig_summary <- gss_cat |>
   group_by(relig) |>
   summarize(
-    age = mean(age, na.rm = TRUE),
     tvhours = mean(tvhours, na.rm = TRUE),
     n = n()
   )
@@ -223,7 +219,6 @@ rincome_summary <- gss_cat |>
   group_by(rincome) |>
   summarize(
     age = mean(age, na.rm = TRUE),
-    tvhours = mean(tvhours, na.rm = TRUE),
     n = n()
   )
 
@@ -274,19 +269,21 @@ This makes the plot easier to read because the colors of the line at the far rig
 #|     shape, and widowed starts off low but increases steeply after age
 #|     60.
 by_age <- gss_cat |>
-  filter(!is.na(age)) |>
+  filter(!is.na(age)) |> 
   count(age, marital) |>
   group_by(age) |>
   mutate(
     prop = n / sum(n)
   )
 
 ggplot(by_age, aes(x = age, y = prop, color = marital)) +
-  geom_line(na.rm = TRUE)
+  geom_line(linewidth = 1) + 
+  scale_color_brewer(palette = "Set1")
 
 ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
-  geom_line() +
-  labs(color = "marital")
+  geom_line(linewidth = 1) +
+  scale_color_brewer(palette = "Set1") + 
+  labs(color = "marital") 
 ```
 
 Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.

diff --git a/logicals.qmd b/logicals.qmd
@@ -137,14 +137,14 @@ NA == NA
 It's easiest to understand why this is true if we artificially supply a little more context:
 
 ```{r}
-# Let x be Mary's age. We don't know how old she is.
-x <- NA
+# We don't know how old Mary is
+age_mary <- NA
 
-# Let y be John's age. We don't know how old he is.
-y <- NA
+# We don't know how old John is
+age_john <- NA
 
 # Are John and Mary the same age?
-x == y
+age_john == age_john
 # We don't know!
 ```
 
@@ -191,13 +191,14 @@ We'll come back to cover missing values in more depth in @sec-missing-values.
 
 ### Exercises
 
-1.  How does `dplyr::near()` work? Type `near` to see the source code.
+1.  How does `dplyr::near()` work? Type `near` to see the source code. Is `sqrt(2)^2` near 2?
 2.  Use `mutate()`, `is.na()`, and `count()` together to describe how the missing values in `dep_time`, `sched_dep_time` and `dep_delay` are connected.
 
 ## Boolean algebra
 
 Once you have multiple logical vectors, you can combine them together using Boolean algebra.
 In R, `&` is "and", `|` is "or", `!` is "not", and `xor()` is exclusive or[^logicals-2].
+For example, `df |> filter(!is.na(x))` finds all rows where `x` is not missing and `df |> filter(x < -10 | x > 0)` finds all rows where `x` is smaller than -10 or bigger than 0.
 @fig-bool-ops shows the complete set of Boolean operations and how they work.
 
 [^logicals-2]: That is, `xor(x, y)` is true if x is true, or y is true, but not both.
@@ -331,14 +332,15 @@ There are two main logical summaries: `any()` and `all()`.
 `all(x)` is equivalent of `&`; it'll return `TRUE` only if all values of `x` are `TRUE`'s.
 Like all summary functions, they'll return `NA` if there are any missing values present, and as usual you can make the missing values go away with `na.rm = TRUE`.
 
-For example, we could use `all()` to find out if there were days where every flight was delayed:
+For example, we could use `all()` and `any()` to find out if every flight was delayed by less than an hour or if any flights was delayed by over 5 hours.
+And using `group_by()` allows us to do that by day:
 
 ```{r}
 flights |> 
   group_by(year, month, day) |> 
   summarize(
-    all_delayed = all(arr_delay >= 0, na.rm = TRUE),
-    any_delayed = any(arr_delay >= 0, na.rm = TRUE),
+    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
+    any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
     .groups = "drop"
   )
 ```
@@ -349,36 +351,18 @@ That leads us to the numeric summaries.
 ### Numeric summaries of logical vectors {#sec-numeric-summaries-of-logicals}
 
 When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
-This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` will give the number of `TRUE`s and `mean(x)` the proportion of `TRUE`s.
-That lets us see the distribution of delays across the days of the year as shown in @fig-prop-delayed-dist
+This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` gives the number of `TRUE`s and `mean(x)` gives the proportion of `TRUE`s (because `mean()` is just `sum()` divided by `length()`.
 
-```{r}
-#| label: fig-prop-delayed-dist
-#| fig-cap: >
-#|   A histogram showing the proportion of delayed flights each day.
-#| fig-alt: >
-#|   The distribution is unimodal and mildly right skewed. The distribution
-#|   peaks around 30% delayed flights.
-flights |> 
-  group_by(year, month, day) |> 
-  summarize(
-    prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
-    .groups = "drop"
-  ) |> 
-  ggplot(aes(x = prop_delayed)) + 
-  geom_histogram(binwidth = 0.05)
-```
-
-Or we could ask: "How many flights left before 5am?", which are often flights that were delayed from the previous day:
+That, for example, allows us to see the proportion of flights that were delayed by less than 60 minutes and the number of flights that were delayed by over 5 hours:
 
 ```{r}
 flights |> 
   group_by(year, month, day) |> 
   summarize(
-    n_early = sum(dep_time < 500, na.rm = TRUE),
+    all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
+    any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
     .groups = "drop"
-  ) |> 
-  arrange(desc(n_early))
+  )
 ```
 
 ### Logical subsetting
@@ -574,6 +558,18 @@ Here are the most important cases that are compatible:
 
 We don't expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.
 
+### Exercises
+
+1.  A number is even if its divisible by two, which in R you can find out with `x %% 2 == 0`.
+    Use this fact and `if_else()` to determine whether each number between 0 and 20 is even or odd.
+
+2.  Given a vector of days like `x <- c("Monday", "Saturday", "Wednesday")`, use an `ifelse()` statement to label them as weekends or weekdays.
+
+3.  Use `ifelse()` to compute the absolute value of a numeric vector called `x`.
+
+4.  Write a `case_when()` statement that uses the `month` and `day` columns from `flights` to label a selection of important US holidays (e.g. New Years Day, 4th of July, Thanksgiving, and Christmas).
+    First create a logical column that is either `TRUE` or `FALSE`, and then create a character column that either gives the name of the holiday or is `NA`.
+
 ## Summary
 
 The definition of a logical vector is simple because each value must be either `TRUE`, `FALSE`, or `NA`.

diff --git a/numbers.qmd b/numbers.qmd
@@ -91,7 +91,7 @@ This means that it only works inside dplyr verbs:
 n()
 ```
 
-There are a couple of variants of `n()` that you might find useful:
+There are a couple of variants of `n()` and `count()` that you might find useful:
 
 -   `n_distinct(x)` counts the number of distinct (unique) values of one or more variables.
     For example, we could figure out which destinations are served by the most carriers:
@@ -216,7 +216,7 @@ df |>
 
 ### Modular arithmetic
 
-Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder.
+Modular arithmetic is the technical name for the type of math you did before you learned about decimal places, i.e. division that yields a whole number and a remainder.
 In R, `%/%` does integer division and `%%` computes the remainder:
 
 ```{r}
@@ -326,7 +326,7 @@ round(x / 0.25) * 0.25
 
 ### Cutting numbers into ranges
 
-Use `cut()`[^numbers-1] to break up a numeric vector into discrete buckets:
+Use `cut()`[^numbers-1] to break up (aka bin) a numeric vector into discrete buckets:
 
 [^numbers-1]: ggplot2 provides some helpers for common cases in `cut_interval()`, `cut_number()`, and `cut_width()`.
     ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.
@@ -395,6 +395,8 @@ If you need more complex rolling or sliding aggregates, try the [slider](https:/
 
     Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).
 
+4.  Round `dep_time` and `arr_time` to the nearest five minutes.
+
 ## General transformations
 
 The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.
@@ -436,13 +438,13 @@ In this case, it'll give the number of the "current" row.
 When combined with `%%` or `%/%` this can be a useful tool for dividing data into similarly sized groups:
 
 ```{r}
-df <- tibble(x = runif(10))
+df <- tibble(id = 1:10)
 
 df |> 
   mutate(
     row0 = row_number() - 1,
     three_groups = row0 %% 3,
-    three_in_each_group = row0 %/% 3,
+    three_in_each_group = row0 %/% 3
   )
 ```
 
@@ -474,8 +476,7 @@ You can lead or lag by more than one position by using the second argument, `n`.
 ### Consecutive identifiers
 
 Sometimes you want to start a new group every time some event occurs.
-For example, when you're looking at website data, it's common to want to break up events into sessions, where a session is defined as a gap of more than x minutes since the last activity.
-
+For example, when you're looking at website data, it's common to want to break up events into sessions, where you begin a new session after gap of more than `x` minutes since the last activity.
 For example, imagine you have the times when someone visited a website:
 
 ```{r}
@@ -485,23 +486,23 @@ events <- tibble(
 
 ```
 
-And you've the time lag between the events, and figured out if there's a gap that's big enough to qualify:
+And you've computed the time between each event, and figured out if there's a gap that's big enough to qualify:
 
 ```{r}
 events <- events |> 
   mutate(
     diff = time - lag(time, default = first(time)),
-    gap = diff >= 5
+    has_gap = diff >= 5
   )
 events
 ```
 
 But how do we go from that logical vector to something that we can `group_by()`?
-`cumsum()` from @sec-cumulative-and-rolling-aggregates comes to the rescue as each occurring gap, i.e. `gap` is `TRUE`, increments `group` by one (see @sec-numeric-summaries-of-logicals on the numerical interpretation of logicals):
+`cumsum()`, from @sec-cumulative-and-rolling-aggregates, comes to the rescue as gap, i.e. `has_gap` is `TRUE`, will increment `group` by one (@sec-numeric-summaries-of-logicals):
 
 ```{r}
 events |> mutate(
-  group = cumsum(gap)
+  group = cumsum(has_gap)
 )
 ```
 
@@ -513,11 +514,9 @@ df <- tibble(
   x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
   y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
 )
-df
 ```
 
-You want to keep the first row from each repeated `x`.
-That's easier to express with a combination of `consecutive_id()` and `slice_head()`:
+If you want to keep the first row from each repeated `x`, you could use `group_by()`, `consecutive_id()`, and `slice_head()`:
 
 ```{r}
 df |> 
@@ -720,28 +719,24 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu
 
 ### Positions
 
-There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position.
-You can do this with the base R `[` function, but we're not going to cover it in detail until @sec-subset-many, because it's a very powerful and general function.
-For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
+There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position: `first(x)`, `last(x)`, and `nth(x, n)`.
 
 For example, we can find the first and last departure for each day:
 
 ```{r}
 flights |> 
   group_by(year, month, day) |> 
   summarize(
-    first_dep = first(dep_time), 
-    fifth_dep = nth(dep_time, 5),
-    last_dep = last(dep_time)
+    first_dep = first(dep_time, na_rm = TRUE), 
+    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
+    last_dep = last(dep_time, na_rm = TRUE)
   )
 ```
 
-(These functions currently lack an `na.rm` argument but will hopefully be fixed by the time you read this book: <https://github.com/tidyverse/dplyr/issues/6242>).
+(NB: Because dplyr functions use `_` to separate components of function and arguments names, these functions use `na_rm` instead of `na.rm`.)
 
-If you're familiar with `[`, you might wonder if you ever need these functions.
-There are two main reasons: the `default` argument and the `order_by` argument.
-`default` allows you to set a default value that's used if the requested position doesn't exist, e.g. you're trying to get the 3rd element from a two element group.
-`order_by` lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by `order_by()`.
+If you're familiar with `[`, which we'll come back to in @sec-subset-many, you might wonder if you ever need these functions.
+There are three reasons: the `default` argument allows you to provide a default if the specified position doesn't exist, the `order_by` argument allows you to locally override the order of the rows, and the `na_rm` argument allows you to drop missing values.
 
 Extracting values at positions is complementary to filtering on ranks.
 Filtering gives you all variables, with each observation in a separate row:
@@ -761,19 +756,17 @@ For example:
 
 -   `x / sum(x)` calculates the proportion of a total.
 -   `(x - mean(x)) / sd(x)` computes a Z-score (standardized to mean 0 and sd 1).
+-   `(x - min(x)) / (max(x) - min(x))` standardizes to range \[0, 1\].
 -   `x / first(x)` computes an index based on the first observation.
 
 ### Exercises
 
 1.  Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
-    Consider the following scenarios:
-
-    -   A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
-    -   A flight is always 10 minutes late.
-    -   A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
-    -   99% of the time a flight is on time. 1% of the time it's 2 hours late.
-
-    Which do you think is more important: arrival delay or departure delay?
+    When is `mean()` useful?
+    When is `median()` useful?
+    When might you want to use something else?
+    Should you use arrival delay or departure delay?
+    Why might you want to use data from `planes`?
 
 2.  Which destinations show the greatest variation in air speed?