Skip to content

Commit

Permalink
TR review feedback for logicals-factors (hadley#1310)
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley authored Feb 27, 2023
1 parent b03248a commit c0f0375
Show file tree
Hide file tree
Showing 5 changed files with 104 additions and 154 deletions.
31 changes: 14 additions & 17 deletions factors.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ status("complete")
Factors are used for categorical variables, variables that have a fixed and known set of possible values.
They are also useful when you want to display character vectors in a non-alphabetical order.

We'll start by motivating why factors are needed for data analysis and how you can create them with `factor()`.
We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
We'll start by motivating why factors are needed for data analysis[^factors-1] and how you can create them with `factor()`. We'll then introduce you to the `gss_cat` dataset which contains a bunch of categorical variables to experiment with.
You'll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.

[^factors-1]: They're also really important for modelling.

### Prerequisites

Base R provides some basic tools for creating and manipulating factors.
Expand Down Expand Up @@ -77,7 +78,7 @@ y2 <- factor(x2, levels = month_levels)
y2
```

This seems risky, so you might want to use `fct()` instead:
This seems risky, so you might want to use `forcats::fct()` instead:

```{r}
#| error: true
Expand All @@ -90,21 +91,17 @@ If you omit the levels, they'll be taken from the data in alphabetical order:
factor(x1)
```

Sometimes you'd prefer that the order of the levels matches the order of the first appearance in the data.
You can do that when creating the factor by setting levels to `unique(x)`, or after the fact, with `fct_inorder()`:
Sorting alphabetically is slightly risky because not every computer will sort strings in the same way.
So `forcats::fct()` orders by first appearance:

```{r}
f1 <- factor(x1, levels = unique(x1))
f1
f2 <- x1 |> factor() |> fct_inorder()
f2
fct(x1)
```

If you ever need to access the set of valid levels directly, you can do so with `levels()`:

```{r}
levels(f2)
levels(y2)
```

You can also create a factor when reading your data with readr with `col_factor()`:
Expand Down Expand Up @@ -169,7 +166,6 @@ For example, imagine you want to explore the average number of hours spent watch
relig_summary <- gss_cat |>
group_by(relig) |>
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
Expand Down Expand Up @@ -223,7 +219,6 @@ rincome_summary <- gss_cat |>
group_by(rincome) |>
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
Expand Down Expand Up @@ -274,19 +269,21 @@ This makes the plot easier to read because the colors of the line at the far rig
#| shape, and widowed starts off low but increases steeply after age
#| 60.
by_age <- gss_cat |>
filter(!is.na(age)) |>
filter(!is.na(age)) |>
count(age, marital) |>
group_by(age) |>
mutate(
prop = n / sum(n)
)
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
geom_line(na.rm = TRUE)
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1")
ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(color = "marital")
geom_line(linewidth = 1) +
scale_color_brewer(palette = "Set1") +
labs(color = "marital")
```

Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables.
Expand Down
60 changes: 28 additions & 32 deletions logicals.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -137,14 +137,14 @@ NA == NA
It's easiest to understand why this is true if we artificially supply a little more context:

```{r}
# Let x be Mary's age. We don't know how old she is.
x <- NA
# We don't know how old Mary is
age_mary <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# We don't know how old John is
age_john <- NA
# Are John and Mary the same age?
x == y
age_john == age_john
# We don't know!
```

Expand Down Expand Up @@ -191,13 +191,14 @@ We'll come back to cover missing values in more depth in @sec-missing-values.

### Exercises

1. How does `dplyr::near()` work? Type `near` to see the source code.
1. How does `dplyr::near()` work? Type `near` to see the source code. Is `sqrt(2)^2` near 2?
2. Use `mutate()`, `is.na()`, and `count()` together to describe how the missing values in `dep_time`, `sched_dep_time` and `dep_delay` are connected.

## Boolean algebra

Once you have multiple logical vectors, you can combine them together using Boolean algebra.
In R, `&` is "and", `|` is "or", `!` is "not", and `xor()` is exclusive or[^logicals-2].
For example, `df |> filter(!is.na(x))` finds all rows where `x` is not missing and `df |> filter(x < -10 | x > 0)` finds all rows where `x` is smaller than -10 or bigger than 0.
@fig-bool-ops shows the complete set of Boolean operations and how they work.

[^logicals-2]: That is, `xor(x, y)` is true if x is true, or y is true, but not both.
Expand Down Expand Up @@ -331,14 +332,15 @@ There are two main logical summaries: `any()` and `all()`.
`all(x)` is equivalent of `&`; it'll return `TRUE` only if all values of `x` are `TRUE`'s.
Like all summary functions, they'll return `NA` if there are any missing values present, and as usual you can make the missing values go away with `na.rm = TRUE`.

For example, we could use `all()` to find out if there were days where every flight was delayed:
For example, we could use `all()` and `any()` to find out if every flight was delayed by less than an hour or if any flights was delayed by over 5 hours.
And using `group_by()` allows us to do that by day:

```{r}
flights |>
group_by(year, month, day) |>
summarize(
all_delayed = all(arr_delay >= 0, na.rm = TRUE),
any_delayed = any(arr_delay >= 0, na.rm = TRUE),
all_delayed = all(dep_delay <= 60, na.rm = TRUE),
any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
.groups = "drop"
)
```
Expand All @@ -349,36 +351,18 @@ That leads us to the numeric summaries.
### Numeric summaries of logical vectors {#sec-numeric-summaries-of-logicals}

When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` will give the number of `TRUE`s and `mean(x)` the proportion of `TRUE`s.
That lets us see the distribution of delays across the days of the year as shown in @fig-prop-delayed-dist
This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` gives the number of `TRUE`s and `mean(x)` gives the proportion of `TRUE`s (because `mean()` is just `sum()` divided by `length()`.

```{r}
#| label: fig-prop-delayed-dist
#| fig-cap: >
#| A histogram showing the proportion of delayed flights each day.
#| fig-alt: >
#| The distribution is unimodal and mildly right skewed. The distribution
#| peaks around 30% delayed flights.
flights |>
group_by(year, month, day) |>
summarize(
prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
.groups = "drop"
) |>
ggplot(aes(x = prop_delayed)) +
geom_histogram(binwidth = 0.05)
```

Or we could ask: "How many flights left before 5am?", which are often flights that were delayed from the previous day:
That, for example, allows us to see the proportion of flights that were delayed by less than 60 minutes and the number of flights that were delayed by over 5 hours:

```{r}
flights |>
group_by(year, month, day) |>
summarize(
n_early = sum(dep_time < 500, na.rm = TRUE),
all_delayed = mean(dep_delay <= 60, na.rm = TRUE),
any_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(n_early))
)
```

### Logical subsetting
Expand Down Expand Up @@ -574,6 +558,18 @@ Here are the most important cases that are compatible:

We don't expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.

### Exercises

1. A number is even if its divisible by two, which in R you can find out with `x %% 2 == 0`.
Use this fact and `if_else()` to determine whether each number between 0 and 20 is even or odd.

2. Given a vector of days like `x <- c("Monday", "Saturday", "Wednesday")`, use an `ifelse()` statement to label them as weekends or weekdays.

3. Use `ifelse()` to compute the absolute value of a numeric vector called `x`.

4. Write a `case_when()` statement that uses the `month` and `day` columns from `flights` to label a selection of important US holidays (e.g. New Years Day, 4th of July, Thanksgiving, and Christmas).
First create a logical column that is either `TRUE` or `FALSE`, and then create a character column that either gives the name of the holiday or is `NA`.

## Summary

The definition of a logical vector is simple because each value must be either `TRUE`, `FALSE`, or `NA`.
Expand Down
59 changes: 26 additions & 33 deletions numbers.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ This means that it only works inside dplyr verbs:
n()
```

There are a couple of variants of `n()` that you might find useful:
There are a couple of variants of `n()` and `count()` that you might find useful:

- `n_distinct(x)` counts the number of distinct (unique) values of one or more variables.
For example, we could figure out which destinations are served by the most carriers:
Expand Down Expand Up @@ -216,7 +216,7 @@ df |>

### Modular arithmetic

Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder.
Modular arithmetic is the technical name for the type of math you did before you learned about decimal places, i.e. division that yields a whole number and a remainder.
In R, `%/%` does integer division and `%%` computes the remainder:

```{r}
Expand Down Expand Up @@ -326,7 +326,7 @@ round(x / 0.25) * 0.25

### Cutting numbers into ranges

Use `cut()`[^numbers-1] to break up a numeric vector into discrete buckets:
Use `cut()`[^numbers-1] to break up (aka bin) a numeric vector into discrete buckets:

[^numbers-1]: ggplot2 provides some helpers for common cases in `cut_interval()`, `cut_number()`, and `cut_width()`.
ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.
Expand Down Expand Up @@ -395,6 +395,8 @@ If you need more complex rolling or sliding aggregates, try the [slider](https:/
Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).
4. Round `dep_time` and `arr_time` to the nearest five minutes.
## General transformations
The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.
Expand Down Expand Up @@ -436,13 +438,13 @@ In this case, it'll give the number of the "current" row.
When combined with `%%` or `%/%` this can be a useful tool for dividing data into similarly sized groups:

```{r}
df <- tibble(x = runif(10))
df <- tibble(id = 1:10)
df |>
mutate(
row0 = row_number() - 1,
three_groups = row0 %% 3,
three_in_each_group = row0 %/% 3,
three_in_each_group = row0 %/% 3
)
```

Expand Down Expand Up @@ -474,8 +476,7 @@ You can lead or lag by more than one position by using the second argument, `n`.
### Consecutive identifiers
Sometimes you want to start a new group every time some event occurs.
For example, when you're looking at website data, it's common to want to break up events into sessions, where a session is defined as a gap of more than x minutes since the last activity.
For example, when you're looking at website data, it's common to want to break up events into sessions, where you begin a new session after gap of more than `x` minutes since the last activity.
For example, imagine you have the times when someone visited a website:
```{r}
Expand All @@ -485,23 +486,23 @@ events <- tibble(
```

And you've the time lag between the events, and figured out if there's a gap that's big enough to qualify:
And you've computed the time between each event, and figured out if there's a gap that's big enough to qualify:

```{r}
events <- events |>
mutate(
diff = time - lag(time, default = first(time)),
gap = diff >= 5
has_gap = diff >= 5
)
events
```

But how do we go from that logical vector to something that we can `group_by()`?
`cumsum()` from @sec-cumulative-and-rolling-aggregates comes to the rescue as each occurring gap, i.e. `gap` is `TRUE`, increments `group` by one (see @sec-numeric-summaries-of-logicals on the numerical interpretation of logicals):
`cumsum()`, from @sec-cumulative-and-rolling-aggregates, comes to the rescue as gap, i.e. `has_gap` is `TRUE`, will increment `group` by one (@sec-numeric-summaries-of-logicals):

```{r}
events |> mutate(
group = cumsum(gap)
group = cumsum(has_gap)
)
```

Expand All @@ -513,11 +514,9 @@ df <- tibble(
x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
)
df
```

You want to keep the first row from each repeated `x`.
That's easier to express with a combination of `consecutive_id()` and `slice_head()`:
If you want to keep the first row from each repeated `x`, you could use `group_by()`, `consecutive_id()`, and `slice_head()`:

```{r}
df |>
Expand Down Expand Up @@ -720,28 +719,24 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu

### Positions

There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position.
You can do this with the base R `[` function, but we're not going to cover it in detail until @sec-subset-many, because it's a very powerful and general function.
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position: `first(x)`, `last(x)`, and `nth(x, n)`.

For example, we can find the first and last departure for each day:

```{r}
flights |>
group_by(year, month, day) |>
summarize(
first_dep = first(dep_time),
fifth_dep = nth(dep_time, 5),
last_dep = last(dep_time)
first_dep = first(dep_time, na_rm = TRUE),
fifth_dep = nth(dep_time, 5, na_rm = TRUE),
last_dep = last(dep_time, na_rm = TRUE)
)
```

(These functions currently lack an `na.rm` argument but will hopefully be fixed by the time you read this book: <https://github.com/tidyverse/dplyr/issues/6242>).
(NB: Because dplyr functions use `_` to separate components of function and arguments names, these functions use `na_rm` instead of `na.rm`.)

If you're familiar with `[`, you might wonder if you ever need these functions.
There are two main reasons: the `default` argument and the `order_by` argument.
`default` allows you to set a default value that's used if the requested position doesn't exist, e.g. you're trying to get the 3rd element from a two element group.
`order_by` lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by `order_by()`.
If you're familiar with `[`, which we'll come back to in @sec-subset-many, you might wonder if you ever need these functions.
There are three reasons: the `default` argument allows you to provide a default if the specified position doesn't exist, the `order_by` argument allows you to locally override the order of the rows, and the `na_rm` argument allows you to drop missing values.

Extracting values at positions is complementary to filtering on ranks.
Filtering gives you all variables, with each observation in a separate row:
Expand All @@ -761,19 +756,17 @@ For example:

- `x / sum(x)` calculates the proportion of a total.
- `(x - mean(x)) / sd(x)` computes a Z-score (standardized to mean 0 and sd 1).
- `(x - min(x)) / (max(x) - min(x))` standardizes to range \[0, 1\].
- `x / first(x)` computes an index based on the first observation.

### Exercises

1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
Consider the following scenarios:

- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it's 2 hours late.

Which do you think is more important: arrival delay or departure delay?
When is `mean()` useful?
When is `median()` useful?
When might you want to use something else?
Should you use arrival delay or departure delay?
Why might you want to use data from `planes`?

2. Which destinations show the greatest variation in air speed?

Expand Down
Loading

0 comments on commit c0f0375

Please sign in to comment.