Improve data tidying

Fixes hadley#1322
qiuwei · Mar 7, 2023 · 424665c · 424665c
1 parent 810b9f6
commit 424665c
Showing 1 changed file with 6 additions and 4 deletions.
diff --git a/data-tidy.qmd b/data-tidy.qmd
@@ -176,9 +176,11 @@ billboard
 
 In this dataset, each observation is a song.
 The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song.
-Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week.
+Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week[^data-tidy-1].
 Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).
 
+[^data-tidy-1]: The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.
+
 To tidy this data, we'll use `pivot_longer()`:
 
 ```{r, R.options=list(pillar.print_min = 10)}
@@ -202,9 +204,9 @@ Now let's turn our attention to the resulting, longer data frame.
 What happens if a song is in the top 100 for less than 76 weeks?
 Take 2 Pac's "Baby Don't Cry", for example.
 The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
-These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
+These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-2], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
 
-[^data-tidy-1]: We'll come back to this idea in @sec-missing-values.
+[^data-tidy-2]: We'll come back to this idea in @sec-missing-values.
 
 ```{r}
 billboard |> 
@@ -216,7 +218,7 @@ billboard |>
   )
 ```
 
-The number of rows is now much lower, indicating that the rows with `NA`s were dropped.
+The number of rows is now much lower, indicating that many rows with `NA`s were dropped.
 
 You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
 We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.