Skip to content

Commit

Permalink
Improve data tidying
Browse files Browse the repository at this point in the history
  • Loading branch information
hadley committed Mar 7, 2023
1 parent 810b9f6 commit 424665c
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions data-tidy.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,11 @@ billboard

In this dataset, each observation is a song.
The first three columns (`artist`, `track` and `date.entered`) are variables that describe the song.
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week.
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week[^data-tidy-1].
Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).

[^data-tidy-1]: The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.

To tidy this data, we'll use `pivot_longer()`:

```{r, R.options=list(pillar.print_min = 10)}
Expand All @@ -202,9 +204,9 @@ Now let's turn our attention to the resulting, longer data frame.
What happens if a song is in the top 100 for less than 76 weeks?
Take 2 Pac's "Baby Don't Cry", for example.
The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-2], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:

[^data-tidy-1]: We'll come back to this idea in @sec-missing-values.
[^data-tidy-2]: We'll come back to this idea in @sec-missing-values.

```{r}
billboard |>
Expand All @@ -216,7 +218,7 @@ billboard |>
)
```

The number of rows is now much lower, indicating that the rows with `NA`s were dropped.
The number of rows is now much lower, indicating that many rows with `NA`s were dropped.

You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.
Expand Down

0 comments on commit 424665c

Please sign in to comment.