Skip to content

Commit

Permalink
Review edits (hadley#1330)
Browse files Browse the repository at this point in the history
* Cut case study down a bit

* Intro feedback

* Visualize feedback

* Update data-transform.qmd

Co-authored-by: Hadley Wickham <[email protected]>

* Update data-visualize.qmd

Co-authored-by: Hadley Wickham <[email protected]>

* Update data-visualize.qmd

Co-authored-by: Hadley Wickham <[email protected]>

* Update intro.qmd

Co-authored-by: Hadley Wickham <[email protected]>

* Update intro.qmd

Co-authored-by: Hadley Wickham <[email protected]>

* Incorporate review feedback

---------

Co-authored-by: Hadley Wickham <[email protected]>
  • Loading branch information
mine-cetinkaya-rundel and hadley authored Mar 2, 2023
1 parent fc631a4 commit 70687bf
Show file tree
Hide file tree
Showing 3 changed files with 93 additions and 149 deletions.
91 changes: 19 additions & 72 deletions data-transform.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
Expand Down Expand Up @@ -753,107 +754,53 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
Whenever you do any aggregation, it's always a good idea to include a count (`n()`).
That way, you can ensure that you're not drawing conclusions based on very small amounts of data.
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
```{r}
#| fig-alt: >
#| A frequency histogram showing the distribution of flight delays.
#| The distribution is unimodal, with a large spike around 0, and
#| asymmetric: very few flights leave more than 30 minutes early,
#| but flights are delayed up to 5 hours.
delays <- flights |>
filter(!is.na(arr_delay), !is.na(tailnum)) |>
group_by(tailnum) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(delays, aes(x = delay)) +
geom_freqpoly(binwidth = 10)
```

Wow, there are some planes that have an *average* delay of 5 hours (300 minutes)!
That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:

```{r}
#| fig-alt: >
#| A scatterplot showing number of flights versus average arrival delay. Delays
#| for planes with very small number of flights have very high variability
#| (from -50 to ~300), but the variability rapidly decreases as the
#| number of flights increases.
ggplot(delays, aes(x = delay, y = n)) +
geom_point(alpha = 1/10)
```

Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane.
The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].

[^data-transform-4]: \*cough\* the law of large numbers \*cough\*.

When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:

```{r}
#| warning: false
#| fig-alt: >
#| Scatterplot of number of flights of a given plane vs. the average delay
#| for those flights, for planes with more than 25 flights. As average delay
#| increases from -20 to 10, the number of flights also increases. For
#| larger average delayes, the number of flights decreases.
delays |>
filter(n > 25) |>
ggplot(aes(x = delay, y = n)) +
geom_point(alpha = 1/10) +
geom_smooth(se = FALSE)
```

Note the handy pattern for combining ggplot2 and dplyr.
It's a bit annoying that you have to switch from `|>` to `+`, but it's not too much of a hassle once you get the hang of it.

There's another common variation on this pattern that we can see in some data about baseball players.
The following code uses data from the **Lahman** package to compare what proportion of times a player gets a hit vs. the number of times they try to put the ball in play:
We'll demonstrate this with some baseball data from the **Lahman** package.
Specifically, we will compare what proportion of times a player gets a hit vs. the number of times they try to put the ball in play:
```{r}
batters <- Lahman::Batting |>
group_by(playerID) |>
summarize(
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
batters
```

When we plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
When we plot the skill of the batter (measured by the batting average, `performance`) against the number of opportunities to hit the ball (measured by times at bat, `n`), you see two patterns:

1. As above, the variation in our aggregate decreases as we get more data points.
1. The variation in our aggregate decreases as we get more data points.
The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].

2. There's a positive correlation between skill (`perf`) and opportunities to hit the ball (`n`) because obviously teams want to give their best batters the most opportunities to hit the ball.
2. There's a positive correlation between skill (`perf`) and opportunities to hit the ball (`n`) because teams want to give their best batters the most opportunities to hit the ball.

[^data-transform-4]: \*cough\* the law of large numbers \*cough\*.

```{r}
#| warning: false
#| fig-alt: >
#| A scatterplot of number of batting opportunites vs. batting performance
#| A scatterplot of number of batting performance vs. batting opportunites
#| overlaid with a smoothed line. Average performance increases sharply
#| from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance
#| continues to increase linearly at a much shallower slope reaching
#| ~0.3 when n is ~15,000.
batters |>
filter(n > 100) |>
ggplot(aes(x = n, y = perf)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)
ggplot(aes(x = n, y = performance)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)
```

Note the handy pattern for combining ggplot2 and dplyr.
It's a bit annoying that you have to switch from `|>` to `+`, but it's not too much of a hassle once you get the hang of it.

This also has important implications for ranking.
If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:
If you naively sort on `desc(performance)`, the people with the best batting averages are clearly lucky, not skilled:

```{r}
batters |>
arrange(desc(perf))
arrange(desc(performance))
```

You can find a good explanation of this problem and how to overcome it at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <https://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
Expand Down
Loading

0 comments on commit 70687bf

Please sign in to comment.