- Installation
- Usage
- Example 1: replicate
nflscrapR
withfast_scraper
- Example 2: scrape a batch of games very quickly with
fast_scraper
and parallel processing - Example 3: completion percentage over expected (CPOE)
- Example 4: using drive information
- Example 5: scrape rosters with
fast_scraper_roster
- Example 6: scrape highlight clips with
fast_scraper_clips
- Example 7: Plot offensive and defensive EPA per play for a given season
- Example 1: replicate
- More information
- About
nflfastR
is a set of functions to efficiently scrape NFL play-by-play
and roster data. nflfastR
expands upon the features of nflscrapR:
- By incorporating the NFL’s RS feed, the package currently supports full play-by-play back to 2000
- As suggested by the package name, it scrapes games much faster
- Includes completion probability (
cp
) and completion percentage over expected (cpoe
) in play-by-play going back to 2006 - The default RS feed includes drive information, including drive starting position and drive result
- Includes fast functions for roster and highlight scraping
We owe a debt of gratitude to the original
nflscrapR
team, Maksim
Horowitz, Ronald Yurko, and Samuel Ventura, without whose contributions
and inspiration this package would not exist.
You can load and install nflfastR from GitHub with:
# If 'devtools' isn't installed run
# install.packages("devtools")
# If 'nflscrapR' isn't installed run
# devtools::install_github("maksimhorowitz/nflscrapR")
devtools::install_github("mrcaseb/nflfastR")
The functionality of nflscrapR
can be duplicated by using
fast_scraper
with the ‘gc’ (for Gamecenter) option specified. This
scrapes from the same source as nflscrapR
but much more quickly.
Reasons to use the source = "gc"
option include (a) duplicating the
output of nflscrapR
or (b) when scraping a live or recently-completed
game: Gamecenter updates live and the RS feed does not. For scraping old
seasons, we recommend not specifying a source option and letting the
scraper default to the RS feed (see Example 2 below).
This example also uses the built-in function clean_pbp
to create a
"name’ column for the primary player involved (the QB on pass play or
ball-carrier on run play).
library(nflfastR)
library(tidyverse)
library(nflscrapR)
gameId <- 2019111100
nflscrapR::scrape_json_play_by_play(gameId) %>%
select(desc, play_type, epa, home_wp) %>% head(5) %>%
knitr::kable(digits = 3)
desc | play_type | epa | home_wp |
---|---|---|---|
J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 0.000 | NA |
(15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). | run | -0.606 | 0.500 |
(14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). | run | -1.146 | 0.485 |
(13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac | pass | 3.223 | 0.453 |
(12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no_play | 0.774 | 0.551 |
#The 'gc' option specifies scraping gamecenter like nflscrapR does, as opposed to 'rs'
fast_scraper(gameId, source = "gc") %>%
clean_pbp() %>%
select(desc, play_type, epa, home_wp, name) %>% head(5) %>%
knitr::kable(digits = 3)
desc | play_type | epa | home_wp | name |
---|---|---|---|---|
J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. | kickoff | 0.000 | NA | NA |
(15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). | run | -0.606 | 0.500 | T.Coleman |
(14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). | run | -1.146 | 0.485 | T.Coleman |
(13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac | pass | 3.223 | 0.453 | J.Garoppolo |
(12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. | no_play | 0.774 | 0.551 | NA |
#get list of some games from 2019
games_2019 <- fast_scraper_schedules(2019) %>% filter(game_type == 'REG') %>% head(16) %>% pull(game_id)
tictoc::tic(glue::glue('{length(games_2019)} games with nflfastR:'))
f <- fast_scraper(games_2019, pp = TRUE)
tictoc::toc()
#> 16 games with nflfastR:: 13.88 sec elapsed
tictoc::tic(glue::glue('{length(games_2019)} games with nflscrapR:'))
n <- map_df(games_2019, nflscrapR::scrape_json_play_by_play)
tictoc::toc()
#> 16 games with nflscrapR:: 535.89 sec elapsed
Let’s look at CPOE leaders from the 2009 regular season.
games <- fast_scraper_schedules(2009) %>% filter(game_type == 'REG') %>% pull(game_id)
tictoc::tic('scraping all 256 games from 2009')
games_2009 <- fast_scraper(games, pp = TRUE)
tictoc::toc()
#> scraping all 256 games from 2009: 150.526 sec elapsed
games_2009 %>% filter(!is.na(cpoe)) %>% group_by(passer_player_name) %>%
summarize(cpoe = mean(cpoe), Atts=n()) %>%
filter(Atts > 200) %>%
arrange(-cpoe) %>%
head(5) %>%
knitr::kable(digits = 1)
passer_player_name | cpoe | Atts |
---|---|---|
D.Brees | 9.3 | 509 |
P.Manning | 7.4 | 569 |
B.Favre | 6.6 | 526 |
P.Rivers | 6.4 | 474 |
B.Roethlisberger | 5.8 | 503 |
When scraping from the default RS feed, drive results are automatically included. Let’s look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2006.
nflfastR
has a data repository for old seasons, so there’s no need to
actually scrape them. Let’s use that here (the below reads .rds files,
but .csv is also available).
games_2000 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2000.rds'))
games_2015 <-readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds'))
pbp <- rbind(games_2000, games_2015)
pbp %>% filter(game_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 == 80) %>%
mutate(drive_score = if_else(drive_how_ended %in% c("Touchdown", "Field_Goal"), 1, 0)) %>%
group_by(season) %>%
summarize(drive_score = mean(drive_score)) %>%
knitr::kable(digits = 3)
season | drive_score |
---|---|
2000 | 0.233 |
2015 | 0.305 |
So about 23% of 1st & 10 plays from teams’ own 20 would see the drive end up in a score in 2000, compared to 30% in 2015. This has implications for EPA models (see below).
# Roster of Steelers and Seahawks in 2016 & 2019 using parallel processing
# teams_colors_logos is included in the package
team_ids <- teams_colors_logos %>% filter(team_abbr %in% c("SEA", "PIT")) %>% pull(team_id)
fast_scraper_roster(team_ids, c("2016", "2019"), pp = TRUE) %>%
select(2,9:13) %>% head() %>%
knitr::kable()
teamPlayers.displayName | teamPlayers.position | teamPlayers.nflId | teamPlayers.esbId | teamPlayers.gsisId | teamPlayers.birthDate |
---|---|---|---|---|---|
Shamarko Thomas | SS | 2539937 | THO379701 | 00-0030412 | 02/23/1991 |
Sean Davis | SS | 2555386 | DAV746549 | 00-0033053 | 10/23/1993 |
Javon Hargrave | NT | 2555239 | HAR143881 | 00-0033109 | 02/07/1993 |
Mike Hilton | DB | 2556559 | HIL796239 | 00-0032521 | 03/09/1994 |
Shaquille Riddick | LB | 2552584 | RID186261 | 00-0032111 | 03/12/1993 |
Ricardo Mathews | DE | 1037901 | MAT188704 | 00-0027829 | 07/30/1987 |
#use same week 1 games from above
vids <- fast_scraper_clips(games_2019)
vids %>% select(highlight_video_url) %>% head(2) %>% knitr::kable()
Let’s build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. The logo urls of the espn logos are integrated into the ‘team_colors_logos’ data frame which is delivered with the package.
Let’s also use the included helper function clean_pbp
, which creates
“rush” and “pass” columns that (a) properly count sacks and scrambles
as pass plays and (b) properly include plays with penalties. Using this,
we can keep only rush or pass plays.
library(ggimage)
pbp <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2005.rds')) %>%
filter(game_type == 'REG') %>% clean_pbp() %>% filter(!is.na(posteam) & (rush == 1 | pass == 1))
offense <- pbp %>% group_by(posteam) %>% summarise(off_epa = mean(epa, na.rm = TRUE))
defense <- pbp %>% group_by(defteam) %>% summarise(def_epa = mean(epa, na.rm = TRUE))
logos <- teams_colors_logos %>% select(team_abbr, team_logo_espn)
offense %>%
inner_join(defense, by = c("posteam" = "defteam")) %>%
inner_join(logos, by = c("posteam" = "team_abbr")) %>%
ggplot(aes(x = off_epa, y = def_epa)) +
geom_abline(slope = -1.5, intercept = c(.4, .3, .2, .1, 0, -.1, -.2, -.3), alpha = .2) +
geom_hline(aes(yintercept = mean(off_epa)), color = "red", linetype = "dashed") +
geom_vline(aes(xintercept = mean(def_epa)), color = "red", linetype = "dashed") +
geom_image(aes(image = team_logo_espn), size = 0.05, asp = 16 / 9) +
labs(
x = "Offense EPA/play",
y = "Defense EPA/play",
caption = "Data: @nflfastR | EPA model: @nflscrapR",
title = "2005 NFL Offensive and Defensive EPA per Play"
) +
theme_bw() +
theme(
aspect.ratio = 9 / 16,
plot.title = element_text(size = 12, hjust = 0.5, face = "bold")
) +
scale_y_reverse()
nflfastR
scrapes NFL Gamecenter or RS feeds, defaulting to the RS
feed. Live games are only available from Gamecenter (we think) so when
scraping ongoing or recent games, use source = 'gc'
. Columns that
exist in both GC and RS are consistent across the two scrapers (e.g.,
player_id, play_id, etc.) but there are some columns in RS that do not
exist in GC (drive_how_ended, roof_type, game_time_eastern, etc.).
nflfastR
uses the Expected Points and Win Probability models developed
by the nflscrapR
team and provided by the nflscrapR
package. For a
description of the models, please see the paper
here. When using EP or WP from
this package, please cite nflscrapR
as it is their work behind the
models (see the example in the caption of the figure above). Because
these models were trained on more recent seasons, they should be used
with caution for games in the early 2000s (note the means being not
centered at zero in the figure above). If you would like to help us
extend the EPA model to work better in the early 2000s, we are very open
to contributions from others.
Even though nflfastR
is very fast, for completed seasons we recommend
downloading the data from
here as in Examples 4 and
7. These data sets include play-by-play data of complete seasons going
back to 2000 and we will update them in 2020 once the season starts. The
files contain both regular season and postseason data, and one can use
game_type or week to figure out which games occurred in the postseason.
Data are available as either .csv or .rds, but if you’re using R, the
.rds files are much smaller and thus faster to download.
fast_scraper
can also scrape the 1999 season. However, several games
of the 1999 season are missing play-by-play data completely. nflfastR
will point this out when trying to scrape this season and specify the
missing games.
nflfastR
was developed by Sebastian
Carl and Ben
Baldwin.
Special thanks to Florian Schmitt for the logo design!