forked from leerichardson/game_simulation
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
added season 2013 MSE and MAE results as well as pulled the season wi…
…n totals from ESPN
- Loading branch information
1 parent
a999260
commit be71bca
Showing
11 changed files
with
23,993 additions
and
0 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
#Load the XML Library | ||
library(XML) | ||
|
||
###### URLs | ||
url<-paste0("http://www.basketball-reference.com/players/",letters,"/") | ||
|
||
#Set the years from which data is available | ||
years <- 1950:2014 | ||
|
||
#Set the urls for the data we want | ||
totals_url <- paste0("http://www.basketball-reference.com/leagues/NBA_", years, "_totals.html") | ||
advanced_url <- paste0("http://www.basketball-reference.com/leagues/NBA_", years, "_advanced.html") | ||
|
||
#Set the length of the tables that we are scraping from | ||
totals_len<-length(totals_url) | ||
advanced_len<-length(advanced_url) | ||
|
||
#Initialize both the totals and advanced tables | ||
totals_table <- readHTMLTable(totals_url[1])[[1]] | ||
totals_table$year <- 1950 | ||
|
||
advanced_table <- readHTMLTable(advanced_url[1])[[1]] | ||
advanced_table$year <- 1950 | ||
|
||
#Append together all of the totals data | ||
for (i in 2:totals_len) { | ||
#Create a temporary table, and then append on a variable that contains the year | ||
temp_table <- readHTMLTable(totals_url[i])[[1]] | ||
temp_table$year <- i + 1949 | ||
totals_table<-rbind(totals_table,temp_table) | ||
} | ||
|
||
#Append together all of the advanced data | ||
for (i in 2:advanced_len) { | ||
#Create a temporary table, and then append on a variable that contains the year | ||
temp_table <- readHTMLTable(advanced_url[i])[[1]] | ||
temp_table$year <- i + 1949 | ||
advanced_table<-rbind(advanced_table,temp_table) | ||
} | ||
|
||
#Combine the output of these two tables, and get rid of the default rows which don't | ||
all_table <- as.data.frame(subset(cbind(totals_table, advanced_table), Player != "Player")) | ||
all_table_revised <- all_table[,!grepl(".1", names(all_table))] | ||
|
||
#Write the polished data frame to a CSV | ||
write.csv(all_table_revised, file = "data/bball_ref_data.csv") | ||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
## Purpose: Pull actual standings from different seasons in order to | ||
## compare our simulated seasons against and compute RMSE | ||
|
||
## SET WORKING DIRCTORY ## | ||
setwd("C:/Users/leeri_000/basketball_stats/game_simulation") | ||
|
||
## Load in the library package | ||
library(XML) | ||
|
||
# Set up connection to our database | ||
con <- dbConnect(drv="SQLite", dbname="nba_rRegression_chi/nba.db") | ||
teams <- dbGetQuery(con, 'SELECT * FROM teams') | ||
|
||
## Set up years | ||
years <- 2013:2014 | ||
|
||
## Set up initial dataframe | ||
win_vector <- as.data.frame(matrix(0, nrow=0, ncol=3)) | ||
|
||
for(j in years){ | ||
## Set up the URL for extraction | ||
url <- paste("http://espn.go.com/nba/standings/_/year/", j, sep="") | ||
standings <- readHTMLTable(url[1], header=T)[[1]][c(2:9, 11:17, 20:27, 29:35),c(2:3)] | ||
colnames(standings) <- c("name", "wins") | ||
standings[,1] <- as.character(standings[,1]) | ||
|
||
## Merge these standings with the teams | ||
for(i in 1:length(standings[,1])){ | ||
standings[i,1] <- gsub("x - ", "", standings[i,1]) | ||
standings[i,1] <- gsub("y - ", "", standings[i,1]) | ||
standings[i,1] <- gsub("z - ", "", standings[i,1]) | ||
} | ||
|
||
standings <- standings[order(standings$name),] | ||
teams <- teams[order(teams$fullName),] | ||
team_wins <- cbind(standings, teams) | ||
team_wins$year <- j - 1 | ||
team_wins <- team_wins[, c(2, 3, 5)] | ||
win_vector <- rbind(win_vector, team_wins) | ||
} | ||
|
||
## Clean and save | ||
rownames(win_vector) <- NULL | ||
write.csv(win_vector, "data/espn_data/team_wins.csv") | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
Data Preparation | ||
For this project since we don't have ready-made dataset, we were making efforts getting the data from available resources. This basically includes the selection of data source and steps for ETL. | ||
|
||
Data Source | ||
A very popular resource for nba statistics is the ESPN NBA website (http://espn.go.com/nba/). Specifically, among all information about nba games, we are interested in below list of nba statistics: | ||
Game score: includes the datasets of each game between 2009 and 2014 up to date. The dataset includes home team, visiting team, game score for either team, as well as the game date and game season. | ||
Game score here will be the target value we are trying to build a model to learn and predict in further works. | ||
Game detail: for each of the game we are not only interested in the final score, it'll also be helpful that we get information about the statistics for each player in that game, | ||
including field minutes, points, shoots attempted / made, rebounds, steals, fouls, etc. | ||
Player detail: for each of the players, we are also have a list with 50 features for each player in the season. All standard features listed on ESPN NBA website for players are included. | ||
For ESPN NBA data eventually we got information of 7139 games and detailed statistics about 5511 player * season from 30 teams. | ||
|
||
Apart from the standard statistics from ESPN NBA website, it's worth mentioning that we are also retrieving a bunch of fresh out statistics at Basketball Reference website (http://www.basketball-reference.com/). | ||
Here we can find not only most of the conventional statistics available, but also the Real Plus Minus (RPM) statistics. | ||
Specifically RPM statistics includes ORPM (offense), DRPM (defense), RPM (overall), etc. | ||
These features will be essential for our further experiments, since RPM statistics has just become the state of the art of features in capturing the performance of any player. | ||
|
||
|
||
ETL | ||
Extraction using web crawlers | ||
For ESPN NBA statistics we wrote crawlers in Python to get the data. BeautifulSoup package was used to parse the web html pages. For Basketball Reference website we were using R packages for similar tasks. | ||
Transforming: merging two data sources | ||
Since we have two different data sources, we need to treat the data separately to ensure the merging of two datasets successful. Within ESPN dataset the join for each tables are generally without problems, since we already get the match ids and player ids for each game or player, but at Basketball References we do not have such ids. Therefore for player information we are using player names to perform the join operation. Therefore we manually checked all the name mismatches and fixed the spelling mismatches. | ||
Another frequently happening issue on data is missing values. For most of statistics we are just using average value among all records for that feature. | ||
Loading data to database | ||
We are using SQLite database to store all cleaned up data. The design of the database follows the 3rd normal form to ensure there is no redundancy, and indexes were built on frequent used keys to ensure the queries gets speed up. | ||
The reason we are using database is that we are frequently trying different format of the feature tables, so we need to ensure the query and join operations on primal datasets are built on a solid and trustworthy basis. |
Binary file not shown.
Oops, something went wrong.