Skip to content

Commit

Permalink
added season 2013 MSE and MAE results as well as pulled the season wi…
Browse files Browse the repository at this point in the history
…n totals from ESPN
  • Loading branch information
leerichardson committed Nov 29, 2014
1 parent a999260 commit be71bca
Show file tree
Hide file tree
Showing 11 changed files with 23,993 additions and 0 deletions.
22,801 changes: 22,801 additions & 0 deletions data/bball_ref/individual/bball_ref_data.csv

Large diffs are not rendered by default.

52 changes: 52 additions & 0 deletions data/bball_ref/individual/scrape_bball_ref.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#Load the XML Library
library(XML)

###### URLs
url<-paste0("http://www.basketball-reference.com/players/",letters,"/")

#Set the years from which data is available
years <- 1950:2014

#Set the urls for the data we want
totals_url <- paste0("http://www.basketball-reference.com/leagues/NBA_", years, "_totals.html")
advanced_url <- paste0("http://www.basketball-reference.com/leagues/NBA_", years, "_advanced.html")

#Set the length of the tables that we are scraping from
totals_len<-length(totals_url)
advanced_len<-length(advanced_url)

#Initialize both the totals and advanced tables
totals_table <- readHTMLTable(totals_url[1])[[1]]
totals_table$year <- 1950

advanced_table <- readHTMLTable(advanced_url[1])[[1]]
advanced_table$year <- 1950

#Append together all of the totals data
for (i in 2:totals_len) {
#Create a temporary table, and then append on a variable that contains the year
temp_table <- readHTMLTable(totals_url[i])[[1]]
temp_table$year <- i + 1949
totals_table<-rbind(totals_table,temp_table)
}

#Append together all of the advanced data
for (i in 2:advanced_len) {
#Create a temporary table, and then append on a variable that contains the year
temp_table <- readHTMLTable(advanced_url[i])[[1]]
temp_table$year <- i + 1949
advanced_table<-rbind(advanced_table,temp_table)
}

#Combine the output of these two tables, and get rid of the default rows which don't
all_table <- as.data.frame(subset(cbind(totals_table, advanced_table), Player != "Player"))
all_table_revised <- all_table[,!grepl(".1", names(all_table))]

#Write the polished data frame to a CSV
write.csv(all_table_revised, file = "data/bball_ref_data.csv")






46 changes: 46 additions & 0 deletions data/bball_ref/teams/team_wins.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
## Purpose: Pull actual standings from different seasons in order to
## compare our simulated seasons against and compute RMSE

## SET WORKING DIRCTORY ##
setwd("C:/Users/leeri_000/basketball_stats/game_simulation")

## Load in the library package
library(XML)

# Set up connection to our database
con <- dbConnect(drv="SQLite", dbname="nba_rRegression_chi/nba.db")
teams <- dbGetQuery(con, 'SELECT * FROM teams')

## Set up years
years <- 2013:2014

## Set up initial dataframe
win_vector <- as.data.frame(matrix(0, nrow=0, ncol=3))

for(j in years){
## Set up the URL for extraction
url <- paste("http://espn.go.com/nba/standings/_/year/", j, sep="")
standings <- readHTMLTable(url[1], header=T)[[1]][c(2:9, 11:17, 20:27, 29:35),c(2:3)]
colnames(standings) <- c("name", "wins")
standings[,1] <- as.character(standings[,1])

## Merge these standings with the teams
for(i in 1:length(standings[,1])){
standings[i,1] <- gsub("x - ", "", standings[i,1])
standings[i,1] <- gsub("y - ", "", standings[i,1])
standings[i,1] <- gsub("z - ", "", standings[i,1])
}

standings <- standings[order(standings$name),]
teams <- teams[order(teams$fullName),]
team_wins <- cbind(standings, teams)
team_wins$year <- j - 1
team_wins <- team_wins[, c(2, 3, 5)]
win_vector <- rbind(win_vector, team_wins)
}

## Clean and save
rownames(win_vector) <- NULL
write.csv(win_vector, "data/espn_data/team_wins.csv")


27 changes: 27 additions & 0 deletions dataPreparation.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Data Preparation
For this project since we don't have ready-made dataset, we were making efforts getting the data from available resources. This basically includes the selection of data source and steps for ETL.

Data Source
A very popular resource for nba statistics is the ESPN NBA website (http://espn.go.com/nba/). Specifically, among all information about nba games, we are interested in below list of nba statistics:
Game score: includes the datasets of each game between 2009 and 2014 up to date. The dataset includes home team, visiting team, game score for either team, as well as the game date and game season.
Game score here will be the target value we are trying to build a model to learn and predict in further works.
Game detail: for each of the game we are not only interested in the final score, it'll also be helpful that we get information about the statistics for each player in that game,
including field minutes, points, shoots attempted / made, rebounds, steals, fouls, etc.
Player detail: for each of the players, we are also have a list with 50 features for each player in the season. All standard features listed on ESPN NBA website for players are included.
For ESPN NBA data eventually we got information of 7139 games and detailed statistics about 5511 player * season from 30 teams.

Apart from the standard statistics from ESPN NBA website, it's worth mentioning that we are also retrieving a bunch of fresh out statistics at Basketball Reference website (http://www.basketball-reference.com/).
Here we can find not only most of the conventional statistics available, but also the Real Plus Minus (RPM) statistics.
Specifically RPM statistics includes ORPM (offense), DRPM (defense), RPM (overall), etc.
These features will be essential for our further experiments, since RPM statistics has just become the state of the art of features in capturing the performance of any player.


ETL
Extraction using web crawlers
For ESPN NBA statistics we wrote crawlers in Python to get the data. BeautifulSoup package was used to parse the web html pages. For Basketball Reference website we were using R packages for similar tasks.
Transforming: merging two data sources
Since we have two different data sources, we need to treat the data separately to ensure the merging of two datasets successful. Within ESPN dataset the join for each tables are generally without problems, since we already get the match ids and player ids for each game or player, but at Basketball References we do not have such ids. Therefore for player information we are using player names to perform the join operation. Therefore we manually checked all the name mismatches and fixed the spelling mismatches.
Another frequently happening issue on data is missing values. For most of statistics we are just using average value among all records for that feature.
Loading data to database
We are using SQLite database to store all cleaned up data. The design of the database follows the 3rd normal form to ensure there is no redundancy, and indexes were built on frequent used keys to ensure the queries gets speed up.
The reason we are using database is that we are frequently trying different format of the feature tables, so we need to ensure the query and join operations on primal datasets are built on a solid and trustworthy basis.
Binary file added poster_final_paper/WebofSciencetop100.xlsx
Binary file not shown.
Loading

0 comments on commit be71bca

Please sign in to comment.