This project is to investigate the relationship between player salary and player features including various player per game performance metrics (e.g., points, rebounds, and shooting performance) and non-performance statistics (e.g., age and position). Specifically, I studied and extracted eight years of NBA player data from 2012 to 2020, and then applied linear regression models to determine which features play a vital role in player salary.
-
To setup the model, I fit simple linear regression models and multiple linear regression models on the individual performance-salary data for each year.
-
Due to the cross-correlation of performance statistics, ridge and lasso regression were applied to minimize multicollinearity and predictors, respectively.
-
Using the subset of predictors obtained from lasso regression, we set up a generalized additive model for each performance-salary data.
I obtained the player performance data from the 2013 season to the 2020 season using https://www.basketball-reference.com/leagues/NBA_{YEAR}_per_game.html.
I obtained the player salary data from https://www.basketball-reference.com/contracts/players.html. Unfortunately, Basketball Reference only hosts the contracts for the current year; thus, to examine previous years contracts, we entered the link above into the Wayback Machine to access past snapshots of the url. This was the limiting factor as to why we only processed eight years of data. Basketball Reference had no data regarding contracts prior to the 2012-2013 NBA season.
I cleaned the data using preprocess.py whcih processes salary data and matches it to the correct player on the correct team in the correct year. I set a baseline for a player’s performance statistics to be included in this study.
Baseline Requirements | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
The processed csv files are located in the stats_sal directory. All box score data is averaged over a season fro each individual player.
Basic Info | Box Score | |||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Histograms of Salary, Minutes Played, Age, and Points over 8 years:
The relationship between salary and MP/TOV/FT% from different years:
2015: Salary vs. MP | 2020: Salary vs. TOV | 2018: Salary vs. FT% |
---|---|---|
![]() |
![]() |
![]() |
Working with data from each season individually, I set up a linear regression model determining the correlation between a single performance statistic with the log-based salary data. Thus, for each of the eight years and for the 26 performance statistics (excluding position), I generated the SLR model plot and the diagnostic plot for that model. These figures can be found in each year directory under the specified performance statistic.
If the SLR model was a satisfactory fit by using the mean of quartile values of all the adjusted R-squared values from the models. All the models with green SLR lines were in the 75% quartile of adjusted R2 values, all the models with orange SLR lines were in the inter-quartile range, and all the models with red SLR lines were below the 25% quartile.
MLR: lm output and diagnostic plot
2020: MLR lm output | 2020: MLR Diagnostic plot |
---|---|
![]() |
![]() |
The predictors that have p-values less than 0.05 are PosC, Age, MP, FG%, eFg% and BLK. This output is interesting because it is contradictory to the hypothesis raised in SLR analysis. Although they are percentage statistics, FG% and eFg% are deemed to be significant. Furthermore, Age, a predictor that was only deemed as insignificant in the SLR model, has the lowest p-value of any predictor. Looking through past years, Age is consistently the predictor with the lowest p-value and many of the significant predictors are percentage statistics
Ridge regression estimates without optimizing for cross-validation
2020: Ridge regression lambda output | 2020: Ridge regression plot |
---|---|
![]() |
![]() |
Ridge regression estimates with optimizing for cross-validation
2020: Ridge regression lambda output | 2020: Ridge regression plot |
---|---|
![]() |
![]() |
Optimized lasso regression model for the 2020 data
2020: Lasso regression lambda output | 2020: Lasso regression plot |
---|---|
![]() |
![]() |
Optimized lasso regression model for the 2020 data
2020: GAM on Age | 2020: GAM on X2PA |
---|---|
![]() |
![]() |
In-depth analysis on the graphs provided above can be found in STAT410 Final.pdf
Project is created with:
- Python 3.6, R
- csv: Python package for processing csv's
- glmnet: GAM analysis package in R