springboard/capstone_1/ratingsAnalysis at master · ebhtra/springboard

History

Name		Name	Last commit message	Last commit date
parent directory ..
DescriptionNLP.ipynb		DescriptionNLP.ipynb
FrequentRaters.ipynb		FrequentRaters.ipynb
Mile3.ipynb		Mile3.ipynb
Mile_2.ipynb		Mile_2.ipynb
MilestoneReport.ipynb		MilestoneReport.ipynb
README.md		README.md
Report_Material.ipynb		Report_Material.ipynb
Results.ipynb		Results.ipynb
Results_2.ipynb		Results_2.ipynb

README.md

These notebooks contain various analyses of the Untappd ratings

Results.ipynb takes a look at the checkins from a few different angles:

It models them as a networkx bipartite graph with Users as one partition and Beers as the other, with ratings as edges. The networkx methods available on that resulting graph, such as diameter(), turn out to take too long to run, due to the graph size, so I just used my own graph representation to calculate the sparsity of the User-Beer-Rating graph (It's very sparse).
It attempts to use the Surprise library to apply SVD and K-Nearest Neighbors to the ratings in order to predict unseen ratings. Both of these approaches fail to compete with an educated baseline estimate that's calculated by simply adding the user's mean bias to the mean overall rating for the beer. The KNN in particular underperforms, which suggests that the graph is simply too sparse for such analysis.
An attempt is made to calculate Pearson similarities for pairs of users and to adjust the educated baseline predictions slightly, where justified. After much parameter tuning, any tiny improvement upon the baseline appears to be just overfitting the parameters to the test batch.

Results_2.ipynb uses sklearn's GradientBoostingRegressor to attempt to beat the baseline predictions, but the only features that turn out to have any importance are alcohol content and global rating of the beer.

DescriptionNLP.ipynb uses sklearn's SGDRegressor to find correlations between the terms used in beer descriptions and ratings. The learned weights can help predict global ratings for new beers. For models trained on individual users, however, the known global rating of the beer is still the best predictor, at least until the user has rated over 150 beers, at which point the descriptions can begin to help the predictions. The way they can help at this point is shown in FrequentRaters.ipynb, where the rank accuracy of recommendations begins to beat the baseline accuracy.

Mile_2.ipynb revisits Pearson similarities between user pairs and finds that with enough commonly-rated beers, user similarity can improve the recommendations. This notebook uses an online learning method for every Pearson number, where a large dictionary stores the components of the Pearson calculation needed to update each number every time one of the users rates a new beer. Mile3.ipynb is mostly the same, except that it just uses scipy stats' Pearson function to calculate each similarity.

The other notebooks in this directory are just used for reports which incorporate the above notebooks and comprise pieces of the overall project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ratingsAnalysis

ratingsAnalysis

README.md

These notebooks contain various analyses of the Untappd ratings

Files

ratingsAnalysis

Directory actions

More options

Directory actions

More options

Latest commit

History

ratingsAnalysis

Folders and files

parent directory

README.md

These notebooks contain various analyses of the Untappd ratings