Skip to content

Commit

Permalink
add mike's changes
Browse files Browse the repository at this point in the history
  • Loading branch information
jacobBaumbach committed Jan 19, 2018
1 parent 15459ce commit 47e6252
Show file tree
Hide file tree
Showing 4 changed files with 130 additions and 13 deletions.
143 changes: 130 additions & 13 deletions Week_03.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,15 @@
"There is a method implemented in Scikit that splits the dataset randomly for us called [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split). We can use this method twice to perform a train-validation-test split done below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![a](images/dataset.png)\n",
"![a](images/testtrainvalidation.png)\n",
"[source](https://cdn-images-1.medium.com/max/948/1*4G__SV580CxFj78o9yUXuQ.png)"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand All @@ -85,10 +94,13 @@
"validation_test_size = validation_size + test_size\n",
"test_size_adjusted = test_size / validation_test_size\n",
"\n",
"# perform the first split which gets us the train data and the validation/test data that\n",
"# we must split one more time\n",
"X_train, X_validation_test, y_train, y_validation_test = train_test_split(X, y,\\\n",
" test_size = validation_test_size,\\\n",
" random_state = random_state)\n",
"\n",
"# perform the second split which splits the validation/test data into two distinct datasets\n",
"X_validation, X_test, y_validation, y_test = train_test_split(X_validation_test, y_validation_test,\\\n",
" test_size = test_size_adjusted,\\\n",
" random_state = random_state)"
Expand Down Expand Up @@ -165,6 +177,15 @@
"There are two kinds of Grid Search, exhaustive and random."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![a](images/gridsearch.png)\n",
"\n",
"[source](https://cdn-images-1.medium.com/max/1920/1*Uxo81NjcpqNXYJCeqnK1Pw.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -196,11 +217,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"A random search for parameter values uses a generating function (typically a selected distribution, i.e. rbf/beta/gamma with user-input parameters) to produce candidate value sets for the hyperparameters. This has two main benefits over an exhaustive search:\n",
"\n",
" 1) A budget can be chosen independent of the number of parameters and possible values. Thus the user only has one parameter to handle.\n",
"A random search for parameter values uses a generating function (typically a selected distribution, i.e. rbf/beta/gamma with user-input parameters) to produce candidate value sets for the hyperparameters. This has one main benefits over an exhaustive search:\n",
"\n",
" 2) Adding parameters that do not influence the performance does not decrease efficiency, contrary to a standard grid search in that manual selections of a specifed parameter may result in very little influence to the tuning."
" - A budget can be chosen independent of the number of parameters and possible values. Thus the user only has one parameter to handle.\n"
]
},
{
Expand Down Expand Up @@ -253,17 +272,20 @@
"from sklearn.datasets import load_digits\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# get some data\n",
"# load digit dataset\n",
"digits = load_digits()\n",
"# split data into inputs and output\n",
"X, y = digits.data, digits.target\n",
"\n",
"# build a classifier\n",
"# build a random forest classifier\n",
"clf = RandomForestClassifier(n_estimators=20)\n",
"\n",
"\n",
"# Utility function to report best scores\n",
"def report(grid_scores, n_top=3):\n",
" # sort scores based on metric so we can grab the n_top models\n",
" top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]\n",
" # iterate over the n_top models\n",
" for i in range(n_top):\n",
" print(\"Model with rank: {0}\".format(i + 1))\n",
" print(\"Mean validation score: {0:.3f} (std: {1:.3f})\".format(\n",
Expand All @@ -274,43 +296,54 @@
"\n",
"\n",
"# specify parameters and distributions to sample from - \n",
"# what methods might we consider that would improve these estimates\n",
"# \n",
"# what methods might we consider that would improve these estimates?\n",
"param_dist = {\"max_depth\": [3, None],\n",
" \"max_features\": sp_randint(1, 11),\n",
" \"min_samples_split\": sp_randint(2, 11),\n",
" \"min_samples_leaf\": sp_randint(1, 11),\n",
" \"bootstrap\": [True, False],\n",
" \"criterion\": [\"gini\", \"entropy\"]}\n",
"\n",
"# run randomized search\n",
"# number of models we are going to train\n",
"n_iter_search = 20\n",
"# create our randomized gridsearch classifier\n",
"# clf, is the model we are performing the search on\n",
"# param_dist, is a dictionary of paramater distributions that we will sample over\n",
"# n_iter_search, number of models we are going to train\n",
"# True, the scores from our training for each model will be returned when we perform the gridsearch\n",
"random_search = RandomizedSearchCV(clf, param_distributions=param_dist,\n",
" n_iter=n_iter_search, return_train_score=True)\n",
"\n",
"# start a timer so we know how long the random gridsearch took\n",
"start = time()\n",
"# perform the random gridsearch\n",
"random_search.fit(X, y)\n",
"print(\"RandomizedSearchCV took %.2f seconds for %d candidates\"\n",
" \" parameter settings.\" % ((time() - start), n_iter_search))\n",
"# print the top 3 model outputs from the random gridsearch\n",
"report(random_search.cv_results_)\n",
"\n",
"# use a full grid over all parameters. \n",
"# The grid search will generate parameter sets for each and every one of these\n",
"# \n",
"param_grid = {\"max_depth\": [3, None],\n",
" \"max_features\": [1, 3, 10],\n",
" \"min_samples_split\": [2,3,10],\n",
" \"min_samples_leaf\": [1, 3, 10],\n",
" \"bootstrap\": [True, False],\n",
" \"criterion\": [\"gini\", \"entropy\"]}\n",
"\n",
"# run grid search\n",
"# create an exhaustive gridsearch object\n",
"# clf, is the model we are performing the search on\n",
"# param_grid dictionary with the parameter settings the search will try\n",
"# True, the scores from our training for each model will be returned when we perform the gridsearch\n",
"grid_search = GridSearchCV(clf, param_grid=param_grid, return_train_score=True)\n",
"# start a timer so we know how long the exhaustive gridsearch took\n",
"start = time()\n",
"# perform the exhaustive gridsearch\n",
"grid_search.fit(X, y)\n",
"\n",
"print(\"GridSearchCV took %.2f seconds for %d candidate parameter settings.\"\n",
" % (time() - start, len(grid_search.cv_results_)))\n",
"# print the top 3 model outputs from the exhaustive gridsearch\n",
"report(grid_search.cv_results_)"
]
},
Expand Down Expand Up @@ -476,7 +509,91 @@
"collapsed": true
},
"source": [
"Create your own pipeline where you take the embeddings we created in Lecture 2 and feed them into XGBoost, that we learned about in Lecture 1."
"Now you will create your own pipeline where you take the embeddings we created in Lecture 2 and feed them into XGBoost, that we learned about in Lecture 1."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1) Setup a pipeline where embeddings we created in Lecture 2 are fed into XGBoost"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2) How did your first iteration of the pipeline do?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3) How could we improve the performance of the pipeline?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4) What parameters are important to tune for the [embedding process?](https://radimrehurek.com/gensim/models/doc2vec.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5) What parameters are important to tune for [XGBoost?](http://xgboost.readthedocs.io/en/latest/python/python_api.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6) Now that you know what parameters are important to both processes in the pipeline, hypertune both models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7) Are there any sources of information leakage? Explain."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"8) Is the data balanced? How do we know the balances of the data?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"9) If the data is imbalanced what can we do to make our pipeline robust to the imbalances?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"10) Should our test set be balanced or not? Explain."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"11) Based on the data we have, should we perform KFold Cross Validation and/or a train-validation-test split?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"12) If time permits, write some code so that we can have balanced classes."
]
}
],
Expand Down
Binary file added images/dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/gridsearch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/testtrainvalidation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 47e6252

Please sign in to comment.