add mike's changes

charlienew · Jan 19, 2018 · 47e6252 · 47e6252
1 parent 15459ce
commit 47e6252
Show file tree

Hide file tree

Showing 4 changed files with 130 additions and 13 deletions.
diff --git a/Week_03.ipynb b/Week_03.ipynb
@@ -61,6 +61,15 @@
     "There is a method implemented in Scikit that splits the dataset randomly for us called [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split).  We can use this method twice to perform a train-validation-test split done below."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![a](images/dataset.png)\n",
+    "![a](images/testtrainvalidation.png)\n",
+    "[source](https://cdn-images-1.medium.com/max/948/1*4G__SV580CxFj78o9yUXuQ.png)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 1,
@@ -85,10 +94,13 @@
     "validation_test_size = validation_size + test_size\n",
     "test_size_adjusted = test_size / validation_test_size\n",
     "\n",
+    "# perform the first split which gets us the train data and the validation/test data that\n",
+    "# we must split one more time\n",
     "X_train, X_validation_test, y_train, y_validation_test =  train_test_split(X, y,\\\n",
     "                                                                           test_size = validation_test_size,\\\n",
     "                                                                           random_state = random_state)\n",
     "\n",
+    "# perform the second split which splits the validation/test data into two distinct datasets\n",
     "X_validation, X_test, y_validation, y_test = train_test_split(X_validation_test, y_validation_test,\\\n",
     "                                                              test_size = test_size_adjusted,\\\n",
     "                                                              random_state = random_state)"
@@ -165,6 +177,15 @@
     "There are two kinds of Grid Search, exhaustive and random."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![a](images/gridsearch.png)\n",
+    "\n",
+    "[source](https://cdn-images-1.medium.com/max/1920/1*Uxo81NjcpqNXYJCeqnK1Pw.png)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -196,11 +217,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A random search for parameter values uses a generating function (typically a selected distribution, i.e. rbf/beta/gamma with user-input parameters) to produce candidate value sets for the hyperparameters. This has two main benefits over an exhaustive search:\n",
-    "\n",
-    "    1) A budget can be chosen independent of the number of parameters and possible values. Thus the user only has one parameter to handle.\n",
+    "A random search for parameter values uses a generating function (typically a selected distribution, i.e. rbf/beta/gamma with user-input parameters) to produce candidate value sets for the hyperparameters. This has one main benefits over an exhaustive search:\n",
     "\n",
-    "    2) Adding parameters that do not influence the performance does not decrease efficiency, contrary to a standard grid search in that manual selections of a specifed parameter may result in very little influence to the tuning."
+    "    - A budget can be chosen independent of the number of parameters and possible values. Thus the user only has one parameter to handle.\n"
    ]
   },
   {
@@ -253,17 +272,20 @@
     "from sklearn.datasets import load_digits\n",
     "from sklearn.ensemble import RandomForestClassifier\n",
     "\n",
-    "# get some data\n",
+    "# load digit dataset\n",
     "digits = load_digits()\n",
+    "# split data into inputs and output\n",
     "X, y = digits.data, digits.target\n",
     "\n",
-    "# build a classifier\n",
+    "# build a random forest classifier\n",
     "clf = RandomForestClassifier(n_estimators=20)\n",
     "\n",
     "\n",
     "# Utility function to report best scores\n",
     "def report(grid_scores, n_top=3):\n",
+    "    # sort scores based on metric so we can grab the n_top models\n",
     "    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]\n",
+    "    # iterate over the n_top models\n",
     "    for i in range(n_top):\n",
     "        print(\"Model with rank: {0}\".format(i + 1))\n",
     "        print(\"Mean validation score: {0:.3f} (std: {1:.3f})\".format(\n",
@@ -274,43 +296,54 @@
     "\n",
     "\n",
     "# specify parameters and distributions to sample from - \n",
-    "# what methods might we consider that would improve these estimates\n",
-    "# \n",
+    "# what methods might we consider that would improve these estimates?\n",
     "param_dist = {\"max_depth\": [3, None],\n",
     "              \"max_features\": sp_randint(1, 11),\n",
     "              \"min_samples_split\": sp_randint(2, 11),\n",
     "              \"min_samples_leaf\": sp_randint(1, 11),\n",
     "              \"bootstrap\": [True, False],\n",
     "              \"criterion\": [\"gini\", \"entropy\"]}\n",
     "\n",
-    "# run randomized search\n",
+    "# number of models we are going to train\n",
     "n_iter_search = 20\n",
+    "# create our randomized gridsearch classifier\n",
+    "#      clf, is the model we are performing the search on\n",
+    "#      param_dist, is a dictionary of paramater distributions that we will sample over\n",
+    "#      n_iter_search, number of models we are going to train\n",
+    "#      True, the scores from our training for each model will be returned when we perform the gridsearch\n",
     "random_search = RandomizedSearchCV(clf, param_distributions=param_dist,\n",
     "                                   n_iter=n_iter_search, return_train_score=True)\n",
-    "\n",
+    "# start a timer so we know how long the random gridsearch took\n",
     "start = time()\n",
+    "# perform the random gridsearch\n",
     "random_search.fit(X, y)\n",
     "print(\"RandomizedSearchCV took %.2f seconds for %d candidates\"\n",
     "      \" parameter settings.\" % ((time() - start), n_iter_search))\n",
+    "# print the top 3 model outputs from the random gridsearch\n",
     "report(random_search.cv_results_)\n",
     "\n",
     "# use a full grid over all parameters. \n",
     "# The grid search will generate parameter sets for each and every one of these\n",
-    "# \n",
     "param_grid = {\"max_depth\": [3, None],\n",
     "              \"max_features\": [1, 3, 10],\n",
     "              \"min_samples_split\": [2,3,10],\n",
     "              \"min_samples_leaf\": [1, 3, 10],\n",
     "              \"bootstrap\": [True, False],\n",
     "              \"criterion\": [\"gini\", \"entropy\"]}\n",
     "\n",
-    "# run grid search\n",
+    "# create an exhaustive gridsearch object\n",
+    "#      clf, is the model we are performing the search on\n",
+    "#      param_grid dictionary with the parameter settings the search will try\n",
+    "#      True, the scores from our training for each model will be returned when we perform the gridsearch\n",
     "grid_search = GridSearchCV(clf, param_grid=param_grid, return_train_score=True)\n",
+    "# start a timer so we know how long the exhaustive gridsearch took\n",
     "start = time()\n",
+    "# perform the exhaustive gridsearch\n",
     "grid_search.fit(X, y)\n",
     "\n",
     "print(\"GridSearchCV took %.2f seconds for %d candidate parameter settings.\"\n",
     "      % (time() - start, len(grid_search.cv_results_)))\n",
+    "# print the top 3 model outputs from the exhaustive gridsearch\n",
     "report(grid_search.cv_results_)"
    ]
   },
@@ -476,7 +509,91 @@
     "collapsed": true
    },
    "source": [
-    "Create your own pipeline where you take the embeddings we created in Lecture 2 and feed them into XGBoost, that we learned about in Lecture 1."
+    "Now you will create your own pipeline where you take the embeddings we created in Lecture 2 and feed them into XGBoost, that we learned about in Lecture 1."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "1) Setup a pipeline where embeddings we created in Lecture 2 are fed into XGBoost"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "2) How did your first iteration of the pipeline do?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "3) How could we improve the performance of the pipeline?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "4) What parameters are important to tune for the [embedding process?](https://radimrehurek.com/gensim/models/doc2vec.html)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "5) What parameters are important to tune for [XGBoost?](http://xgboost.readthedocs.io/en/latest/python/python_api.html)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "6) Now that you know what parameters are important to both processes in the pipeline, hypertune both models."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "7) Are there any sources of information leakage? Explain."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "8) Is the data balanced? How do we know the balances of the data?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "9) If the data is imbalanced what can we do to make our pipeline robust to the imbalances?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "10) Should our test set be balanced or not?  Explain."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "11) Based on the data we have, should we perform KFold Cross Validation and/or a train-validation-test split?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "12) If time permits, write some code so that we can have balanced classes."
    ]
   }
  ],

diff --git a/images/dataset.png b/images/dataset.png
diff --git a/images/gridsearch.png b/images/gridsearch.png
diff --git a/images/testtrainvalidation.png b/images/testtrainvalidation.png