Skip to content

Commit

Permalink
Custom metric
Browse files Browse the repository at this point in the history
  • Loading branch information
Norbert Kozlowski committed Jul 5, 2016
1 parent 0aa4159 commit bb8c13d
Show file tree
Hide file tree
Showing 4 changed files with 575 additions and 211 deletions.
168 changes: 22 additions & 146 deletions notebooks/3. Going deeper/3.1 Spotting Most Important Features.ipynb

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions notebooks/3. Going deeper/3.2 Bias-variance tradeoff.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@
"source": [
"## Bias/variance trade-off\n",
"\n",
"The following notebook presents visual explanation about how to deal with bias/variance trade-off, which is common machine learning problem\n",
"The following notebook presents visual explanation about how to deal with bias/variance trade-off, which is common machine learning problem.\n",
"\n",
"- Bias and variance\n",
"- Under- and over-fitting\n",
"- How to detect it (plot, and interpret)\n",
"- What we can do?"
"**What you will learn**:\n",
"\n",
"- What is bias and variance in terms of ML problem\n",
"- Detect and deal with under- and over-fitting"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,25 @@
"metadata": {},
"source": [
"## Hyper-parameter tuning with grid search\n",
"\n",
"As you know there are plenty of tunable parameters. Each one results in different output. The question is which combination results in best output.\n",
"\n",
"The following notebook will show you how to configure Scikit-learn grid search module for figuring out the best parameters for your XGBoost model.\n",
"\n",
"**What you will learn:** Finding best hyper-parameters for your dataset"
"**What you will learn:**\n",
"- Finding best hyper-parameters for your dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's begin with loading all required libraries."
"Let's begin with loading all required libraries in one place and set seed number for reproducibility."
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 1,
"metadata": {
"collapsed": true
},
Expand All @@ -49,12 +53,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate test dataset"
"Generate artificial dataset"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 2,
"metadata": {
"collapsed": false
},
Expand All @@ -67,12 +71,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Define cross-validation strategy for testing"
"Define cross-validation strategy for testing. Let's use `StratifiedKFold` which guarantees that target label is equally distributed across each fold."
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 3,
"metadata": {
"collapsed": true
},
Expand All @@ -85,12 +89,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a dictionary holding possible parameter values."
"Define a dictionary holding possible parameter values we want to test."
]
},
{
"cell_type": "code",
"execution_count": 41,
"execution_count": 17,
"metadata": {
"collapsed": false
},
Expand All @@ -99,20 +103,20 @@
"params_grid = {\n",
" 'max_depth': [1, 2, 3],\n",
" 'n_estimators': [5, 10, 25, 50],\n",
" 'learning_rate': [0.1, 0.5, 1.0]\n",
" 'learning_rate': np.linspace(1e-16, 1, 3)\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And those for which we want values to be fixed"
"And a dictiorany for fixed parameters."
]
},
{
"cell_type": "code",
"execution_count": 42,
"execution_count": 12,
"metadata": {
"collapsed": true
},
Expand All @@ -124,9 +128,16 @@
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a `GridSearchCV` estimator. We will be looking for combination giving the best accuracy."
]
},
{
"cell_type": "code",
"execution_count": 43,
"execution_count": 18,
"metadata": {
"collapsed": false
},
Expand All @@ -144,12 +155,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Before running the calcuations notice that $3*4*3*10=360$ models will be created for testing all combinations."
"Before running the calcuations notice that $3*4*3*10=360$ models will be created for testing all combinations. You should always have rough estimations about what is going to happen."
]
},
{
"cell_type": "code",
"execution_count": 44,
"execution_count": 19,
"metadata": {
"collapsed": false
},
Expand All @@ -165,11 +176,11 @@
" objective='binary:logistic', reg_alpha=0, reg_lambda=1,\n",
" scale_pos_weight=1, seed=123, silent=1, subsample=1),\n",
" fit_params={}, iid=True, n_jobs=1,\n",
" param_grid={'learning_rate': [0.1, 0.5, 1.0], 'n_estimators': [5, 10, 25, 50], 'max_depth': [1, 2, 3]},\n",
" param_grid={'learning_rate': array([ 1.00000e-16, 5.00000e-01, 1.00000e+00]), 'max_depth': [2, 3, 4], 'n_estimators': [5, 10, 25, 50]},\n",
" pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)"
]
},
"execution_count": 44,
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -182,58 +193,58 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Show all obtained scores"
"Now, we can look at all obtained scores, and try to manually see what matters and what not. A quick glance looks that the largeer `n_estimators` then the accuracy is higher."
]
},
{
"cell_type": "code",
"execution_count": 45,
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[mean: 0.74800, std: 0.03682, params: {'learning_rate': 0.1, 'n_estimators': 5, 'max_depth': 1},\n",
" mean: 0.78200, std: 0.02960, params: {'learning_rate': 0.1, 'n_estimators': 10, 'max_depth': 1},\n",
" mean: 0.83300, std: 0.03494, params: {'learning_rate': 0.1, 'n_estimators': 25, 'max_depth': 1},\n",
" mean: 0.86900, std: 0.02982, params: {'learning_rate': 0.1, 'n_estimators': 50, 'max_depth': 1},\n",
" mean: 0.80600, std: 0.02691, params: {'learning_rate': 0.1, 'n_estimators': 5, 'max_depth': 2},\n",
" mean: 0.84300, std: 0.02052, params: {'learning_rate': 0.1, 'n_estimators': 10, 'max_depth': 2},\n",
" mean: 0.88900, std: 0.02587, params: {'learning_rate': 0.1, 'n_estimators': 25, 'max_depth': 2},\n",
" mean: 0.91000, std: 0.03098, params: {'learning_rate': 0.1, 'n_estimators': 50, 'max_depth': 2},\n",
" mean: 0.86000, std: 0.02793, params: {'learning_rate': 0.1, 'n_estimators': 5, 'max_depth': 3},\n",
" mean: 0.88000, std: 0.01265, params: {'learning_rate': 0.1, 'n_estimators': 10, 'max_depth': 3},\n",
" mean: 0.90500, std: 0.02247, params: {'learning_rate': 0.1, 'n_estimators': 25, 'max_depth': 3},\n",
" mean: 0.92300, std: 0.01900, params: {'learning_rate': 0.1, 'n_estimators': 50, 'max_depth': 3},\n",
" mean: 0.81500, std: 0.03294, params: {'learning_rate': 0.5, 'n_estimators': 5, 'max_depth': 1},\n",
" mean: 0.85700, std: 0.03900, params: {'learning_rate': 0.5, 'n_estimators': 10, 'max_depth': 1},\n",
" mean: 0.88900, std: 0.02948, params: {'learning_rate': 0.5, 'n_estimators': 25, 'max_depth': 1},\n",
" mean: 0.89400, std: 0.03169, params: {'learning_rate': 0.5, 'n_estimators': 50, 'max_depth': 1},\n",
" mean: 0.88000, std: 0.02530, params: {'learning_rate': 0.5, 'n_estimators': 5, 'max_depth': 2},\n",
" mean: 0.90600, std: 0.01855, params: {'learning_rate': 0.5, 'n_estimators': 10, 'max_depth': 2},\n",
" mean: 0.91600, std: 0.02245, params: {'learning_rate': 0.5, 'n_estimators': 25, 'max_depth': 2},\n",
" mean: 0.92400, std: 0.02107, params: {'learning_rate': 0.5, 'n_estimators': 50, 'max_depth': 2},\n",
" mean: 0.90000, std: 0.02530, params: {'learning_rate': 0.5, 'n_estimators': 5, 'max_depth': 3},\n",
" mean: 0.91300, std: 0.02492, params: {'learning_rate': 0.5, 'n_estimators': 10, 'max_depth': 3},\n",
" mean: 0.92500, std: 0.01857, params: {'learning_rate': 0.5, 'n_estimators': 25, 'max_depth': 3},\n",
" mean: 0.93100, std: 0.01513, params: {'learning_rate': 0.5, 'n_estimators': 50, 'max_depth': 3},\n",
" mean: 0.83000, std: 0.02646, params: {'learning_rate': 1.0, 'n_estimators': 5, 'max_depth': 1},\n",
" mean: 0.86400, std: 0.02800, params: {'learning_rate': 1.0, 'n_estimators': 10, 'max_depth': 1},\n",
" mean: 0.87700, std: 0.02492, params: {'learning_rate': 1.0, 'n_estimators': 25, 'max_depth': 1},\n",
" mean: 0.88100, std: 0.02625, params: {'learning_rate': 1.0, 'n_estimators': 50, 'max_depth': 1},\n",
" mean: 0.87500, std: 0.02802, params: {'learning_rate': 1.0, 'n_estimators': 5, 'max_depth': 2},\n",
" mean: 0.89000, std: 0.02236, params: {'learning_rate': 1.0, 'n_estimators': 10, 'max_depth': 2},\n",
" mean: 0.90600, std: 0.03382, params: {'learning_rate': 1.0, 'n_estimators': 25, 'max_depth': 2},\n",
" mean: 0.90600, std: 0.02245, params: {'learning_rate': 1.0, 'n_estimators': 50, 'max_depth': 2},\n",
" mean: 0.89800, std: 0.02821, params: {'learning_rate': 1.0, 'n_estimators': 5, 'max_depth': 3},\n",
" mean: 0.90200, std: 0.03187, params: {'learning_rate': 1.0, 'n_estimators': 10, 'max_depth': 3},\n",
" mean: 0.91800, std: 0.02358, params: {'learning_rate': 1.0, 'n_estimators': 25, 'max_depth': 3},\n",
" mean: 0.92400, std: 0.02154, params: {'learning_rate': 1.0, 'n_estimators': 50, 'max_depth': 3}]"
"[mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 2, 'n_estimators': 5},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 2, 'n_estimators': 10},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 2, 'n_estimators': 25},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 2, 'n_estimators': 50},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 3, 'n_estimators': 5},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 3, 'n_estimators': 10},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 3, 'n_estimators': 25},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 3, 'n_estimators': 50},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 4, 'n_estimators': 5},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 4, 'n_estimators': 10},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 4, 'n_estimators': 25},\n",
" mean: 0.50000, std: 0.00000, params: {'learning_rate': 9.9999999999999998e-17, 'max_depth': 4, 'n_estimators': 50},\n",
" mean: 0.88000, std: 0.02530, params: {'learning_rate': 0.5, 'max_depth': 2, 'n_estimators': 5},\n",
" mean: 0.90600, std: 0.01855, params: {'learning_rate': 0.5, 'max_depth': 2, 'n_estimators': 10},\n",
" mean: 0.91600, std: 0.02245, params: {'learning_rate': 0.5, 'max_depth': 2, 'n_estimators': 25},\n",
" mean: 0.92400, std: 0.02107, params: {'learning_rate': 0.5, 'max_depth': 2, 'n_estimators': 50},\n",
" mean: 0.90000, std: 0.02530, params: {'learning_rate': 0.5, 'max_depth': 3, 'n_estimators': 5},\n",
" mean: 0.91300, std: 0.02492, params: {'learning_rate': 0.5, 'max_depth': 3, 'n_estimators': 10},\n",
" mean: 0.92500, std: 0.01857, params: {'learning_rate': 0.5, 'max_depth': 3, 'n_estimators': 25},\n",
" mean: 0.93100, std: 0.01513, params: {'learning_rate': 0.5, 'max_depth': 3, 'n_estimators': 50},\n",
" mean: 0.90300, std: 0.02900, params: {'learning_rate': 0.5, 'max_depth': 4, 'n_estimators': 5},\n",
" mean: 0.91500, std: 0.02655, params: {'learning_rate': 0.5, 'max_depth': 4, 'n_estimators': 10},\n",
" mean: 0.92000, std: 0.02490, params: {'learning_rate': 0.5, 'max_depth': 4, 'n_estimators': 25},\n",
" mean: 0.91900, std: 0.02343, params: {'learning_rate': 0.5, 'max_depth': 4, 'n_estimators': 50},\n",
" mean: 0.87500, std: 0.02802, params: {'learning_rate': 1.0, 'max_depth': 2, 'n_estimators': 5},\n",
" mean: 0.89000, std: 0.02236, params: {'learning_rate': 1.0, 'max_depth': 2, 'n_estimators': 10},\n",
" mean: 0.90600, std: 0.03382, params: {'learning_rate': 1.0, 'max_depth': 2, 'n_estimators': 25},\n",
" mean: 0.90600, std: 0.02245, params: {'learning_rate': 1.0, 'max_depth': 2, 'n_estimators': 50},\n",
" mean: 0.89800, std: 0.02821, params: {'learning_rate': 1.0, 'max_depth': 3, 'n_estimators': 5},\n",
" mean: 0.90200, std: 0.03187, params: {'learning_rate': 1.0, 'max_depth': 3, 'n_estimators': 10},\n",
" mean: 0.91800, std: 0.02358, params: {'learning_rate': 1.0, 'max_depth': 3, 'n_estimators': 25},\n",
" mean: 0.92400, std: 0.02154, params: {'learning_rate': 1.0, 'max_depth': 3, 'n_estimators': 50},\n",
" mean: 0.90400, std: 0.03105, params: {'learning_rate': 1.0, 'max_depth': 4, 'n_estimators': 5},\n",
" mean: 0.90600, std: 0.03292, params: {'learning_rate': 1.0, 'max_depth': 4, 'n_estimators': 10},\n",
" mean: 0.92000, std: 0.02490, params: {'learning_rate': 1.0, 'max_depth': 4, 'n_estimators': 25},\n",
" mean: 0.92000, std: 0.02683, params: {'learning_rate': 1.0, 'max_depth': 4, 'n_estimators': 50}]"
]
},
"execution_count": 45,
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -246,12 +257,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Show best combinations"
"If there are many results, we can sort or filter them manually or get best combination"
]
},
{
"cell_type": "code",
"execution_count": 46,
"execution_count": 21,
"metadata": {
"collapsed": false
},
Expand All @@ -263,8 +274,8 @@
"Best accuracy obtained: 0.931\n",
"Parameters:\n",
"\tlearning_rate: 0.5\n",
"\tn_estimators: 50\n",
"\tmax_depth: 3\n"
"\tmax_depth: 3\n",
"\tn_estimators: 50\n"
]
}
],
Expand All @@ -274,6 +285,13 @@
"for key, value in bst_grid.best_params_.items():\n",
" print(\"\\t{}: {}\".format(key, value))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking for best parameters is an iterative process. You should start with coarsed-granularity and move to to more detailed values."
]
}
],
"metadata": {
Expand Down
Loading

0 comments on commit bb8c13d

Please sign in to comment.