Skip to content

Commit

Permalink
Completed part 1
Browse files Browse the repository at this point in the history
  • Loading branch information
git-GB committed Jul 16, 2022
1 parent 7962464 commit a77b130
Show file tree
Hide file tree
Showing 10 changed files with 3,271 additions and 186 deletions.
Binary file added .DS_Store
Binary file not shown.
129 changes: 0 additions & 129 deletions .gitignore

This file was deleted.

21 changes: 0 additions & 21 deletions LICENSE

This file was deleted.

Binary file added datasets/.DS_Store
Binary file not shown.
3,001 changes: 3,001 additions & 0 deletions datasets/data.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions feature-engineering-guide-part1.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"cells":[{"cell_type":"markdown","metadata":{"id":"3zY0IZEDB78W"},"source":["# Feature Engineering: Beginners Guide Part 1\n","---\n","#### Techniques to process Numerical and Categorical Data in Python\n","\n","## Introduction \n","This Notebook is Supplimant to the [Feature engineering in python: The Basics.(Free Guide)](https://www.theblublog.com/feature-engineering-in-python-a-free-guide). The Notebooks aims to provide starter code and examples of Engineering Numerical and Categorical features.\n","\n","To learn More on this or other data science topics visit [The Blu Blog](https://www.theblublog.com). Learn data science with 100% Free Guides and Interactive Notebooks.\n"]},{"cell_type":"markdown","metadata":{"id":"pciB5ScjjwKy"},"source":["## Engineering features for Numerical data"]},{"cell_type":"markdown","metadata":{"id":"MZcM8o2Wj52S"},"source":["### Rescaling Numeric features\n","Rescaling is a common preprocessing task in machine learning. There are several rescaling techniques, but one of the simplest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. \n","The Scikit-learn 'MinMaxScaler' offers two options to rescale a feature. One option is to use fit to calculate the minimum and maximum values of the feature, then use transform to rescale the feature. The second option is to use 'fit_transform()' to do both operations at once. There is no mathematical difference between the two options, but it may sometimes be useful to perform the functions separately on different data. Following is an example with code."]},{"cell_type":"markdown","metadata":{"id":"1KgHauN7xfHB"},"source":["Let's start by importing necessary libraries"]},{"cell_type":"code","execution_count":null,"metadata":{"executionInfo":{"elapsed":3,"status":"ok","timestamp":1657870390649,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"cis-lsFvB7Cc"},"outputs":[],"source":["#Load Libraries\n","import numpy as np\n","from sklearn import preprocessing\n","import random\n","import matplotlib.pyplot as plt"]},{"cell_type":"markdown","metadata":{"id":"FSfDrZmUxLJB"},"source":["Let's create a randomized dataset called income with the help of random Library"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":588,"status":"ok","timestamp":1657214969089,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"f8L3GnwTxc5l","outputId":"c2f8f5a5-4485-4170-ab3f-00e4a36fb670"},"outputs":[],"source":["#Creating a dataset\n","sales = np.array([[-200],[-10],[50],[1000],[15],[20],[30],[50],[100],[200],[10000],[-12000],[150000],[160000]])\n","\n","#for x in range(50):\n","# sales.append(random.randint(-100000000,1000000000))\n","print(sales)"]},{"cell_type":"markdown","metadata":{"id":"0_ykCvhDz2vb"},"source":["Now let's create a Scaler and scale the sales"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":621},"executionInfo":{"elapsed":370,"status":"error","timestamp":1657204749681,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"V3bVGzSOxxm0","outputId":"270a501b-cea3-4262-fa2c-add1100655b2"},"outputs":[],"source":["# Create a scaler\n","minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))\n","\n","#Scale feature\n","scaled_sales = minmax_scale.fit_transform(sales)\n","\n","#Show feature\n","scaled_sales"]},{"cell_type":"markdown","metadata":{"id":"YHcQTIGsjb5T"},"source":["### Standardizing features\n","The scaling of features to be roughly standard and normally distributed is a common substitute for the min-max scaling. To accomplish this, we standardize the data so that it has a mean, of 0, and a standard deviation of 1.\n","The transformed feature shows how far the original value deviates from the mean value of the feature by standard deviations (also called a z-score in statistics). Standardization is frequently chosen over min-max scaling as the preferred scaling technique for machine learning preprocessing, in my experience. It is, however, subject to the learning algorithm. For instance, standardization frequently improves the performance of principal component analysis, and min-max scaling is typically advised for neural networks.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Vh02lde8jb5U","outputId":"f732dfe0-963f-4da6-e0bc-11160f07fdff"},"outputs":[],"source":["#Create a scaler\n","std_scaler = preprocessing.StandardScaler()\n","std_sales = std_scaler.fit_transform(sales)\n","\n","# Show feature standardized\n","std_sales"]},{"cell_type":"markdown","metadata":{"id":"eP5HACxGjb5U"},"source":["### Normalizing\n","One method for feature scaling is normalization. We use normalization most often when the data is not skewed along either axis or when it does not follow the Gaussian distribution. By converting data features with different scales to a single scale during normalization, we further simplify the processing of the data for modeling. As a result, each data feature (variable) tends to have a similar impact on the final model."]},{"cell_type":"markdown","metadata":{"id":"2m9_dWohkLxI"},"source":["Let's import the Normalizer Library from scikit learn"]},{"cell_type":"code","execution_count":null,"metadata":{"executionInfo":{"elapsed":343,"status":"ok","timestamp":1657922793491,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"bnIeHjWMjb5V"},"outputs":[],"source":["from sklearn.preprocessing import Normalizer"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":4,"status":"ok","timestamp":1657922794130,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"T_xGe-eKjb5V","outputId":"7dce5aa1-bc45-4b7d-ccff-4f252d02ea2e"},"outputs":[],"source":["# Create feature matrix\n","x = np.array([[2.5, 1.5],[2.1, 3.4], [1.5, 10.2], [4.63, 34.4], [10.9, 3.3], [17.5,0.8], [15.4, 0.7]])\n","\n","# Create normalizer\n","normalizer = Normalizer(norm=\"l2\")\n","\n","# Transform feature matrix normalizer.transform(features)\n","normalizer.transform(x)"]},{"cell_type":"markdown","metadata":{"id":"XHcE5vI4Lw11"},"source":["## Engineering Features for Catogorical data"]},{"cell_type":"markdown","metadata":{"id":"f1ABlG8V2HKq"},"source":["### Encoding for Ordinal\n","Encoding is the process of converting ordinal data into a numeric format so that the Machine learning algorithm can make sense of it. For transforming ordinal data into numeric data, we usually convert each class into a number. For example cold, average, is mapped to 1, 2, and 3 respectively. Let’s see how we can do this easily. "]},{"cell_type":"markdown","metadata":{"id":"Vf-2iddBnkF7"},"source":["Let's start by importing pandas and creating a data set."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":376,"status":"ok","timestamp":1657918009144,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"gKOme2uaL8Hn","outputId":"57496a22-1e90-44a0-9c18-74561142ba55"},"outputs":[],"source":["#Importing libraries\n","import pandas as pd\n","\n","#Creating the data\n","data = pd.DataFrame({\"Temprature\":[\"Very Cold\", \"Cold\", \"Warm\",\"Hot\", \"Very Hot\"]})\n","\n","print(data)\n"]},{"cell_type":"markdown","metadata":{"id":"rFCom1sEqPEZ"},"source":["Now Let's map the data to numerical values."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":0},"executionInfo":{"elapsed":7,"status":"ok","timestamp":1657918047948,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"AEhSG-Irm57w","outputId":"b4db66e9-a79e-48b0-9467-45582155cad8"},"outputs":[],"source":["#Mapping to numerical data\n","scale_map = {\"Very Cold\": -3,\n"," \"Cold\": -1,\n"," \"Warm\": 0,\n"," \"Hot\" : 1,\n"," \"Very Hot\" : 3}\n","\n","#Replacing with mapped values\n","data_mapped = data[\"Temprature\"].replace(scale_map)\n","data[\"encoded_temp\"] = data_mapped\n","data"]},{"cell_type":"markdown","metadata":{"id":"loulhiJqg1e5"},"source":[]},{"cell_type":"markdown","metadata":{"id":"t7KfqUZqqHHE"},"source":["### Nominal Data\n","In one hot encoding, we convert each class of nominal data into its own feature and we assign a binary value of 1 or 0 to tell whether the feature is true or false. Let’s see how this can be done using the MultiLibraryBinarizer in scikit learn.\n"]},{"cell_type":"markdown","metadata":{"id":"D0ar7f2M2N-F"},"source":["importing data and creating a data frame."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":347,"status":"ok","timestamp":1657922705480,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"Uouh2nPqnfaf","outputId":"836c387e-93dd-4b0a-a660-427b55e63f71"},"outputs":[],"source":["#Import libraries\n","import numpy as np\n","import pandas as pd\n","from sklearn.preprocessing import LabelBinarizer\n","# Create the dataset\n","color_data = {\"itemid\": [\"A1\",\"B1\",\"C2\", \"D4\",\"E9\"],\n"," \"color\" : [\"red\",\"blue\",\"green\",\"yellow\",\"pink\"]}\n","\n","color_data = pd.DataFrame(color_data)\n","color_data\n"]},{"cell_type":"markdown","metadata":{"id":"vJUNxkr72XCp"},"source":["One hot encoding with Label Binarizer"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":358,"status":"ok","timestamp":1657922567103,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"Zl90Ow4FsO-Q","outputId":"81643f75-4f95-410e-c8f2-72c5d61649c3"},"outputs":[],"source":["# Creating one-hot encoder\n","one_hot = LabelBinarizer() \n","\n","# One-hot encode the data and assign to a var\n","color_encoding = one_hot.fit_transform(color_data.color)\n","\n","# feature classes\n","color_new = one_hot.classes_\n","\n","#creating new Data Frame with encoded values \n","encoded = pd.DataFrame(color_encoding)\n","encoded.columns = color_new\n","\n","#Deleting color column and merging with encoded values\n","color_data_new = color_data.drop(\"color\",axis = 1)\n","color_data_new = pd.concat([color_data,encoded],axis = 1)\n","\n","#Viewing new data\n","print(color_data_new)"]},{"cell_type":"markdown","metadata":{"id":"AIgDL4G52i7N"},"source":["One hot encoding with Pandas \n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":552,"status":"ok","timestamp":1657922574298,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"6NaWrhzwshfY","outputId":"76986786-e525-451d-b63a-5d7cde940d36"},"outputs":[],"source":["#Creating encoded df\n","encoded_pd = pd.get_dummies(color_data.color)\n","\n","#Deleting color column and merging with encoded values\n","color_data_pd = color_data.drop(\"color\", axis = 1)\n","color_data_pd = pd.concat([color_data,encoded_pd],axis = 1)\n","\n","#Viewing new data\n","print(color_data_pd)"]},{"cell_type":"markdown","metadata":{"id":"H-gevZWL5T0h"},"source":["It’s good practice to drop one of the features after one hot encoding to reduce linear dependency.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":353},"executionInfo":{"elapsed":349,"status":"error","timestamp":1657922745658,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"pA8NSV6n4QcH","outputId":"9f6ecbbc-3e01-4d52-98eb-c8106d6a4d47"},"outputs":[],"source":["#Dropping final column\n","color_data_pd.drop(\"yellow\",axis =1, inplace = True)\n","color_data_pd"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"FSbmN-_U5Hyq"},"outputs":[],"source":[]}],"metadata":{"colab":{"collapsed_sections":["0_ykCvhDz2vb","eP5HACxGjb5U","2m9_dWohkLxI","f1ABlG8V2HKq"],"name":"feature_engineering_part1-guide.ipynb","provenance":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3.9.12 ('base')","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.12"},"vscode":{"interpreter":{"hash":"73b821250e16fc16900c2a3e52b4ecb495c96a009b33fe27669e30a795cf2d76"}}},"nbformat":4,"nbformat_minor":0}
Loading

0 comments on commit a77b130

Please sign in to comment.