-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
10 changed files
with
3,271 additions
and
186 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
Binary file not shown.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"cells":[{"cell_type":"markdown","metadata":{"id":"3zY0IZEDB78W"},"source":["# Feature Engineering: Beginners Guide Part 1\n","---\n","#### Techniques to process Numerical and Categorical Data in Python\n","\n","## Introduction \n","This Notebook is Supplimant to the [Feature engineering in python: The Basics.(Free Guide)](https://www.theblublog.com/feature-engineering-in-python-a-free-guide). The Notebooks aims to provide starter code and examples of Engineering Numerical and Categorical features.\n","\n","To learn More on this or other data science topics visit [The Blu Blog](https://www.theblublog.com). Learn data science with 100% Free Guides and Interactive Notebooks.\n"]},{"cell_type":"markdown","metadata":{"id":"pciB5ScjjwKy"},"source":["## Engineering features for Numerical data"]},{"cell_type":"markdown","metadata":{"id":"MZcM8o2Wj52S"},"source":["### Rescaling Numeric features\n","Rescaling is a common preprocessing task in machine learning. There are several rescaling techniques, but one of the simplest is called min-max scaling. Min-max scaling uses the minimum and maximum values of a feature to rescale values to within a range. \n","The Scikit-learn 'MinMaxScaler' offers two options to rescale a feature. One option is to use fit to calculate the minimum and maximum values of the feature, then use transform to rescale the feature. The second option is to use 'fit_transform()' to do both operations at once. There is no mathematical difference between the two options, but it may sometimes be useful to perform the functions separately on different data. Following is an example with code."]},{"cell_type":"markdown","metadata":{"id":"1KgHauN7xfHB"},"source":["Let's start by importing necessary libraries"]},{"cell_type":"code","execution_count":null,"metadata":{"executionInfo":{"elapsed":3,"status":"ok","timestamp":1657870390649,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"cis-lsFvB7Cc"},"outputs":[],"source":["#Load Libraries\n","import numpy as np\n","from sklearn import preprocessing\n","import random\n","import matplotlib.pyplot as plt"]},{"cell_type":"markdown","metadata":{"id":"FSfDrZmUxLJB"},"source":["Let's create a randomized dataset called income with the help of random Library"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":588,"status":"ok","timestamp":1657214969089,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"f8L3GnwTxc5l","outputId":"c2f8f5a5-4485-4170-ab3f-00e4a36fb670"},"outputs":[],"source":["#Creating a dataset\n","sales = np.array([[-200],[-10],[50],[1000],[15],[20],[30],[50],[100],[200],[10000],[-12000],[150000],[160000]])\n","\n","#for x in range(50):\n","# sales.append(random.randint(-100000000,1000000000))\n","print(sales)"]},{"cell_type":"markdown","metadata":{"id":"0_ykCvhDz2vb"},"source":["Now let's create a Scaler and scale the sales"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":621},"executionInfo":{"elapsed":370,"status":"error","timestamp":1657204749681,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"V3bVGzSOxxm0","outputId":"270a501b-cea3-4262-fa2c-add1100655b2"},"outputs":[],"source":["# Create a scaler\n","minmax_scale = preprocessing.MinMaxScaler(feature_range=(0,1))\n","\n","#Scale feature\n","scaled_sales = minmax_scale.fit_transform(sales)\n","\n","#Show feature\n","scaled_sales"]},{"cell_type":"markdown","metadata":{"id":"YHcQTIGsjb5T"},"source":["### Standardizing features\n","The scaling of features to be roughly standard and normally distributed is a common substitute for the min-max scaling. To accomplish this, we standardize the data so that it has a mean, of 0, and a standard deviation of 1.\n","The transformed feature shows how far the original value deviates from the mean value of the feature by standard deviations (also called a z-score in statistics). Standardization is frequently chosen over min-max scaling as the preferred scaling technique for machine learning preprocessing, in my experience. It is, however, subject to the learning algorithm. For instance, standardization frequently improves the performance of principal component analysis, and min-max scaling is typically advised for neural networks.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Vh02lde8jb5U","outputId":"f732dfe0-963f-4da6-e0bc-11160f07fdff"},"outputs":[],"source":["#Create a scaler\n","std_scaler = preprocessing.StandardScaler()\n","std_sales = std_scaler.fit_transform(sales)\n","\n","# Show feature standardized\n","std_sales"]},{"cell_type":"markdown","metadata":{"id":"eP5HACxGjb5U"},"source":["### Normalizing\n","One method for feature scaling is normalization. We use normalization most often when the data is not skewed along either axis or when it does not follow the Gaussian distribution. By converting data features with different scales to a single scale during normalization, we further simplify the processing of the data for modeling. As a result, each data feature (variable) tends to have a similar impact on the final model."]},{"cell_type":"markdown","metadata":{"id":"2m9_dWohkLxI"},"source":["Let's import the Normalizer Library from scikit learn"]},{"cell_type":"code","execution_count":null,"metadata":{"executionInfo":{"elapsed":343,"status":"ok","timestamp":1657922793491,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"bnIeHjWMjb5V"},"outputs":[],"source":["from sklearn.preprocessing import Normalizer"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":4,"status":"ok","timestamp":1657922794130,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"T_xGe-eKjb5V","outputId":"7dce5aa1-bc45-4b7d-ccff-4f252d02ea2e"},"outputs":[],"source":["# Create feature matrix\n","x = np.array([[2.5, 1.5],[2.1, 3.4], [1.5, 10.2], [4.63, 34.4], [10.9, 3.3], [17.5,0.8], [15.4, 0.7]])\n","\n","# Create normalizer\n","normalizer = Normalizer(norm=\"l2\")\n","\n","# Transform feature matrix normalizer.transform(features)\n","normalizer.transform(x)"]},{"cell_type":"markdown","metadata":{"id":"XHcE5vI4Lw11"},"source":["## Engineering Features for Catogorical data"]},{"cell_type":"markdown","metadata":{"id":"f1ABlG8V2HKq"},"source":["### Encoding for Ordinal\n","Encoding is the process of converting ordinal data into a numeric format so that the Machine learning algorithm can make sense of it. For transforming ordinal data into numeric data, we usually convert each class into a number. For example cold, average, is mapped to 1, 2, and 3 respectively. Let’s see how we can do this easily. "]},{"cell_type":"markdown","metadata":{"id":"Vf-2iddBnkF7"},"source":["Let's start by importing pandas and creating a data set."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":376,"status":"ok","timestamp":1657918009144,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"gKOme2uaL8Hn","outputId":"57496a22-1e90-44a0-9c18-74561142ba55"},"outputs":[],"source":["#Importing libraries\n","import pandas as pd\n","\n","#Creating the data\n","data = pd.DataFrame({\"Temprature\":[\"Very Cold\", \"Cold\", \"Warm\",\"Hot\", \"Very Hot\"]})\n","\n","print(data)\n"]},{"cell_type":"markdown","metadata":{"id":"rFCom1sEqPEZ"},"source":["Now Let's map the data to numerical values."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":0},"executionInfo":{"elapsed":7,"status":"ok","timestamp":1657918047948,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"AEhSG-Irm57w","outputId":"b4db66e9-a79e-48b0-9467-45582155cad8"},"outputs":[],"source":["#Mapping to numerical data\n","scale_map = {\"Very Cold\": -3,\n"," \"Cold\": -1,\n"," \"Warm\": 0,\n"," \"Hot\" : 1,\n"," \"Very Hot\" : 3}\n","\n","#Replacing with mapped values\n","data_mapped = data[\"Temprature\"].replace(scale_map)\n","data[\"encoded_temp\"] = data_mapped\n","data"]},{"cell_type":"markdown","metadata":{"id":"loulhiJqg1e5"},"source":[]},{"cell_type":"markdown","metadata":{"id":"t7KfqUZqqHHE"},"source":["### Nominal Data\n","In one hot encoding, we convert each class of nominal data into its own feature and we assign a binary value of 1 or 0 to tell whether the feature is true or false. Let’s see how this can be done using the MultiLibraryBinarizer in scikit learn.\n"]},{"cell_type":"markdown","metadata":{"id":"D0ar7f2M2N-F"},"source":["importing data and creating a data frame."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":347,"status":"ok","timestamp":1657922705480,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"Uouh2nPqnfaf","outputId":"836c387e-93dd-4b0a-a660-427b55e63f71"},"outputs":[],"source":["#Import libraries\n","import numpy as np\n","import pandas as pd\n","from sklearn.preprocessing import LabelBinarizer\n","# Create the dataset\n","color_data = {\"itemid\": [\"A1\",\"B1\",\"C2\", \"D4\",\"E9\"],\n"," \"color\" : [\"red\",\"blue\",\"green\",\"yellow\",\"pink\"]}\n","\n","color_data = pd.DataFrame(color_data)\n","color_data\n"]},{"cell_type":"markdown","metadata":{"id":"vJUNxkr72XCp"},"source":["One hot encoding with Label Binarizer"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":358,"status":"ok","timestamp":1657922567103,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"Zl90Ow4FsO-Q","outputId":"81643f75-4f95-410e-c8f2-72c5d61649c3"},"outputs":[],"source":["# Creating one-hot encoder\n","one_hot = LabelBinarizer() \n","\n","# One-hot encode the data and assign to a var\n","color_encoding = one_hot.fit_transform(color_data.color)\n","\n","# feature classes\n","color_new = one_hot.classes_\n","\n","#creating new Data Frame with encoded values \n","encoded = pd.DataFrame(color_encoding)\n","encoded.columns = color_new\n","\n","#Deleting color column and merging with encoded values\n","color_data_new = color_data.drop(\"color\",axis = 1)\n","color_data_new = pd.concat([color_data,encoded],axis = 1)\n","\n","#Viewing new data\n","print(color_data_new)"]},{"cell_type":"markdown","metadata":{"id":"AIgDL4G52i7N"},"source":["One hot encoding with Pandas \n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":552,"status":"ok","timestamp":1657922574298,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"6NaWrhzwshfY","outputId":"76986786-e525-451d-b63a-5d7cde940d36"},"outputs":[],"source":["#Creating encoded df\n","encoded_pd = pd.get_dummies(color_data.color)\n","\n","#Deleting color column and merging with encoded values\n","color_data_pd = color_data.drop(\"color\", axis = 1)\n","color_data_pd = pd.concat([color_data,encoded_pd],axis = 1)\n","\n","#Viewing new data\n","print(color_data_pd)"]},{"cell_type":"markdown","metadata":{"id":"H-gevZWL5T0h"},"source":["It’s good practice to drop one of the features after one hot encoding to reduce linear dependency.\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":353},"executionInfo":{"elapsed":349,"status":"error","timestamp":1657922745658,"user":{"displayName":"Govind Bhat","userId":"11500974554390449652"},"user_tz":-330},"id":"pA8NSV6n4QcH","outputId":"9f6ecbbc-3e01-4d52-98eb-c8106d6a4d47"},"outputs":[],"source":["#Dropping final column\n","color_data_pd.drop(\"yellow\",axis =1, inplace = True)\n","color_data_pd"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"FSbmN-_U5Hyq"},"outputs":[],"source":[]}],"metadata":{"colab":{"collapsed_sections":["0_ykCvhDz2vb","eP5HACxGjb5U","2m9_dWohkLxI","f1ABlG8V2HKq"],"name":"feature_engineering_part1-guide.ipynb","provenance":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3.9.12 ('base')","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.12"},"vscode":{"interpreter":{"hash":"73b821250e16fc16900c2a3e52b4ecb495c96a009b33fe27669e30a795cf2d76"}}},"nbformat":4,"nbformat_minor":0} |
Oops, something went wrong.