This lab is about data preparation. Using IBM Data Refinery for shaping operations clean, organize, fix, and validate data.
In the life-cycle of Data Science, data preparation is one of the most important stages. Data scientists spend 80% of their time cleansing, shaping and formatting data before doing any analysis. IBM Data Refinery, an intuitive cloud-based data preparation service, where you can quickly source, shape and share your data sets. This lab is a short introductory for data wrangling; it will introduce you to IBM Data Refinery's capabilities and how can you utilize it to prepare your data.
The use-case of this lab is Titanic data set. It has 12 columns of type integer, double and string. Some columns need shaping or cleaning operarions to make use of the data to the full extent. Mostly we will fill missing values with different approaches.
- Acquire the skills of data engineering/cleaning
- How to fomat and shape your data according to your project requirement.
If you have poor data you will have poor results.
The lab takes approximately 10 minutes long.
-
IBM Watson Studio: Analyze data using RStudio, Jupyter and Machine Learning Flow in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
-
IBM Cloud Object Storage: An IBM Cloud service that provides an unstructured cloud data store to build and deliver cost effective apps and services with high reliability and fast speed to market. This code pattern uses Cloud Object Storage.
-
IBM Data Refinery: A part of the Watson Data Platform integrated environment and is a self-service data preparation client for data scientists, data engineers, and business analysts. It can transform large amounts of raw data into consumable, quality information that’s ready for analytics. IBM Data Refinery makes it easy to explore, prepare, and deliver data.
- IBM Cloud account: An account must exist to use the platform.
- Watson Studio service instance: A service instance must exist to be able to use IBM Data Refinery.
If you do not have an IBM Cloud account, create an account here
- A Lite account, which is a free of charge. Make sure to set the region to US South.
If you don't have a watson Studio instance, do the following:
- Select Catalog found at the top right of the page.
- Click on Watson from the menu on the left, which you can find under Platform services.
- Select Browse Services under Watson Services.
- Choose watson Studio instance
- Once the main page of the service appears, click on Get Started. This will redirect your browser to the Watson Studio platform. It might ask to confirm IBM Cloud organization and space information.
From the Get Started page, select Creat a Project
Then Choose a Standard plan
- Make sure a cloud storage instance exists, or add a new IBM Cloud Object Storage instance by clicking on Add under Select storage service.
The operations will be done using Titanic dataset which can be downloaded from , save the csv file to apply the following steps.
Under the Asset tab in the project, choose this icon on the right to upload the dataset to the platform.
- Click browse to navigate your folders where the dataset set can be found, select file train.csv.
- After it's uploaded, it will be listed in the Data assets.
To start the process, press the Action Menu (triple dot) in the right side of the train.csv bar to open Refine.
All the columns initially are of type string, for better shaping, convert those integer values columns from string to integer. From the Action menu that appears in the right side of each column, select Convert Column type and choose the type. In Titanic's use case, the columns that are converted to integer are Survived, PClass, Sibsp and Parch. The columns are converted to decimanl are Age and Fare.
The columns that have missing values in Titanic dataset are Age, Cabin and Embarked. The methods to fulfill the missing values are different for each attribute depending on the purpose of the attribute.
So, to fill the missing values in Embarked attribute, we only fill it with 'S' knwong that the passengers actually embarked at Southampton.
For Cabin attribute, since tracing the actual cabin for each passenger is impossible. Handling this attribute by creating additional column that has 1 for a passenger who's cabin exists and 0 if it does not exist. Relating to the accident, known passenger's cabin indicate that they survived. To do that follow the below steps:
- Press the Actions menu in the Cabin column.
- Select Conditional Replace under the Organize Category
- Add two conditions in Cabin's column, if value empty replace it with 0, if not empty replace with 1.
- Choose Cabin column, in the option of Replace put 1
For age attribute, calculate the mean of the column values and placing it in the null values. To replace missing values by the mean of the column, do the following:
- From the Actions menu choose Filter condition.
- Select column Age and Is not empty under operator for the Filteration condition.
- From the Operation Bar, select Summarize operator.
- Fill in the operation command the required variables like this Summarize(newVarName=operator(column))
summarize(newAge= mean(``Age``))
- Copy the geneated value to use after, and undo last two action from the backward arrow above. Since the filteration and the new summerized value is now useless.
- Select Age column again, from the Action menu choose Replace missing values
Insert the mean value to be replaced with and press Apply button.
Titanic data set does not have sensitive information that should be unique except for the passenger ID. Simply select the Action menu in Passenger Id column and choose Remove duplicates.
We have learned the first stage of the data science, where the outcome of this stage decide the success of the futher stages.IBM Data Refinery helps developer get rhough data cleaning in a fast approach with no coding requirements.