This Jupyter notebook introduces you to some basic principles of data exploration and visualization using Python language along with the libraries like Matplotlib and Seaborn. You will learn different methods for exploration of data using visualization techniques. We will use several Python packages like matplotlib, Pandas plotting, and seaborn to create the visualizations.
To run this notebook you need to install necessary packages, listed down. If you have not done so, you will need to install them first, as these are not in the Anaconda distribution as of now. From a command prompt on your computer type the following command. If no error occurs, you will have installed them.
pip install seaborn
pip install pandas
pip install matplotlib
Fork the repository to run the jupyter notebook on your own computer.
A bit of experience with Python, Pandas and Jupyter Notebook is sufficient. If you are a beginner then you can follow along with:
The datasets used for exploration is Pokemon Dataset.
This dataset contains information on all 802 Pokemon from all Seven Generations of Pokemon. The information contained in this dataset include:
Features | Description |
---|---|
name | The English name of the Pokemon |
japanese_name | The Original Japanese name of the Pokemon |
pokedex_number | The entry number of the Pokemon in the National Pokedex |
percentage_male | The percentage of the species that are male. Blank if the Pokemon is genderless |
type1 | The Primary Type of the Pokemon |
type2 | The Secondary Type of the Pokemon |
classification | The Classification of the Pokemon as described by the Sun and Moon Pokedex |
height_m | Height of the Pokemon in metres |
weight_kg | The Weight of the Pokemon in kilograms |
capture_rate | Capture Rate of the Pokemon |
base_egg_steps | The number of steps required to hatch an egg of the Pokemon |
abilities | A stringified list of abilities that the Pokemon is capable of having |
experience_growth | The Experience Growth of the Pokemon |
base_happiness | Base Happiness of the Pokemon |
against_? | Eighteen features that denote the amount of damage taken against an attack of a particular type |
hp | The Base HP of the Pokemon |
attack | The Base Attack of the Pokemon |
defense | The Base Defense of the Pokemon |
sp_attack | The Base Special Attack of the Pokemon |
sp_defense | The Base Special Defense of the Pokemon |
speed | The Base Speed of the Pokemon |
generation | The numbered generation which the Pokemon was first introduced |
is_legendary | Denotes if the Pokemon is legendary |
You can download the dataset from Kaggle
“Visualization gives you answers to questions you didn’t know you had.” – Ben Schneiderman
Visualization is an essential method in any data scientist's toolbox. Visualization is a key first step in the exploration of most datasets. These process of exploring data visually and with simple summary statistics is known as Exploratory Data Analysis(EDA). As a general rule, you should never start creating models until you have an understanding of the relationships in your data. Visualization is also a powerful tool for presentation of results and for determining sources of problems with analytics.
The concept of exploring a dataset visually were pioneered by John Tukey in the 1960s and 1970s.
The key concept of exploratory data analysis(EDA) or visual exploration of data is to understand the relationship in the dataset. Specially using visualization when you approach a new dataset you can:
- Explore complex datasets, using visualization to develop understanding of the inherent relationships.
- Use different chart types to create multiple views of data to highlight differnt aspects of the inherent relationships.
- Use plot aesthetics to project multiple dimensions.
- Apply conditioning methods to project multiple dimensions.
In these exercises, you will use Pandas plotting, Matplotlib and the Seaborn packages. We assume you have atleast a bit of experience using Pandas and Jupyter notebooks.
There are enumerable chart types that are used for data exploration. Some of them are explained below
- Scatter plot : Scatter plots show the relationship between two variables in the form of dots on the plot. In simple terms, the value along a horizontal axis are plotted against a vertical axis.
- Line plot : Line plots are similar to point plots. In line plots the discrete points are connected by lines.
- Bar plot : Bar plots are used to display the counts of unique values of a categorical variable. The height of the bar represents the count for each unique category of the variable.
- Histogram : Histograms are related to bar plots. Histograms are used for numeric variables. Whereas, a bar plot shows the counts of unique categories, a histogram shows the number of data with values within a bin. The bin divide the values of the variable into equal segments. The vertical axis of the histogram shows the count of data values within each bin.
- Box plot : Box plots, also known as box and wisker plots, were introduced by John Tukey in 1970. Box plots are another way to visualize the distribution of data values. In this respect, box plots are comparable to histograms, but are quite different in presentation. On a box plot the median value is shown with a dark bar. The inner two quartiles of data values are contained within the 'box'. The 'wiskers' enclose the majority of the data(up to +/-2.5 * interquartile range). Outliers are shown by symbols beyond the wiskers. Several box plots can be stacked along an axis for comparison. The data are divided using a 'group by' operation, and the box plots for each group are attached next to each other. In this way, the box plot allows you to display two dimensions of your dataset.
- Kernel Density Estimation Plots(KDE) : Kernel density plots are similar in concept to a histogram. A kernel density plot displays the values of a smoothed density curve of the data values. In other words, the kernel density plot is a smoothed version of a histogram.
- Violin plot : A violin plot combines attributes of boxplots and a kernel density estimation plot. Like a box plot, the violin plots can be stacked, with a 'group by' operation. Additionally, the violin plot provides a kernel density estimate for each group. As with the box plot, violin plots allow you to display two dimensions of your dataset.
These lessons are prepared by Praneet Nigam. He is currently working as a Machine Learning Facilitator for the Google Machine Learning Crash Course. For being in touch with the speaker, contact him on listed down social media links.
- Email : [email protected]
Some of the past projects of Praneet Nigam
You can buy me a cup of coffee. Even a small contribution helps a lot in a long way. Please Donate Here
In this tutorial we will work with powerful Python packages like Pandas, Matplotlib and Seaborn. These packages have extensive online documentation. There is an extensive tutorial on Visualization with Pandas. The Seaborn tutorial contains many examples of data visualization. The matplotlib website has addition resources for learning plotting with Python tools.