Skip to content

Presenting my data science tool kit that might help students better navigate in the Machine Learning seas!

Notifications You must be signed in to change notification settings

hervelao/my-data-science-tool-belt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

my-data-science-tool-belt

NOTICE: This repository may help students better navigate in the Machine Learning seas 🌊!

In the near future, I also intend to add a collection of minimal and clean implementations of machine learning algorithms (so that people would be able to easily play with the libraries), and add resources to help people better review their data science knowledge and prepare them for data science questions.

To install some of my packages, you can use pip:

pip install -r requirements.txt

Don't hesitate to contribute to the repo 🔥🔥🔥

Summary

The Zen of Python

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one -- and preferably only one -- obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Keyboard's shortcuts (Mac)

Sublime Text

open a file in current project             # cmd + P (or cmd + T)
move a line upwards in file                # ctrl + cmd + ↑
move a line downwards in file              # ctrl + cmd + ↓
(un)comment lines                          # cmd + shift + / (or cmd + /)
search in file                             # cmd + F
search in project                          # cmd + shift + F
add package                                # cmd + shift + P
add indentation                            # tab
remove indentation                         # shift + tab
reach next word                            # alt + →
reach previous word                        # alt + ←
reach end of line                          # cmd + →
reach beginning of line                    # cmd + ←
close tab                                  # cmd + w
reopen last tab closed                     # cmd + shift + T
navigate to tab on the right               # cmd + alt + →
navigate to tab on the left                # cmd + alt + ←
duplicate line                             # cmd + shift + D
split view into two columns                # cmd + alt + 2
(un)toggle sidebar                         # cmd + K, cmd + B
select similar words                       # cmd + shift + G
put cursor in several places               # cmd + click

Terminal

open a new tab                             # cmd + T
close tab                                  # cmd + W
clear window (keeping history)             # ctrl + L
clear window (losing history)              # cmd + K
reach next word                            # alt + →
reach previous word                        # alt + ←
reach beginning of line                    # ctrl + A
reach end of line                          # ctrl + E
erase the whole line                       # ctrl + U
navigate to tab on the right               # cmd + shift + →
navigate to tab on the left                # cmd + shift + ←

Chrome

open developer tools (elements)            # cmd + alt + I
open developer tools (console)             # cmd + alt + J
navigate to tab on the right               # cmd + alt + →
navigate to tab on the left                # cmd + alt + ←
open a new tab                             # cmd + t
open a new incognito window                # cmd + shift + N
Show downloads                             # cmd + shift + J
focus on address bar                       # cmd + L
close tab                                  # cmd + W
reopen last tab closed                     # cmd + shift + T
hard refresh (clear cache)                 # cmd + shift + R

Jupyter Notebook

BOTH MODES:
run selected cell                          # ctrl + enter
run cell and insert below                  # alt + enter
run cell and select below                  # shift + enter
save and checkpoint                        # cmd + S

COMMAND MODE:
enter command mode (blue):                 # esc (or click anywhere outsite the cell)
add new cell above the current one         # A
add new cell below the current one         # B
delete selected cells                      # D, D (press the key twice)
open command palette                       # P
show all shortcuts                         # H
selected cells above                       # shift + ↑
selected cells below                       # shift + ↓
cut selected cells                         # X
copy selected cells                        # C
paste cells below                          # V
undo cell deletion                         # shift + V
change the cell type to Code               # Y
change the cell type to Markdown           # M
scroll notebook up                         # shift + space
scroll notebook down                       # space

EDIT MODE:
enter edit mode (green):                   # enter (or click in a cell)
(un)comment lines                          # cmd + ! (or cmd + /)
put cursor in several places               # cmd + click
open command palette                       # cmd + shift + P
indent                                     # tab
open tooltip                               # shift + tab

🚀 For more keyboard shortcuts, Help -> Keyboard Shortcuts from Menu bar.

♥️ nbextensions

pip install jupyter_contrib_nbextensions
pip install jupyter_nbextensions_configurator
jupyter contrib nbextension install --user
pip install rise

Jupyter Lab

BOTH MODES:
run selected cell                          # ctrl + enter
run cell and insert below                  # alt + enter
run cell and select below                  # shift + enter
save notebook                              # cmd + S
save notebook as                           # cmd + shift + S

COMMAND MODE:
enter command mode (blue):                 # esc (or click anywhere outsite the cell)
add new cell above the current one         # A
add new cell below the current one         # B
delete selected cells                      # D, D (press the key twice)
selected cells above                       # shift + ↑
selected cells below                       # shift + ↓
cut selected cells                         # X
copy selected cells                        # C
paste cells below                          # V
change the cell type to Code               # Y
change the cell type to Markdown           # M

EDIT MODE:
enter edit mode (green):                   # enter (or click in a cell)
(un)comment lines                          # cmd + ! (or cmd + /)
put cursor in several places               # cmd + click
indent                                     # tab

Magic functions

compute time for your code to run          # %%time
show your plots (Jupyter Lab)              # %matplotlib inline

reload the kernel every time..             # %load_ext autoreload
..you change something in your code        # %autoreload 2

show the variables in your environment     # %who_ls

Mastering git

2 master rules:

  • Never commit directly to master. Use feature branches;
  • Always make sure git status is clean pull, checkout or merge.

⚠️ Don't forget to sweep or delete your branches when they've become unnecessary! git branch -d branchname

How to use features branches

git:(master)                               git status
# Make sure it is clean

git:(master)                               git checkout -b my-new-feature
git:(my-feature)                           git branch

We've created a new local branch called my-new-feature. Therefore, any new commit will only apply to this branch.

How to push a branch to GitHub

git:(my-feature)                           git add .
git:(my-feature)                           git commit -m 'clear and meaningful commit'
git:(my-feature)                           git push origin my-feature

git:(my-feature)                           git status
# Make sure this is clean

git:(my-feature)                           hub browse
# You can now create a pull request on GitHub

git:(my-feature)                           git checkout master
git:(master)                               git pull origin master
git:(master)                               git sweep

git:(master)                               git checkout -b my-new-feature

How to merge your branch to the master branch

git:(my-new-feature)                       git add .
git:(my-new-feature)                       git commit -m 'clear and meaningful commit'

git:(my-new-feature)                       git status
# Make sure this is clean

git:(my-new-feature)                       git checkout master
git:(master)                               git pull origin master

git:(master)                               git checkout my-new-feature
git:(my-new-feature)                       git merge master

How to solve your conflicts

When you can't merge because of conflicts in an unmergeable_branch, follow this process:

git:(unmergeable_branch)                   git status
# Make sure this is clean

git:(unmergeable_branch)                   git checkout master

git:(master)                               git pull origin master

git:(unmergeable_branch)                   git checkout unmergeable_branch

git:(unmergeable_branch)                   git merge master

# Now you've fetched all conflicts locally, time to solve them 🕵️‍♀️
# Open your code editor and solve conflicts (locate them with cmd + shift + F)

# After that, you can commit your code
git:(unmergeable_branch)                   git add .
git:(unmergeable_branch)                   git commit -m "conflict solving"
git:(unmergeable_branch)                   git push origin unmergeable_branch

git:(unmergeable_branch)                   hub browse
# You can now create a pull request on GitHub

git:(unmergeable_branch)                   git checkout master
git:(master)                               git pull origin master
git:(master)                               git sweep

git:(master)                               git checkout -b my-new-feature

How to fix an accidental commit to master

When you committed some changes to master and wanted to commit them to a new branch, follow this process:

# First, create a new branch from the current state of master
git:(master)                               git checkout -b some-new-branch-name

# Remove the commit from the master branch
git:(some-new-branch-name)                 git checkout master
git:(master)                               git reset HEAD~ --hard

git:(some-new-branch-name)                 git checkout some-new-branch-name
# Your commit lives in this branch now

Mastering Pandas

Below a small demonstration of main Pandas methods

# For printed results in Jupyter notebooks
pd.set_option("display.precision", 2)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# Basic methods
df = pd.read_csv('../data/file.csv')
df.head()
df.shape()
df.columns()
df.info()
df.describe() # for non-numerical features, add include=['object', 'bool']
df['Age'].value_counts() # to calculate fractions, add normalize=True

# Data types: bool, int64, float64 or object
df['Height'] = df['Height'].astype('int64')

# Sorting
df.sort_values(by='Age', ascending=False)
df.sort_values(by=['Age', 'Height], ascending=[True, False])

# Indexing and retrieving data
df['Height'].mean()
df[df['Age'] == 25].mean()
df[(df['Age'] == 25) & (df['Height'] < 185)]['Total minutes played'].mean()
df.loc[0:4, 'Age':'City'] # indexing by name, 4 and 'City' are inclusive
df.iloc[0:4, 0:5] # indexing by number, 4 and 5 are inclusive
df[-1:] # last line of the dataframe

# Applying and mapping functions
df.apply(np.max)
df[df['City'].apply(lambda city: city[0] == 'P')].head() # cities starting with P
df['column_name'] = df['column_name'].map({old_value: new_value})
SAME AS
df['column_name'] = df.replace({'column_name': {old_value: new_value}})

# Grouping
df.groupby(['Team'])[columns_to_show].describe(percentiles=[])
SAME AS
df.groupby(['Team'])[columns_to_show].agg([np.mean, np.std, np.min, np.max])

# Summary tables
pd.crosstab(df['Team'], df['Height']) # may add normalize=True
df.pivot_table(['Age', 'Height'], ['Team'], aggfunc='mean')

# Transformations
df['X_total'] = df['X1'] + df['X2']
df['X_total'] = df.apply(lambda row: row.X1 + row.X2, axis=1)
df["Player_height_mean"] = df.groupby('Player_ID')["Height"].transform('mean')
df.drop(['X1', 'X2'], axis=1, inplace=True) # got rid of columns
df.drop([1, 2]).head() # got rid of rows

Mastering Data Visualization

Leads you to the most appropriate graph for your data. It links to the code to build it and lists common caveats you should avoid.

2 interfaces for plotting:

  • MATLAB style plotting using pyplot (METHOD 1);
  • Object Oriented Interface (METHOD 2).

METHOD 1: MATLAB style plotting (doc)

plt.figure(figsize=(12, 8))
plt.plot(x, y**2, c='black', lw=3, ls='--', label='quadratic')
plt.plot(x, y**3, c='pink', lw=3, ls='--', label='cubic')
plt.title('title 1')
plt.xlabel('axe x')
plt.ylabel('axe y')
plt.legend()
plt.show()
plt.savefig('my_fig.png') # vector pdf image might be interesting

# To divide the figure in a 2x1 grid
plt.subplot(2,1,1)
plt.plot(x, y, c='orange')
plt.subplot(2,1,2)
plt.plot(x, y, c='blue')

METHOD 2: Object Oriented Interface (doc)

# Get an empty figure, a fig object can contain one or more axes objects
fig = plt.figure()

# Get the axes instance at 1st location in 1x1 grid, one axes represents one plot inside fig
# fig.add_subplot(2,2,1) would generate a grid of 2x2 subplots and and get for 1st location
ax = fig.add_subplot(1,1,1)

# Generate the plot
ax.plot(x, y)

# Set labels for x and y axis
ax.set_xlabel('X label')
ax.set_ylabel('Y label')

# Set title for the plot
ax.set_title('Title')

# Display the figure
plt.show()

# Another way to do it could be: fig, ax = plt.subplots(2, 1, sharex=True)

Matplotlib & seaborn demonstration

Below a small demonstration of the main graphics and tables that can be used with matplotlib and seaborn (example gallery using seaborn)

# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

# Change the display settings
%config InlineBackend.figure_format = 'retina' # Graphics are more sharp. 'svg'/'png'
plt.rcParams['figure.figsize'] = 8, 5
plt.rcParams['image.cmap'] = 'viridis'

# Histograms and density plots
df[features].hist(figsize=(10, 4));
df[features].plot(kind='density', subplots=True, layout=(1, 2),
                sharex=False, figsize=(10, 4)); # see 'kind='bar', or 'kde', 'scatter', etc
sns.distplot(df['feature']);

# Box plot
sns.boxplot(x='num_feature', y='cat_feature', data=df, orient='h');

# Violin plot
_, axes = plt.subplots(1, 2, sharey=True, figsize=(6, 4))
sns.boxplot(data=df['feature'], ax=axes[0]);
sns.violinplot(data=df['feature'], ax=axes[1]);

# Bar plot
_, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
sns.countplot(x='cat_feat_1', data=df, ax=axes[0]); # can add "hue" for another cat_feat
sns.countplot(x='cat_feat_2', data=df, ax=axes[1]);

# Correlation matrix
corr_matrix = df[num_features].corr()
sns.heatmap(corr_matrix, annot=True, fmt=".1f", linewidths=.5);
# heatmap allows to view the distribution of a num over two cat

# Scatter plot and Scatter plot matrix
plt.scatter(df['X1'], df['X2']);
sns.jointplot(x='X1', y='X2', data=df, kind='scatter');
sns.jointplot('X1', 'X2', data=df, kind="kde", color="g");
sns.pairplot(df[numerical]);

# Plot with categories
sns.lmplot('X1', 'X2', data=df, hue='cat_feat', fit_reg=False);
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(10, 7))
for idx, feat in enumerate(numerical):
    ax = axes[int(idx / 4), idx % 4]
    sns.boxplot(x='cat', y=feat, data=df, ax=ax)
    ax.set_xlabel('')
    ax.set_ylabel(feat)
fig.tight_layout();
_, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4))
sns.boxplot(x='cat_feat', y='num_feat', data=df, ax=axes[0]);
sns.violinplot(x='cat_feat', y='num_feat', data=df, ax=axes[1]);
sns.catplot(x='cat_feat_1', y='num_feat', col='cat_feat_2',
               data=df[df['cat_feat_2'] < 3], kind="box",
               col_wrap=4, height=3, aspect=.8);

Plotly demonstration

The plotly library can also be used, that allows creation of interactive plots within a Jupyter notebook without having to use Javascript (doc)

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly
import plotly.graph_objs as go

init_notebook_mode(connected=True)

# For line plots
trace0 = go.Scatter(  # Create a line (trace) for the GDP in France
    x=df.index,
    y=df['France GDP'],
    name='France GDP'
)

trace1 = go.Scatter( # Create a line (trace) for the GDP in Germany
    x=df.index,
    y=df['Germany GDP'],
    name='Germany GDP'
) # To use bar chart, you could've replaced 'go.Scatter' by 'go.Bar' in both

data = [trace0, trace1] # Define the data array
layout = {'title': 'France vs Germany GDPs'} # Set the title

fig = go.Figure(data=data, layout=layout) # Create a Figure and plot it
iplot(fig, show_link=False) # Visualize

plotly.offline.plot(fig, filename='gdp_stats.html', show_link=False); # Save the plot in an html file

# For box plots
data = []
for country in df.Country.unique():
    data.append(
        go.Box(y=df[df.Country == country].height, name=country)
    )
iplot(data, show_link=False) # Visualize

Mastering SQL

CRUD

### CREATE
INSERT INTO table_name (column1, column2) VALUES (value1, value2);
INSERT INTO table_name VALUES (value1, value2 …);

### READ
# SELECT: used to select data from a database
SELECT * FROM table_name;
# DISTINCT: filters away duplicate values and returns rows of specified column
SELECT DISTINCT column_name;
# WHERE: used to filter records/rows
SELECT column1, column2 FROM table_name WHERE condition;
# ORDER BY: used to sort the result-set in ascending or descending order
SELECT * FROM table_name ORDER BY column1 ASC, column2;
# LIMIT: used to specify the number of records to return from top of table
LIMIT 10;
# LIKE: operator used in a WHERE clause to search for a specific pattern in a column
LIKE ‘Harry_Potter_%_Azkaban’ # underscore represents a single character
# GROUP BY: statement often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns
GROUP BY column_name1 ORDER BY COUNT(column_name2) DESC;
# HAVING: this clause was added to SQL because the WHERE keyword could not be used with aggregate functions
GROUP BY column_name2 HAVING COUNT(column_name1) > 42;
# WITH: often used for retrieving hierarchical data or re-using temp result set several times in a query. Also referred to as "Common Table Expression"
WITH trips_by_day AS (
  SELECT c0.* FROM categories AS c0 WHERE EXTRACT(YEAR FROM start_date) = 2020
)
SELECT *,
SUM(num_trips)
    OVER (
         PARTITION BY id
         ORDER BY trip_date
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
         ) AS cumulative_trips
FROM trips_by_day

MIN() (or MAX()) # Returns the minimum (or maximum) of input values
AVG() (or SUM()) # Returns the average (or sum) of input values
COUNT() # Returns the number of rows in the input

FIRST_VALUE() (or LAST_VALUE()) # Returns the first (or last) value in the input
LEAD() (and LAG()) # Returns the value on a subsequent (or preceding) row

ROW_NUMBER() # Returns the order in which rows appear in the input (starting with 1)
RANK() # All rows with the same value in the ordering column receive the same rank value, where the next row receives a rank value which increments by the number of rows with the previous rank value.

### UPDATE
UPDATE table_name SET column1 = value1, column2 = value2 WHERE condition;
UPDATE table_name SET column_name = value;

### DELETE
DELETE FROM table_name WHERE condition;
DELETE * FROM table_name;

Prerequisite warmup

Kaggle Learn, DataQuest [Part1/Part2/Part3], CodeAcademy

Learn the basics of programming languages (Python, SQL)

Basics of calculus, linear algebra, optimization and statistics are essential to understand the toolset you’re going to use. (Disclosure: I know the guy, and he's gold.)

Some combination of math and entertainment. The goal is for explanations to be driven by animations and for difficult problems to be made simple with changes in perspective. BIG MVP!!!

Spoilers: Solve real company problems using data.

Interesting posts

Funny satire.

Shows you the power of RNNs.

Webcomics of romance, sarcasm, science, and language by Randall Munroe and Zach Weinersmith.

Useful resources

The daily email newsletter covering the latest news from Wall St. to Silicon Valley. Informative, witty, and everything you need to start your day.

Visualize data like you have never seen it before.

Level up your coding skills, and achieve code mastery through challenge.

Quickly Build and Deploy a Dashboard.

Unlock the power of your data with interactive dashboards.

Going further

Basic Machine Learning is covered here. Wonderfully well explained.

The world's largest data science community with powerful tools and resources to help you achieve your data science goals. Learn from all previous competitions 🙏

About

Presenting my data science tool kit that might help students better navigate in the Machine Learning seas!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published