Skip to content

Commit

Permalink
Various minor style improvements (skrub-data#431)
Browse files Browse the repository at this point in the history
* Fix style in examples

* Format CHANGES and CONTRIBUTING
  • Loading branch information
LilianBoulard authored Jan 6, 2023
1 parent 455a672 commit 06516a6
Show file tree
Hide file tree
Showing 6 changed files with 97 additions and 88 deletions.
43 changes: 28 additions & 15 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,21 @@ Major changes
-------------

* New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based
on string similarities and the nearest neighbors matches are found for each category. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be used to download any indicator from the World Bank Open Data platform. It only needs the indicator ID that can be found on the website. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
* Unnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`
* The MinHashEncoder now supports a `n_jobs` parameter to parallelize the hashes computation. :pr:`267` by :user:`Leo Grinsztajn <LeoGrin>` and :user:`Lilian Boulard <LilianBoulard>`.
on string similarities and the nearest neighbors matches are found for each category.
:pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator`
that can be used to download any indicator from the World Bank Open Data platform.
It only needs the indicator ID that can be found on the website. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
* Unnecessary API has been made private: everything (files, functions, classes)
starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`
* The :class:`MinHashEncoder` now supports a `n_jobs` parameter to parallelize
the hashes computation. :pr:`267` by :user:`Leo Grinsztajn <LeoGrin>` and :user:`Lilian Boulard <LilianBoulard>`.

Minor changes
-------------
* Removed example `Fitting scalable, non-linear models on data with dirty categories`. :pr:`386` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* :class:`MinHashEncoder`'s `minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`
* :class:`MinHashEncoder`'s :func:`minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* Fetching functions now have an additional argument ``directory``,
which can be used to specify where to save and load from datasets.
Expand All @@ -25,7 +30,7 @@ Minor changes
Bug fixes
---------

* :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
* The :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
than raising an error. :pr:`378` by :user:`Gael Varoquaux <GaelVaroquaux>`

Release 0.3.0
Expand All @@ -34,7 +39,9 @@ Release 0.3.0
Major changes
-------------

* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the :class:`SuperVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`
* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns
(year, month, day, hour, minute, second, ...). It is now the default transformer used
in the :class:`SuperVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`

* The :class:`SuperVectorizer` has seen some major improvements and bug fixes:

Expand All @@ -52,19 +59,21 @@ Major changes
* Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. :pr:`289` by :user:`Lilian Boulard <LilianBoulard>`

* Bumped minimum dependencies:

- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0 :pr:`299` and :pr:`300` by :user:`Lilian Boulard <LilianBoulard>`

* Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
The :class:`SimilarityEncoder` now exclusively uses ``ngram`` for similarities,

- The :class:`SimilarityEncoder` now exclusively uses ``ngram`` for similarities,
and the `similarity` parameter is deprecated. It will be removed in 0.5. :pr:`282` by :user:`Lilian Boulard <LilianBoulard>`

Notes
-----

* The ``transformers_`` attribute of the SuperVectorizer now contains column
* The ``transformers_`` attribute of the :class:`SuperVectorizer` now contains column
names instead of column indices for the "remainder" columns. :pr:`266` by :user:`Leo Grinsztajn <LeoGrin>`


Expand All @@ -74,8 +83,8 @@ Release 0.2.2
Bug fixes
---------

* Fixed a bug in the :class:`SuperVectorizer` causing a `FutureWarning`
when using the `get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`
* Fixed a bug in the :class:`SuperVectorizer` causing a :class:`FutureWarning`
when using the :func:`get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`


Release 0.2.1
Expand All @@ -85,14 +94,18 @@ Major changes
-------------

* Improvements to the :class:`SuperVectorizer`

- Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.

:pr:`238` by :user:`Leo Grinsztajn <LeoGrin>`

* `get_feature_names` becomes `get_feature_names_out`, following changes in the scikit-learn API. `get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`
* :func:`get_feature_names` becomes :func:`get_feature_names_out`, following changes in the scikit-learn API.
:func:`get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`

* Improvements to the :class:`MinHashEncoder`
- It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`. Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` method, on multiple columns.
- It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`.
Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` function,
on multiple columns.

:pr:`243` by :user:`Jovan Stojanovic <jovan-stojanovic>`

Expand All @@ -104,7 +117,7 @@ Bug-fixes

* :class:`GapEncoder`'s `get_feature_names_out` now accepts all iterators, not just lists. :pr:`255` by :user:`Lilian Boulard <LilianBoulard>`

* Fixed `DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`
* Fixed :class:`DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`

Notes
-----
Expand All @@ -113,7 +126,7 @@ Notes

* Fix typos and update links for website.

* Documentation of the SuperVectorizer and the :class:`SimilarityEncoder` improved.
* Documentation of the :class:`SuperVectorizer` and the :class:`SimilarityEncoder` improved.

Release 0.2.0
=============
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The following is a set of guidelines for contributing to
.. contents::
:local:

|
I don’t want to read the whole thing I just have a question
Expand Down
53 changes: 26 additions & 27 deletions examples/01_dirty_categories.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,7 @@
X.dropna(subset=["gender"], inplace=True)
y = y[~mask]

# %%
#
# #############################################################################
# Assembling a machine-learning pipeline that encodes the data
# ------------------------------------------------------------
#
Expand All @@ -96,7 +95,7 @@
# To build a learning pipeline, we need to assemble encoders for each
# column, and apply a supervised learning model on top.

# %%
###############################################################################
# The categorical encoders
# ........................
#
Expand All @@ -106,7 +105,7 @@

one_hot = OneHotEncoder(handle_unknown="ignore", sparse=False)

# %%
###############################################################################
# We assemble these to apply them to the relevant columns.
# The |ColumnTransformer| is created by specifying a set of transformers
# alongside with the column names on which each must be applied:
Expand All @@ -121,7 +120,7 @@
remainder="drop",
)

# %%
###############################################################################
# Pipelining an encoder with a learner
# ....................................
#
Expand All @@ -139,11 +138,11 @@

pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())

# %%
###############################################################################
# The pipeline can be readily applied to the dataframe for prediction:
pipeline.fit(X, y)

# %%
###############################################################################
# Dirty-category encoding
# -----------------------
#
Expand All @@ -153,7 +152,7 @@

np.unique(y)

# %%
###############################################################################
# We will now experiment with encoders specially made for handling
# dirty columns:

Expand All @@ -172,7 +171,7 @@
"gap": GapEncoder(n_components=100),
}

# %%
###############################################################################
# We now loop over the different encoding methods,
# instantiate a new |Pipeline| each time, fit it
# and store the returned cross-validation score:
Expand All @@ -196,7 +195,7 @@
print(f"r2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}\n")
all_scores[name] = scores

# %%
###############################################################################
# Plotting the results
# ....................
#
Expand All @@ -212,7 +211,7 @@
plt.yticks(size=20)
plt.tight_layout()

# %%
###############################################################################
# The clear trend is that encoders grasping similarities between categories
# (|SE|, |MinHash|, and |Gap|) perform better than those discarding it.
#
Expand All @@ -225,7 +224,7 @@
# |
#

# %%
###############################################################################
# .. _example_super_vectorizer:
#
# A simpler way: automatic vectorization
Expand All @@ -239,20 +238,20 @@
X = employee_salaries.X
y = employee_salaries.y

# %%
###############################################################################
# We'll drop the 'date_first_hired' column as it's redundant with
# 'year_first_hired'.
X = X.drop(["date_first_hired"], axis=1)

# %%
###############################################################################
# We still have a complex and heterogeneous dataframe:
X

# %%
# The |SV| can to turn this dataframe into a form suited for
# machine learning.

# %%
###############################################################################
# Using the SuperVectorizer in a supervised-learning pipeline
# -----------------------------------------------------------
#
Expand All @@ -269,7 +268,7 @@
SuperVectorizer(auto_cast=True), HistGradientBoostingRegressor()
)

# %%
###############################################################################
# Let's perform a cross-validation to see how well this model predicts:

from sklearn.model_selection import cross_val_score
Expand All @@ -280,12 +279,12 @@
print(f"mean={np.mean(scores)}")
print(f"std={np.std(scores)}")

# %%
###############################################################################
# The prediction performed here is pretty much as good as above
# but the code here is much simpler as it does not involve specifying
# columns manually.

# %%
###############################################################################
# Analyzing the features created
# ------------------------------
#
Expand All @@ -304,15 +303,15 @@
X_train_enc = sup_vec.fit_transform(X_train, y_train)
X_test_enc = sup_vec.transform(X_test)

# %%
###############################################################################
# The encoded data, X_train_enc and X_test_enc are numerical arrays:
X_train_enc

# %%
# They have more columns than the original dataframe, but not much more:
X_train.shape, X_train_enc.shape

# %%
###############################################################################
# Inspecting the features created
# ...............................
#
Expand All @@ -322,7 +321,7 @@

pprint(sup_vec.transformers_)

# %%
###############################################################################
# This is what is being passed to the |ColumnTransformer| under the hood.
# If you're familiar with how the latter works, it should be very intuitive.
# We can notice it classified the columns 'gender' and 'assignment_category'
Expand All @@ -338,12 +337,12 @@
# Before encoding:
X.columns.to_list()

# %%
###############################################################################
# After encoding (we only plot the first 8 feature names):
feature_names = sup_vec.get_feature_names_out()
feature_names[:8]

# %%
###############################################################################
# As we can see, it gave us interpretable columns.
# This is because we used the |Gap| on the column 'division',
# which was classified as a high cardinality string variable.
Expand All @@ -353,7 +352,7 @@
len(feature_names)


# %%
###############################################################################
# Feature importances in the statistical model
# --------------------------------------------
#
Expand All @@ -372,7 +371,7 @@
regressor = RandomForestRegressor()
regressor.fit(X_train_enc, y_train)

# %%
###############################################################################
# Retrieving the feature importances:

importances = regressor.feature_importances_
Expand All @@ -381,7 +380,7 @@
# Sort from least to most
indices = list(reversed(indices))

# %%
###############################################################################
# Plotting the results:

import matplotlib.pyplot as plt
Expand All @@ -396,7 +395,7 @@
plt.tight_layout(pad=1)
plt.show()

# %%
###############################################################################
# We can deduce from this data that the three factors that define the
# most the salary are: being hired for a long time, being a manager, and
# having a permanent, full-time job :)
Expand Down
Loading

0 comments on commit 06516a6

Please sign in to comment.