Various minor style improvements (skrub-data#431)

* Fix style in examples * Format CHANGES and CONTRIBUTING
fcas · Jan 6, 2023 · 06516a6 · 06516a6
1 parent 455a672
commit 06516a6
Show file tree

Hide file tree

Showing 6 changed files with 97 additions and 88 deletions.
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -7,16 +7,21 @@ Major changes
 -------------
 
 * New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based
- on string similarities and the nearest neighbors matches are found for each category. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
-* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be used to download any indicator from the World Bank Open Data platform. It only needs the indicator ID that can be found on the website. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
-* Unnecessary API has been made private: everything (files, functions, classes) starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`
-* The MinHashEncoder now supports a `n_jobs` parameter to parallelize the hashes computation. :pr:`267` by :user:`Leo Grinsztajn <LeoGrin>` and :user:`Lilian Boulard <LilianBoulard>`.
+ on string similarities and the nearest neighbors matches are found for each category.
+:pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
+* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator`
+ that can be used to download any indicator from the World Bank Open Data platform.
+ It only needs the indicator ID that can be found on the website. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
+* Unnecessary API has been made private: everything (files, functions, classes)
+ starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`
+* The :class:`MinHashEncoder` now supports a `n_jobs` parameter to parallelize
+ the hashes computation. :pr:`267` by :user:`Leo Grinsztajn <LeoGrin>` and :user:`Lilian Boulard <LilianBoulard>`.
 
 Minor changes
 -------------
 * Removed example `Fitting scalable, non-linear models on data with dirty categories`. :pr:`386` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 
-* :class:`MinHashEncoder`'s `minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`
+* :class:`MinHashEncoder`'s :func:`minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 
 * Fetching functions now have an additional argument ``directory``,
  which can be used to specify where to save and load from datasets.
@@ -25,7 +30,7 @@ Minor changes
 Bug fixes
 ---------
 
-* :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
+* The :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
  than raising an error. :pr:`378` by :user:`Gael Varoquaux <GaelVaroquaux>`
 
 Release 0.3.0
@@ -34,7 +39,9 @@ Release 0.3.0
 Major changes
 -------------
 
-* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the :class:`SuperVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`
+* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns
+ (year, month, day, hour, minute, second, ...). It is now the default transformer used
+ in the :class:`SuperVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`
 
 * The :class:`SuperVectorizer` has seen some major improvements and bug fixes:
 
@@ -52,19 +59,21 @@ Major changes
 * Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. :pr:`289` by :user:`Lilian Boulard <LilianBoulard>`
 
 * Bumped minimum dependencies:
+
  - scikit-learn>=0.23
  - scipy>=1.4.0
  - numpy>=1.17.3
  - pandas>=1.2.0 :pr:`299` and :pr:`300` by :user:`Lilian Boulard <LilianBoulard>`
 
 * Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
- The :class:`SimilarityEncoder` now exclusively uses ``ngram`` for similarities,
+
+ - The :class:`SimilarityEncoder` now exclusively uses ``ngram`` for similarities,
  and the `similarity` parameter is deprecated. It will be removed in 0.5. :pr:`282` by :user:`Lilian Boulard <LilianBoulard>`
 
 Notes
 -----
 
-* The ``transformers_`` attribute of the SuperVectorizer now contains column
+* The ``transformers_`` attribute of the :class:`SuperVectorizer` now contains column
  names instead of column indices for the "remainder" columns. :pr:`266` by :user:`Leo Grinsztajn <LeoGrin>`
 
 
@@ -74,8 +83,8 @@ Release 0.2.2
 Bug fixes
 ---------
 
-* Fixed a bug in the :class:`SuperVectorizer` causing a `FutureWarning`
- when using the `get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`
+* Fixed a bug in the :class:`SuperVectorizer` causing a :class:`FutureWarning`
+ when using the :func:`get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`
 
 
 Release 0.2.1
@@ -85,14 +94,18 @@ Major changes
 -------------
 
 * Improvements to the :class:`SuperVectorizer`
+
  - Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
 
  :pr:`238` by :user:`Leo Grinsztajn <LeoGrin>`
 
-* `get_feature_names` becomes `get_feature_names_out`, following changes in the scikit-learn API. `get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`
+* :func:`get_feature_names` becomes :func:`get_feature_names_out`, following changes in the scikit-learn API.
+ :func:`get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`
 
 * Improvements to the :class:`MinHashEncoder`
- - It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`. Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` method, on multiple columns.
+ - It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`.
+ Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` function,
+ on multiple columns.
 
  :pr:`243` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 
@@ -104,7 +117,7 @@ Bug-fixes
 
 * :class:`GapEncoder`'s `get_feature_names_out` now accepts all iterators, not just lists. :pr:`255` by :user:`Lilian Boulard <LilianBoulard>`
 
-* Fixed `DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`
+* Fixed :class:`DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`
 
 Notes
 -----
@@ -113,7 +126,7 @@ Notes
 
 * Fix typos and update links for website.
 
-* Documentation of the SuperVectorizer and the :class:`SimilarityEncoder` improved.
+* Documentation of the :class:`SuperVectorizer` and the :class:`SimilarityEncoder` improved.
 
 Release 0.2.0
 =============

diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -10,7 +10,7 @@ The following is a set of guidelines for contributing to
 
 .. contents::
  :local:
- 
+
 |
 
 I don’t want to read the whole thing I just have a question

diff --git a/examples/01_dirty_categories.py b/examples/01_dirty_categories.py
@@ -85,8 +85,7 @@
 X.dropna(subset=["gender"], inplace=True)
 y = y[~mask]
 
-# %%
-#
+# #############################################################################
 # Assembling a machine-learning pipeline that encodes the data
 # ------------------------------------------------------------
 #
@@ -96,7 +95,7 @@
 # To build a learning pipeline, we need to assemble encoders for each
 # column, and apply a supervised learning model on top.
 
-# %%
+###############################################################################
 # The categorical encoders
 # ........................
 #
@@ -106,7 +105,7 @@
 
 one_hot = OneHotEncoder(handle_unknown="ignore", sparse=False)
 
-# %%
+###############################################################################
 # We assemble these to apply them to the relevant columns.
 # The |ColumnTransformer| is created by specifying a set of transformers
 # alongside with the column names on which each must be applied:
@@ -121,7 +120,7 @@
  remainder="drop",
 )
 
-# %%
+###############################################################################
 # Pipelining an encoder with a learner
 # ....................................
 #
@@ -139,11 +138,11 @@
 
 pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())
 
-# %%
+###############################################################################
 # The pipeline can be readily applied to the dataframe for prediction:
 pipeline.fit(X, y)
 
-# %%
+###############################################################################
 # Dirty-category encoding
 # -----------------------
 #
@@ -153,7 +152,7 @@
 
 np.unique(y)
 
-# %%
+###############################################################################
 # We will now experiment with encoders specially made for handling
 # dirty columns:
 
@@ -172,7 +171,7 @@
  "gap": GapEncoder(n_components=100),
 }
 
-# %%
+###############################################################################
 # We now loop over the different encoding methods,
 # instantiate a new |Pipeline| each time, fit it
 # and store the returned cross-validation score:
@@ -196,7 +195,7 @@
  print(f"r2 score: mean: {np.mean(scores):.3f}; std: {np.std(scores):.3f}\n")
  all_scores[name] = scores
 
-# %%
+###############################################################################
 # Plotting the results
 # ....................
 #
@@ -212,7 +211,7 @@
 plt.yticks(size=20)
 plt.tight_layout()
 
-# %%
+###############################################################################
 # The clear trend is that encoders grasping similarities between categories
 # (|SE|, |MinHash|, and |Gap|) perform better than those discarding it.
 #
@@ -225,7 +224,7 @@
 # |
 #
 
-# %%
+###############################################################################
 # .. _example_super_vectorizer:
 #
 # A simpler way: automatic vectorization
@@ -239,20 +238,20 @@
 X = employee_salaries.X
 y = employee_salaries.y
 
-# %%
+###############################################################################
 # We'll drop the 'date_first_hired' column as it's redundant with
 # 'year_first_hired'.
 X = X.drop(["date_first_hired"], axis=1)
 
-# %%
+###############################################################################
 # We still have a complex and heterogeneous dataframe:
 X
 
 # %%
 # The |SV| can to turn this dataframe into a form suited for
 # machine learning.
 
-# %%
+###############################################################################
 # Using the SuperVectorizer in a supervised-learning pipeline
 # -----------------------------------------------------------
 #
@@ -269,7 +268,7 @@
  SuperVectorizer(auto_cast=True), HistGradientBoostingRegressor()
 )
 
-# %%
+###############################################################################
 # Let's perform a cross-validation to see how well this model predicts:
 
 from sklearn.model_selection import cross_val_score
@@ -280,12 +279,12 @@
 print(f"mean={np.mean(scores)}")
 print(f"std={np.std(scores)}")
 
-# %%
+###############################################################################
 # The prediction performed here is pretty much as good as above
 # but the code here is much simpler as it does not involve specifying
 # columns manually.
 
-# %%
+###############################################################################
 # Analyzing the features created
 # ------------------------------
 #
@@ -304,15 +303,15 @@
 X_train_enc = sup_vec.fit_transform(X_train, y_train)
 X_test_enc = sup_vec.transform(X_test)
 
-# %%
+###############################################################################
 # The encoded data, X_train_enc and X_test_enc are numerical arrays:
 X_train_enc
 
 # %%
 # They have more columns than the original dataframe, but not much more:
 X_train.shape, X_train_enc.shape
 
-# %%
+###############################################################################
 # Inspecting the features created
 # ...............................
 #
@@ -322,7 +321,7 @@
 
 pprint(sup_vec.transformers_)
 
-# %%
+###############################################################################
 # This is what is being passed to the |ColumnTransformer| under the hood.
 # If you're familiar with how the latter works, it should be very intuitive.
 # We can notice it classified the columns 'gender' and 'assignment_category'
@@ -338,12 +337,12 @@
 # Before encoding:
 X.columns.to_list()
 
-# %%
+###############################################################################
 # After encoding (we only plot the first 8 feature names):
 feature_names = sup_vec.get_feature_names_out()
 feature_names[:8]
 
-# %%
+###############################################################################
 # As we can see, it gave us interpretable columns.
 # This is because we used the |Gap| on the column 'division',
 # which was classified as a high cardinality string variable.
@@ -353,7 +352,7 @@
 len(feature_names)
 
 
-# %%
+###############################################################################
 # Feature importances in the statistical model
 # --------------------------------------------
 #
@@ -372,7 +371,7 @@
 regressor = RandomForestRegressor()
 regressor.fit(X_train_enc, y_train)
 
-# %%
+###############################################################################
 # Retrieving the feature importances:
 
 importances = regressor.feature_importances_
@@ -381,7 +380,7 @@
 # Sort from least to most
 indices = list(reversed(indices))
 
-# %%
+###############################################################################
 # Plotting the results:
 
 import matplotlib.pyplot as plt
@@ -396,7 +395,7 @@
 plt.tight_layout(pad=1)
 plt.show()
 
-# %%
+###############################################################################
 # We can deduce from this data that the three factors that define the
 # most the salary are: being hired for a long time, being a manager, and
 # having a permanent, full-time job :)