add a github workflow to check if CHANGES.rst has been changed (skrub…

…-data#405)
fcas · Dec 13, 2022 · 6601dd3 · 6601dd3
1 parent ece35a6
commit 6601dd3
Show file tree

Hide file tree

Showing 6 changed files with 350 additions and 48 deletions.
diff --git a/.github/workflows/changelog.yml b/.github/workflows/changelog.yml
@@ -0,0 +1,68 @@
+name: Check Changelog
+# This check makes sure that the changelog is properly updated
+# when a PR introduces a change in a test file.
+# To bypass this check, label the PR with "No Changelog Needed".
+on:
+ pull_request:
+ types: [opened, edited, labeled, unlabeled, synchronize]
+
+jobs:
+ check:
+ name: A reviewer will let you know if it is required or can be bypassed
+ runs-on: ubuntu-latest
+ if: ${{ contains(github.event.pull_request.labels.*.name, 'No Changelog Needed') == 0 }}
+ steps:
+ - name: Get PR number and milestone
+ run: |
+ echo "PR_NUMBER=${{ github.event.pull_request.number }}" >> $GITHUB_ENV
+ echo "TAGGED_MILESTONE=${{ github.event.pull_request.milestone.title }}" >> $GITHUB_ENV
+ - uses: actions/checkout@v3
+ with:
+ fetch-depth: '0'
+ - name: Check the changelog entry
+ run: |
+ set -xe
+ changed_files=$(git diff --name-only origin/main)
+ # Changelog should be updated only if tests have been modified
+ if [[ ! "$changed_files" =~ tests ]]
+ then
+ exit 0
+ fi
+ all_changelogs=$(cat ./CHANGES.rst)
+ if [[ "$all_changelogs" =~ :pr:\`$PR_NUMBER\` ]]
+ then
+ echo "Changelog has been updated."
+ # If the pull request is milestoned check the correspondent changelog
+ if exist -f ./CHANGES.rst${TAGGED_MILESTONE:0:4}.rst
+ then
+ expected_changelog=$(cat ./CHANGES.rst${TAGGED_MILESTONE:0:4}.rst)
+ if [[ "$expected_changelog" =~ :pr:\`$PR_NUMBER\` ]]
+ then
+ echo "Changelog and milestone correspond."
+ else
+ echo "Changelog and milestone do not correspond."
+ echo "If you see this error make sure that the tagged milestone for the PR"
+ echo "and the edited changelog filename properly match."
+ exit 1
+ fi
+ fi
+ else
+ echo "A Changelog entry is missing."
+ echo ""
+ echo "Please add an entry to the changelog at 'CHANGES.rst'"
+ echo "to document your change assuming that the PR will be merged"
+ echo "in time for the next release of dirty-cat."
+ echo ""
+ echo "Look at other entries in that file for inspiration and please"
+ echo "reference this pull request using the ':pr:' directive and"
+ echo "credit yourself (and other contributors if applicable) with"
+ echo "the ':user:' directive., for instance :pr:`453` by :user:`Jo Blib <JoBlib>`."
+ echo ""
+ echo "If you see this error and there is already a changelog entry,"
+ echo "check that the PR number is correct."
+ echo ""
+ echo "If you believe that this PR does not warrant a changelog"
+ echo "entry, say so in a comment so that a maintainer will label"
+ echo "the PR with 'No Changelog Needed' to bypass this check."
+ exit 1
+ fi
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -7,66 +7,61 @@ Major changes
 -------------
 
 * New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based
- on string similarities and the nearest neighbors matches are found for each category.
-* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be
- used to download any indicator from the World Bank Open Data platform. It only needs the
- indicator ID that can be found on the website.
+ on string similarities and the nearest neighbors matches are found for each category. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
+* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be used to download any indicator from the World Bank Open Data platform. It only needs the indicator ID that can be found on the website. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 * Unnecessary API has been made private: everything (files, functions, classes)
- starting with an underscore shouldn't be imported in your code.
+ starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`
 
 Minor changes
 -------------
-* Removed example `Fitting scalable, non-linear models on data with dirty categories`.
+* Removed example `Fitting scalable, non-linear models on data with dirty categories`. :pr:`386` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 
-* :class:`MinHashEncoder`'s `minhash` method is no longer public.
+* :class:`MinHashEncoder`'s `minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 
 Bug fixes
 ---------
 
 * :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
- than raising an error.
+ than raising an error. :pr:`378` by :user:`Gael Varoquaux <GaelVaroquaux>`
 
 Release 0.3.0
 =============
 
 Major changes
 -------------
 
-* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical
- columns (year, month, day, hour, minute, second, ...). It is now the default transformer used
- in the :class:`SuperVectorizer` for datetime columns.
+* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the :class:`SuperVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`
 
 * The :class:`SuperVectorizer` has seen some major improvements and bug fixes:
+
  - Fixes the automatic casting logic in ``transform``.
- - To avoid dimensionality explosion when a feature has two unique values,
- the default encoder (:class:`~sklearn.preprocessing.OneHotEncoder`) now drops one of the two
- vectors (see parameter `drop="if_binary"`).
- - ``fit_transform`` and ``transform`` can now return unencoded features,
- like the :class:`~sklearn.compose.ColumnTransformer`'s behavior.
- Previously, a ``RuntimeError`` was raised.
+ - To avoid dimensionality explosion when a feature has two unique values, the default encoder (:class:`~sklearn.preprocessing.OneHotEncoder`) now drops one of the two vectors (see parameter `drop="if_binary"`).
+ - ``fit_transform`` and ``transform`` can now return unencoded features, like the :class:`~sklearn.compose.ColumnTransformer`'s behavior. Previously, a ``RuntimeError`` was raised.
+
+ :pr:`300` by :user:`Lilian Boulard <LilianBoulard>`
 
 * **Backward-incompatible change in the SuperVectorizer**:
  To apply ``remainder`` to features (with the ``*_transformer`` parameters),
  the value ``'remainder'`` must be passed, instead of ``None`` in previous versions.
- ``None`` now indicates that we want to use the default transformer.
+ ``None`` now indicates that we want to use the default transformer. :pr:`303` by :user:`Lilian Boulard <LilianBoulard>`
 
-* Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required.
+* Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. :pr:`289` by :user:`Lilian Boulard <LilianBoulard>`
 
 * Bumped minimum dependencies:
  - scikit-learn>=0.23
  - scipy>=1.4.0
  - numpy>=1.17.3
- - pandas>=1.2.0
+ - pandas>=1.2.0 :pr:`299` and :pr:`300` by :user:`Lilian Boulard <LilianBoulard>`
 
 * Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
  The :class:`SimilarityEncoder` now exclusively uses ``ngram`` for similarities,
- and the `similarity` parameter is deprecated. It will be removed in 0.5.
+ and the `similarity` parameter is deprecated. It will be removed in 0.5. :pr:`282` by :user:`Lilian Boulard <LilianBoulard>`
 
 Notes
 -----
 
 * The ``transformers_`` attribute of the SuperVectorizer now contains column
- names instead of column indices for the "remainder" columns.
+ names instead of column indices for the "remainder" columns. :pr:`266` by :user:`Leo Grinsztajn <LeoGrin>`
 
 
 Release 0.2.2
@@ -76,7 +71,7 @@ Bug fixes
 ---------
 
 * Fixed a bug in the :class:`SuperVectorizer` causing a `FutureWarning`
- when using the `get_feature_names_out` method.
+ when using the `get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`
 
 
 Release 0.2.1
@@ -86,27 +81,26 @@ Major changes
 -------------
 
 * Improvements to the :class:`SuperVectorizer`
+ - Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.
 
- - Type detection works better: handles dates, numerics columns encoded as strings,
- or numeric columns containing strings for missing values.
+ :pr:`238` by :user:`Leo Grinsztajn <LeoGrin>`
 
-* `get_feature_names` becomes `get_feature_names_out`, following changes in the scikit-learn API.
- `get_feature_names` is deprecated in scikit-learn > 1.0.
+* `get_feature_names` becomes `get_feature_names_out`, following changes in the scikit-learn API. `get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`
 
 * Improvements to the :class:`MinHashEncoder`
- - It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`.
-  Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` method,
-  on multiple columns.
+ - It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`. Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` method, on multiple columns.
+
+ :pr:`243` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 
 
 Bug-fixes
 ---------
 
-* Fixed a bug that resulted in the :class:`GapEncoder` ignoring the analyzer argument.
+* Fixed a bug that resulted in the :class:`GapEncoder` ignoring the analyzer argument. :pr:`242` by :user:`Jovan Stojanovic <jovan-stojanovic>`
 
-* :class:`GapEncoder`'s `get_feature_names_out` now accepts all iterators, not just lists.
+* :class:`GapEncoder`'s `get_feature_names_out` now accepts all iterators, not just lists. :pr:`255` by :user:`Lilian Boulard <LilianBoulard>`
 
-* Fixed `DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`
+* Fixed `DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`
 
 Notes
 -----
@@ -127,8 +121,8 @@ Major changes
 
 * Bump minimum dependencies:
 
- - scikit-learn (>=0.21.0)
- - pandas (>=1.1.5) **! NEW REQUIREMENT !**
+ - scikit-learn (>=0.21.0) :pr:`202` by :user:`Lilian Boulard <LilianBoulard>`
+ - pandas (>=1.1.5) **! NEW REQUIREMENT !** :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`
 
 * **datasets.fetching** - backward-incompatible changes to the example
  datasets fetchers:
@@ -139,17 +133,21 @@ Major changes
  but their return values were modified in favor of a more Pythonic interface.
  Refer to the docstrings of functions `dirty_cat.datasets.fetch_*`
  for more information.
- - The example notebooks were updated to reflect these changes.
+ - The example notebooks were updated to reflect these changes. :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`
 
 * **Backward incompatible change to** :class:`MinHashEncoder`: The :class:`MinHashEncoder` now
  only supports two dimensional inputs of shape (N_samples, 1).
+ :pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.
 
 * Update `handle_missing` parameters:
+
  - :class:`GapEncoder`: the default value "zero_impute" becomes "empty_impute" (see doc).
  - :class:`MinHashEncoder`: the default value "" becomes "zero_impute" (see doc).
+
+ :pr:`210` by :user:`Alexis Cvetkov <alexis-cvetkov>`.
 
 * Add a method "get_feature_names_out" for the :class:`GapEncoder` and the :class:`SuperVectorizer`,
- since `get_feature_names` will be depreciated in scikit-learn 1.2 (#216).
+ since `get_feature_names` will be depreciated in scikit-learn 1.2. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`
 
 Notes
 -----
@@ -163,6 +161,8 @@ Notes
  - Type casting and per-column imputation are now learnt during fitting
  - Several bugfixes
 
+ :pr:`201` by :user:`Lilian Boulard <LilianBoulard>`
+
 Release 0.2.0a1
 ===============
 
@@ -189,18 +189,19 @@ Major changes
  :class:`SuperVectorizer` class. It transforms
  columns automatically based on their type. It provides a replacement
  for scikit-learn's :class:`~sklearn.compose.ColumnTransformer` simpler to use on heterogeneous
- pandas DataFrame.
+ pandas DataFrame. :pr:`167` by :user:`Lilian Boulard <LilianBoulard>`
 
 * **Backward incompatible change to** :class:`GapEncoder`: The :class:`GapEncoder` now only
  supports two-dimensional inputs of shape (n_samples, n_features).
  Internally, features are encoded by independent :class:`GapEncoder` models,
  and are then concatenated into a single matrix.
+ :pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.
 
 
 Bug-fixes
 ---------
 
-* Fix `get_feature_names` for scikit-learn > 0.21
+* Fix `get_feature_names` for scikit-learn > 0.21. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`
 
 
 Release 0.1.1
@@ -212,7 +213,7 @@ Major changes
 Bug-fixes
 ---------
 
-* RuntimeWarnings due to overflow in :class:`GapEncoder` (#161)
+* RuntimeWarnings due to overflow in :class:`GapEncoder`. :pr:`161` by :user:`Alexis Cvetkov <alexis-cvetkov>`
 
 
 Release 0.1.0
@@ -224,12 +225,12 @@ Major changes
 * :class:`GapEncoder`: Added online Gamma-Poisson factorization through the
  :class:`GapEncoder` class. This method discovers latent categories formed
  via combinations of substrings, and encodes string data as combinations of
- these categories. To be used if interpretability is important.
+ these categories. To be used if interpretability is important. :pr:`153` by :user:`Alexis Cvetkov <alexis-cvetkov>`
 
 Bug-fixes
 ---------
 
-* Multiprocessing exception in notebook (#154)
+* Multiprocessing exception in notebook. :pr:`154` by :user:`Lilian Boulard <LilianBoulard>`
 
 
 Release 0.0.7

diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -10,6 +10,7 @@ The following is a set of guidelines for contributing to
 
 .. contents::
  :local:
+
 |
 
 I don’t want to read the whole thing I just have a question
@@ -146,29 +147,29 @@ So, first step: create your environment.
 
 For this example, we’ll use conda:
 
-.. code:: commandline
+.. code:: console
 
  conda create python=3.10 --name dirty_cat
  conda activate dirty_cat
 
 Secondly, clone the repository (you’ll need to have ``git`` installed -
 it is already on most linux distributions).
 
-.. code:: commandline
+.. code:: console
 
  git clone https://github.com/dirty-cat/dirty_cat
 
 Next, install the project dependencies. They are listed in ``setup.cfg``.
 
-.. code:: commandline
+.. code:: console
 
  pip install -e .[dev]
 
 Code-formatting and linting is automatically done via
 ```pre-commit`` <https://github.com/pre-commit/pre-commit>`__. You
 install this setup using:
 
-.. code:: commandline
+.. code:: console
 
  pip install pre-commit
  pre-commit install
@@ -178,7 +179,7 @@ ignored by ``git blame`` and IDE integrations. The revisions to be
 ignored are listed in ``.git-blame-ignore-revs``, which can be set in
 your local repository with:
 
-.. code:: commandline
+.. code:: console
 
  git config blame.ignoreRevsFile .git-blame-ignore-revs
 
@@ -216,7 +217,7 @@ It is advised to create a new branch every time you work on a new issue,
 to avoid confusion.
 Use the following command to create a branch:
 
-.. code:: commandline
+.. code:: console
 
  git checkout -b branch_name
 

diff --git a/doc/conf.py b/doc/conf.py
@@ -19,6 +19,13 @@
 import os
 import shutil
 from datetime import datetime
+import sys
+# If extensions (or modules to document with autodoc) are in another
+# directory, add these directories to sys.path here. If the directory
+# is relative to the documentation root, use os.path.abspath to make it
+# absolute, like shown here.
+sys.path.insert(0, os.path.abspath("sphinxext"))
+
 
 # -- Copy files for docs --------------------------------------------------
 #
@@ -46,6 +53,7 @@
  "sphinx.ext.viewcode",
  "sphinx.ext.githubpages",
  "numpydoc",
+ "sphinx_issues",
  "sphinx.ext.autodoc.typehints",
  "sphinx_gallery.gen_gallery",
 ]
@@ -294,3 +302,8 @@
 
 # -- The javascript to highlight the toc as we scroll ----------------------
 html_js_files = ["scrolltoc.js"]
+
+# -- github links --------------------------------------
+
+# we use the issues path for PRs since the issues URL will forward
+issues_github_path = "dirty-cat/dirty_cat"
diff --git a/doc/sphinxext/MANIFEST.in b/doc/sphinxext/MANIFEST.in
@@ -0,0 +1,2 @@
+recursive-include tests *.py
+include *.txt