Skip to content

Commit

Permalink
add a github workflow to check if CHANGES.rst has been changed (skrub…
Browse files Browse the repository at this point in the history
  • Loading branch information
LeoGrin authored Dec 13, 2022
1 parent ece35a6 commit 6601dd3
Show file tree
Hide file tree
Showing 6 changed files with 350 additions and 48 deletions.
68 changes: 68 additions & 0 deletions .github/workflows/changelog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
name: Check Changelog
# This check makes sure that the changelog is properly updated
# when a PR introduces a change in a test file.
# To bypass this check, label the PR with "No Changelog Needed".
on:
pull_request:
types: [opened, edited, labeled, unlabeled, synchronize]

jobs:
check:
name: A reviewer will let you know if it is required or can be bypassed
runs-on: ubuntu-latest
if: ${{ contains(github.event.pull_request.labels.*.name, 'No Changelog Needed') == 0 }}
steps:
- name: Get PR number and milestone
run: |
echo "PR_NUMBER=${{ github.event.pull_request.number }}" >> $GITHUB_ENV
echo "TAGGED_MILESTONE=${{ github.event.pull_request.milestone.title }}" >> $GITHUB_ENV
- uses: actions/checkout@v3
with:
fetch-depth: '0'
- name: Check the changelog entry
run: |
set -xe
changed_files=$(git diff --name-only origin/main)
# Changelog should be updated only if tests have been modified
if [[ ! "$changed_files" =~ tests ]]
then
exit 0
fi
all_changelogs=$(cat ./CHANGES.rst)
if [[ "$all_changelogs" =~ :pr:\`$PR_NUMBER\` ]]
then
echo "Changelog has been updated."
# If the pull request is milestoned check the correspondent changelog
if exist -f ./CHANGES.rst${TAGGED_MILESTONE:0:4}.rst
then
expected_changelog=$(cat ./CHANGES.rst${TAGGED_MILESTONE:0:4}.rst)
if [[ "$expected_changelog" =~ :pr:\`$PR_NUMBER\` ]]
then
echo "Changelog and milestone correspond."
else
echo "Changelog and milestone do not correspond."
echo "If you see this error make sure that the tagged milestone for the PR"
echo "and the edited changelog filename properly match."
exit 1
fi
fi
else
echo "A Changelog entry is missing."
echo ""
echo "Please add an entry to the changelog at 'CHANGES.rst'"
echo "to document your change assuming that the PR will be merged"
echo "in time for the next release of dirty-cat."
echo ""
echo "Look at other entries in that file for inspiration and please"
echo "reference this pull request using the ':pr:' directive and"
echo "credit yourself (and other contributors if applicable) with"
echo "the ':user:' directive., for instance :pr:`453` by :user:`Jo Blib <JoBlib>`."
echo ""
echo "If you see this error and there is already a changelog entry,"
echo "check that the PR number is correct."
echo ""
echo "If you believe that this PR does not warrant a changelog"
echo "entry, say so in a comment so that a maintainer will label"
echo "the PR with 'No Changelog Needed' to bypass this check."
exit 1
fi
85 changes: 43 additions & 42 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,66 +7,61 @@ Major changes
-------------

* New experimental feature: joining tables using :func:`fuzzy_join` by approximate key matching. Matches are based
on string similarities and the nearest neighbors matches are found for each category.
* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be
used to download any indicator from the World Bank Open Data platform. It only needs the
indicator ID that can be found on the website.
on string similarities and the nearest neighbors matches are found for each category. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>` and :user:`Leo Grinsztajn <LeoGrin>`
* **datasets.fetching**: contains a new function :func:`fetch_world_bank_indicator` that can be used to download any indicator from the World Bank Open Data platform. It only needs the indicator ID that can be found on the website. :pr:`291` by :user:`Jovan Stojanovic <jovan-stojanovic>`
* Unnecessary API has been made private: everything (files, functions, classes)
starting with an underscore shouldn't be imported in your code.
starting with an underscore shouldn't be imported in your code. :pr:`331` by :user:`Lilian Boulard <LilianBoulard>`

Minor changes
-------------
* Removed example `Fitting scalable, non-linear models on data with dirty categories`.
* Removed example `Fitting scalable, non-linear models on data with dirty categories`. :pr:`386` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* :class:`MinHashEncoder`'s `minhash` method is no longer public.
* :class:`MinHashEncoder`'s `minhash` method is no longer public. :pr:`379` by :user:`Jovan Stojanovic <jovan-stojanovic>`

Bug fixes
---------

* :class:`MinHashEncoder` now considers `None` and empty strings as missing values, rather
than raising an error.
than raising an error. :pr:`378` by :user:`Gael Varoquaux <GaelVaroquaux>`

Release 0.3.0
=============

Major changes
-------------

* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical
columns (year, month, day, hour, minute, second, ...). It is now the default transformer used
in the :class:`SuperVectorizer` for datetime columns.
* New encoder: :class:`DatetimeEncoder` can transform a datetime column into several numerical columns (year, month, day, hour, minute, second, ...). It is now the default transformer used in the :class:`SuperVectorizer` for datetime columns. :pr:`239` by :user:`Leo Grinsztajn <LeoGrin>`

* The :class:`SuperVectorizer` has seen some major improvements and bug fixes:

- Fixes the automatic casting logic in ``transform``.
- To avoid dimensionality explosion when a feature has two unique values,
the default encoder (:class:`~sklearn.preprocessing.OneHotEncoder`) now drops one of the two
vectors (see parameter `drop="if_binary"`).
- ``fit_transform`` and ``transform`` can now return unencoded features,
like the :class:`~sklearn.compose.ColumnTransformer`'s behavior.
Previously, a ``RuntimeError`` was raised.
- To avoid dimensionality explosion when a feature has two unique values, the default encoder (:class:`~sklearn.preprocessing.OneHotEncoder`) now drops one of the two vectors (see parameter `drop="if_binary"`).
- ``fit_transform`` and ``transform`` can now return unencoded features, like the :class:`~sklearn.compose.ColumnTransformer`'s behavior. Previously, a ``RuntimeError`` was raised.

:pr:`300` by :user:`Lilian Boulard <LilianBoulard>`

* **Backward-incompatible change in the SuperVectorizer**:
To apply ``remainder`` to features (with the ``*_transformer`` parameters),
the value ``'remainder'`` must be passed, instead of ``None`` in previous versions.
``None`` now indicates that we want to use the default transformer.
``None`` now indicates that we want to use the default transformer. :pr:`303` by :user:`Lilian Boulard <LilianBoulard>`

* Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required.
* Support for Python 3.6 and 3.7 has been dropped. Python >= 3.8 is now required. :pr:`289` by :user:`Lilian Boulard <LilianBoulard>`

* Bumped minimum dependencies:
- scikit-learn>=0.23
- scipy>=1.4.0
- numpy>=1.17.3
- pandas>=1.2.0
- pandas>=1.2.0 :pr:`299` and :pr:`300` by :user:`Lilian Boulard <LilianBoulard>`

* Dropped support for Jaro, Jaro-Winkler and Levenshtein distances.
The :class:`SimilarityEncoder` now exclusively uses ``ngram`` for similarities,
and the `similarity` parameter is deprecated. It will be removed in 0.5.
and the `similarity` parameter is deprecated. It will be removed in 0.5. :pr:`282` by :user:`Lilian Boulard <LilianBoulard>`

Notes
-----

* The ``transformers_`` attribute of the SuperVectorizer now contains column
names instead of column indices for the "remainder" columns.
names instead of column indices for the "remainder" columns. :pr:`266` by :user:`Leo Grinsztajn <LeoGrin>`


Release 0.2.2
Expand All @@ -76,7 +71,7 @@ Bug fixes
---------

* Fixed a bug in the :class:`SuperVectorizer` causing a `FutureWarning`
when using the `get_feature_names_out` method.
when using the `get_feature_names_out` method. :pr:`262` by :user:`Lilian Boulard <LilianBoulard>`


Release 0.2.1
Expand All @@ -86,27 +81,26 @@ Major changes
-------------

* Improvements to the :class:`SuperVectorizer`
- Type detection works better: handles dates, numerics columns encoded as strings, or numeric columns containing strings for missing values.

- Type detection works better: handles dates, numerics columns encoded as strings,
or numeric columns containing strings for missing values.
:pr:`238` by :user:`Leo Grinsztajn <LeoGrin>`

* `get_feature_names` becomes `get_feature_names_out`, following changes in the scikit-learn API.
`get_feature_names` is deprecated in scikit-learn > 1.0.
* `get_feature_names` becomes `get_feature_names_out`, following changes in the scikit-learn API. `get_feature_names` is deprecated in scikit-learn > 1.0. :pr:`241` by :user:`Gael Varoquaux <GaelVaroquaux>`

* Improvements to the :class:`MinHashEncoder`
- It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`.
Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` method,
on multiple columns.
- It is now possible to fit multiple columns simultaneously with the :class:`MinHashEncoder`. Very useful when using for instance the :func:`~sklearn.compose.make_column_transformer` method, on multiple columns.

:pr:`243` by :user:`Jovan Stojanovic <jovan-stojanovic>`


Bug-fixes
---------

* Fixed a bug that resulted in the :class:`GapEncoder` ignoring the analyzer argument.
* Fixed a bug that resulted in the :class:`GapEncoder` ignoring the analyzer argument. :pr:`242` by :user:`Jovan Stojanovic <jovan-stojanovic>`

* :class:`GapEncoder`'s `get_feature_names_out` now accepts all iterators, not just lists.
* :class:`GapEncoder`'s `get_feature_names_out` now accepts all iterators, not just lists. :pr:`255` by :user:`Lilian Boulard <LilianBoulard>`

* Fixed `DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`
* Fixed `DeprecationWarning` raised by the usage of `distutils.version.LooseVersion`. :pr:`261` by :user:`Lilian Boulard <LilianBoulard>`

Notes
-----
Expand All @@ -127,8 +121,8 @@ Major changes

* Bump minimum dependencies:

- scikit-learn (>=0.21.0)
- pandas (>=1.1.5) **! NEW REQUIREMENT !**
- scikit-learn (>=0.21.0) :pr:`202` by :user:`Lilian Boulard <LilianBoulard>`
- pandas (>=1.1.5) **! NEW REQUIREMENT !** :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`

* **datasets.fetching** - backward-incompatible changes to the example
datasets fetchers:
Expand All @@ -139,17 +133,21 @@ Major changes
but their return values were modified in favor of a more Pythonic interface.
Refer to the docstrings of functions `dirty_cat.datasets.fetch_*`
for more information.
- The example notebooks were updated to reflect these changes.
- The example notebooks were updated to reflect these changes. :pr:`155` by :user:`Lilian Boulard <LilianBoulard>`

* **Backward incompatible change to** :class:`MinHashEncoder`: The :class:`MinHashEncoder` now
only supports two dimensional inputs of shape (N_samples, 1).
:pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.

* Update `handle_missing` parameters:

- :class:`GapEncoder`: the default value "zero_impute" becomes "empty_impute" (see doc).
- :class:`MinHashEncoder`: the default value "" becomes "zero_impute" (see doc).

:pr:`210` by :user:`Alexis Cvetkov <alexis-cvetkov>`.

* Add a method "get_feature_names_out" for the :class:`GapEncoder` and the :class:`SuperVectorizer`,
since `get_feature_names` will be depreciated in scikit-learn 1.2 (#216).
since `get_feature_names` will be depreciated in scikit-learn 1.2. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`

Notes
-----
Expand All @@ -163,6 +161,8 @@ Notes
- Type casting and per-column imputation are now learnt during fitting
- Several bugfixes

:pr:`201` by :user:`Lilian Boulard <LilianBoulard>`

Release 0.2.0a1
===============

Expand All @@ -189,18 +189,19 @@ Major changes
:class:`SuperVectorizer` class. It transforms
columns automatically based on their type. It provides a replacement
for scikit-learn's :class:`~sklearn.compose.ColumnTransformer` simpler to use on heterogeneous
pandas DataFrame.
pandas DataFrame. :pr:`167` by :user:`Lilian Boulard <LilianBoulard>`

* **Backward incompatible change to** :class:`GapEncoder`: The :class:`GapEncoder` now only
supports two-dimensional inputs of shape (n_samples, n_features).
Internally, features are encoded by independent :class:`GapEncoder` models,
and are then concatenated into a single matrix.
:pr:`185` by :user:`Lilian Boulard <LilianBoulard>` and :user:`Alexis Cvetkov <alexis-cvetkov>`.


Bug-fixes
---------

* Fix `get_feature_names` for scikit-learn > 0.21
* Fix `get_feature_names` for scikit-learn > 0.21. :pr:`216` by :user:`Alexis Cvetkov <alexis-cvetkov>`


Release 0.1.1
Expand All @@ -212,7 +213,7 @@ Major changes
Bug-fixes
---------

* RuntimeWarnings due to overflow in :class:`GapEncoder` (#161)
* RuntimeWarnings due to overflow in :class:`GapEncoder`. :pr:`161` by :user:`Alexis Cvetkov <alexis-cvetkov>`


Release 0.1.0
Expand All @@ -224,12 +225,12 @@ Major changes
* :class:`GapEncoder`: Added online Gamma-Poisson factorization through the
:class:`GapEncoder` class. This method discovers latent categories formed
via combinations of substrings, and encodes string data as combinations of
these categories. To be used if interpretability is important.
these categories. To be used if interpretability is important. :pr:`153` by :user:`Alexis Cvetkov <alexis-cvetkov>`

Bug-fixes
---------

* Multiprocessing exception in notebook (#154)
* Multiprocessing exception in notebook. :pr:`154` by :user:`Lilian Boulard <LilianBoulard>`


Release 0.0.7
Expand Down
13 changes: 7 additions & 6 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ The following is a set of guidelines for contributing to
.. contents::
:local:

|
I don’t want to read the whole thing I just have a question
Expand Down Expand Up @@ -146,29 +147,29 @@ So, first step: create your environment.

For this example, we’ll use conda:

.. code:: commandline
.. code:: console
conda create python=3.10 --name dirty_cat
conda activate dirty_cat
Secondly, clone the repository (you’ll need to have ``git`` installed -
it is already on most linux distributions).

.. code:: commandline
.. code:: console
git clone https://github.com/dirty-cat/dirty_cat
Next, install the project dependencies. They are listed in ``setup.cfg``.

.. code:: commandline
.. code:: console
pip install -e .[dev]
Code-formatting and linting is automatically done via
```pre-commit`` <https://github.com/pre-commit/pre-commit>`__. You
install this setup using:

.. code:: commandline
.. code:: console
pip install pre-commit
pre-commit install
Expand All @@ -178,7 +179,7 @@ ignored by ``git blame`` and IDE integrations. The revisions to be
ignored are listed in ``.git-blame-ignore-revs``, which can be set in
your local repository with:

.. code:: commandline
.. code:: console
git config blame.ignoreRevsFile .git-blame-ignore-revs
Expand Down Expand Up @@ -216,7 +217,7 @@ It is advised to create a new branch every time you work on a new issue,
to avoid confusion.
Use the following command to create a branch:

.. code:: commandline
.. code:: console
git checkout -b branch_name
Expand Down
13 changes: 13 additions & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,13 @@
import os
import shutil
from datetime import datetime
import sys
# If extensions (or modules to document with autodoc) are in another
# directory, add these directories to sys.path here. If the directory
# is relative to the documentation root, use os.path.abspath to make it
# absolute, like shown here.
sys.path.insert(0, os.path.abspath("sphinxext"))


# -- Copy files for docs --------------------------------------------------
#
Expand Down Expand Up @@ -46,6 +53,7 @@
"sphinx.ext.viewcode",
"sphinx.ext.githubpages",
"numpydoc",
"sphinx_issues",
"sphinx.ext.autodoc.typehints",
"sphinx_gallery.gen_gallery",
]
Expand Down Expand Up @@ -294,3 +302,8 @@

# -- The javascript to highlight the toc as we scroll ----------------------
html_js_files = ["scrolltoc.js"]

# -- github links --------------------------------------

# we use the issues path for PRs since the issues URL will forward
issues_github_path = "dirty-cat/dirty_cat"
2 changes: 2 additions & 0 deletions doc/sphinxext/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
recursive-include tests *.py
include *.txt
Loading

0 comments on commit 6601dd3

Please sign in to comment.