Add random_seed and cross_validation

valteresj2 · Jul 27, 2020 · f543609 · f543609
1 parent e41011d
commit f543609
Show file tree

Hide file tree

Showing 4 changed files with 88 additions and 38 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # ppscore - a Python implementation of the Predictive Power Score (PPS)
 
-### From the makers of [bamboolib](https://bamboolib.com)
+### From the makers of [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)
 
 
 __If you don't know yet what the Predictive Power Score is, please read the following blog post:__
@@ -81,7 +81,7 @@ sns.barplot(data=df_predictors, x="x", y="ppscore")
 
 ## API
 
-### ppscore.score(df, x, y, sample=5000)
+### ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=None)
 
 Calculate the Predictive Power Score (PPS) for "x predicts y"
 
@@ -105,7 +105,12 @@ Calculate the Predictive Power Score (PPS) for "x predicts y"
 - __sample__ : int or ``None``
     - Number of rows for sampling. The sampling decreases the calculation time of the PPS.
     If ``None`` there will be no sampling.
-
+- __cross_validation__ : int
+    - Number of iterations during cross-validation. This has the following implications:
+    For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
+- __random_seed__ : int or ``None``
+    - Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling.
+    If the value is set, the results will be reproducible. If the value is ``None`` a new random number is drawn at the start of each calculation.
 #### Returns
 
 - __Dict__:
@@ -127,7 +132,7 @@ Calculate the Predictive Power Score (PPS) for all columns in the dataframe agai
 - __sorted__ : bool
     - Whether or not to sort the output dataframe/list
 - __kwargs__ :
-    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__
+    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, or __random_seed__
 
 #### Returns
 
@@ -146,7 +151,7 @@ Calculate the Predictive Power Score (PPS) matrix for all columns in the datafra
 - __output__ : str - potential values: "df", "dict"
     - Control the type of the output. Either return a df or a dict with all the PPS dicts arranged by the target column
 - __kwargs__ :
-    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__
+    - Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, or __random_seed__
 
 #### Returns
 
@@ -161,10 +166,11 @@ Calculate the Predictive Power Score (PPS) matrix for all columns in the datafra
 There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:
 
 - The score is calculated using only 1 feature trying to predict the target column. This means there are no interaction effects between the scores of various features. Note that this is in contrast to feature importance
-- The score is calculated on the test sets of a 4-fold crossvalidation (number is adjustable via `ppscore.CV_ITERATIONS`). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
+- The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via `cross_validation`). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
 - All rows which have a missing value in the feature or the target column are dropped
-- In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows with a fixed random seed (`ppscore.RANDOM_SEED`). You can adjust the number of rows or skip this sampling via the API. However, in most scenarios the results will be very similar
+- In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows. You can adjust the number of rows or skip this sampling via the API (`sample`). However, in most scenarios the results will be very similar
 - There is no grid search for optimal model parameters
+- The result might change between calculations because the calculation contains random elements, e.g. the sampling of the rows or the shuffling of the rows before cross-validation. If you want to make sure that your results are reproducible you can set the random seed (`random_seed`).
 
 
 ### Learning algorithm
@@ -231,9 +237,9 @@ If the task is a classification, we compute the weighted F1 score (wF1) as the u
 #### Special tasks
 
 The special tasks all have predefined PPS scores. Those tasks exist for implementation reasons in order to communicate special cases and save computation time.
-- __predict_id__ has a score of 0 because an ID column cannot be predicted by any other column as part of a crossvalidation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
+- __predict_id__ has a score of 0 because an ID column cannot be predicted by any other column as part of a cross-validation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
 - __predict_constant__ has a score of 1 because any column and baseline can perfectly predict a column that only has a single value. It could be argued that the score should be 0 because the model is not better than the naive predictor. However, (so far) we chose to set a value of 1 in order to communicate that there is perfect predictive power.
 - __predict_itself__ means that the feature and target columns are the same and thus the PPS is 1 because a column can always perfectly predict its own value.
 
 ## About
-ppscore is developed by [8080 Labs](https://8080labs.com) - we create tools for Python Data Scientists. If you like `ppscore`, please check out our other project [bamboolib](https://bamboolib.com)
+ppscore is developed by [8080 Labs](https://8080labs.com) - we create tools for Python Data Scientists. If you like `ppscore` you might want to check out our other project [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)
diff --git a/src/ppscore/__init__.py b/src/ppscore/__init__.py
@@ -11,4 +11,4 @@
     del get_distribution, DistributionNotFound
 
 
-from ppscore.calculation import score, predictors, matrix, CV_ITERATIONS, RANDOM_SEED
+from ppscore.calculation import score, predictors, matrix
diff --git a/src/ppscore/calculation.py b/src/ppscore/calculation.py
@@ -14,26 +14,26 @@
     is_timedelta64_dtype,
 )
 
-# if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the minimum observations also increase. This is important, because this is the limit when sklearn will throw an error which will lead to a score of 0 if we catch it
-CV_ITERATIONS = 4
 
-RANDOM_SEED = 587136
+NOT_SUPPORTED_ANYMORE = "NOT_SUPPORTED_ANYMORE"
 
 
-def _calculate_model_cv_score_(df, target, feature, task, **kwargs):
+def _calculate_model_cv_score_(
+    df, target, feature, task, cross_validation, random_seed, **kwargs
+):
     "Calculates the mean model score based on cross-validation"
     # Sources about the used methods:
     # https://scikit-learn.org/stable/modules/tree.html
     # https://scikit-learn.org/stable/modules/cross_validation.html
     # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
     metric = task["metric_key"]
     model = task["model"]
-    # shuffle the rows - this is important for crossvalidation
-    # because the crossvalidation just takes the first n lines
+    # shuffle the rows - this is important for cross-validation
+    # because the cross-validation just takes the first n lines
     # if there is a strong pattern in the rows eg 0,0,0,0,1,1,1,1
     # then this will lead to problems because the first cv sees mostly 0 and the later 1
     # this approach might be wrong for timeseries because it might leak information
-    df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False)
+    df = df.sample(frac=1, random_state=random_seed, replace=False)
 
     # preprocess target
     if task["type"] == "classification":
@@ -53,10 +53,10 @@ def _calculate_model_cv_score_(df, target, feature, task, **kwargs):
         # reshaping needed because there is only 1 feature
         feature_input = df[feature].values.reshape(-1, 1)
 
-    # Crossvalidation is stratifiedKFold for classification, KFold for regression
+    # Cross-validation is stratifiedKFold for classification, KFold for regression
     # CV on one core (n_job=1; default) has shown to be fastest
     scores = cross_val_score(
-        model, feature_input, target_series, cv=CV_ITERATIONS, scoring=metric
+        model, feature_input, target_series, cv=cross_validation, scoring=metric
     )
 
     return scores.mean()
@@ -74,7 +74,7 @@ def _normalized_mae_score(model_mae, naive_mae):
         return 1 - (model_mae / naive_mae)
 
 
-def _mae_normalizer(df, y, model_score):
+def _mae_normalizer(df, y, model_score, **kwargs):
     "In case of MAE, calculates the baseline score for y and derives the PPS."
     df["naive"] = df[y].median()
     baseline_score = mean_absolute_error(df[y], df["naive"])  # true, pred
@@ -98,12 +98,12 @@ def _normalized_f1_score(model_f1, baseline_f1):
         return f1_diff / scale_range  # 0.1/0.3 = 0.33
 
 
-def _f1_normalizer(df, y, model_score):
+def _f1_normalizer(df, y, model_score, random_seed):
     "In case of F1, calculates the baseline score for y and derives the PPS."
     label_encoder = preprocessing.LabelEncoder()
     df["truth"] = label_encoder.fit_transform(df[y])
     df["most_common_value"] = df["truth"].value_counts().index[0]
-    random = df["truth"].sample(frac=1)
+    random = df["truth"].sample(frac=1, random_state=random_seed)
 
     baseline_score = max(
         f1_score(df["truth"], df["most_common_value"], average="weighted"),
@@ -201,7 +201,7 @@ def _feature_is_id(df, x):
     return category_count == len(df[x])
 
 
-def _maybe_sample(df, sample):
+def _maybe_sample(df, sample, random_seed=None):
     """
     Maybe samples the rows of the given df to have at most ``sample`` rows
     If sample is ``None`` or falsy, there will be no sampling.
@@ -213,6 +213,8 @@ def _maybe_sample(df, sample):
         Dataframe that might be sampled
     sample : int or ``None``
         Number of rows to be sampled
+    random_seed : int or ``None``
+        Random seed that is forwarded to pandas.DataFrame.sample as ``random_state``
 
     Returns
     -------
@@ -222,11 +224,19 @@ def _maybe_sample(df, sample):
     if sample and len(df) > sample:
         # this is a problem if x or y have more than sample=5000 categories
         # TODO: dont sample when the problem occurs and show warning
-        df = df.sample(sample, random_state=RANDOM_SEED, replace=False)
+        df = df.sample(sample, random_state=random_seed, replace=False)
     return df
 
 
-def score(df, x, y, task=None, sample=5000):
+def score(
+    df,
+    x,
+    y,
+    task=NOT_SUPPORTED_ANYMORE,
+    sample=5_000,
+    cross_validation=4,
+    random_seed=None,
+):
     """
     Calculate the Predictive Power Score (PPS) for "x predicts y"
     The score always ranges from 0 to 1 and is data-type agnostic.
@@ -246,6 +256,12 @@ def score(df, x, y, task=None, sample=5000):
     sample : int or ``None``
         Number of rows for sampling. The sampling decreases the calculation time of the PPS.
         If ``None`` there will be no sampling.
+    cross_validation : int
+        Number of iterations during cross-validation. This has the following implications:
+        For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
+    random_seed : int or ``None``
+        Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling.
+        If the value is set, the results will be reproducible. If the value is ``None`` a new random number is drawn at the start of each calculation.
 
     Returns
     -------
@@ -274,11 +290,16 @@ def score(df, x, y, task=None, sample=5000):
         raise AssertionError(
             f"The dataframe has {len(df[[y]].columns)} columns with the same column name {y}\nPlease adjust the dataframe and make sure that only 1 column has the name {y}"
         )
-    if task is not None:
+    if task is not NOT_SUPPORTED_ANYMORE:
         raise AttributeError(
             "The attribute 'task' is no longer supported because it led to confusion and inconsistencies.\nThe task of the model is now determined based on the data types of the columns. If you want to change the task please adjust the data type of the column.\nFor more details, please refer to the README"
         )
 
+    if random_seed is None:
+        from random import random
+
+        random_seed = int(random() * 1000)
+
     if x == y:
         task_name = "predict_itself"
     else:
@@ -288,12 +309,8 @@ def score(df, x, y, task=None, sample=5000):
             raise Exception(
                 "After dropping missing values, there are no valid rows left"
             )
-        df = _maybe_sample(df, sample)
-
-        if task is None:
-            task_name = _infer_task(df, x, y)
-        else:
-            task_name = task
+        df = _maybe_sample(df, sample, random_seed=random_seed)
+        task_name = _infer_task(df, x, y)
 
     task = TASKS[task_name]
 
@@ -310,9 +327,19 @@ def score(df, x, y, task=None, sample=5000):
         ppscore = 0
         baseline_score = 0
     else:
-
-        model_score = _calculate_model_cv_score_(df, target=y, feature=x, task=task)
-        ppscore, baseline_score = task["score_normalizer"](df, y, model_score)
+        model_score = _calculate_model_cv_score_(
+            df,
+            target=y,
+            feature=x,
+            task=task,
+            cross_validation=cross_validation,
+            random_seed=random_seed,
+        )
+        # IDEA: the baseline_scores do sometimes change significantly, e.g. for F1 and thus change the PPS
+        # we might want to calculate the baseline_score 10 times and use the mean in order to have less variance
+        ppscore, baseline_score = task["score_normalizer"](
+            df, y, model_score, random_seed=random_seed
+        )
 
     return {
         "x": x,
@@ -343,7 +370,8 @@ def predictors(df, y, output="df", sorted=True, **kwargs):
     sorted: bool
         Whether or not to sort the output dataframe/list
     kwargs:
-        Other key-word arguments that shall be forwarded to the pps.score method
+        Other key-word arguments that shall be forwarded to the pps.score method,
+        e.g. ``sample``, ``cross_validation``, or ``random_seed``
 
     Returns
     -------
@@ -404,7 +432,8 @@ def matrix(df, output="df", **kwargs):
     output: str - potential values: "df", "dict"
         Control the type of the output. Either return a df or a dict with all the PPS dicts arranged by the target column
     kwargs:
-        Other key-word arguments that shall be forwarded to the pps.score method
+        Other key-word arguments that shall be forwarded to the pps.score method,
+        e.g. ``sample``, ``cross_validation``, or ``random_seed``
 
     Returns
     -------

diff --git a/tests/test_calculation.py b/tests/test_calculation.py
@@ -135,7 +135,22 @@ def test_score():
             duplicate_column_names_df, "unique_column_name", "duplicate_column_name"
         )
 
-    # check tasks
+    # check cross_validation
+    # if more folds than data, there is an error
+    with pytest.raises(ValueError):
+        assert pps.score(df, "x", "y", cross_validation=2000)
+
+    # check random_seed
+    assert pps.score(df, "x", "y", random_seed=1) == pps.score(
+        df, "x", "y", random_seed=1
+    )
+    assert pps.score(df, "x", "y", random_seed=1) != pps.score(
+        df, "x", "y", random_seed=2
+    )
+    # the random seed that is drawn automatically is smaller than <1000
+    assert pps.score(df, "x", "y") != pps.score(df, "x", "y", random_seed=123_456)
+
+    # check task inference
     assert pps.score(df, "x", "y")["task"] == "regression"
     assert pps.score(df, "x", "x_greater_0_string")["task"] == "classification"
     assert pps.score(df, "x", "constant")["task"] == "predict_constant"
Original file line number	Diff line number	Diff line change
Expand Up		@@ -11,4 +11,4 @@
		del get_distribution, DistributionNotFound


		from ppscore.calculation import score, predictors, matrix, CV_ITERATIONS, RANDOM_SEED
		from ppscore.calculation import score, predictors, matrix