Skip to content

Commit

Permalink
Add random_seed and cross_validation
Browse files Browse the repository at this point in the history
  • Loading branch information
FlorianWetschoreck committed Jul 27, 2020
1 parent e41011d commit f543609
Show file tree
Hide file tree
Showing 4 changed files with 88 additions and 38 deletions.
24 changes: 15 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ppscore - a Python implementation of the Predictive Power Score (PPS)

### From the makers of [bamboolib](https://bamboolib.com)
### From the makers of [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)


__If you don't know yet what the Predictive Power Score is, please read the following blog post:__
Expand Down Expand Up @@ -81,7 +81,7 @@ sns.barplot(data=df_predictors, x="x", y="ppscore")

## API

### ppscore.score(df, x, y, sample=5000)
### ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=None)

Calculate the Predictive Power Score (PPS) for "x predicts y"

Expand All @@ -105,7 +105,12 @@ Calculate the Predictive Power Score (PPS) for "x predicts y"
- __sample__ : int or ``None``
- Number of rows for sampling. The sampling decreases the calculation time of the PPS.
If ``None`` there will be no sampling.

- __cross_validation__ : int
- Number of iterations during cross-validation. This has the following implications:
For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
- __random_seed__ : int or ``None``
- Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling.
If the value is set, the results will be reproducible. If the value is ``None`` a new random number is drawn at the start of each calculation.
#### Returns

- __Dict__:
Expand All @@ -127,7 +132,7 @@ Calculate the Predictive Power Score (PPS) for all columns in the dataframe agai
- __sorted__ : bool
- Whether or not to sort the output dataframe/list
- __kwargs__ :
- Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__
- Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, or __random_seed__

#### Returns

Expand All @@ -146,7 +151,7 @@ Calculate the Predictive Power Score (PPS) matrix for all columns in the datafra
- __output__ : str - potential values: "df", "dict"
- Control the type of the output. Either return a df or a dict with all the PPS dicts arranged by the target column
- __kwargs__ :
- Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__
- Other key-word arguments that shall be forwarded to the pps.score method, e.g. __sample__, __cross_validation__, or __random_seed__

#### Returns

Expand All @@ -161,10 +166,11 @@ Calculate the Predictive Power Score (PPS) matrix for all columns in the datafra
There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:

- The score is calculated using only 1 feature trying to predict the target column. This means there are no interaction effects between the scores of various features. Note that this is in contrast to feature importance
- The score is calculated on the test sets of a 4-fold crossvalidation (number is adjustable via `ppscore.CV_ITERATIONS`). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
- The score is calculated on the test sets of a 4-fold cross-validation (number is adjustable via `cross_validation`). For classification, stratifiedKFold is used. For regression, normal KFold. Please note that this sampling might not be valid for time series data sets
- All rows which have a missing value in the feature or the target column are dropped
- In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows with a fixed random seed (`ppscore.RANDOM_SEED`). You can adjust the number of rows or skip this sampling via the API. However, in most scenarios the results will be very similar
- In case that the dataset has more than 5,000 rows the score is only calculated on a random subset of 5,000 rows. You can adjust the number of rows or skip this sampling via the API (`sample`). However, in most scenarios the results will be very similar
- There is no grid search for optimal model parameters
- The result might change between calculations because the calculation contains random elements, e.g. the sampling of the rows or the shuffling of the rows before cross-validation. If you want to make sure that your results are reproducible you can set the random seed (`random_seed`).


### Learning algorithm
Expand Down Expand Up @@ -231,9 +237,9 @@ If the task is a classification, we compute the weighted F1 score (wF1) as the u
#### Special tasks

The special tasks all have predefined PPS scores. Those tasks exist for implementation reasons in order to communicate special cases and save computation time.
- __predict_id__ has a score of 0 because an ID column cannot be predicted by any other column as part of a crossvalidation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
- __predict_id__ has a score of 0 because an ID column cannot be predicted by any other column as part of a cross-validation. There still might be a 1 to 1 relationship but this is not detectable by the current implementation of the PPS.
- __predict_constant__ has a score of 1 because any column and baseline can perfectly predict a column that only has a single value. It could be argued that the score should be 0 because the model is not better than the naive predictor. However, (so far) we chose to set a value of 1 in order to communicate that there is perfect predictive power.
- __predict_itself__ means that the feature and target columns are the same and thus the PPS is 1 because a column can always perfectly predict its own value.

## About
ppscore is developed by [8080 Labs](https://8080labs.com) - we create tools for Python Data Scientists. If you like `ppscore`, please check out our other project [bamboolib](https://bamboolib.com)
ppscore is developed by [8080 Labs](https://8080labs.com) - we create tools for Python Data Scientists. If you like `ppscore` you might want to check out our other project [bamboolib - a GUI for pandas DataFrames](https://bamboolib.com)
2 changes: 1 addition & 1 deletion src/ppscore/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@
del get_distribution, DistributionNotFound


from ppscore.calculation import score, predictors, matrix, CV_ITERATIONS, RANDOM_SEED
from ppscore.calculation import score, predictors, matrix
83 changes: 56 additions & 27 deletions src/ppscore/calculation.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,26 @@
is_timedelta64_dtype,
)

# if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the minimum observations also increase. This is important, because this is the limit when sklearn will throw an error which will lead to a score of 0 if we catch it
CV_ITERATIONS = 4

RANDOM_SEED = 587136
NOT_SUPPORTED_ANYMORE = "NOT_SUPPORTED_ANYMORE"


def _calculate_model_cv_score_(df, target, feature, task, **kwargs):
def _calculate_model_cv_score_(
df, target, feature, task, cross_validation, random_seed, **kwargs
):
"Calculates the mean model score based on cross-validation"
# Sources about the used methods:
# https://scikit-learn.org/stable/modules/tree.html
# https://scikit-learn.org/stable/modules/cross_validation.html
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
metric = task["metric_key"]
model = task["model"]
# shuffle the rows - this is important for crossvalidation
# because the crossvalidation just takes the first n lines
# shuffle the rows - this is important for cross-validation
# because the cross-validation just takes the first n lines
# if there is a strong pattern in the rows eg 0,0,0,0,1,1,1,1
# then this will lead to problems because the first cv sees mostly 0 and the later 1
# this approach might be wrong for timeseries because it might leak information
df = df.sample(frac=1, random_state=RANDOM_SEED, replace=False)
df = df.sample(frac=1, random_state=random_seed, replace=False)

# preprocess target
if task["type"] == "classification":
Expand All @@ -53,10 +53,10 @@ def _calculate_model_cv_score_(df, target, feature, task, **kwargs):
# reshaping needed because there is only 1 feature
feature_input = df[feature].values.reshape(-1, 1)

# Crossvalidation is stratifiedKFold for classification, KFold for regression
# Cross-validation is stratifiedKFold for classification, KFold for regression
# CV on one core (n_job=1; default) has shown to be fastest
scores = cross_val_score(
model, feature_input, target_series, cv=CV_ITERATIONS, scoring=metric
model, feature_input, target_series, cv=cross_validation, scoring=metric
)

return scores.mean()
Expand All @@ -74,7 +74,7 @@ def _normalized_mae_score(model_mae, naive_mae):
return 1 - (model_mae / naive_mae)


def _mae_normalizer(df, y, model_score):
def _mae_normalizer(df, y, model_score, **kwargs):
"In case of MAE, calculates the baseline score for y and derives the PPS."
df["naive"] = df[y].median()
baseline_score = mean_absolute_error(df[y], df["naive"]) # true, pred
Expand All @@ -98,12 +98,12 @@ def _normalized_f1_score(model_f1, baseline_f1):
return f1_diff / scale_range # 0.1/0.3 = 0.33


def _f1_normalizer(df, y, model_score):
def _f1_normalizer(df, y, model_score, random_seed):
"In case of F1, calculates the baseline score for y and derives the PPS."
label_encoder = preprocessing.LabelEncoder()
df["truth"] = label_encoder.fit_transform(df[y])
df["most_common_value"] = df["truth"].value_counts().index[0]
random = df["truth"].sample(frac=1)
random = df["truth"].sample(frac=1, random_state=random_seed)

baseline_score = max(
f1_score(df["truth"], df["most_common_value"], average="weighted"),
Expand Down Expand Up @@ -201,7 +201,7 @@ def _feature_is_id(df, x):
return category_count == len(df[x])


def _maybe_sample(df, sample):
def _maybe_sample(df, sample, random_seed=None):
"""
Maybe samples the rows of the given df to have at most ``sample`` rows
If sample is ``None`` or falsy, there will be no sampling.
Expand All @@ -213,6 +213,8 @@ def _maybe_sample(df, sample):
Dataframe that might be sampled
sample : int or ``None``
Number of rows to be sampled
random_seed : int or ``None``
Random seed that is forwarded to pandas.DataFrame.sample as ``random_state``
Returns
-------
Expand All @@ -222,11 +224,19 @@ def _maybe_sample(df, sample):
if sample and len(df) > sample:
# this is a problem if x or y have more than sample=5000 categories
# TODO: dont sample when the problem occurs and show warning
df = df.sample(sample, random_state=RANDOM_SEED, replace=False)
df = df.sample(sample, random_state=random_seed, replace=False)
return df


def score(df, x, y, task=None, sample=5000):
def score(
df,
x,
y,
task=NOT_SUPPORTED_ANYMORE,
sample=5_000,
cross_validation=4,
random_seed=None,
):
"""
Calculate the Predictive Power Score (PPS) for "x predicts y"
The score always ranges from 0 to 1 and is data-type agnostic.
Expand All @@ -246,6 +256,12 @@ def score(df, x, y, task=None, sample=5000):
sample : int or ``None``
Number of rows for sampling. The sampling decreases the calculation time of the PPS.
If ``None`` there will be no sampling.
cross_validation : int
Number of iterations during cross-validation. This has the following implications:
For example, if the number is 4, then it is possible to detect patterns when there are at least 4 times the same observation. If the limit is increased, the required minimum observations also increase. This is important, because this is the limit when sklearn will throw an error and the PPS cannot be calculated
random_seed : int or ``None``
Random seed for the parts of the calculation that require random numbers, e.g. shuffling or sampling.
If the value is set, the results will be reproducible. If the value is ``None`` a new random number is drawn at the start of each calculation.
Returns
-------
Expand Down Expand Up @@ -274,11 +290,16 @@ def score(df, x, y, task=None, sample=5000):
raise AssertionError(
f"The dataframe has {len(df[[y]].columns)} columns with the same column name {y}\nPlease adjust the dataframe and make sure that only 1 column has the name {y}"
)
if task is not None:
if task is not NOT_SUPPORTED_ANYMORE:
raise AttributeError(
"The attribute 'task' is no longer supported because it led to confusion and inconsistencies.\nThe task of the model is now determined based on the data types of the columns. If you want to change the task please adjust the data type of the column.\nFor more details, please refer to the README"
)

if random_seed is None:
from random import random

random_seed = int(random() * 1000)

if x == y:
task_name = "predict_itself"
else:
Expand All @@ -288,12 +309,8 @@ def score(df, x, y, task=None, sample=5000):
raise Exception(
"After dropping missing values, there are no valid rows left"
)
df = _maybe_sample(df, sample)

if task is None:
task_name = _infer_task(df, x, y)
else:
task_name = task
df = _maybe_sample(df, sample, random_seed=random_seed)
task_name = _infer_task(df, x, y)

task = TASKS[task_name]

Expand All @@ -310,9 +327,19 @@ def score(df, x, y, task=None, sample=5000):
ppscore = 0
baseline_score = 0
else:

model_score = _calculate_model_cv_score_(df, target=y, feature=x, task=task)
ppscore, baseline_score = task["score_normalizer"](df, y, model_score)
model_score = _calculate_model_cv_score_(
df,
target=y,
feature=x,
task=task,
cross_validation=cross_validation,
random_seed=random_seed,
)
# IDEA: the baseline_scores do sometimes change significantly, e.g. for F1 and thus change the PPS
# we might want to calculate the baseline_score 10 times and use the mean in order to have less variance
ppscore, baseline_score = task["score_normalizer"](
df, y, model_score, random_seed=random_seed
)

return {
"x": x,
Expand Down Expand Up @@ -343,7 +370,8 @@ def predictors(df, y, output="df", sorted=True, **kwargs):
sorted: bool
Whether or not to sort the output dataframe/list
kwargs:
Other key-word arguments that shall be forwarded to the pps.score method
Other key-word arguments that shall be forwarded to the pps.score method,
e.g. ``sample``, ``cross_validation``, or ``random_seed``
Returns
-------
Expand Down Expand Up @@ -404,7 +432,8 @@ def matrix(df, output="df", **kwargs):
output: str - potential values: "df", "dict"
Control the type of the output. Either return a df or a dict with all the PPS dicts arranged by the target column
kwargs:
Other key-word arguments that shall be forwarded to the pps.score method
Other key-word arguments that shall be forwarded to the pps.score method,
e.g. ``sample``, ``cross_validation``, or ``random_seed``
Returns
-------
Expand Down
17 changes: 16 additions & 1 deletion tests/test_calculation.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,22 @@ def test_score():
duplicate_column_names_df, "unique_column_name", "duplicate_column_name"
)

# check tasks
# check cross_validation
# if more folds than data, there is an error
with pytest.raises(ValueError):
assert pps.score(df, "x", "y", cross_validation=2000)

# check random_seed
assert pps.score(df, "x", "y", random_seed=1) == pps.score(
df, "x", "y", random_seed=1
)
assert pps.score(df, "x", "y", random_seed=1) != pps.score(
df, "x", "y", random_seed=2
)
# the random seed that is drawn automatically is smaller than <1000
assert pps.score(df, "x", "y") != pps.score(df, "x", "y", random_seed=123_456)

# check task inference
assert pps.score(df, "x", "y")["task"] == "regression"
assert pps.score(df, "x", "x_greater_0_string")["task"] == "classification"
assert pps.score(df, "x", "constant")["task"] == "predict_constant"
Expand Down

0 comments on commit f543609

Please sign in to comment.