Skip to content

Commit

Permalink
added README for feature selection
Browse files Browse the repository at this point in the history
  • Loading branch information
fakenewssyria authored Jun 7, 2021
1 parent eb1c053 commit d87f1df
Showing 1 changed file with 73 additions and 0 deletions.
73 changes: 73 additions & 0 deletions feature_selection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Feature Selection ([feature_selection_code.py](https://github.com/fakenewssyria/fake_news_detection/blob/master/feature_selection/feature_selection_code.py))
This Repository contains the necessary codes for applying feature selection.

Modify lines 17/18 in order to control the number of features to be selected in each feature selection method


## Feature Selection Methods:

* ```drop_zero_std()```: drops columns that have 0 standard deviation. (Actually will not drop but show the
columns that must be dropped)

* ```drop_low_var()```: drops columns that have low variance. (Actually will not drop but show the
columns that must be dropped)

* ```drop_high_correlation()```: drops columns that have high correlation. (Actually will not drop but show the
columns that must be dropped)

* ```feature_importance(xg_boost=True, extra_trees=False)```: Applies feature importance to the data.
* ```xg_boost=True, extra_trees=False```: will perform feature importance using XG Boost only
* ```xg_boost=False, extra_trees=True```: will perform feature importance using Extra Trees only
* ```xg_boost=True, extra_trees=True```: will perform feature importance using both XG Boost and Extra Trees
* ```xg_boost=False, extra_trees=False```: Nothing will happen. Avoid this if you want to use feature selection.
* **Default Behavior:** i.e. if we do: feature_importance() it will do only XG Boost. As: xg_boost=False, extra_trees=True

* ```univariate()```: Applies Univariate Feature Selection with NUM_FEATURES_UNIVARIATE being selected (specified in line 17 in feature_selection_code.py). Raises Vlaue Error in this is greater than the total number of input features.

* ```rfe()```: Applies Recursive Feature Elimination with NUM_FEATURES_RFE being selected (specified in line 18 in feature_selection_code.py). Raises Vlaue Error in this is greater than the total number of input features.

## Scaling
* ```scale=True, scale_input=True, scale_output=True```: Will scale both the input and the output columns.
* ```scale=True, scale_input=True, scale_output=False```: Will scale only the input columns.
* ```scale=True, scale_input=False, scale_output=True```: Will scale only the output column.
* ```scale=True, scale_input=False, scale_output=False```: Will not scale any columns, although scale=True, but either scale_input or scale_output must be True
* ```scale=False, scale_input=True, scale_output=True```: Will not scale any columns, although both scale_input=True and scale_output=True, but scale must be True as well in order to perform any scaling exercise.
* ```scale=False, scale_input=True, scale_output=False```: Will not scale any columns, although scale_input=True, but scale must be True as well in order to perform any scaling exercise.
* ```scale=False, scale_input=False, scale_output=True```: Will not scale any columns, although scale_output=True, but scale must be True as well in order to perform any scaling exercise.
* ```scale=False, scale_input=False, scale_output=False```: Will not scale any columns.

## Indexing Input Columns in order to Scale
If ```scale=True and scale_input=True```:
* ```input_zscore=(start_index_1, end_index_1)```: will apply Z-score scaling for the input columns starting at index start_index_1 and ending at end_index_1 (eclusive). By defualt, None. If None, no z-score scaling to any of the input columns is applied.
* ```input_minmax=(start_index_2, end_index_2)```: will apply min-max scaling for the input columns starting at index start_index_2 and ending at end_index_2 (eclusive). By defualt, None. If None, no min-max scaling to any of the input columns is applied.
* ```input_box=(start_index_3, end_index_3)```: will apply box-cox transformation for the input columns starting at index start_index_3 and ending at end_index_3 (eclusive). By defualt, None. If None, no box-cox transformation to any of the input columns is applied.
* ```input_log=(start_index_4, end_index_4)```: will apply log transformation for the input columns starting at index start_index_4 and ending at end_index_4 (eclusive). By defualt, None. If None, no log transformation to any of the input columns is applied.


## Specifying Scaling Type to the Output Column
If ```scale=True and scale_output=True```:
* ```output_zscore```: Boolean, by default, False. If True, Z-score scaling for the output column will be applied.
* ```output_minmax```: Boolean, by default, False. If True, min-max scaling for the output column will be applied.
* ```output_box```: Boolean, by default, False. If True, box-cox transformation for the output column will be applied.
* ```output_log```: Boolean, by default, False. If True, log transformation for the output column will be applied.

Note: Either one of the above must be True, and all others must be False because we have to apply only one kind of scaling for the output column.

## Saving Feature Selection Plots
* ```output_folder```: the path to the output folder that will be holding several modeling plots. If the path specified does not exist, it will be created dynamically at runtime.

## Regression vs. Classification

If Regression, then specify ```regression=True```. If Classification, then specify ```regression=False```


## Columns

* ```cols_drop```: list containing the names of the columns the user wants to drop from the data. By default, None. If None, no columns will be dropped from the data.
* ```target_variable```: name of the column holding the target variable (this will be the output column)


## Raises
ValueError
* If ```NUM_FEATURES_UNIVARIATE``` is greater than the total number of input features.
* If ```NUM_FEATURES_RFE``` is greater than the total number of input features.

0 comments on commit d87f1df

Please sign in to comment.