Skip to content

πŸ’¬ Classification of Reddit comments to the subreddit they were posted in

Notifications You must be signed in to change notification settings

vaquierm/RedditCommentTextClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

39 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RedditCommentTextClassification

Directory Structure

.
β”œβ”€β”€ data
β”‚Β Β  β”œβ”€β”€ processed_data
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ LEMMA_test_clean.csv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ LEMMA_train_clean.csv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ STEM_test_clean.csv
β”‚Β Β  β”‚Β Β  └── STEM_train_clean.csv
β”‚Β Β  └── raw_data
β”‚Β Β      β”œβ”€β”€ reddit_test.csv
β”‚Β Β      └── reddit_train.csv
|
β”œβ”€β”€ results
β”‚Β Β  β”œβ”€β”€ predictions.csv
β”‚Β Β  β”œβ”€β”€ results.txt
β”‚Β Β  └── STEM_BINARY_DT_confusion.png
|
└── src
    β”œβ”€β”€ main.py
    β”œβ”€β”€ config.py
    |
    β”œβ”€β”€ create_vocabularies.py
    β”œβ”€β”€ validation_pipeline.py
    β”œβ”€β”€ generate_kaggle_results.py
    |
    β”œβ”€β”€ Data_Analysis.ipynb
    |
    β”œβ”€β”€ data_processing
    β”‚Β Β  └── vocabulary.py
    |
    β”œβ”€β”€ models
    β”‚Β Β  β”œβ”€β”€ LazyNaiveBayes.py
    β”‚Β Β  β”œβ”€β”€ Model.py
    β”‚Β Β  └── NaiveBayes.py
    |
    └── utils
     Β Β  β”œβ”€β”€ factory.py
     Β Β  └── utils.py

The data/ folder contains all raw and parsed csv files that are used for training the model an generate predictions for the Kaggle competition.

  • The ````data/raw_data/ contains the raw data downloaded from Kaggle containing all reddit comments. (One training file, and one test file)
  • The data/processed_data/ folder contains the csv files for the processed version of the raw data with all words either stemmed or lemmatized, custom regex filters applied to further reduce the feature space.

The results/ folder is the default folder where all automatic scripts will dump all their results and figures

  • The results.txt file will contain a detailed result of all the accuracies of each models ran on each configuration.
  • The predictions.csv file contains the predictions generated to submit to Kaggle
  • The *.confusion.png files are images of confusion matrices of each model ran on all different configurations

The src/ folder contains all of the .py and .ipynb files

  • The config.py file is probably the most important file where all the configurations and models to be ran are defined as well as all the file paths of the raw data and result folder
  • The create_vocabulary.py file is the script that will preprocess all the initial raw data to be lemmatized and stemmed. All custom regex filters are applied as well to reduce the feature space. Once the raw data is processed, it is saved into another csv file in the data/preprocessed_data/ folder.
  • The validation_pipeline.py file is a script that runs all different configurations and models that are defined in the config.py file and calculates the accuracy, confusion matrices, and saves all this data in the results/ folder.
  • The generate_kaggle_results.py file is a script that: based on the submission configuration and model defined in the config.py file, predicts the test data and generates a predictions.csv file in the results/ folder to be submitted to kaggle.
  • The main.py script will run all three previously described scripts in order: create_vocabulary, validation_pipeline, generate_kaggle_results.

About

πŸ’¬ Classification of Reddit comments to the subreddit they were posted in

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •