Post type classification

This is a python script that classifys a dataset to a following post type post(ask_hn, show_hn, story, poll). It has over 80% accuracy for a test set of 5000 post types. The classification has been tested on Hacker News dataset fetched form kaggle.

Information about the dataset:

Hacker News posts from 2018 to 2019
Each post includes the following columns: Object ID | Title | Post Type | Author | Created At | URL | Points | Number of Comments | year

Classifier specifications:

Builds a probabilistic model from the training set using Naïve Bays Classifier
Data extrated from "Created At" column of value 2019 is used as a testing dataset.
The training set was extracted from "Created At" column of value 2018
Posts are tokenized and the resulting word set is used as vocabulary.
Each word in the vocablary set its frequency and conditional probability are calculated and a smoothing of value 0.5 is used.

Classifier experiments:

Baseline:
Accesses the data and calculates the score of story, ask-hn, show-hn, poll. Select the correct post kind based on the scores Generate a label to indicated if the accessment is correct Student's Guide poll 0.002 0.03 0.007 0.12 story wrong
Stop-word Filtering:
Remove specific words from the vocabulary which are accessible in stopwords.txt
Word Length Filtering:
remove all words with length ≤2 and all words with length ≥ 9
Infrequent Word Filtering:
Use the baseline experiment, and gradually remove from the vocabulary words with frequency= 1, frequency ≤ 5, frequency ≤ 10, frequency ≤ 15 and frequency ≤ 20. Then gradually remove the top 5% most frequent words, the 10% most frequent words, 15%, 20% and 25% most frequent words. Plot both performance of the classifiers against the number of words left in your vocabulary

Steps to run the program:

-- To change the csv file name, change the first two variables in const.py -- To disable a predictor that increases performance, change 'HEURSTIC' to 'False' in const.py

install all the libraries
navigate to the folder where main.py exists
open the command line prompt
Type 'py main.py' in the command line
Instruction are written in the terminal
To run all the experiments, enter 0
To run the baseline experiment, enter 1
To run the stopword experiment, enter 2
To run the word length filter experiment, enter 3
To run the infrequency filter experiment, enter 4
Output text files exist in 'txtOutput'

libraries used:

pandas
matplotlib
sklearn
nltk
json
string
math
re

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.vscode		.vscode
__pycache__		__pycache__
csvFiles		csvFiles
freq_filter		freq_filter
output		output
txtOutput		txtOutput
classifier.py		classifier.py
const.py		const.py
main.py		main.py
nltk_functions.py		nltk_functions.py
readme.md		readme.md
stopwords.txt		stopwords.txt
vocab.py		vocab.py
wordlength.txt		wordlength.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Post type classification

About

Releases

Packages

Languages

mohhef/Machine-learning-post-classification

Folders and files

Latest commit

History

Repository files navigation

Post type classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages