Step 1: First thing is to load my dataset to clean it further.
Step 2: After uploading, I’ll clean my dataset using NLTK Library( Natural language toolkit) by removing
the following so that machine could work efficiently and process my dataset further.
Step 3: Text Preprocessing
I used
• Tokenization (converting into 'token')
• Normalization (morpheme= stem of the word)
List of Text Preprocessing Steps
1. Converting dataset into lowercase
2. Removing numbers
3. Removing punctuations, accent marks
4. Removing accent
5. Remove tags
6. Remove special characters and digits
7. Removing stop spaces.
8. Remove stopwords
9. Lemmatization
This step includes:
• word frequency analysis,
• sentence length analysis,etc.
• visualizing by n-grams (one, bi, tri ,four,.. grams)
Step 1: We need to spilt out data into train and test data.
Step 2: I've applied Naive Bayes Model. which shows best accuracy than other model.
Step 3: Here, I fit Naive Bayes to the training set
Step 4: Then I predict the Test set results
Other algorithms to find the accuracy of model:
1. Decision Tree Algorithm (never worked on this model)
2. Random Forest (worked but didn't find the best)
1. One idea that generally came to my mind is to use multiple algorithms on single dataset as
because some algorithms are better suited to a particular type of data sets than others.
I've tried this doing with single datasets aplying random Forest and Naive bayes. I get
Random forest algorithm works better in that datatset.
2. Another idea that came in my mind is to add more data in the dataset. I didn't try this but
have heard one of the professor while in ACM Winter School.
Reason: As it allows the “data to tell for itself,” instead of relying on assumptions and
weak correlations.
(I stuck little while working with Naive Bayes in Google Collab notebook, still learning and working on my mistakes)