Skip to content

arushi-11/MIDAS-IIITD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

MIDAS@IIITD

TASK-3: NLP

♦️ Show how you would clean and process the data?

 Step 1:  First thing is to load my dataset to clean it further.

 Step 2: After uploading, I’ll clean my dataset using NLTK Library( Natural language toolkit) by removing 
 the following so that machine could work efficiently and process my dataset further.

 Step 3: Text Preprocessing

I used 
•	Tokenization (converting into 'token')
•	Normalization (morpheme= stem of the word)

List of Text Preprocessing Steps

     1.	Converting dataset into lowercase
     2.	Removing numbers
     3.	Removing punctuations, accent marks
     4.	Removing accent
     5.	Remove tags
     6.	Remove special characters and digits
     7.	Removing stop spaces.
     8.	Remove stopwords
     9.	Lemmatization

♦️ Show how you would visualize this data

   This step includes:

     •	word frequency analysis,

     •	sentence length analysis,etc.

     •   visualizing by n-grams (one, bi, tri ,four,.. grams)

♦️ Show how you would measure the accuracy of the model

  Step 1: We need to spilt out data into train and test data. 

  Step 2: I've applied Naive Bayes Model. which shows best accuracy than other model.

  Step 3: Here, I fit Naive Bayes to the training set

  Step 4: Then I predict the Test set results

  Other algorithms to find the accuracy of model:
        1. Decision Tree Algorithm (never worked on this model)
        2. Random Forest (worked but didn't find the best)

♦️ What ideas do you have to improve the accuracy of the model in NLP

   1. One idea that generally came to my mind is to use multiple algorithms on single dataset as 
      because some algorithms are better suited to a particular type of data sets than others. 
      I've tried this doing with single datasets aplying random Forest and Naive bayes. I get 
      Random forest algorithm works better in that datatset.

   2. Another idea that came in my mind is to add more data in the dataset. I didn't try this but
      have heard one of the professor while in ACM Winter School.
      Reason: As it allows the “data to tell for itself,” instead of relying on assumptions and 
      weak correlations. 

(I stuck little while working with Naive Bayes in Google Collab notebook, still learning and working on my mistakes)

THANK YOU, SIR:exclamation:



✨ I have also give online session to provide general information about "Introduction to Natural Language Processing" to B.tech and MCA students!✨

👉 https://youtu.be/mb7qbHCNTKE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published