Skip to content

Pem14604/Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Text Clustering

Steps

1 Data Prepprocessing- punchuation removal, stopwords removal, special character, digit and other words we dont want as per our usecase. 2 Text to vector conversion- TDIDF, Word2Vec, Glove, Fasttext 3 Clustering- Kmeans, Gausian, H-cluster/Agglo, Lingo. 4 Representation- we can try different graphs like Foamtree, Nextworx, Neo4j

Text to vector conversion

I have tried various techniquesas our result is depend on our vector, how good our text is represneted in vector. Got best result with TFIDF and Fasttext. Even i have tried varies tweeking techniaues in TDIDF like keeping only nous and verbs in feature vector, setting threshold to keep the vector size upto a limit and various combination of max_df and min_df, it all depend on our data and you need to test various combination of these to get the best results.

Code is uploaded for different steps.

  • Data cleaning-punctiuation removal,stopwords,digits, special character,Keeping only English words
  • Text to Vector or word embedding’s: Fasttext, TFIDF,Word2vec,TFIDF tweeking by setting threshold
  • Fastetxt transfer learning or over the top training
  • Sentence vector
  • Algo: Kmeans, Agglomerative
  • Dataframe

About

Text Clustering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published