In this project, an attempt has been made to reduce the need for complex pre-processing in Persian by creating word vectors using the fastText (skip-gram method). The word embedding has been trained on 3 million comments from the Digikala website, which can be used for a wide range of problems. This word embedding can help with high accuracy to analyze the Persian comments of users, especially in the digital goods section. No complicated pre-processing has been used to create the word embeddings and it has been done only through regular expression. Therefore, using this word vectors and performing analysis with its help, does not require complex pre-processes specific to Persian language. We used this generated word vectors to sentiment analysis of the Digikala website’s comments and got incredible accuracy (AUC=0.9944, and F-score=0.9288). We initially extracted 3 million comments from the Digikala website related to the digital goods section using web-mining techniques. Due to the protection of Digikala website rights, the dataset is not fully published. But, as a sample, 10,000 of these comments are available in this repository as “sample_dataset_10000.rar”. The necessary pre-processes were performed on the dataset using regular expressions, and finally the dataset was ready for the process of creating word embeddings. The pre-process function is available under the name "preprocess.py". Next, the word vectors have been extracted by fastText which was trained by 3 million comments, the file of which is available under the name "DigiKalaEmbeddingVevtors.rar". The "test_fasttext.ipynb" is also a file that examines generated word embeddings in such a way that some words are given as input to the word embeddings and some words with nearest vectors are declared as output. Finally, the "modeling.ipynb" file describes the steps for performing data balancing, pseudo-labelling, and creating and using the CNN model to sentiment analysis of of the Digikala website’s comments.
-
Notifications
You must be signed in to change notification settings - Fork 5
mosiomohsen/persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Sentiment analysis in Persian language using fastText word embeddings and Convolutional Neural Network.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published