Skip to content

Sentiment analysis in Persian language using fastText word embeddings and Convolutional Neural Network.

Notifications You must be signed in to change notification settings

mosiomohsen/persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

3a799d0 · May 11, 2021

History

14 Commits
May 11, 2021
May 11, 2021
May 11, 2021
May 11, 2021
May 11, 2021
Oct 31, 2020
May 11, 2021
Nov 3, 2020
Jan 31, 2021
Nov 3, 2020
Nov 3, 2020
Oct 31, 2020
Oct 29, 2020

Repository files navigation

persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling

In this project, an attempt has been made to reduce the need for complex pre-processing in Persian by creating word vectors using the fastText (skip-gram method). The word embedding has been trained on 3 million comments from the Digikala website, which can be used for a wide range of problems. This word embedding can help with high accuracy to analyze the Persian comments of users, especially in the digital goods section. No complicated pre-processing has been used to create the word embeddings and it has been done only through regular expression. Therefore, using this word vectors and performing analysis with its help, does not require complex pre-processes specific to Persian language. We used this generated word vectors to sentiment analysis of the Digikala website’s comments and got incredible accuracy (AUC=0.9944, and F-score=0.9288). We initially extracted 3 million comments from the Digikala website related to the digital goods section using web-mining techniques. Due to the protection of Digikala website rights, the dataset is not fully published. But, as a sample, 10,000 of these comments are available in this repository as “sample_dataset_10000.rar”. The necessary pre-processes were performed on the dataset using regular expressions, and finally the dataset was ready for the process of creating word embeddings. The pre-process function is available under the name "preprocess.py". Next, the word vectors have been extracted by fastText which was trained by 3 million comments, the file of which is available under the name "DigiKalaEmbeddingVevtors.rar". The "test_fasttext.ipynb" is also a file that examines generated word embeddings in such a way that some words are given as input to the word embeddings and some words with nearest vectors are declared as output. Finally, the "modeling.ipynb" file describes the steps for performing data balancing, pseudo-labelling, and creating and using the CNN model to sentiment analysis of of the Digikala website’s comments.

About

Sentiment analysis in Persian language using fastText word embeddings and Convolutional Neural Network.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published