persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling

In this project, an attempt has been made to reduce the need for complex pre-processing in Persian by creating word vectors using the fastText (skip-gram method). The word embedding has been trained on 3 million comments from the Digikala website, which can be used for a wide range of problems. This word embedding can help with high accuracy to analyze the Persian comments of users, especially in the digital goods section. No complicated pre-processing has been used to create the word embeddings and it has been done only through regular expression. Therefore, using this word vectors and performing analysis with its help, does not require complex pre-processes specific to Persian language. We used this generated word vectors to sentiment analysis of the Digikala website’s comments and got incredible accuracy (AUC=0.9944, and F-score=0.9288). We initially extracted 3 million comments from the Digikala website related to the digital goods section using web-mining techniques. Due to the protection of Digikala website rights, the dataset is not fully published. But, as a sample, 10,000 of these comments are available in this repository as “sample_dataset_10000.rar”. The necessary pre-processes were performed on the dataset using regular expressions, and finally the dataset was ready for the process of creating word embeddings. The pre-process function is available under the name "preprocess.py". Next, the word vectors have been extracted by fastText which was trained by 3 million comments, the file of which is available under the name "DigiKalaEmbeddingVevtors.rar". The "test_fasttext.ipynb" is also a file that examines generated word embeddings in such a way that some words are given as input to the word embeddings and some words with nearest vectors are declared as output. Finally, the "modeling.ipynb" file describes the steps for performing data balancing, pseudo-labelling, and creating and using the CNN model to sentiment analysis of of the Digikala website’s comments.

Name	Name	Last commit message	Last commit date
Latest commit mosiomohsen Add files via upload May 11, 2021 3a799d0 · May 11, 2021 History 14 Commits
DK.part01.rar	DK.part01.rar	Add files via upload	May 11, 2021
DK.part02.rar	DK.part02.rar	Add files via upload	May 11, 2021
DK.part03.rar	DK.part03.rar	Add files via upload	May 11, 2021
DK.part04.rar	DK.part04.rar	Add files via upload	May 11, 2021
DK.part05.rar	DK.part05.rar	Add files via upload	May 11, 2021
DigiKalaEmbeddingVevtors.rar	DigiKalaEmbeddingVevtors.rar	Add files via upload	Oct 31, 2020
NewDigi.rar	NewDigi.rar	Add files via upload	May 11, 2021
README.md	README.md	Update README.md	Nov 3, 2020
modeling-lstm.ipynb	modeling-lstm.ipynb	Add files via upload	Jan 31, 2021
modeling.ipynb	modeling.ipynb	Add files via upload	Nov 3, 2020
preprocess.py	preprocess.py	Add files via upload	Nov 3, 2020
sample_dataset_10000.rar	sample_dataset_10000.rar	Add files via upload	Oct 31, 2020
test_fasttext.ipynb	test_fasttext.ipynb	Add files via upload	Oct 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling

About

Releases

Packages

Languages

mosiomohsen/persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling

Folders and files

Latest commit

History

Repository files navigation

persian-sentiment-analysis-using-fastText-word-embedding-and-pseudo-labeling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages