ADD Sentiment Analysis implementation

IonesioJunior · Aug 3, 2018 · 1e44824 · 1e44824
1 parent f4351bc
commit 1e44824
Show file tree

Hide file tree

Showing 2 changed files with 58,476 additions and 0 deletions.
diff --git a/Sentimental Analysis.ipynb b/Sentimental Analysis.ipynb
@@ -0,0 +1,378 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h1>Sentimental Analysis using Twitter messages</h1>\n",
+ "This notebook tries to explain step-by-step about the development of an algorithm that tries to classify the sentimental characteristics of some phrase using NaiveBayes concept"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "__author__ = \"Ionésio Junior\"\n",
+ "\n",
+ "import pandas as pd\n",
+ "import nltk, re\n",
+ "from collections import defaultdict\n",
+ "from string import punctuation as punct\n",
+ "from collections import OrderedDict\n",
+ "from nltk.classify.util import accuracy as eval_accuracy\n",
+ "from nltk.classify import NaiveBayesClassifier \n",
+ "from nltk.metrics import (BigramAssocMeasures, precision as eval_precision,\n",
+ " recall as eval_recall, f_measure as eval_f_measure)\n",
+ "from math import floor"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Data pre-processing</h2>\n",
+ "This snippet implements some functions to load and filter our data set.We need to remove links/hashtags and punctuation of our data before train some model.After that, we'll get token words."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "stopwords = nltk.corpus.stopwords.words('portuguese')\n",
+ "\n",
+ "def filter_by_stopwords(word):\n",
+ " if word not in stopwords and word not in punct:\n",
+ " return True\n",
+ " else:\n",
+ " return False\n",
+ "\n",
+ "\n",
+ "def filter_dataset(data_text):\n",
+ " # Remove URLS / Hashtags / links\n",
+ " data_text = re.sub(r'@\\S+', '', data_text)\n",
+ " data_text = re.sub(r'http\\S+', '', data_text)\n",
+ " data_text = re.sub(r'#\\S+', '', data_text)\n",
+ "\n",
+ " # Filter stop words and extract tokens\n",
+ " tokens = list( filter( lambda word: filter_by_stopwords(word), nltk.word_tokenize( data_text.lower() ) ))\n",
+ " return tokens"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Structuring our data set</h2>\n",
+ "Now, we need to put and organize our data set in a \"bag of words\" structure and label the bags.After that, we'll separate dataset by label (pos / neg) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def build_bag_of_words(tweet_text):\n",
+ " ''' \n",
+ " Construct an abstraction of concept \"bag of words\" to each tweet\n",
+ " Args:\n",
+ " Tweet_text(String) : text of tweet message\n",
+ " Return:\n",
+ " {Word:Boolean} : Bag of words\n",
+ " '''\n",
+ " return { word:True for word in filter_dataset(tweet_text) }\n",
+ "\n",
+ "def extract_labels(dataset):\n",
+ " '''\n",
+ " Extract labels and filter dataset\n",
+ " \n",
+ " Params:\n",
+ " DataSet(DataFrame) : Set of tweets previously labeled\n",
+ " Return:\n",
+ " (RotuloPositivo, RotuloNegativo) : Separated/filtered labels \n",
+ " '''\n",
+ " positive_label = dataset[dataset.sentiment == 1].text\n",
+ " negative_label = dataset[dataset.sentiment == 0].text\n",
+ " filtered_positive_label = [ (build_bag_of_words(tweet),\"pos\") for tweet in positive_label ]\n",
+ " filtered_negative_label = [ (build_bag_of_words(tweet),\"neg\") for tweet in negative_label ]\n",
+ " return (filtered_positive_label, filtered_negative_label)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Training</h2>\n",
+ "After filtering,structuring and labeling our dataset, we can train our classifier. But, for test principles we'll divide our data (70% to train and 30% to test)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def train_model(dataset):\n",
+ " '''\n",
+ " Build and train a model of classfier\n",
+ " \n",
+ " Args:\n",
+ " DataSet(DataFrame) : data set to be used by classifier\n",
+ " Return:\n",
+ " classifier : trained classfier\n",
+ " '''\n",
+ " # Extracting filtered and labeled text data\n",
+ " positive_label, negative_label = extract_labels( dataset )\n",
+ " \n",
+ " # Spliting data set (70% train / 30% test)\n",
+ " dataset_size = len(positive_label)\n",
+ " train_set = positive_label[:floor(dataset_size * 0.7)] + negative_label[:floor(dataset_size * 0.7)]\n",
+ " test_set = positive_label[floor(dataset_size * 0.7):] + negative_label[floor(dataset_size * 0.7):]\n",
+ " \n",
+ " # Training our model\n",
+ " classifier = NaiveBayesClassifier.train(train_set)\n",
+ " return (classifier, test_set)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Evaluate Classifier</h2>\n",
+ "Now, we need to develop some function to measure accuracy and others metrics of our classifier using test set"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 64,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def evaluate(test_set, classifier=None, accuracy=True, f_measure=True,\n",
+ " precision=True, recall=True, verbose=False):\n",
+ " \"\"\"\n",
+ " Evaluate and print classifier performance on the test set.\n",
+ "\n",
+ " :param test_set: A list of (tokens, label) tuples to use as gold set.\n",
+ " :param classifier: a classifier instance (previously trained).\n",
+ " :param accuracy: if `True`, evaluate classifier accuracy.\n",
+ " :param f_measure: if `True`, evaluate classifier f_measure.\n",
+ " :param precision: if `True`, evaluate classifier precision.\n",
+ " :param recall: if `True`, evaluate classifier recall.\n",
+ " :return: evaluation results.\n",
+ " :rtype: dict(str): float\n",
+ " \"\"\"\n",
+ " if classifier is None:\n",
+ " classifier = classifier\n",
+ " print(\"=== Evaluating {0} results... ===\".format(type(classifier).__name__))\n",
+ " metrics_results = {}\n",
+ " if accuracy == True:\n",
+ " accuracy_score = eval_accuracy(classifier, test_set)\n",
+ " metrics_results['Accuracy'] = accuracy_score\n",
+ "\n",
+ " gold_results = defaultdict(set)\n",
+ " test_results = defaultdict(set)\n",
+ " labels = set()\n",
+ " for i, (feats, label) in enumerate(test_set):\n",
+ " labels.add(label)\n",
+ " gold_results[label].add(i)\n",
+ " observed = classifier.classify(feats)\n",
+ " test_results[observed].add(i)\n",
+ "\n",
+ " for label in labels:\n",
+ " if precision == True:\n",
+ " precision_score = eval_precision(gold_results[label],\n",
+ " test_results[label])\n",
+ " metrics_results['Precision [{0}]'.format(label)] = precision_score\n",
+ " if recall == True:\n",
+ " recall_score = eval_recall(gold_results[label],\n",
+ " test_results[label])\n",
+ " metrics_results['Recall [{0}]'.format(label)] = recall_score\n",
+ " if f_measure == True:\n",
+ " f_measure_score = eval_f_measure(gold_results[label],\n",
+ " test_results[label])\n",
+ " metrics_results['F-measure [{0}]'.format(label)] = f_measure_score\n",
+ "\n",
+ " # Print evaluation results (in alphabetical order)\n",
+ " if verbose == True:\n",
+ " for result in sorted(metrics_results):\n",
+ " print('{0}: {1}'.format(result, metrics_results[result]))\n",
+ "\n",
+ " return metrics_results"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Apply trained classifier on other text</h2>\n",
+ "Here, we'll get a new text input and produce a positive/negative label using our trained classifier"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def classify(classifier, text):\n",
+ " probabilities = classifier.prob_classify(build_bag_of_words(text))\n",
+ " predicted = probabilities.max()\n",
+ " print(text + \"[ \" + \"{:.2f}\".format(probabilities.prob(predicted)*100)+ \"% \" + predicted.capitalize() + \"]\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Testing with simple inputs</h2>\n",
+ "This snippet tests the classifier using simple texts"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 89,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def test_classifier_on_simple_texts(classifier):\n",
+ " print(\"=== Applying the classifier on other phrases ===\")\n",
+ " classify(classifier, \"Hoje é um dia triste!\")\n",
+ " classify(classifier, \"Sorrir faz bem para a alma!\")\n",
+ " classify(classifier, \"Dançar é muito bom!\")\n",
+ " classify(classifier, \"Adoro correr!\")\n",
+ " classify(classifier, \"Odeio tomar café frio\")\n",
+ " classify(classifier, \"Sorria, você está sendo filmado!\")\n",
+ " print(\"\\n== Outlier ==\")\n",
+ " classify(classifier, \"Não grite comigo!\")\n",
+ " classify(classifier, \"Adoro ver os outros perderem!\")\n",
+ " classify(classifier, \"Vamos destruir o preconceito!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Using the classifier on another tweets</h2>\n",
+ "This snippet label ***#MeuPaiNãoSabeMas*** posts on twitter."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 90,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def test_classifier_on_tweets(classifier):\n",
+ " print(\"=== Applying the classifier on tweets ===\")\n",
+ " classify(classifier, \"#MeuPaiNãoSabeMas eu sinto saudade de ele me chamar de princesa\")\n",
+ " classify(classifier, \"#MeuPaiNaoSabeMas quando ele grita comigo eu choro\")\n",
+ " classify(classifier, \"#MeuPaiNaoSabeMas amo ele e tenho muito orgulho de ser filho dele\")\n",
+ " classify(classifier, \"#MeuPaiNãoSabeMas a falta dele ainda machuca muito\")\n",
+ " classify(classifier, \"#MeuPaiNãoSabeMas quando ele pedia pra pegar mais uma cerveja na geladeira eu dava um gole so pra saber qual era o gosto\")\n",
+ " classify(classifier, \"#MeuPaiNãoSabeMas ele é minha maior inspiração\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Let's put it all together</h2>"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 91,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "=== Evaluating NaiveBayesClassifier results... ===\n",
+ "Accuracy:0.74\n",
+ "Precision [neg]:0.69\n",
+ "Recall [neg]:0.74\n",
+ "F-measure [neg]:0.72\n",
+ "Precision [pos]:0.78\n",
+ "Recall [pos]:0.74\n",
+ "F-measure [pos]:0.76\n",
+ "\n",
+ "\n",
+ "\n",
+ "=== Applying the classifier on other phrases ===\n",
+ "Hoje é um dia triste![ 82.22% Neg]\n",
+ "Sorrir faz bem para a alma![ 87.24% Pos]\n",
+ "Dançar é muito bom![ 84.48% Pos]\n",
+ "Adoro correr![ 93.07% Pos]\n",
+ "Odeio tomar café frio[ 81.79% Neg]\n",
+ "Sorria, você está sendo filmado![ 86.51% Pos]\n",
+ "\n",
+ "== Outlier ==\n",
+ "Não grite comigo![ 78.16% Pos]\n",
+ "Adoro ver os outros perderem![ 95.51% Pos]\n",
+ "Vamos destruir o preconceito![ 94.10% Neg]\n",
+ "\n",
+ "\n",
+ "\n",
+ "#MeuPaiNãoSabeMas eu sinto saudade de ele me chamar de princesa[ 97.63% Neg]\n",
+ "#MeuPaiNaoSabeMas quando ele grita comigo eu choro[ 83.00% Neg]\n",
+ "#MeuPaiNaoSabeMas amo ele e tenho muito orgulho de ser filho dele[ 64.00% Pos]\n",
+ "#MeuPaiNãoSabeMas a falta dele ainda machuca muito[ 78.77% Neg]\n",
+ "#MeuPaiNãoSabeMas quando ele pedia pra pegar mais uma cerveja na geladeira eu dava um gole so pra saber qual era o gosto[ 65.18% Pos]\n",
+ "#MeuPaiNãoSabeMas ele é minha maior inspiração[ 77.65% Pos]\n"
+ ]
+ }
+ ],
+ "source": [
+ "if __name__ == \"__main__\":\n",
+ " # Load data set\n",
+ " dataset = pd.read_csv('database/db.csv',encoding='utf-8', sep='\\t')\n",
+ " # Train data set\n",
+ " classifier, test_set = train_model(dataset)\n",
+ " #Test data set\n",
+ " test_results = evaluate(test_set,classifier)\n",
+ " [ print(str(key) + \":\" + \"%.2f\" % test_results[key]) for key in test_results.keys() ]\n",
+ " print(\"\\n\\n\")\n",
+ " #Apply classifier on other phrases\n",
+ " test_classifier_on_simple_texts(classifier)\n",
+ " print(\"\\n\\n\")\n",
+ " #Apply classifier on another tweets\n",
+ " test_classifier_on_tweets(classifier);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "<h2>Analysis of results</h2>\n",
+ "The model was relatively accurate in our manual queries. However, as we can see from generating the evaluation metrics using the test set, we still have relatively low ones. Possible suggestions for improving the classifier would be to increase the data set and perform more advanced preprocessing."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}