SparkStreaming

Streaming tweets using Kafka servers on spark cluster of databricks, analyzing the tweets for basic sentiment classification and visualization of data using KIbana in elastic search that is set up in AWS EC2 instance.

Tech Stack

Architecture

Everything is on a combination of databricks and aws ec2 instance. Now lets deep dive into the details.

1. Kafka Setup

Databricks platform is used to setup kafka in spark clusters
Both zookeeper instance and kafka instances are ran on two different notebooks
A notebook is used for the producer where the Scrapper is run.
Another notebook is used for consumer where sentiment analysis is done on the tweets.

2. Scraper

The scraper will collect all tweets using tweepy api and sends them to Kafka for analytics.
Collecting tweets in real-time with particular hash tags. For example, we will collect all tweets with #trump, #coronavirus.
After filtering, we will send them to Kafka.
the scraper program runs infinitely and takes hash tag as input parameter while running.

3. Spark Streaming

In Spark Streaming, we need to create a Kafka consumer and periodically collect filtered tweets from scrapper.
This is done in databricks clusters.
For each hash tag, we perform basic sentiment analysis to classify the tweets as - neutral, negative and positive.

4. Sentiment Analyzer

Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It's also known as opinion mining, deriving the opinion or attitude of a speaker.
For example,

“President Donald Trump approaches his first big test this week from a position of unusual weakness.” - has positive sentiment.
“Trump has the lowest standing in public opinion of any new president in modern history.” - has neutral sentiment.
“Trump has displayed little interest in the policy itself, casting it as a thankless chore to be done before getting to tax-cut legislation he values more.” - has negative sentiment.

5. Elasticsearch

We need to install the Elasticsearch and run it to store the tweets and their sentiment information for further visualization purpose.
Elasticsearch is installed in AWS ec2 instance and accesed publically for saving the sentiment information.

6. Kibana

Kibana is a visualization tool that can explore the data stored in Elasticsearch.
Kibana is installed as in aws ec2 intance.
Here, instead of directly output the result, we use the visualization tool to show tweets sentiment classification result in a real-time manner.

Setup:

Kafka on databricks.

Set up an old version of spark cluster 6.x on databricks. (7.x version do not support kafka instances and will lead to fatal errors, sad)
Create a zookeeper instance

%sh java -version
ls -ltr ./
wget https://archive.apache.org/dist/kafka/0.10.2.2/kafka_2.10-0.10.2.2.tgz
tar -xzf kafka_2.10-0.10.2.2.tgz
kafka_2.10-0.10.2.2/bin/zookeeper-server-start.sh kafka_2.10-0.10.2.2/config/zookeeper.properties

Create kafka server instance

%sh kafka_2.10-0.10.2.2/bin/kafka-server-start.sh kafka_2.10-0.10.2.2/config/server.properties

AWS ec2 instance with elastic search instance and kibana.

Video Tutorial : https://www.youtube.com/watch?v=SbeleFxyvu8&feature=emb_title

Step1: Install java and its Dependencies

SSH into device

java -version
sudo yum -y install java-1.8.0-openjdk
sudo yum -y remove java-1.7.0-openjdk
java -version

Step2: Install java and its Dependencies

sudo su
yum install -y
cd /root
wget  https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.2.noarch.rpm
yum install elasticsearch-1.7.2.noarch.rpm -y
rm -f elasticsearch-1.7.2.noarch.rpm

Step3: Start the Server

service elasticsearch start

Step4: Automatically Boot u on start

Use the chkconfig command to configure Elasticsearch to start automatically when the system boots up

sudo chkconfig --add elasticsearch

Step5:Configuring AWS IP so you can access using public IP

Dirty hack to make it work don’t do on prod

echo "network.host: 0.0.0.0" >> /etc/elasticsearch/elasticsearch.yml

Step6:Install Plugins

Dirty hack to make it work don’t do on prod

cd /usr/share/elasticsearch/
./bin/plugin -install mobz/elasticsearch-head
./bin/plugin -install lukas-vlcek/bigdesk
./bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.7.1
./bin/plugin --install lmenezes/elasticsearch-kopf/1.5.7

Step 7:Install Kibana

Dirty hack to make it work don’t do on prod

sudo su
yum update -y
cd /root
wget https://download.elastic.co/kibana/kibana/kibana-4.1.2-linux-x64.tar.gz
tar xzf kibana-4.1.2-linux-x64.tar.gz
rm -f kibana-4.1.2-linux-x64.tar.gz
cd kibana-4.1.2-linux-x64
nano config/kibana.yml
nohup ./bin/kibana &

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
KmeansAndALS		KmeansAndALS
consumer		consumer
kafka		kafka
kibanaVisuals		kibanaVisuals
producer		producer
static		static
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SparkStreaming

Tech Stack

Architecture

1. Kafka Setup

2. Scraper

3. Spark Streaming

4. Sentiment Analyzer

5. Elasticsearch

6. Kibana

Setup:

Reference:

About

Uh oh!

Releases

Packages

Languages

mahandas/SparkStreaming

Folders and files

Latest commit

History

Repository files navigation

SparkStreaming

Tech Stack

Architecture

1. Kafka Setup

2. Scraper

3. Spark Streaming

4. Sentiment Analyzer

5. Elasticsearch

6. Kibana

Setup:

Reference:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages