Skip to content

sumana-reddy/big-data-final-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Big-Data-final-project

About Me 👩

Link for Databricks published notebook 📓

Source of text used

Process text using Databricks Community Edition and PySpark.

Using PySpark for Natural language Processing

  • Cleaning
  • Processing
  • Charting

Cleaning Flowers in Chania

  • NLP stopword removal
  • Remove all non-leters
  • To get data text from URL
import urllib.request
urllib.request.urlretrieve("https://www.gutenberg.org/files/6130/6130-0.txt", "/tmp/gutenberg.txt")

  • dbutils.fs.mv to transfer the data and Storing the data in gutenberg.txt
dbutils.fs.mv("file:/tmp/gutenberg.txt", "dbfs:/data/gutenberg.txt")

rawRDD = sc.textFile("dbfs:/data/gutenberg.txt")

Processing

  • Flatmap from one to many(one line of text to many words)
  • cast everything to lowercase
  • strip whitespace from beginning and end split by our delimiter (space, commma)
  • map() into intermediate key-value pairs
  • filter() to get just some records
  • reduceByKey() to get the count
  • flatmap each line to words
wordsRDD = rawRDD.flatMap(lambda line : line.lower().strip().split(" "))

  • map() words to (words,1) intermediate key-value pairs.
  • removing punctutations.
  • prepare to clean stopwords and remove spaces and empty words
import re
cleanTokenRDD = wordsRDD.map(lambda w: re.sub(r'[^a-zA-Z]','',w))
from pyspark.ml.feature import StopWordsRemover
removes =StopWordsRemover()
stopWords = removes.getStopWords()
cleanWordsRDD=cleanTokenRDD.filter(lambda w: w not in stopWords)
FinalcleanWordsRDD = cleanWordsRDD.filter(lambda x: x != "")
IKVPairsRDD= FinalcleanWordsRDD.map(lambda word: (word,1))

  • reduceByKey() to get (word, count) results
wordsCountRDD = IKVPairsRDD.reduceByKey(lambda acc, value: acc+value)

results = wordsCountRDD.map(lambda x: (x[1], x[0])).sortByKey(False).take(20)
print(results)

  • collect() action to get back to Python
finalresults = wordsCountRDD.collect()
print(finalresults)

Charting 📊

  • Plot Options... was used to configure the graph below.
  • The Key is Year and appears on the X-Axis.
  • The Series groupings is Product and there is a different color to denote each of those.
  • The Values is salesAmount and appears on the Y-Axis.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

source = 'The Project Gutenberg eBook of The Iliad, by Homer'
title = 'Top 20 Words in ' + source
xlabel = 'Count (No.of occourences)'
ylabel = 'Words'

# create Pandas dataframe from list of tuples
df = pd.DataFrame.from_records(results, columns =[xlabel, ylabel]) 
print(df)

# create plot (using matplotlib)
plt.figure(figsize=(15,5))
sns.barplot(xlabel, ylabel, data=df, palette="rocket").set_title(title)

Results

  • Results for top 20 words
  • Results in bar graph

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published