GitHub - irinaplacintagit/toyota-spark-streaming: Spark application doing stats on the IMDB dataset

Overview

The project computes some statistics on the IMDB dataset and is built using Spark.

As the data is static, Spark batch mode has been used.

In order to convert this to a streaming project, we can consider the data coming in the respective folders and change the read to use streams instead. Note this has been attempted and a timestamp column has been addded to the datasets to simulate a real time streaming application. In order to answer the first requirement, a stream x stream join is needed; the complexity started increasing at that point and after double checking with Paul, I left this as a batch project.

How to run

Download the IMDB dataset, unzip the following files and places them as:

name.basics.tsv into src/main/resources/name.basics/
title.basics.tsv into src/main/resources/title.basics/
title.principals.tsv into src/main/resources/title.principals/
title.ratings.tsv into src/main/resources/title.ratings/

Prerequisite

sbt
IntelliJ

Instructions

Import project into IntelliJ or your IDE of choice
run sbt compile
Run SparkMain.scala

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
project		project
src		src
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

How to run

Prerequisite

Instructions

About

Releases

Packages

Languages

irinaplacintagit/toyota-spark-streaming

Folders and files

Latest commit

History

Repository files navigation

Overview

How to run

Prerequisite

Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages