Sparkify Song Data Analysis

Problem Statement

Sparkify is an online new music streaming app. They have collected user activity info and want to do analysis on what songs the users are listening to by doing analytical search on the data.As of now the information is store in JSON files and other metadata files and there is no easy way to search. The ta

As their data engineer tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

Solution

The solution that is designed to solve the above problem is to build a ETL pipeline that extracts data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow analytics team to continue finding insights in what songs their users are listening to.

Software

AWS
Python 3.6 with panda,spark,pyspark library
Jupyter Notebooks

Programming languages

SQL
Python
JSON

DataSet

Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
Log Dataset The dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

Pre-requisites

Create an AWS user with programmer access and attach policy for S3 Read and Write

How to run from Udacity Workspace 👇

Run Run From Udacity.ipynb Jupyter notebook

How to run from local

Download the files
Modify dl.cfg with your AWS KEY and ID
Run etl.py

A Quick 🏃 of the flow 👇

Database is designed in Star Schema with Fact and Dimensions table.
Program is designed in a way to first import the data into songs, artist, user, time in parquet files and then from these table do the search query to fill the fact songplays parquet files.
The Source files AWS path are kept in configuration
For importing and reading json for song records df = spark.read.json(song_data)
For importing and reading json object per line for log jsons df = spark.read.json(log_data)
After massaging the data into pyspark dataframes from song and log json files, data is written into parquet files

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
README.md		README.md
Run From Udacity.ipynb		Run From Udacity.ipynb
dl.cfg		dl.cfg
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparkify Song Data Analysis

Problem Statement

Solution

Software

Programming languages

DataSet

Pre-requisites

How to run from Udacity Workspace 👇

How to run from local

A Quick 🏃 of the flow 👇

About

Uh oh!

Releases

Packages

Languages

rawatankit90/Datalake-AWS-S3-Python-ParquetFiles

Folders and files

Latest commit

History

Repository files navigation

Sparkify Song Data Analysis

Problem Statement

Solution

Software

Programming languages

DataSet

Pre-requisites

How to run from Udacity Workspace 👇

How to run from local

A Quick 🏃 of the flow 👇

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages