Sparkify is an online new music streaming app. They have collected user activity info and want to do analysis on what songs the users are listening to by doing analytical search on the data.As of now the information is store in JSON files and other metadata files and there is no easy way to search. The ta
As their data engineer tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.
The solution that is designed to solve the above problem is to build a ETL pipeline that extracts data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow analytics team to continue finding insights in what songs their users are listening to.
- AWS
- Python 3.6 with panda,spark,pyspark library
- Jupyter Notebooks
- SQL
- Python
- JSON
-
Song Dataset The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
-
Log Dataset The dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.
- Create an AWS user with programmer access and attach policy for S3 Read and Write
- Run
Run From Udacity.ipynb
Jupyter notebook
- Download the files
- Modify dl.cfg with your AWS KEY and ID
- Run
etl.py
- Database is designed in Star Schema with Fact and Dimensions table.
- Program is designed in a way to first import the data into
songs
,artist
,user
,time
in parquet files and then from these table do the search query to fill the factsongplays
parquet files. - The Source files AWS path are kept in configuration
- For importing and reading json for song records
df = spark.read.json(song_data)
- For importing and reading json object per line for log jsons
df = spark.read.json(log_data)
- After massaging the data into pyspark dataframes from song and log json files, data is written into parquet files