GitHub - EddieAmaitum/NYC-Yellow-Taxi-DataOps-with-AWS-Analyzing-TLC-Datasets: Performed business operations using Big data technologies: AWS EMR, AWS RDS (MySQL), Hadoop, Apache Scoop, Apache HBase, MapReduce

Unlocking Insights with NYC Yellow Taxi Data using Big Data Technologies 🚖

In this repository, we leverage the power of Big Data technologies to perform data-driven business operations on the NYC Yellow Taxi dataset. Our toolkit includes industry-standard tools and services such as:

AWS EMR: Harness the scalability of Amazon Elastic MapReduce for efficient data processing and analysis.

AWS RDS (MySQL): Store and manage structured data seamlessly with Amazon RDS, a reliable and high-performance database service.

Hadoop: Utilize the robust Hadoop ecosystem for distributed data storage and processing.

Apache Scoop: Streamline data ingestion between Hadoop and relational databases effortlessly.

Apache HBase: Leverage the NoSQL capabilities of Apache HBase for high-speed, random read/write access to your data.

MapReduce: Implement MapReduce algorithms to extract valuable insights from massive datasets

Data

We use the New York City TLC yellow taxi data set for the year 2017
The data dictionary can be found here

Approach

The project was broken down into the following 4 tasks
Please refer to attached files for detailed explanations with code samples and screenshots

Task 1: Setting up the environment and loading data

I created an RDS (Relational Database Service) instance on my AWS account and uploaded data to the RDS instance
- I created an appropriate schema for the data sets to upload them to RDS
I created an AWS EMR Instance with the above services.
- I used the m4.xlarge cluster with ample storage size since we are working with a huge data set
- I used a single master node instead of a multi-node cluster to limit my AWS credit consuption
I then proceeded to connect RDS with the EMR instance
I then logged into RDS through EMR instance
I created the "yellow_taxi" database followed by the table "trip_records"
I then downloaded the data files onto the EMR cluster using wget "url" command
To load the data into MySQL table, I logged in and run appropriate SQL commands
I confirmed the data was loaded into the table by running simple SQL queries and observing the outputs

Task 2: Ingesting data from RDS into the HBase table using Sqoop

First, I logged in into the EMR instance and completed the initial steps of setup
- Now I istalled the MySQL connector jar file then run appropriate step to extract the MySQL connector tar file
- I then went to MySQL connector directory and copied it the the Sqoop library to complete the installation
Having now installed the MySQL Connector. I now set up MySQL on EMR cluster and proceeded
I run appropriate commands to ingest data from MySQL RDS to HBase table

Task 3: Bulk inport subsequent files to HBase table

I bulk imported data from subsequent files in the dataset on the EMR cluster to the HBase table using relevant codes
See the Python code (batch_ingest.py) used

Task 4 : Using MapReduce to perform data analysis on files downloaded to the EMR instance

Please refer to the MapReduceTasks pdf file for a detailed approach with screenshots
Please refer to the corresponding mrtask_#.py files for Python codes used
The following business questions where explored:
- mrtask_a) Which vendors have the most trips, and what is the total revenue generated by that vendor?
- mrtask_b) Which pickup location generates the most revenue?
- mrtask_c) What are the different payment types used by customers and their count?
- mrtask_d) What is the average trip time for different pickup locations?
- mrtask_e) Calculate the average tips to revenue ratio of the drivers for different pickup locations in sorted format
- mrtask_f) How does revenue vary over time? Calculate the average trip revenue per month - analyzing it by hour of the day (day vs night) and the day of the week (weekday vs weekend)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
MapReduceTasks.pdf		MapReduceTasks.pdf
NYC Yellow Taxi photo.png		NYC Yellow Taxi photo.png
README.md		README.md
Task 1_RDS and EMR setup.pdf		Task 1_RDS and EMR setup.pdf
Task 2_Data Ingestion.pdf		Task 2_Data Ingestion.pdf
batch_ingest.py		batch_ingest.py
mrtask_a.py		mrtask_a.py
mrtask_b.py		mrtask_b.py
mrtask_c.py		mrtask_c.py
mrtask_d.py		mrtask_d.py
mrtask_e.py		mrtask_e.py
mrtask_f.py		mrtask_f.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unlocking Insights with NYC Yellow Taxi Data using Big Data Technologies 🚖

Data

Approach

Task 1: Setting up the environment and loading data

Task 2: Ingesting data from RDS into the HBase table using Sqoop

Task 3: Bulk inport subsequent files to HBase table

Task 4 : Using MapReduce to perform data analysis on files downloaded to the EMR instance

About

Releases

Packages

Languages

EddieAmaitum/NYC-Yellow-Taxi-DataOps-with-AWS-Analyzing-TLC-Datasets

Folders and files

Latest commit

History

Repository files navigation

Unlocking Insights with NYC Yellow Taxi Data using Big Data Technologies 🚖

Data

Approach

Task 1: Setting up the environment and loading data

Task 2: Ingesting data from RDS into the HBase table using Sqoop

Task 3: Bulk inport subsequent files to HBase table

Task 4 : Using MapReduce to perform data analysis on files downloaded to the EMR instance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages