In this repository, we leverage the power of Big Data technologies to perform data-driven business operations on the NYC Yellow Taxi dataset. Our toolkit includes industry-standard tools and services such as:
AWS EMR: Harness the scalability of Amazon Elastic MapReduce for efficient data processing and analysis.
AWS RDS (MySQL): Store and manage structured data seamlessly with Amazon RDS, a reliable and high-performance database service.
Hadoop: Utilize the robust Hadoop ecosystem for distributed data storage and processing.
Apache Scoop: Streamline data ingestion between Hadoop and relational databases effortlessly.
Apache HBase: Leverage the NoSQL capabilities of Apache HBase for high-speed, random read/write access to your data.
MapReduce: Implement MapReduce algorithms to extract valuable insights from massive datasets
- We use the New York City TLC yellow taxi data set for the year 2017
- The data dictionary can be found here
-
The project was broken down into the following 4 tasks
-
Please refer to attached files for detailed explanations with code samples and screenshots
-
I created an RDS (Relational Database Service) instance on my AWS account and uploaded data to the RDS instance
- I created an appropriate schema for the data sets to upload them to RDS
-
I created an AWS EMR Instance with the above services.
- I used the m4.xlarge cluster with ample storage size since we are working with a huge data set
- I used a single master node instead of a multi-node cluster to limit my AWS credit consuption
-
I then proceeded to connect RDS with the EMR instance
-
I then logged into RDS through EMR instance
-
I created the "yellow_taxi" database followed by the table "trip_records"
-
I then downloaded the data files onto the EMR cluster using wget "url" command
-
To load the data into MySQL table, I logged in and run appropriate SQL commands
-
I confirmed the data was loaded into the table by running simple SQL queries and observing the outputs
-
First, I logged in into the EMR instance and completed the initial steps of setup
-
Now I istalled the MySQL connector jar file then run appropriate step to extract the MySQL connector tar file
-
I then went to MySQL connector directory and copied it the the Sqoop library to complete the installation
-
-
Having now installed the MySQL Connector. I now set up MySQL on EMR cluster and proceeded
-
I run appropriate commands to ingest data from MySQL RDS to HBase table
-
I bulk imported data from subsequent files in the dataset on the EMR cluster to the HBase table using relevant codes
-
See the Python code (batch_ingest.py) used
-
Please refer to the MapReduceTasks pdf file for a detailed approach with screenshots
-
Please refer to the corresponding mrtask_#.py files for Python codes used
-
The following business questions where explored:
-
mrtask_a) Which vendors have the most trips, and what is the total revenue generated by that vendor?
-
mrtask_b) Which pickup location generates the most revenue?
-
mrtask_c) What are the different payment types used by customers and their count?
-
mrtask_d) What is the average trip time for different pickup locations?
-
mrtask_e) Calculate the average tips to revenue ratio of the drivers for different pickup locations in sorted format
-
mrtask_f) How does revenue vary over time? Calculate the average trip revenue per month - analyzing it by hour of the day (day vs night) and the day of the week (weekday vs weekend)
-