The Yelp Business Review Data Pipeline is a comprehensive solution for processing and analyzing Yelp business review data. This end-to-end data pipeline integrates various technologies, including Python, Kafka, AWS services (DynamoDB, S3, Redshift), PySpark, AWS Lambda, and Power BI. The project enables real-time streaming, change data capture (CDC), daily batch processing, and data visualization to gain insights into customer sentiment, business performance, and industry trends.
Python
Kafka
AWS services (DynamoDB, S3, Redshift)
PySpark
AWS Lambda
Power BI
Our big data project revolves around processing and analyzing Yelp business review data. The pipeline handles real-time streaming, CDC, daily batch processing, and data visualization. The key data sources include three main tables from Yelp's dataset: Business, Reviews, and Users, providing valuable information about businesses, customer reviews, and user details.
Used for real-time data streaming. Producers ingest data from JSON and CSV files. Topics created for each table (Business, Reviews, Users) for organized data streams.
Python scripts serve as producers for streaming data into Kafka topics. Continuous ingestion of Yelp dataset into the Kafka cluster.
Implemented using PySpark to process and transform incoming data from Kafka. Enriches data and prepares it for storage in DynamoDB.
NoSQL database for real-time storage of processed Yelp data. AWS Lambda Triggers in DynamoDB configured for CDC processes.
Functions trigger CDC processes in DynamoDB on data changes. Captures and propagates changes to an S3 bucket for daily batch processing.
Daily batch processing on data stored in S3 for aggregation, cleaning, and transformation. Processed data made ready for loading into the Redshift data warehouse.
Central data warehouse for storing Yelp data in a structured and optimized format. Daily batches of processed data from S3 loaded for analytical queries.
Utilized for data visualization, creating interactive dashboards and reports. Direct queries to Redshift for real-time visualization and analysis.
Yelp data continuously streamed in real-time from JSON and CSV files into Kafka topics.
PySpark jobs in the Kafka consumer process and transform incoming data before uploading it to DynamoDB.
AWS Lambda Triggers in DynamoDB capture changes in real-time, ensuring data consistency.
Daily batches of data from DynamoDB processed and transformed in S3 for optimal analytics.
Processed data loaded daily into Redshift for historical and analytical queries.
Power BI connected to Redshift for real-time queries, enabling the creation of interactive dashboards and reports.
Leveraged Kafka and PySpark for real-time data streaming and processing, providing immediate insights into Yelp data.
AWS services (DynamoDB, S3, Redshift) offered scalability and flexibility to handle growing data volumes and analytical needs.
Implemented CDC in DynamoDB using Lambda Triggers for capturing and propagating real-time data changes.
Utilized S3 for efficient daily batch processing, ensuring data is optimized for Redshift.
Power BI employed for creating visually appealing and interactive dashboards for easy data exploration and analysis.