Reddit Data Extraction Pipeline

This project is a fully containerized ELT (Extract, Load, Transform) pipeline that automates the extraction of top posts from a specific Reddit subreddit. The pipeline leverages Docker, Apache Airflow for orchestration, Apache Spark for data transformation, and MinIO as an S3-compatible object storage solution.

Project Overview

The data pipeline workflow is as follows:

Extract: Use the Reddit API to fetch top posts from a specified subreddit.
Load: Store the raw data in a MinIO bucket, organized in the raw_data folder.
Transform: Use Spark to clean and structure the data, saving the results to the transformed_data folder.
Archive: Move the processed data from raw_data to the archived_data folder in MinIO.

The entire pipeline is managed by Airflow and uses a Spark cluster for scalable data processing. The project runs on Docker Compose, making setup and deployment straightforward.

Technologies Used

Airflow: Manages the ELT pipeline workflow and scheduling.
Spark: Handles data transformations.
MinIO: S3-compatible storage for holding raw, transformed, and archived data.
Docker & Docker Compose: Containerizes and orchestrates all services, allowing easy deployment.

Pipeline Workflow

Reddit Data Extraction: A script uses the Reddit API to pull top posts from a specified subreddit.
Raw Data Loading: Extracted data is saved to the MinIO raw folder.
Data Transformation: Spark transforms the raw data (e.g., data cleaning, normalization) and stores the result as parquet for optimize storage and compression in the transformed folder.
Data Archiving: Raw data is moved from raw to archived in MinIO to free up space and maintain a historical record.

Running the Project

Clone the repository:

git clone https://github.com/phnguynmkhoi/reddit_data_pipeline.git
cd reddit_data_pipeline

Create .env file and add your Reddit API credentials in the appropriate configuration file.

Build the Docker images:

docker build -f Dockerfile.airflow -t airflow-default .
docker build -f Dockerfile.spark -t spark-cluster:3.1.3 .

Run the entire pipeline:
```
docker-compose up
```
The pipeline will start running based on the Airflow scheduler. You can monitor the tasks through the Airflow UI at http://localhost:8080.

Accessing MinIO

MinIO Console: http://localhost:10001
Use the credentials in your .env or Docker Compose file to log in.
Also remember to create key and paste it to your .env file

Airflow UI

Airflow’s web server is accessible at http://localhost:8080, where you can monitor, manage, and troubleshoot DAG runs.

Data Output

Raw Data: Available in MinIO under the raw folder.
Transformed Data: Available in MinIO under the transformed folder.
Archived Data: Processed data is moved from raw to archived once the pipeline completes.

Notes

You must download following jars files into ./jars folder: hadoop-aws-3.2.0.jar, aws-java-sdk-bundle-1.11.375.jar before running the project.
Ensure that the MinIO and Reddit API credentials are configured before running the pipeline.
You can create additional Spark jobs for more advanced data transformations and aggregations based on the extracted data.
This project can be scaled and extended to handle multiple subreddits or different data sources.

Future Enhancements

Data Validation: Implement data validation checks in the transformation step to improve data quality.
Pipeline Split: Split the project into two parts by creating a trigger mechanism in MinIO. When new data is loaded to MinIO, a trigger can prompt Airflow to automatically run the Spark job for data transformation, providing more flexibility beyond the current automated schedule.
Notification System: Add a notification system to report the status of each pipeline stage.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
apps		apps
dags		dags
etls		etls
pipelines		pipelines
tests		tests
.env_example		.env_example
.gitignore		.gitignore
Dockerfile.airflow		Dockerfile.airflow
Dockerfile.spark		Dockerfile.spark
README.md		README.md
airflow.cfg		airflow.cfg
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
start-spark.sh		start-spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Data Extraction Pipeline

Project Overview

Technologies Used

Pipeline Workflow

Running the Project

Accessing MinIO

Airflow UI

Data Output

Notes

Future Enhancements

About

Releases

Packages

Languages

phnguynmkhoi/reddit_data_pipeline

Folders and files

Latest commit

History

Repository files navigation

Reddit Data Extraction Pipeline

Project Overview

Technologies Used

Pipeline Workflow

Running the Project

Accessing MinIO

Airflow UI

Data Output

Notes

Future Enhancements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages