This project is a fully containerized ELT (Extract, Load, Transform) pipeline that automates the extraction of top posts from a specific Reddit subreddit. The pipeline leverages Docker, Apache Airflow for orchestration, Apache Spark for data transformation, and MinIO as an S3-compatible object storage solution.
The data pipeline workflow is as follows:
- Extract: Use the Reddit API to fetch top posts from a specified subreddit.
- Load: Store the raw data in a MinIO bucket, organized in the
raw_data
folder. - Transform: Use Spark to clean and structure the data, saving the results to the
transformed_data
folder. - Archive: Move the processed data from
raw_data
to thearchived_data
folder in MinIO.
The entire pipeline is managed by Airflow and uses a Spark cluster for scalable data processing. The project runs on Docker Compose, making setup and deployment straightforward.
- Airflow: Manages the ELT pipeline workflow and scheduling.
- Spark: Handles data transformations.
- MinIO: S3-compatible storage for holding raw, transformed, and archived data.
- Docker & Docker Compose: Containerizes and orchestrates all services, allowing easy deployment.
- Reddit Data Extraction: A script uses the Reddit API to pull top posts from a specified subreddit.
- Raw Data Loading: Extracted data is saved to the MinIO
raw
folder. - Data Transformation: Spark transforms the raw data (e.g., data cleaning, normalization) and stores the result as parquet for optimize storage and compression in the
transformed
folder. - Data Archiving: Raw data is moved from
raw
toarchived
in MinIO to free up space and maintain a historical record.
-
Clone the repository:
git clone https://github.com/phnguynmkhoi/reddit_data_pipeline.git cd reddit_data_pipeline
-
Create .env file and add your Reddit API credentials in the appropriate configuration file.
-
Build the Docker images:
docker build -f Dockerfile.airflow -t airflow-default . docker build -f Dockerfile.spark -t spark-cluster:3.1.3 .
-
Run the entire pipeline:
docker-compose up
-
The pipeline will start running based on the Airflow scheduler. You can monitor the tasks through the Airflow UI at
http://localhost:8080
.
- MinIO Console: http://localhost:10001
- Use the credentials in your
.env
or Docker Compose file to log in. - Also remember to create key and paste it to your
.env
file
Airflow’s web server is accessible at http://localhost:8080, where you can monitor, manage, and troubleshoot DAG runs.
- Raw Data: Available in MinIO under the
raw
folder. - Transformed Data: Available in MinIO under the
transformed
folder. - Archived Data: Processed data is moved from
raw
toarchived
once the pipeline completes.
- You must download following jars files into ./jars folder: hadoop-aws-3.2.0.jar, aws-java-sdk-bundle-1.11.375.jar before running the project.
- Ensure that the MinIO and Reddit API credentials are configured before running the pipeline.
- You can create additional Spark jobs for more advanced data transformations and aggregations based on the extracted data.
- This project can be scaled and extended to handle multiple subreddits or different data sources.
- Data Validation: Implement data validation checks in the transformation step to improve data quality.
- Pipeline Split: Split the project into two parts by creating a trigger mechanism in MinIO. When new data is loaded to MinIO, a trigger can prompt Airflow to automatically run the Spark job for data transformation, providing more flexibility beyond the current automated schedule.
- Notification System: Add a notification system to report the status of each pipeline stage.