This repository automates the deployment of a scalable data pipeline on Kubernetes. It integrates Kafka, Spark, HDFS, and Atlas for data lineage tracking and error-checking, alongside ELK (Elasticsearch, Logstash, Kibana) for advanced monitoring and logging.
- Data Processing: Stream and batch processing with Apache Spark.
- Data Lineage: Apache Atlas integration.
- Monitoring: ELK Stack with Filebeat for log aggregation.
- Scalability: Kubernetes for container orchestration.
- Automation: Spark jobs with Kubernetes CronJobs.
-
Clone the repository:
git clone https://github.com/kardesyazilim/data-pipeline-k8s.git cd data-pipeline-k8s
-
Deploy infrastructure components:
kubectl apply -f manifests/
-
Monitor logs and metrics in Kibana.
data-pipeline-k8s/
├── README.md
├── helm-charts/
├── manifests/
├── spark-scripts/
├── config/
├── LICENSE
└── .gitignore