Skip to content

DoEKS is a tool to build, deploy and scale Data Platforms on Amazon EKS

License

Notifications You must be signed in to change notification settings

hethkar/data-on-eks

 
 

Repository files navigation

plan-examples

Data on Amazon EKS (DoEKS)

💥 Welcome to Data on Amazon EKS (DoEKS) 💥

Data on Amazon EKS(DoEKS) is a tool for users to build aws managed and self-managed scalable data platforms on Amazon EKS. This repo provides Infrastructure as Code(IaC) templates(e.g., Terraform, AWS CDK etc.), sample Apache Spark/ML jobs, references to AWS Data blogs, Performance Benchmark reports and Best Practices for deploying Data Solutions on Amazon EKS.

Note: Data on EKS is under active development for number of patterns. Please refer to the issues section to see the work in progress features.

🏗️ Architecture

The diagram displays the open source data tools, k8s operators and frameworks that runs on Kubernetes covered in DoEKS. AWS Data Analytics managed services integration with Data on EKS OSS tools.

image

🌟 Features

Data on EKS(DoEKS) solution is categorized into the following areas.

🎯 Data Analytics on EKS

🎯 AI/ML on EKS

🎯 Distributed Databases on EKS

🎯 Streaming Platforms on EKS

🎯 Scheduler Workflow Platforms on EKS

🏃‍♀️Getting Started

In this repository you will find multiple deployment examples for bootstrapping Data platforms with Amazon EKS Cluster and the Kubernetes add-ons.

🚀 EMR on EKS with Apache YuniKorn - This template deploys EMR on EKS cluster and uses Apache YuniKorn for custom batch scheduling.

🚀 EMR on EKS with Karpenter - <---Start Here if you are new to EMR on EKS. This template deploys EMR on EKS cluster and uses Karpenter to scale Spark jobs.

🚀 Spark Operator on EKS - This template deploys EKS cluster and uses Spark Operator and Apache YuniKorn for running self-managed Spark jobs

🚀 Amazon Manged Workflows for Apache Airflow (MWAA) - This template deploys EMR on EKS cluster and uses Amazon Managed Workflows for Apache Airflow (MWAA) to run Spark jobs.

🚀 Self-managed Airflow on EKS - This template deploys self-managed Apache Airflow with best practices on Amazon EKS cluster.

🚀 Ray on EKS - This template deploys Ray Operator on EKS with sample scripts.

🗂️ Documentation

Checkout the DoEKS Website for instructions to deploy the Data on EKS patterns and run sample tests.

🏆 Motivation

Kubernetes is the most widely known system for large-scale orchestration of containerized software. It became more mature for running stateful workloads with the introduction of several storage options in version 1.19. In addition, with an introduction of Spark on Kubernetes and the flexibility that Kubernetes offers have motivated many users to migrate their existing Hadoop based clusters to Kubernetes.

Deploying and managing Kubernetes clusters and scaling data workloads is still challenging for many users because they are expected to be familiar with Kubernetes and data workloads. To address this, we chose to launch this new Data on EKS (DoEKS) tool to help simplify the journey for the users who want to run Spark on EKS, Kubeflow, MLFlow, Airflow, Presto, Kafka, Cassandra etc. or any other data workloads.

🤝 Support & Feedback

Data on EKS(DoEKS) is maintained by AWS Solution Architects. It is not part of an AWS service, and support is provided best effort by the Data on EKS Blueprints community.

Please use the Issues section of this GitHub to post feedback, submit feature ideas, or report bugs.

🔐 Security

See CONTRIBUTING for more information.

💼 License

This library is licensed under the Apache 2.0 License.

🙌 Community

We invite everyone who is passionate about data on Kubernetes to join this initiative.

Built with ❤️ at AWS.

About

DoEKS is a tool to build, deploy and scale Data Platforms on Amazon EKS

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HCL 65.6%
  • Shell 15.9%
  • Python 12.0%
  • JavaScript 3.6%
  • TypeScript 2.2%
  • CSS 0.4%
  • Other 0.3%