Inference workload deployment sample with optional bin-packing

The aws-do-inference repository contains an end-to-end example for running model inference locally on Docker or at scale on EKS. It supports CPU, GPU, and Inferentia processors and can pack multiple models in a single processor core for improved cost efficiency. While this example focuses on one processor target at a time, iterating over the steps below for CPU/GPU and Inferentia enables hybrid deployments where the best processor/accelerator is used to serve each model depending on its resource consumption profile. In this sample repository, we use a bert-base NLP model from huggingface.co, however the project structure and workflow is generic and can be adapted for use with other models.

Fig. 1 - Sample Amazon EKS cluster infrastructure for deploying, running and testing ML Inference workloads

The ML inference workloads in this sample project are deployed on the CPU, GPU, or Inferentia nodes as shown on Fig. 1. The control scripts run in any location that has access to the cluster API. To eliminate latency concern related to the cluster ingress, load tests run in a pod within the cluster and send requests to the models directly through the cluster pod network.

1. The Amazon EKS cluster has several node groups, with one EC2 instance family per node group. Each node group can support different instance types, such as CPU (c5,c6i, c7g), GPU (g4dn), AWS Inferentia (Inf2) and can pack multiple models per EKS node to maximize the number of served ML models that are running in a node group. Model bin packing is used to maximize compute and memory utilization of the compute node EC2 instances in the cluster node groups.
2. The natural language processing (NLP) open-source PyTorch model from [huggingface.co](https://huggingface.co/) serving application and ML framework dependencies are built by Users as container images using Automation framework uploaded to Amazon Elastic Container Registry - [Amazon ECR](https://aws.amazon.com/ecr/).
3. Using project Automation framework, Model container images are obtained from ECR and deployed to [Amazon EKS cluster](https://aws.amazon.com/eks/) using generated Deployment and Service manifests via Kubernetes API exposed via Elastic Load Balancer (ELB). Model deployments are customized for each target EKS compute node instance type via settings in the central configuration file.
4. Following best practices of separation of Model data from containers that run it, ML model microservice design allows to scale out to a large number of models. In the project, model containers are pulling data from Amazon Simple Storage Service ([Amazon S3](https://aws.amazon.com)) and other public model data sources each time they are initialized.
5. Using project Automation framework, Test container images are obtained from ECR and deployed to Amazon EKS cluster using generated Deployment and Service manifests via Kubernetes API. Test deployments are customized for each deployment target EKS compute node architecture via settings in the central configuration file. Load/scale testing is performed via sending simultaneous requests to the Model service pool. Performance Test results metrics are obtained, recorded and aggregated.

Fig. 2 - aws-do-inference video walkthrough

See an end-to-end accelerated video walkthrough (7 min) or follow the instructions below to build and run your own inference solution.

Prerequisites

It is assumed that an EKS cluster exists and contains nodegroups of the desired target instance types. In addition it is assumed that the following basic tools are present: docker, kubectl, envsubst, kubetail, bc.

Operation

The project is operated through a set of action scripts as described below. To complete a full cycle from beginning-to-end, first configure the project, then follow steps 1 through 5 executing the corresponding action scripts. Each of the action scripts has a help screen, which can be invoked by passing "help" as argument: <script>.sh help

Configure

./config.sh

A centralized configuration file config.properties contains all settings that are customizeable for the project. This file comes pre-configured with reasonable defaults that work out of the box. To set the processor target or any other setting edit the config file, or execute the config.sh script. Configuration changes take effect immediately upon execution of the next action script.

1. Build

./build.sh

This step builds a base container for the selected processor. A base container is required for any of the subsequent steps. This step can be executed on any instance type, regardless of processor target.

Optionally, if you'd like to push the base image to a container registry, execute ./build.sh push. Pushing the base image to a container registry is required if you are planning to run the test step against models deployed to Kubernetes. If you are using a private registry and you need to login before pushing, execute ./login.sh. This script will login to AWS ECR, other private registry implementations can be added to the script as needed.

2. Trace

./trace.sh

Compiles the model into a TorchScript serialized graph file (.pt). This step requires the model to run on the target processor. Therefore it is necessary to run this step on an instance that has the target processor available.

Upon successful compilation, the model will be saved in a local folder named trace-{model_name}.

Note

It is recommended to use the AWS Deep Learning AMI to launch the instance where your model will be traced.

To trace a model for GPU, run the trace step on a GPU instance launched with the AWS DLAMI.
To trace a model for Inferentia, run the trace step on an Inferentia instance launched with the AWS DLAMI with Neuron and activate the Neuron compiler conda environment

3. Pack

./pack.sh

Packs the model in a container with FastAPI, also allowing for multiple models to be packed within the same container. FastAPI is used as an example here for simplicity and performance, however it can be interchanged with any other model server. For the purpose of this project we pack several instances of the same model in the container, however a natural extension of the same concept is to pack different models in the same container.

To push the model container image to a registry, execute ./pack.sh push. The model container must be pushed to a registry if you are deploying your models to Kubernetes.

4. Deploy

./deploy.sh

This script runs your models on the configured runtime. The project has built-in support for both local Docker runtimes and Kubernetes. The deploy script also has several sub-commands that facilitate the management of the full lifecycle of your model server containers.

./deploy.sh run - (default) runs model server containers
./deploy.sh status [number] - show container / pod / service status. Optionally show only specified instance number
./deploy.sh logs [number] - tail container logs. Optionally tail only specified instance number
./deploy.sh exec <number> - open bash into model server container with the specified instance number
./deploy.sh stop - stop and remove deployed model contaiers from runtime

5. Test

./test.sh

The test script helps run a number of tests against the model servers deployed in your runtime environment.

./test.sh build - build test container image
./test.sh push - push test image to container registry
./test.sh pull - pull the current test image from the container registry if one exists
./test.sh run - run a test client container instance for advanced testing and exploration
./test.sh exec - open shell in test container
./test.sh status- show status of test container
./test.sh stop - stop test container
./test.sh help - list the available test commands
./test.sh run seq - run sequential test. One request at a time submitted to each model server and model in sequential order.
./test.sh run rnd - run random test. One request at a time submitted to a randomly selected server and model at a preset frequency.
./test.sh run bmk - run benchmark test client to measure throughput and latency under load with random requests
./test.sh run bma - run benchmark analysis - aggregate and average stats from logs of all completed benchmark containers

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
1-build		1-build
2-trace		2-trace
3-pack		3-pack
4-deploy		4-deploy
5-test		5-test
.dockerignore		.dockerignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
aws-do-inference-video.png		aws-do-inference-video.png
aws-do-inference.png		aws-do-inference.png
build.sh		build.sh
config.properties		config.properties
config.sh		config.sh
deploy.sh		deploy.sh
login.sh		login.sh
low-latency-high-throughput-inference-on-amazon-eks.png		low-latency-high-throughput-inference-on-amazon-eks.png
pack.sh		pack.sh
test.sh		test.sh
trace.sh		trace.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inference workload deployment sample with optional bin-packing

Prerequisites

Operation

Configure

1. Build

2. Trace

Note

3. Pack

4. Deploy

5. Test

Security

License

References

About

Releases

Packages

Languages

License

KeitaW/guidance-for-machine-learning-inference-on-aws

Folders and files

Latest commit

History

Repository files navigation

Inference workload deployment sample with optional bin-packing

Prerequisites

Operation

Configure

1. Build

2. Trace

Note

3. Pack

4. Deploy

5. Test

Security

License

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages