aws-virtual-gpu-device-plugin/benchmark at master · awslabs/aws-virtual-gpu-device-plugin

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Benchmark

If you would like to run the machine learning inference benchmark to evaluate the performance when running multiple pods on one GPU, you may following below steps to get the benchmark result:

Set environment variable with your AWS account ID and default region

export ACCOUNT_ID=123456789012
export AWS_DEFAULT_REGION=us-west-2

Build the tensorflow-benchmark image:

$ cat <<EOF | docker build -t ${ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/tensorflow-benchmark:v1.15.2 -
FROM alpine as intermediate
LABEL stage=intermediate
RUN apk update && \
    apk add --update git && \
    git clone https://github.com/tensorflow/benchmarks.git && \
    cd benchmarks && \
    git checkout cnn_tf_v1.15_compatible

# Choose the base image for our final image
FROM tensorflow/tensorflow:1.15.2-gpu
RUN mkdir /opt/benchmarks
COPY --from=intermediate /benchmarks /opt/benchmarks
EOF

Create ECR repository, login to ECR and upload image to ECR

aws ecr create-repository --repository-name tensorflow-benchmark
$(aws ecr get-login --no-include-email)
docker push ${ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/tensorflow-benchmark:v1.15.2

Run tensorflow benchmark jobs in parallel, you may change the model name to resnet101, inception3, vgg16, please refer to benchmark for more informaiton about the parameters.

$ cat <<EOF | kubectl apply -f -
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tf-benchmark
    spec:
      completions: 4
      parallelism: 4
      backoffLimit: 1
      template:
        spec:
          restartPolicy: Never
          hostIPC: true
          containers:
          - name: tf-benchmark
            image: ${ACCOUNT_ID}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/tensorflow-benchmark:v1.15.2
            args:
            - "python3"
            - "/opt/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py"
            - "--data_name=imagenet"
            - "--model=resnet50"
            - "--num_batches=100"
            - "--batch_size=4"
            - "--num_gpus=1"
            - "--gpu_memory_frac_for_testing=0.2"
            resources:
              limits:
                k8s.amazonaws.com/vgpu: 2
            volumeMounts:
            - name: nvidia-mps
              mountPath: /tmp/nvidia-mps
          volumes:
          - name: nvidia-mps
            hostPath:
              path: /tmp/nvidia-mps
    EOF

Wait for jobs to complete

kubectl wait --for=condition=complete --timeout=30m job/tf-benchmark > /dev/null

Get result of each job

$ for podName in $(kubectl get pods -l job-name=tf-benchmark --no-headers -o custom-columns=":metadata.name")
  do
      score=$(kubectl logs $podName | grep 'total images/sec: ' | sed -E 's/total\ images\/sec\:\ (.*)/\1/g')
      echo $score
  done

You may get benchmark result as following, it represent the detected images per second in each job.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark

benchmark

README.md

Benchmark

Files

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

Benchmark