Add Triton Inference Server deployment (WongKinYiu#346)

* Add client code * Add README.md Co-authored-by: Philipp Schmidt <[email protected]>
cena001plus · Jul 28, 2022 · 8eee99f · 8eee99f
1 parent a7c0029
commit 8eee99f
Show file tree

Hide file tree

Showing 8 changed files with 772 additions and 0 deletions.
diff --git a/deploy/triton-inference-server/README.md b/deploy/triton-inference-server/README.md
@@ -0,0 +1,161 @@
+# YOLOv7 on Triton Inference Server
+
+Instructions to deploy YOLOv7 as TensorRT engine to [Triton Inference Server](https://github.com/NVIDIA/triton-inference-server).
+
+Triton Inference Server takes care of model deployment with many out-of-the-box benefits, like a GRPC and HTTP interface, automatic scheduling on multiple GPUs, shared memory (even on GPU), dynamic server-side batching, health metrics and memory resource management.
+
+There are no additional dependencies needed to run this deployment, except a working docker daemon with GPU support.
+
+## Export TensorRT
+
+See https://github.com/WongKinYiu/yolov7#export for more info.
+
+```bash
+# Pytorch Yolov7 -> ONNX with grid, EfficientNMS plugin and dynamic batch size
+python export.py --weights ./yolov7.pt --grid --end2end --dynamic-batch --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img-size 640 640
+# ONNX -> TensorRT with trtexec and docker
+docker run -it --rm --gpus=all nvcr.io/nvidia/tensorrt:22.06-py3
+# Copy onnx -> container: docker cp yolov7.onnx <container-id>:/workspace/
+# Export with FP16 precision, min batch 1, opt batch 8 and max batch 8
+./tensorrt/bin/trtexec --onnx=yolov7.onnx --minShapes=images:1x3x640x640 --optShapes=images:8x3x640x640 --maxShapes=images:8x3x640x640 --fp16 --workspace=4096 --saveEngine=yolov7-fp16-1x8x8.engine --timingCacheFile=timing.cache
+# Test engine
+./tensorrt/bin/trtexec --loadEngine=yolov7-fp16-1x8x8.engine
+# Copy engine -> host: docker cp <container-id>:/workspace/yolov7-fp16-1x8x8.engine .
+```
+
+Example output of test with RTX 3090.
+
+```
+[I] === Performance summary ===
+[I] Throughput: 73.4985 qps
+[I] Latency: min = 14.8578 ms, max = 15.8344 ms, mean = 15.07 ms, median = 15.0422 ms, percentile(99%) = 15.7443 ms
+[I] End-to-End Host Latency: min = 25.8715 ms, max = 28.4102 ms, mean = 26.672 ms, median = 26.6082 ms, percentile(99%) = 27.8314 ms
+[I] Enqueue Time: min = 0.793701 ms, max = 1.47144 ms, mean = 1.2008 ms, median = 1.28644 ms, percentile(99%) = 1.38965 ms
+[I] H2D Latency: min = 1.50073 ms, max = 1.52454 ms, mean = 1.51225 ms, median = 1.51404 ms, percentile(99%) = 1.51941 ms
+[I] GPU Compute Time: min = 13.3386 ms, max = 14.3186 ms, mean = 13.5448 ms, median = 13.5178 ms, percentile(99%) = 14.2151 ms
+[I] D2H Latency: min = 0.00878906 ms, max = 0.0172729 ms, mean = 0.0128844 ms, median = 0.0125732 ms, percentile(99%) = 0.0166016 ms
+[I] Total Host Walltime: 3.04768 s
+[I] Total GPU Compute Time: 3.03404 s
+[I] Explanations of the performance metrics are printed in the verbose logs.
+```
+Note: 73.5 qps x batch 8 = 588 fps @ ~15ms latency.
+
+## Model Repository
+
+See [Triton Model Repository Documentation](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md#model-repository) for more info.
+
+```bash
+# Create folder structure
+mkdir -p triton-deploy/models/yolov7/1/
+touch triton-deploy/models/yolov7/config.pbtxt
+# Place model
+mv yolov7-fp16-1x8x8.engine triton-deploy/models/yolov7/1/model.plan
+```
+
+## Model Configuration
+
+See [Triton Model Configuration Documentation](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#model-configuration) for more info.
+
+Minimal configuration for `triton-deploy/models/yolov7/config.pbtxt`:
+
+```
+name: "yolov7"
+platform: "tensorrt_plan"
+max_batch_size: 8
+dynamic_batching { }
+```
+
+Example repository:
+
+```bash
+$ tree triton-deploy/
+triton-deploy/
+└── models
+    └── yolov7
+        ├── 1
+        │   └── model.plan
+        └── config.pbtxt
+
+3 directories, 2 files
+```
+
+## Start Triton Inference Server
+
+```
+docker run --gpus all --rm --ipc=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/triton-deploy/models:/models nvcr.io/nvidia/tritonserver:22.06-py3 tritonserver --model-repository=/models --strict-model-config=false --log-verbose 1
+```
+
+In the log you should see:
+
+```
++--------+---------+--------+
+| Model  | Version | Status |
++--------+---------+--------+
+| yolov7 | 1       | READY  |
++--------+---------+--------+
+```
+
+## Performance with Model Analyzer
+
+See [Triton Model Analyzer Documentation](https://github.com/triton-inference-server/server/blob/main/docs/model_analyzer.md#model-analyzer) for more info.
+
+Performance numbers @ RTX 3090 + AMD Ryzen 9 5950X
+
+Example test for 16 concurrent clients using shared memory, each with batch size 1 requests:
+
+```bash
+docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:22.06-py3-sdk /bin/bash
+
+./install/bin/perf_analyzer -m yolov7 -u 127.0.0.1:8001 -i grpc --shared-memory system --concurrency-range 16
+
+# Result (truncated)
+Concurrency: 16, throughput: 590.119 infer/sec, latency 27080 usec
+```
+
+Throughput for 16 clients with batch size 1 is the same as for a single thread running the engine at 16 batch size locally thanks to Triton [Dynamic Batching Strategy](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#dynamic-batcher). Result without dynamic batching (disable in model configuration) considerably worse:
+
+```bash
+# Result (truncated)
+Concurrency: 16, throughput: 335.587 infer/sec, latency 47616 usec
+```
+
+## How to run model in your code
+
+Example client can be found in client.py. It can run dummy input, images and videos.
+
+```bash
+pip3 install tritonclient[all] opencv-python
+python3 client.py image data/dog.jpg
+```
+
+![exemplary output result](data/dog_result.jpg)
+
+```
+$ python3 client.py --help
+usage: client.py [-h] [-m MODEL] [--width WIDTH] [--height HEIGHT] [-u URL] [-o OUT] [-f FPS] [-i] [-v] [-t CLIENT_TIMEOUT] [-s] [-r ROOT_CERTIFICATES] [-p PRIVATE_KEY] [-x CERTIFICATE_CHAIN] {dummy,image,video} [input]
+
+positional arguments:
+  {dummy,image,video}   Run mode. 'dummy' will send an emtpy buffer to the server to test if inference works. 'image' will process an image. 'video' will process a video.
+  input                 Input file to load from in image or video mode
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -m MODEL, --model MODEL
+                        Inference model name, default yolov7
+  --width WIDTH         Inference model input width, default 640
+  --height HEIGHT       Inference model input height, default 640
+  -u URL, --url URL     Inference server URL, default localhost:8001
+  -o OUT, --out OUT     Write output into file instead of displaying it
+  -f FPS, --fps FPS     Video output fps, default 24.0 FPS
+  -i, --model-info      Print model status, configuration and statistics
+  -v, --verbose         Enable verbose client output
+  -t CLIENT_TIMEOUT, --client-timeout CLIENT_TIMEOUT
+                        Client timeout in seconds, default no timeout
+  -s, --ssl             Enable SSL encrypted channel to the server
+  -r ROOT_CERTIFICATES, --root-certificates ROOT_CERTIFICATES
+                        File holding PEM-encoded root certificates, default none
+  -p PRIVATE_KEY, --private-key PRIVATE_KEY
+                        File holding PEM-encoded private key, default is none
+  -x CERTIFICATE_CHAIN, --certificate-chain CERTIFICATE_CHAIN
+                        File holding PEM-encoded certicate chain default is none
+```
diff --git a/deploy/triton-inference-server/boundingbox.py b/deploy/triton-inference-server/boundingbox.py
@@ -0,0 +1,33 @@
+class BoundingBox:
+    def __init__(self, classID, confidence, x1, x2, y1, y2, image_width, image_height):
+        self.classID = classID
+        self.confidence = confidence
+        self.x1 = x1
+        self.x2 = x2
+        self.y1 = y1
+        self.y2 = y2
+        self.u1 = x1 / image_width
+        self.u2 = x2 / image_width
+        self.v1 = y1 / image_height
+        self.v2 = y2 / image_height
+
+    def box(self):
+        return (self.x1, self.y1, self.x2, self.y2)
+
+    def width(self):
+        return self.x2 - self.x1
+
+    def height(self):
+        return self.y2 - self.y1
+
+    def center_absolute(self):
+        return (0.5 * (self.x1 + self.x2), 0.5 * (self.y1 + self.y2))
+
+    def center_normalized(self):
+        return (0.5 * (self.u1 + self.u2), 0.5 * (self.v1 + self.v2))
+
+    def size_absolute(self):
+        return (self.x2 - self.x1, self.y2 - self.y1)
+
+    def size_normalized(self):
+        return (self.u2 - self.u1, self.v2 - self.v1)