Name		Name	Last commit message	Last commit date
parent directory ..
Dockerfile		Dockerfile
model_repo_offline		model_repo_offline
model_repo_streaming		model_repo_streaming
model_repo_streaming_zipformer		model_repo_streaming_zipformer
scripts		scripts
speech_llm		speech_llm
transducer-scorer-backend		transducer-scorer-backend
whisper		whisper
zipformer/model_repo_offline_bs		zipformer/model_repo_offline_bs
README.md		README.md
requirements.txt		requirements.txt

README.md

Inference Serving Best Practice for Transducer ASR based on Icefall

In this tutorial, we'll go through how to run non-streaming (offline) and streaming ASR Transducer models trained by Icefall on GPUs, and deploy it as service with NVIDIA Triton Inference Server.

Preparation

First of all, we need to get environment, models ready.

Prepare Environment

Clone the repository:

# Clone Sherpa repo
git clone https://github.com/k2-fsa/sherpa.git
cd sherpa
export SHERPA_SRC=$PWD

We highly recommend you to use docker containers to save your life.

Build the server docker image:

cd $SHERPA_SRC/triton
docker build . -f Dockerfile/Dockerfile.server -t sherpa_triton_server:latest --network host

Alternatively, you could directly pull the pre-built image based on tritonserver image.

docker pull soar97/triton-k2:24.07

Start the docker container:

docker run --gpus all -v $SHERPA_SRC:/workspace/sherpa --name sherpa_server --net host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -it soar97/triton-k2:24.07

Now, you should enter into the container successfully.

Deploy on Triton Inference Server

In this part, we'll go through how to deploy the model on Triton.

The model repositories are provided in model_repo_offline and model_repo_streaming directory, you can find directories standing for each of the components. And there is a transducer dir which ensembles all the components into a whole pipeline. Each of those component directories contains a config file config.pbtxt and a version directory containing the model file.

Quick Start

Now start server:

# Inside the docker container
# If you want to use greedy search decoding
cd /Your_SHERPA_SRC/triton/
apt-get install git-lfs
pip3 install -r ./requirements.txt
export CUDA_VISIBLE_DEVICES="your_gpu_id"

bash scripts/build_wenetspeech_zipformer_offline_trt.sh

Benchmark using Dataset

git clone https://github.com/yuekaizhang/Triton-ASR-Client.git
cd Triton-ASR-Client
pip3 install -r requirements.txt
num_task=16
python3 client.py \
    --server-addr localhost \
    --model-name whisper \
    --num-tasks $num_task \
    --whisper-prompt "<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>" \
    --manifest-dir ./datasets/aishell1_test

Using TensorRT acceleration

TRT Quick start

You can directly use the following script to export TRT engine and start Triton server for Conformer Offline model:

bash scripts/build_librispeech_pruned_transducer_stateless3_offline_trt.sh

Export to TensorRT

If you want to build TensorRT for your own service, you can try the following steps:

Model export

You have to prepare the ONNX model by referring here to export your models into ONNX format. Assume you have put your ONNX model in the $model_dir directory. Then, just run the command:

# First, use polygraphy to simplify the onnx model.
polygraphy surgeon sanitize $model_dir/encoder.onnx --fold-constant -o encoder.trt
# Using /usr/src/tensorrt/bin/trtexec tool in the tritonserver docker image.
bash scripts/build_trt.sh 16 $model_dir/encoder.onnx model_repo_offline/encoder/1/encoder.trt

The generated TRT model will be saved into model_repo_offline/encoder/1/encoder.trt. Then you can start the Triton server.

Benchmark for Conformer TRT encoder vs ONNX

Model	Batch size	Avg latency(ms)	QPS
ONNX	1	7.44	134.48
	8	14.92	536.09
	16	22.84	700.67
	32	41.62	768.84
	64	80.48	795.27
	128	171.97	744.32
TRT	1	5.21834	193.93
	8	11.7826	703.49
	16	20.4444	815.79
	32	37.583	893.56
	64	69.8312	965.40
	128	139.702	964.57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

triton

triton

README.md

Inference Serving Best Practice for Transducer ASR based on Icefall

Table of Contents

Preparation

Prepare Environment

Deploy on Triton Inference Server

Quick Start

Benchmark using Dataset

Using TensorRT acceleration

TRT Quick start

Export to TensorRT

Model export

Benchmark for Conformer TRT encoder vs ONNX

Files

triton

Directory actions

More options

Directory actions

More options

Latest commit

History

triton

Folders and files

parent directory

README.md

Inference Serving Best Practice for Transducer ASR based on Icefall

Table of Contents

Preparation

Prepare Environment

Deploy on Triton Inference Server

Quick Start

Benchmark using Dataset

Using TensorRT acceleration

TRT Quick start

Export to TensorRT

Model export

Benchmark for Conformer TRT encoder vs ONNX