Our latest Chinese-CLIP code supports the conversion of Pytorch models of all scales into ONNX or TensorRT formats, thereby improving the inference speed of feature calculation compared with the original Pytorch models without affecting the downstream task effect of feature extraction. Below we give the whole process of preparing the FP16 Chinese-CLIP models in ONNX and TensorRT formats on GPU (and also give the download links of Chinese-CLIP pretraining TensorRT models), and attach the comparison of model effect and inference speed, so that you can take advantage of the advantages of ONNX and TensorRT library in inference performance.
- GPU hardware requirements: Please prepare Nvidia GPUs with Volta architecture and above (equipped with FP16 Tensor Core). Please refer to this document for the corresponding GPUs of each Nvidia architecture. Here we take T4 GPU as an example.
- CUDA: CUDA version 11.6 and above is recommended. We take version 11.6 as an example.
- CUDNN: CUDNN version 8.6.0 and above is recommended. We take version 8.6.0 as an example. Please note that TensorRT and CUDNN have version correspondence, e.g. TensorRT 8.5.x must correspond to CUDNN 8.6.0, see the TensorRT version requirements for details.
- ONNX: Please run
pip install onnx onnxruntime-gpu onnxmltools
to install. Note that when we convert the TensorRT model, we will follow the steps Pytorch → ONNX → TensorRT, so preparing the TensorRT model also requires installing the ONNX library first. Here we take onnx version 1.13.0, onnxruntime-gpu version 1.13.1, and onnxmltools version 1.11.1 as examples. - TensorRT: The recommended TensorRT version is 8.5.x. We use 8.5.2.2 as an example, using pip to install
pip install tensorrt==8.5.2.2
. For the CUDNN version corresponding to each TensorRT version, please refer to the "NVIDIA TensorRT Support Matrix" from this documentation page. - Pytorch: Pytorch version 1.12.1 and above is recommended. We take version 1.12.1 as an example. (It is recommended to directly pip install 1.12.1 + cu116, and try not to use conda to install cudatoolkit, avoiding TensorRT errors due to CUDNN version changes. )
- Other dependencies as required in requirements.txt.
For code to convert a Pytorch checkpoint to ONNX format, see cn_clip/deploy/pytorch_to_onnx.py
. Let's take the example of converting a ViT-B-16 size Chinese-CLIP pretrained model. You can run the following code (Please refer to the Code Organization section of Readme to set ${DATAPATH}
and replace the script content below, using relative paths where possible. ) :
cd Chinese-CLIP/
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip
# Please refer to the 'Code Organization' section in Readme to set `${DATAPATH}` and replace the script content below, using relative paths where possible: https://github.com/OFA-Sys/Chinese-CLIP/blob/master/README_En.md#code-organization
checkpoint_path=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # Specify the full path to the ckpt to be converted
mkdir -p ${DATAPATH}/deploy/ # Create output folders for ONNX models
python cn_clip/deploy/pytorch_to_onnx.py \
--model-arch ViT-B-16 \
--pytorch-ckpt-path ${checkpoint_path} \
--save-onnx-path ${DATAPATH}/deploy/vit-b-16 \
--convert-text --convert-vision
Each configuration item is defined as follows:
model-arch
: Model size, options include["RN50", "ViT-B-16", "ViT-L-14", "ViT-L-14-336", "ViT-H-14"]
, details of each model can be found in Readme.pytorch-ckpt-path
: Specify the Pytorch model ckpt path, which in the code example above we specified as the pretrained ckpt path, or you can specify the location of your finetune ckpt. The parameters in ckpt need to correspond to the model size specified bymodel-arch
.save-onnx-path
: Specifies the path (prefix) to the output ONNX format model. After the conversion is completed, the code will output the text-side and image-side ONNX format encoded model files, one for FP32 and one for FP16, which specifies the path prefix of the above output files.convert-text
andconvert-vision
: Specify whether to convert text-side and image-side models.context-length
(Optional): Specify the sequence length of the input received by the text-side ONNX model, defaulting to the 52 used in our pretraining ckpt.download-root
(Optional): Ifpytorch-ckpt-path
is not specified, the code will automatically download the official Chinese-CLIP pretraining ckpt for conversion according tomodel-arch
, which is stored in the specified directorydownload-root
.
After running, the following log output will be obtained:
Finished PyTorch to ONNX conversion...
>>> The text FP32 ONNX model is saved at ${DATAPATH}/deploy/vit-b-16.txt.fp32.onnx
>>> The text FP16 ONNX model is saved at ${DATAPATH}/deploy/vit-b-16.txt.fp16.onnx with extra file ${DATAPATH}/deploy/vit-b-16.txt.fp16.onnx.extra_file
>>> The vision FP32 ONNX model is saved at ${DATAPATH}/deploy/vit-b-16.img.fp32.onnx
>>> The vision FP16 ONNX model is saved at ${DATAPATH}/deploy/vit-b-16.img.fp16.onnx with extra file ${DATAPATH}/deploy/vit-b-16.img.fp16.onnx.extra_file
After the above code is executed, we get the ViT-B-16 size Chinese-CLIP text-side and image-side ONNX models, which can be used to extract image and text features respectively. The paths to the output ONNX models are all prefixed with save-onnx-path
at the time of running the script, followed by .img
/.txt
, .fp16
/.fp32
, and .onnx
. We will subsequently use mainly ONNX models vit-b-16.txt.fp16.onnx
and vit-b-16.img.fp16.onnx
in FP16 format.
Notice that some of the ONNX model files also come with an extra_file, which is also part of the corresponding ONNX model. When using these ONNX models, the path to the extra_file is stored in the .onnx
file (e.g. ${DATAPATH}/deploy/vit-b-16.txt.fp16.onnx.extra_file
) and the extra_file will be loaded according to this path, so please do not change the path and use relative path for ${DATAPATH}
when converting (e.g. ../datapath
) to avoid errors at runtime when the extra_file is not found.
We use the following sample code in the Chinese-CLIP/
directory to read the converted ViT-B-16 size ONNX image-side model vit-b-16.img.fp16.onnx
and extract the features of the Pikachu image in Readme. Note that the converted ONNX model only accepts inputs with a batch size of 1, i.e. only one input image is processed in one call.
# Complete the necessary import (omitted below)
import onnxruntime
from PIL import Image
import numpy as np
import torch
import argparse
import cn_clip.clip as clip
from clip import load_from_name, available_models
from clip.utils import _MODELS, _MODEL_INFO, _download, available_models, create_model, image_transform
# Load ONNX image-side model(**Please replace ${DATAPATH} with the actual path**)
img_sess_options = onnxruntime.SessionOptions()
img_run_options = onnxruntime.RunOptions()
img_run_options.log_severity_level = 2
img_onnx_model_path="${DATAPATH}/deploy/vit-b-16.img.fp16.onnx"
img_session = onnxruntime.InferenceSession(img_onnx_model_path,
sess_options=img_sess_options,
providers=["CUDAExecutionProvider"])
# Preprocess images
model_arch = "ViT-B-16" # Here we use the ViT-B-16 size, other sizes please modify accordingly
preprocess = image_transform(_MODEL_INFO[model_arch]['input_resolution'])
# Example Pikachu image, Torch Tensor of [1, 3, resolution, resolution] size after preprocessing
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0)
# Calculate image-side features with ONNX model
image_features = img_session.run(["unnorm_image_features"], {"image": image.cpu().numpy()})[0] # Unnormalized image features
image_features = torch.tensor(image_features)
image_features /= image_features.norm(dim=-1, keepdim=True) # Normalized Chinese-CLIP image features for downstream tasks
print(image_features.shape) # Torch Tensor shape: [1, feature dimension]
Similarly, we use the following code to complete the loading and feature calculation of the text-side ONNX model. As with the image-side model, the text-side ONNX model only accepts inputs with a batch size of 1, i.e., only one input text is processed in a single call.
# Load ONNX text-side model(**Please replace ${DATAPATH} with the actual path**)
txt_sess_options = onnxruntime.SessionOptions()
txt_run_options = onnxruntime.RunOptions()
txt_run_options.log_severity_level = 2
txt_onnx_model_path="${DATAPATH}/deploy/vit-b-16.txt.fp16.onnx"
txt_session = onnxruntime.InferenceSession(txt_onnx_model_path,
sess_options=txt_sess_options,
providers=["CUDAExecutionProvider"])
# Tokenize the 4 input texts. The sequence length is specified to 52, which is the same as when converting the ONNX model (see the context_length in the conversion process).
text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], context_length=52)
# Calculate text-side features sequentially with ONNX model
text_features = []
for i in range(len(text)):
one_text = np.expand_dims(text[i].cpu().numpy(),axis=0)
text_feature = txt_session.run(["unnorm_text_features"], {"text":one_text})[0] # Unnormalized image features
text_feature = torch.tensor(text_feature)
text_features.append(text_feature)
text_features = torch.squeeze(torch.stack(text_features),dim=1) # 4 feature vectors stacked together
text_features = text_features / text_features.norm(dim=1, keepdim=True) # Normalized Chinese-CLIP text features for downstream tasks
print(text_features.shape) # Torch Tensor shape: [1, feature dimension]
Similarities are calculated from the normalized image and text features produced by the ONNX model after inner product and softmax (need to consider logit_scale
), which is identical to the original Pytorch models, see the following code:
# Inner product followed by softmax
# Note that in the inner product calculation, due to the concept of temperature during contrast learning training, you need to multiply the model logit_scale.exp(), our pretraining model logit_scale is 4.6052, so here multiply by 100.
# For your own ckpt, please load it using torch.load and then check ckpt['state_dict']['module.logit_scale'] or ckpt['state_dict']['logit_scale'].
logits_per_image = 100 * image_features @ text_features.t()
print(logits_per_image.softmax(dim=-1)) # Image & text similarity probabilities: [[1.2252e-03, 5.2874e-02, 6.7116e-04, 9.4523e-01]]
We can see that the output similarities given by the ONNX model are largely consistent with the results calculated in API Use Case section of Readme based on the same model in Pytorch, proving the correctness of the feature calculation of the ONNX model. However, the feature calculation speed of ONNX model is more advantageous than that of Pytorch (see [below](#Speed Comparison Results) for details).
Compared with ONNX models, TensorRT models have faster inference speed, and we provide converted Chinese-CLIP pretrained TensorRT image-side and text-side models (download links). As mentioned above, we prepare the TensorRT format model, which is further transformed using the ONNX format model we just obtained. For the code to convert ONNX to TensorRT format, see cn_clip/deploy/onnx_to_tensorrt.py
. Still using the ViT-B-16 size Chinese-CLIP as an example, using the ONNX models vit-b-16.txt.fp16.onnx
and vit-b-16.img.fp16.onnx
, run the following code in Chinese-CLIP/
:
export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip
# As above, please replace ${DATAPATH} according to the actual situation
python cn_clip/deploy/onnx_to_tensorrt.py \
--model-arch ViT-B-16 \
--convert-text \
--text-onnx-path ${DATAPATH}/deploy/vit-b-16.txt.fp16.onnx \
--convert-vision \
--vision-onnx-path ${DATAPATH}/deploy/vit-b-16.img.fp16.onnx \
--save-tensorrt-path ${DATAPATH}/deploy/vit-b-16 \
--fp16
Each configuration item is defined as follows:
model-arch
: Model size, options include["RN50", "ViT-B-16", "ViT-L-14", "ViT-L-14-336", "ViT-H-14"]
, details of each model can be found in Readme.convert-text
andconvert-vision
: Specify whether to convert text-side and image-side models.text-onnx-path
: Specify the text-side ONNX model file path, which needs to correspond to the model size specified bymodel-arch
.vision-onnx-path
: Specify the image-side ONNX model file path, which needs to correspond to the model size specified bymodel-arch
.save-tensorrt-path
: Specifies the path (prefix) to the output TensorRT format model. After the conversion, the code will output the model files encoded in TensorRT format for the text side and image side respectively, and this parameter specifies the path prefix of the above output files.fp16
: Specify models converted to FP16 precision TensorRT format.
The process takes a few minutes to ten minutes, depending on the model's size. After running, the following log output will be obtained:
Finished ONNX to TensorRT conversion...
>>> The text FP16 TensorRT model is saved at ${DATAPATH}/deploy/vit-b-16.txt.fp16.trt
>>> The vision FP16 TensorRT model is saved at ${DATAPATH}/deploy/vit-b-16.img.fp16.trt
After the execution of the above sample code, we obtained ViT-B-16 size Chinese-CLIP text-side and image-side TensorRT format models using ONNX models, which can be used to extract image & text features. The paths to the output TensorRT models are all prefixed with save-tensorrt-path
when running the script, followed by .img
/.txt
, .fp16
, and .trt
. We use two output files vit-b-16.txt.fp16.trt
and vit-b-16.img.fp16.trt
.
For each size of Chinese-CLIP pretrained models, we provide converted TensorRT image-side and text-side models (based on TensorRT version 8.5.2.2), which can be downloaded as follows:
Model Size | TensorRT Image-side Model | TensorRT Text-side Model |
CN-CLIPRN50 | Download | Download |
CN-CLIPViT-B/16 | Download | Download |
CN-CLIPViT-L/14 | Download | Download |
CN-CLIPViT-L/14@336px | Download | Download |
CN-CLIPViT-H/14 | Download | Download |
Just download it and place it directly under directory ${DATAPATH}/deploy/
.
When running the TensorRT model, if the conversion and running are not in the same environment, please note that the TensorRT library version of the environment running the model is consistent with the conversion to avoid errors.
Similar to the process of ONNX models, we use the following sample code in the Chinese-CLIP/
directory to read the converted ViT-B-16 size TensorRT image-side model vit-b-16.img.fp16.trt
and extract the features of the Pikachu image in Readme. The converted TensorRT model here also accepts only inputs with a batch size of 1, i.e. only one input image is processed in one call.
# Complete the necessary import (omitted below)
from cn_clip.deploy.tensorrt_utils import TensorRTModel
from PIL import Image
import numpy as np
import torch
import argparse
import cn_clip.clip as clip
from clip import load_from_name, available_models
from clip.utils import _MODELS, _MODEL_INFO, _download, available_models, create_model, image_transform
# Load ONNX image-side model(**Please replace ${DATAPATH} with the actual path**)
img_trt_model_path="${DATAPATH}/deploy/vit-b-16.img.fp16.trt"
img_trt_model = TensorRTModel(img_trt_model_path)
# Preprocess images
model_arch = "ViT-B-16" # Here we use the ViT-B-16 size, other sizes please modify accordingly
preprocess = image_transform(_MODEL_INFO[model_arch]['input_resolution'])
# Example Pikachu image, Torch Tensor of [1, 3, resolution, resolution] size after preprocessing
image = preprocess(Image.open("examples/pokemon.jpeg")).unsqueeze(0).cuda()
# Calculate image-side features with TensorRT model
image_features = img_trt_model(inputs={'image': image})['unnorm_image_features'] # Unnormalized image features
image_features /= image_features.norm(dim=-1, keepdim=True) # Normalized Chinese-CLIP image features for downstream tasks
print(image_features.shape) # Torch Tensor shape: [1, feature dimension]
Similarly, we use the following code to complete the loading and feature calculation of the text-side TensorRT model. As with the image-side model, the text-side TensorRT model only accepts inputs with a batch size of 1, i.e., only one input text is processed in a single call. The text sequence length accepted by TensorRT is consistent with the used ONNX model, see the context-length parameter during ONNX conversion.
# Load TensorRT text-side model(**Please replace ${DATAPATH} with the actual path**)
txt_trt_model_path="${DATAPATH}/deploy/vit-b-16.txt.fp16.trt"
txt_trt_model = TensorRTModel(txt_trt_model_path)
# Tokenize the 4 input texts. The sequence length is specified to 52, which is the same as when converting the ONNX model (see the context_length in the ONNX conversion process).
text = clip.tokenize(["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"], context_length=52).cuda()
# Calculate text-side features sequentially with TensorRT model
text_features = []
for i in range(len(text)):
# Unnormalized image features
text_feature = txt_trt_model(inputs={'text': torch.unsqueeze(text[i], dim=0)})['unnorm_text_features']
text_features.append(text_feature)
text_features = torch.squeeze(torch.stack(text_features), dim=1) # 4 feature vectors stacked together
text_features = text_features / text_features.norm(dim=1, keepdim=True) # Normalized Chinese-CLIP text features for downstream tasks
print(text_features.shape) # Torch Tensor shape: [1, feature dimension]
Similarities are calculated from the normalized image and text features produced by the TensorRT model after inner product and softmax (need to consider logit_scale
), which is identical to the original Pytorch models and ONNX models, see the following code:
# Inner product followed by softmax
# Note that in the inner product calculation, due to the concept of temperature during contrast learning training, you need to multiply the model logit_scale.exp(), our pretraining model logit_scale is 4.6052, so here multiply by 100.
# For your own ckpt, please load it using torch.load and then check ckpt['state_dict']['module.logit_scale'] or ckpt['state_dict']['logit_scale'].
logits_per_image = 100 * image_features @ text_features.t()
print(logits_per_image.softmax(dim=-1)) # Image & text similarity probabilities: [[1.2475e-03, 5.3037e-02, 6.7583e-04, 9.4504e-01]]
We can see that the output similarities given by the TensorRT model are largely consistent with the results calculated in API Use Case section of Readme based on the same model in Pytorch and the results calculated by ONNX models above, proving the correctness of the feature calculation of the TensorRT model. However, the feature calculation speed of TensorRT model is superior to both of the previous two (see [below](#Speed Comparison Results) for details).
Our experiments are conducted on a single T4 GPU (16GB memory) machine with 16 Intel Xeon (Skylake) Platinum 8163 (2.5GHz) CPU cores and 64GB memory. We use the above sample image and one of the candidate texts and perform 100 times of image and text feature extraction for both Pytorch, ONNX, and TensorRT models, taking the average time (ms). Taking the speed measurement of ViT-B-16 size model as an example, the code executed under Chinese-CLIP/
is as follows:
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip
# Please replace ${DATAPATH} according to the actual situation
python3 cn_clip/deploy/speed_benchmark.py \
--model-arch ViT-B-16 \
--pytorch-ckpt ${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt \
--pytorch-precision fp16 \
--onnx-image-model ${DATAPATH}/deploy/vit-b-16.img.fp16.onnx \
--onnx-text-model ${DATAPATH}/deploy/vit-b-16.txt.fp16.onnx \
--tensorrt-image-model ${DATAPATH}/deploy/vit-b-16.img.fp16.trt \
--tensorrt-text-model ${DATAPATH}/deploy/vit-b-16.txt.fp16.trt
The following lines will be printed in the log output:
[Pytorch image inference speed (batch-size: 1):] mean=11.12ms, sd=0.05ms, min=11.00ms, max=11.32ms, median=11.11ms, 95p=11.20ms, 99p=11.30ms
[ONNX image inference speed (batch-size: 1):] mean=4.92ms, sd=0.04ms, min=4.82ms, max=5.01ms, median=4.92ms, 95p=4.98ms, 99p=5.00ms
[TensorRT image inference speed (batch-size: 1):] mean=3.58ms, sd=0.08ms, min=3.30ms, max=3.72ms, median=3.58ms, 95p=3.70ms, 99p=3.72ms
[Pytorch text inference speed (batch-size: 1):] mean=12.47ms, sd=0.07ms, min=12.32ms, max=12.64ms, median=12.48ms, 95p=12.57ms, 99p=12.61ms
[ONNX text inference speed (batch-size: 1):] mean=3.42ms, sd=0.44ms, min=2.96ms, max=3.89ms, median=3.45ms, 95p=3.87ms, 99p=3.88ms
[TensorRT text inference speed (batch-size: 1):] mean=1.54ms, sd=0.01ms, min=1.51ms, max=1.57ms, median=1.54ms, 95p=1.56ms, 99p=1.56ms
We present a comparison of the FP16 precision inference time for each size of Pytorch, ONNX, and TensorRT models for an inference batch size of 1, and we can see that TensorRT has a particularly significant speedup for small-scale models.
Unit: ms/sample | Image Feature Extraction | Text Feature Extraction | ||||
---|---|---|---|---|---|---|
Models | Pytorch | ONNX | TensorRT | Pytorch | ONNX | TensorRT |
CN-CLIPRN50 | 12.93 | 5.04 | 1.36 | 3.64 | 0.95 | 0.58 |
CN-CLIPViT-B/16 | 11.12 | 4.92 | 3.58 | 12.47 | 3.42 | 1.54 |
CN-CLIPViT-L/14 | 21.19 | 17.10 | 13.08 | 12.45 | 3.48 | 1.52 |
CN-CLIPViT-L/14@336px | 47.11 | 48.40 | 31.59 | 12.24 | 3.25 | 1.54 |
CN-CLIPViT-H/14 | 35.10 | 34.00 | 26.98 | 23.98 | 6.01 | 3.89 |
We observe the zero-shot performance of the Pytorch, ONNX, and TensorRT FP16 models in the MUGE text-to-image retrieval task involved in the Chinese-CLIP experiments. As described in Inference and Evaluation section of Readme, the results of MUGE image & text retrieval evaluation are divided into 3 steps: image & text feature extraction, KNN retrieval, and Recall calculation. The image & text feature extraction scripts for ONNX and TensorRT models, please see cn_clip/eval/extract_features_onnx.py
and cn_clip/eval/extract_features_tensorrt.py
respectively, compared with extract_features.py
used for original Pytorch feature extraction, only minor changes have been made. The scripts and processes used for the subsequent KNN and Recall calculations remain exactly the same.
The results of ViT-B-16 and ViT-H-14 scales of Chinese-CLIP are compared as follows:
Setup | ViT-B-16 Zero-shot | ViT-H-14 Zero-shot | ||||||
---|---|---|---|---|---|---|---|---|
Metric | R@1 | R@5 | R@10 | MR | R@1 | R@5 | R@10 | MR |
Pytorch FP16 | 52.1 | 76.7 | 84.4 | 71.1 | 63.0 | 84.1 | 89.2 | 78.8 |
ONNX FP16 | 52.0 | 76.8 | 84.3 | 71.1 | 63.1 | 84.1 | 89.0 | 78.8 |
TensorRT FP16 | 52.0 | 76.8 | 84.2 | 71.0 | 63.1 | 84.2 | 89.1 | 78.8 |
The results are basically the same, with a difference of ±0.2 within an acceptable range (the magnitude of error that can be caused by changing a machine), which proves the correctness of the conversion of the ONNX and TensorRT models.