Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much lower frame rate has been observed on the 1080ti #16

Open
aiLover2 opened this issue May 27, 2022 · 6 comments
Open

Much lower frame rate has been observed on the 1080ti #16

aiLover2 opened this issue May 27, 2022 · 6 comments

Comments

@aiLover2
Copy link

aiLover2 commented May 27, 2022

Hello, I've tested the repo on the 1080ti and on the below configuration I see that the elapsed time is about 600ms, which is extremely high, and I'm thinking maybe I'm doing something wrong.

#define STR2(x) STR1(x)
// #define USE_FP16  // comment out this if want to use FP16
#define CONF_THRESH 0.5
#define BATCH_SIZE 1
#define BILINEAR true
// stuff we know about the network and the input/output blobs
static const int INPUT_H = 512;
static const int INPUT_W = 512;
static const int OUTPUT_SIZE = 512 * 512;

const char *INPUT_BLOB_NAME = "data";
const char *OUTPUT_BLOB_NAME = "prob";

I have downloaded the weights from this link and converted them. I checked the MD5sum output with you and it was fine.I have attached the inference log.

678ms
1
662ms
1
658ms
1
658ms
1
665ms
1
664ms
1
664ms
1
657ms
1
661ms
1
663ms
1
716ms
1
654ms
1
657ms
1
667ms
1

Process finished with exit code 0


Test in pytorch

The same configuration was tested in Pytorch and the logs are attached

input to net -> torch.Size([1, 3, 512, 512])
output from net -> {} torch.Size([1, 1, 512, 512])
Predicted in 40.77601432800293 milliseconds

input to net -> torch.Size([1, 3, 512, 512])
output from net -> {} torch.Size([1, 1, 512, 512])
Predicted in 39.133310317993164 milliseconds

input to net -> torch.Size([1, 3, 512, 512])
output from net -> {} torch.Size([1, 1, 512, 512])
Predicted in 38.28907012939453 milliseconds

Could you please let me know why it takes so long? Thank you in advance

Saeed

@aiLover2 aiLover2 changed the title Different elapsed time has been seen Different elapsed times have been observed May 27, 2022
@aiLover2 aiLover2 changed the title Different elapsed times have been observed Much lower frame rate has been observed on the 1080ti May 28, 2022
@aiLover2
Copy link
Author

aiLover2 commented Jun 1, 2022

According to the below test, the speed was fine with the normal onnx to trt process:

Average on 10 runs - GPU latency: 7.42174 ms - Host latency: 7.56389 ms (end to end 13.8005 ms, enqueue 0.470346 ms)

I will be glad to assist you in finding the problem or help you update the codes as follows(these are simple for you but maybe you don't have the time for that)

  • Finding the sm automatically
  • Create a cmake module for TensorRT finding
  • Update and refine the code to the latest TensorRT version
  • ...
    Please let me know via [email protected] if that is fine with you.

@YuzhouPeng
Copy link
Owner

YuzhouPeng commented Jun 2, 2022

I tried to use 3090 to test image, and the average inference speed is 160ms per image(1918x1280, batch size1), and maybe some information in CMakeList (sm version )influence the inference performance. I suggest to comment some info in CMakeList.txt :

#option(CUDA_USE_STATIC_CUDA_RUNTIME OFF)
#set(CMAKE_CXX_STANDARD 11)
#set(CMAKE_BUILD_TYPE Debug)
#set(CUDA_NVCC_PLAGS ${CUDA_NVCC_PLAGS};-std=c++11;-g;-G;-gencode;arch=compute_30;code=sm_85)
#set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall -Ofast -Wfatal-errors-D_MWAITXINTRIN_H_INCLUDED")
#add_definitions(-O2 -pthread)

@aiLover2
Copy link
Author

aiLover2 commented Jun 5, 2022

Please note that I have changed the sm version to the 1080ti corresponding as follows:

set(CUDA_NVCC_PLAGS ${CUDA_NVCC_PLAGS};-std=c++11;-gencode;arch=compute_61;code=sm_61)

It got worse after I commented on these lines. Time for inference has been changed from 600ms to 1400ms.

@YuzhouPeng
Copy link
Owner

YuzhouPeng commented Jun 6, 2022

I can not test 1080ti performance but I searched and find a similar issue: NVIDIA/TensorRT#1221

maybe use different cudnn can help

@YSUN-coder
Copy link

According to the below test, the speed was fine with the normal onnx to trt process:

Average on 10 runs - GPU latency: 7.42174 ms - Host latency: 7.56389 ms (end to end 13.8005 ms, enqueue 0.470346 ms)

I will be glad to assist you in finding the problem or help you update the codes as follows(these are simple for you but maybe you don't have the time for that)

  • Finding the sm automatically
  • Create a cmake module for TensorRT finding
  • Update and refine the code to the latest TensorRT version
  • ...
    Please let me know via [email protected] if that is fine with you.

Hi, could I know you how to get the test result here? I tried the ./unet -d ../samples in Jetson nano, and the result is similar to your device, 767ms per frame.

@YuzhouPeng
Copy link
Owner

https://github.com/wang-xinyu/tensorrtx/tree/master/unet please use newest repo for testing, this old repo is no longer maintained

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants