Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow stuck in infinite loop when executed with NVBit. #100

Open
JueonPark opened this issue Sep 14, 2022 · 0 comments
Open

TensorFlow stuck in infinite loop when executed with NVBit. #100

JueonPark opened this issue Sep 14, 2022 · 0 comments

Comments

@JueonPark
Copy link

I am using NVBit 1.5.3 to get the trace of TensorFlow application. When I run the application without NVBit, the TensorFlow application runs fine. However, when I run the application with NVBIt, the application is stuck in autotuning.

The running command is as the following:
LD_PRELOAD=$TRACER_TOOL python run.py --batch_size 1024 --num_examples 2000

The error message ends as the following:

2022-09-14 03:50:07.595278: I tensorflow/core/common_runtime/executor.cc:813] Synchronous kernel done: 814 step -8366710006080086661 {{node decoder/basic_decoder/decoder/while/body/_10/decoder/basic_decoder/decoder/while/cond/output/_805}} = Merge[N=2, T=DT_FLOAT, _XlaHasReferenceVars=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Func/decoder/basic_decoder/decoder/while/body/_10/decoder/basic_decoder/decoder/while/cond/then/_797/input/_855, decoder/basic_decoder/decoder/while/body/_10/decoder/basic_decoder/decoder/while/cond/else/_798/decoder/basic_decoder/decoder/while/cond/TensorArrayV2Read/TensorListGetItem) device: /job:localhost/replica:0/task:0/device:GPU:0
2022-09-14 03:50:12.991962: I tensorflow/core/framework/model.cc:1513] Starting optimization of tunable parameters with HillClimb
2022-09-14 03:50:12.992094: I tensorflow/core/framework/model.cc:1563] Number of tunable parameters: 0
2022-09-14 03:50:12.992137: I tensorflow/core/kernels/data/model_dataset_op.cc:200] Waiting for 20480 ms.
2022-09-14 03:50:33.472384: I tensorflow/core/framework/model.cc:1513] Starting optimization of tunable parameters with HillClimb
2022-09-14 03:50:33.472532: I tensorflow/core/framework/model.cc:1563] Number of tunable parameters: 0
2022-09-14 03:50:33.472568: I tensorflow/core/kernels/data/model_dataset_op.cc:200] Waiting for 40960 ms.
2022-09-14 03:51:14.432780: I tensorflow/core/framework/model.cc:1513] Starting optimization of tunable parameters with HillClimb
2022-09-14 03:51:14.432929: I tensorflow/core/framework/model.cc:1563] Number of tunable parameters: 0
2022-09-14 03:51:14.432963: I tensorflow/core/kernels/data/model_dataset_op.cc:200] Waiting for 60000 ms.
...

The application is stuck on parameter autotuning. Even if I disabled the parameter autotuning, the application is stuck and does not end.

I am using cuda 10.1 with cudnn 7.6.4. I am using RTX 2080Ti GPU with Intel Xeon Gold CPU.

I am sorry for submitting issue specified to TF. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant