We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanks for your excellent work. I run into the following mistake when I excecute the train.py.
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15927, OpType=ALLREDUCE, Timeout(ms)=5400000) ran for 5403515 milliseconds before timing out.
This happens after the log information "begin synchronizing". Do you know how to solve this problem?
The text was updated successfully, but these errors were encountered:
I meet same question
Sorry, something went wrong.
Do you have solve it? Could you share your method?
I still cannot solve this problem.
No branches or pull requests
Thanks for your excellent work. I run into the following mistake when I excecute the train.py.
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=15927, OpType=ALLREDUCE, Timeout(ms)=5400000) ran for 5403515 milliseconds before timing out.
This happens after the log information "begin synchronizing". Do you know how to solve this problem?
The text was updated successfully, but these errors were encountered: