-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch Docker image fails to train MNIST with multiple GPUs #165
Comments
Hello @nowei , would you confirm that you can run EVAL_TYPE=benchmark with multiple GPUs? If so, we can narrow down the problem to be in |
Yeah, it's running it in the first screenshot for ten iterations, so yeah, it can run EVAL_TYPE=benchmark with multiple GPUs. |
Would you set NCCL_DEBUG=INFO and run again? You may also set BYTEPS_LOG_LEVEL=INFO or even BYTEPS_LOG_LEVEL=TRACE. Then paste us the logs (it may be very long if you set BYTEPS_LOG_LEVEL=TRACE). Thanks. |
Yeah, sorry for the late reply. It takes a while to set up each time and I was a bit busy over the last few days.
When I ran it with BYTEPS_LOG_LEVEL=INFO, it didn't give anything new. And then I'll make a separate post with the other one since it's pretty long. |
Here's the log with both NCCL_DEBUG=INFO and BYTEPS_LOG_LEVEL=TRACE:
|
@nowei Thank you. You are right. INFO does not give anything new. The useful level is DEBUG. However, TRACE would include anything that DEBUG outputs, so what you have is good enough for us. We'll look into this. |
@nowei If you repeat multiple times with TRACE logs, does it always die on the key |
I ran it a few more times and they all died on key 1048576. |
Thanks. This is very helpful. So, it's a deterministic bug. There has to be something special about this tensor |
@nowei Would you do one more favor? Comment out this line and try again. https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L109 |
It ended up training for one epoch and then it crashed again. It ended with something like this:
Here's the last few lines from the debug info. It seems like it died again on key 1048576.
|
Okay. It dies because of this now https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L157 It seems that your K80 GPUs have problems dealing with the tensors are are not inside the model. I suspect that these tensors are placed on CPU by PyTorch, and K80, as an older GPU, cannot properly map the CPU memory address into GPU memory address space. Consequently, NCCL complains that the given memory address is invalid. We'll see how to address this. There is one simple way to verify this -- add |
I think test loss is a float at that point because it was a float initially, since I'm getting |
@nowei You are right.. you can do it here https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L132 |
It's still dying there:
I also tried sending the tensor to cuda and it still gets the same error. Wait, I had to save the tensor back to itself and I think it's running more, I'm turning off the debug info to see if it actually makes it past one epoch. |
Oh, hey, it's training! Thanks! |
@nowei Thanks. Apparently, Pascal architecture has some improvement over Kelper on the CPU/GPU memory management (https://devblogs.nvidia.com/unified-memory-cuda-beginners/) so this particular problem does not appear in our own environment. We will improve the code robustness. |
Describe the bug
Running on multiple GPUs fails on docker image built on AWS EC2 Spot instance. Specifically, I was running an instance of p2.8xlarge.
To Reproduce
Steps to reproduce the behavior:
I tried to follow https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md
as closely as possible. I'm pretty sure nvidia changed
nvidia-docker
todocker --gpus
and I ended up usingnvidia-driver-430
. So I had to use something likedocker run -it --net=host --gpus all --shm-size=32768m bytepsimage/worker_pytorch bash
instead of
nvidia-docker
.Anyways, I go through the rest of the step-by-step guide and it works out okay and I can run it with
EVAL_TYPE=benchmark
, but when I switch toEVAL_TYPE=mnist
, it breaks. When I run it withexport NVIDIA_VISIBLE_DEVICES=0
, it seems to work out fine except for the odd accuracies, which can be fixed by commenting outinside of
example/pytorch/train_mnist_byteps.py
.Expected behavior
I guess I expected it to run like it did in the one gpu case, but I may have been too optimistic
Screenshots
![image](https://user-images.githubusercontent.com/32493749/69708563-b4b0de00-10b0-11ea-96f8-dd2a2a03358b.png)
![image](https://user-images.githubusercontent.com/32493749/69708682-e7f36d00-10b0-11ea-81f2-0fc225af3cb6.png)
![image](https://user-images.githubusercontent.com/32493749/69708939-5801f300-10b1-11ea-803f-3962502e0d50.png)
![image](https://user-images.githubusercontent.com/32493749/69708963-64864b80-10b1-11ea-859b-96aae2b7fab6.png)
Picture of it working on the benchmark
Picture of it dying on MNIST
Picture of it running with one GPU
Really janky accuracy
Environment (please complete the following information):
I don't think the AWS stuff is relevant, but I thought I should include it just in case.
OS:
GCC version:
CUDA and NCCL version:
Framework (TF, PyTorch, MXNet): PyTorch
Additional context
I tried something similar with the docker image for tensorflow and it seemed to work with multiple gpus, but it never printed any accuracies because I don't think the code had any accuracy statements in it. Also, for that image you either had to install everything with python3 or change the shell file to execute using python2.
Any help would be appreciated!
The text was updated successfully, but these errors were encountered: