Pytorch Docker image fails to train MNIST with multiple GPUs #165

nowei · 2019-11-27T09:16:09Z

Describe the bug
Running on multiple GPUs fails on docker image built on AWS EC2 Spot instance. Specifically, I was running an instance of p2.8xlarge.

To Reproduce
Steps to reproduce the behavior:
I tried to follow https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md
as closely as possible. I'm pretty sure nvidia changed nvidia-docker to docker --gpus and I ended up using nvidia-driver-430. So I had to use something like
docker run -it --net=host --gpus all --shm-size=32768m bytepsimage/worker_pytorch bash
instead of nvidia-docker.

Anyways, I go through the rest of the step-by-step guide and it works out okay and I can run it with EVAL_TYPE=benchmark, but when I switch to EVAL_TYPE=mnist, it breaks. When I run it with export NVIDIA_VISIBLE_DEVICES=0, it seems to work out fine except for the odd accuracies, which can be fixed by commenting out

    test_loss = metric_average(test_loss, 'avg_loss')
    test_accuracy = metric_average(test_accuracy, 'avg_accuracy')

inside of example/pytorch/train_mnist_byteps.py.

Expected behavior
I guess I expected it to run like it did in the one gpu case, but I may have been too optimistic

Screenshots
Picture of it working on the benchmark

Picture of it dying on MNIST

Picture of it running with one GPU

Really janky accuracy

Environment (please complete the following information):
I don't think the AWS stuff is relevant, but I thought I should include it just in case.
OS:

AWS: Ubuntu 18.04.3 LTS (GNU/Linux 4.15.0-1054-aws x86_64)
docker: Ubuntu 16.04 LTS (Xenial Xerus)

GCC version:

AWS: gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
docker: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609

CUDA and NCCL version:

AWS: CUDA isn't set up and it can't find NCCL.
docker: CUDA Version 9.0.176 (CUDA), and I can't seem to find the NCCL version?

Framework (TF, PyTorch, MXNet): PyTorch

Additional context
I tried something similar with the docker image for tensorflow and it seemed to work with multiple gpus, but it never printed any accuracies because I don't think the code had any accuracy statements in it. Also, for that image you either had to install everything with python3 or change the shell file to execute using python2.
Any help would be appreciated!

The text was updated successfully, but these errors were encountered:

bobzhuyb · 2019-11-27T09:57:30Z

Hello @nowei , would you confirm that you can run EVAL_TYPE=benchmark with multiple GPUs? If so, we can narrow down the problem to be in train_mnist_byteps.py

nowei · 2019-11-27T10:00:39Z

Yeah, it's running it in the first screenshot for ten iterations, so yeah, it can run EVAL_TYPE=benchmark with multiple GPUs.

bobzhuyb · 2019-11-28T03:42:46Z

Would you set NCCL_DEBUG=INFO and run again? You may also set BYTEPS_LOG_LEVEL=INFO or even BYTEPS_LOG_LEVEL=TRACE. Then paste us the logs (it may be very long if you set BYTEPS_LOG_LEVEL=TRACE). Thanks.

nowei · 2019-12-01T07:00:37Z

Yeah, sorry for the late reply. It takes a while to set up each time and I was a bit busy over the last few days.
Here's the log with NCCL_DEBUG=INFO

BytePS launching worker
training mnist...
training mnist...
training mnist...
training mnist...

ip-172-31-73-142:618:618 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:618:618 [3] NCCL INFO Using internal Network Socket
ip-172-31-73-142:618:618 [3] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:618:618 [3] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:618:618 [3] NCCL INFO rank 3 nranks 4

ip-172-31-73-142:614:614 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:614:614 [1] NCCL INFO Using internal Network Socket
ip-172-31-73-142:614:614 [1] NCCL INFO rank 1 nranks 4


ip-172-31-73-142:617:617 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:612:612 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:617:617 [2] NCCL INFO Using internal Network Socket
ip-172-31-73-142:612:612 [0] NCCL INFO Using internal Network Socket
ip-172-31-73-142:617:617 [2] NCCL INFO rank 2 nranks 4
NCCL version 2.3.7+cuda9.0
ip-172-31-73-142:612:612 [0] NCCL INFO rank 0 nranks 4
ip-172-31-73-142:614:614 [1] NCCL INFO comm 0x4c601000 rank 1 nranks 4
ip-172-31-73-142:614:614 [1] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:614:614 [1] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:612:612 [0] NCCL INFO comm 0x49e9f4f0 rank 0 nranks 4
ip-172-31-73-142:612:612 [0] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:612:612 [0] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:617:617 [2] NCCL INFO comm 0x4c19dd40 rank 2 nranks 4
ip-172-31-73-142:617:617 [2] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:617:617 [2] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:618:618 [3] NCCL INFO comm 0x4b4a45a0 rank 3 nranks 4
ip-172-31-73-142:614:614 [1] NCCL INFO CUDA Dev 1, IP Interfaces : ens3(PHB)
ip-172-31-73-142:617:617 [2] NCCL INFO CUDA Dev 2, IP Interfaces : ens3(PHB)
ip-172-31-73-142:618:618 [3] NCCL INFO CUDA Dev 3, IP Interfaces : ens3(PHB)
ip-172-31-73-142:612:612 [0] NCCL INFO CUDA Dev 0, IP Interfaces : ens3(PHB)
ip-172-31-73-142:612:612 [0] NCCL INFO Using 128 threads
ip-172-31-73-142:612:612 [0] NCCL INFO Min Comp Cap 3
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 00 :    0   1   2   3
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 01 :    0   1   2   3
ip-172-31-73-142:614:614 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:617:617 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:618:618 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:612:612 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:614:614 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:617:617 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:618:618 [3] NCCL INFO Ring 01 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:617:617 [2] NCCL INFO comm 0x4c19dd40 rank 2 nranks 4 - COMPLETE
ip-172-31-73-142:612:612 [0] NCCL INFO comm 0x49e9f4f0 rank 0 nranks 4 - COMPLETE
ip-172-31-73-142:614:614 [1] NCCL INFO comm 0x4c601000 rank 1 nranks 4 - COMPLETE
ip-172-31-73-142:618:618 [3] NCCL INFO comm 0x4b4a45a0 rank 3 nranks 4 - COMPLETE
ip-172-31-73-142:612:771 [0] NCCL INFO Launch mode Parallel
[2019-12-01 06:58:34.178376: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   618 Aborted                 (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 41, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134

When I ran it with BYTEPS_LOG_LEVEL=INFO, it didn't give anything new. And then I'll make a separate post with the other one since it's pretty long.

nowei · 2019-12-01T07:01:41Z

Here's the log with both NCCL_DEBUG=INFO and BYTEPS_LOG_LEVEL=TRACE:

BytePS launching worker
training mnist...
training mnist...
training mnist...
training mnist...
[2019-12-01 06:56:10.396131: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.396279: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_2
[2019-12-01 06:56:10.396346: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_2
[2019-12-01 06:56:10.396366: D byteps/common/communicator.cc:121] This is WORKER device, rank=2, all sockets create successfully
[2019-12-01 06:56:10.396388: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.396405: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.396431: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.396450: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.396465: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.396483: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.396528: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl2
[2019-12-01 06:56:10.396594: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl2
[2019-12-01 06:56:10.396613: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=2, all sockets create successfully
[2019-12-01 06:56:10.396621: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.396620: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_0
[2019-12-01 06:56:10.396700: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_0
[2019-12-01 06:56:10.396721: D byteps/common/communicator.cc:121] This is WORKER device, rank=0, all sockets create successfully
[2019-12-01 06:56:10.396741: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.396752: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.396779: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.396796: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.396807: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.396840: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.396883: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl0
[2019-12-01 06:56:10.396958: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_1
[2019-12-01 06:56:10.396970: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl0
[2019-12-01 06:56:10.396985: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=0, all sockets create successfully
[2019-12-01 06:56:10.396997: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.397032: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_1
[2019-12-01 06:56:10.397046: D byteps/common/communicator.cc:121] This is WORKER device, rank=1, all sockets create successfully
[2019-12-01 06:56:10.397059: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.397069: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.397086: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.397106: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.397114: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.397185: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl1
[2019-12-01 06:56:10.397188: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-12-01 06:56:10.397246: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl1
[2019-12-01 06:56:10.397259: D byteps/common/communicator.cc:55] This is nccl WORKER device, rank=1, all sockets create successfully
[2019-12-01 06:56:10.397275: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.397324: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_3
[2019-12-01 06:56:10.397413: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_3
[2019-12-01 06:56:10.397477: D byteps/common/communicator.cc:121] This is ROOT device, rank=3, all sockets create successfully
[2019-12-01 06:56:10.397505: D byteps/common/global.cc:102] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-12-01 06:56:10.397524: D byteps/common/global.cc:134] Number of worker=1, launching non-distributed job
[2019-12-01 06:56:10.397532: D byteps/common/communicator.cc:164] Listening on socket 3
[2019-12-01 06:56:10.397552: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-12-01 06:56:10.397584: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 4
[2019-12-01 06:56:10.397609: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-12-01 06:56:10.397702: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_send_nccl3
[2019-12-01 06:56:10.397796: D byteps/common/communicator.cc:157] Init socket at /tmp/socket_recv_nccl3
[2019-12-01 06:56:10.397888: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=3, all sockets create successfully
[2019-12-01 06:56:10.397915: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0 1 2 3
[2019-12-01 06:56:10.397931: D byteps/common/communicator.cc:164] Listening on socket 3

ip-172-31-73-142:402:402 [3] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:402:402 [3] NCCL INFO Using internal Network Socket
ip-172-31-73-142:402:402 [3] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:402:402 [3] NCCL INFO NET/Socket : 1 interfaces found
[2019-12-01 06:56:13.400565: D byteps/common/nccl_manager.cc:104] root nccl_id is -8193983221049262078
ip-172-31-73-142:402:402 [3] NCCL INFO rank 3 nranks 4
[2019-12-01 06:56:13.400666: T byteps/common/communicator.cc:253] non-root socket recved: src=-1915158526, signal=-1907810388, key=0, myrank=0
[2019-12-01 06:56:13.400703: D byteps/common/nccl_manager.cc:114] recv nccl_id is -8193983221049262078, local_rank=0
[2019-12-01 06:56:13.400690: T byteps/common/communicator.cc:253] non-root socket recved: src=-1915158526, signal=-1907810388, key=0, myrank=1
[2019-12-01 06:56:13.400715: T byteps/common/communicator.cc:253] non-root socket recved: src=-1915158526, signal=-1907810388, key=0, myrank=2
[2019-12-01 06:56:13.400731: D byteps/common/nccl_manager.cc:114] recv nccl_id is -8193983221049262078, local_rank=1
[2019-12-01 06:56:13.400758: D byteps/common/nccl_manager.cc:114] recv nccl_id is -8193983221049262078, local_rank=2



ip-172-31-73-142:403:403 [2] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:399:399 [1] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:395:395 [0] misc/ibvwrap.cu:63 NCCL WARN Failed to open libibverbs.so[.1]
ip-172-31-73-142:403:403 [2] NCCL INFO Using internal Network Socket
ip-172-31-73-142:399:399 [1] NCCL INFO Using internal Network Socket
ip-172-31-73-142:403:403 [2] NCCL INFO rank 2 nranks 4
ip-172-31-73-142:395:395 [0] NCCL INFO Using internal Network Socket
ip-172-31-73-142:399:399 [1] NCCL INFO rank 1 nranks 4
NCCL version 2.3.7+cuda9.0
ip-172-31-73-142:395:395 [0] NCCL INFO rank 0 nranks 4
ip-172-31-73-142:403:403 [2] NCCL INFO comm 0x4bc18eb0 rank 2 nranks 4
ip-172-31-73-142:403:403 [2] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:403:403 [2] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:395:395 [0] NCCL INFO comm 0x4cf74be0 rank 0 nranks 4
ip-172-31-73-142:395:395 [0] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:399:399 [1] NCCL INFO comm 0x4c48f820 rank 1 nranks 4
ip-172-31-73-142:395:395 [0] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:399:399 [1] NCCL INFO NET : Using interface ens3:172.31.73.142<0>
ip-172-31-73-142:399:399 [1] NCCL INFO NET/Socket : 1 interfaces found
ip-172-31-73-142:402:402 [3] NCCL INFO comm 0x4cad0600 rank 3 nranks 4
ip-172-31-73-142:399:399 [1] NCCL INFO CUDA Dev 1, IP Interfaces : ens3(PHB)
ip-172-31-73-142:403:403 [2] NCCL INFO CUDA Dev 2, IP Interfaces : ens3(PHB)
ip-172-31-73-142:402:402 [3] NCCL INFO CUDA Dev 3, IP Interfaces : ens3(PHB)
ip-172-31-73-142:395:395 [0] NCCL INFO CUDA Dev 0, IP Interfaces : ens3(PHB)
ip-172-31-73-142:395:395 [0] NCCL INFO Using 128 threads
ip-172-31-73-142:395:395 [0] NCCL INFO Min Comp Cap 3
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 00 :    0   1   2   3
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 01 :    0   1   2   3
ip-172-31-73-142:399:399 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:402:402 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:403:403 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:395:395 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
ip-172-31-73-142:399:399 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
ip-172-31-73-142:403:403 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
ip-172-31-73-142:402:402 [3] NCCL INFO Ring 01 : 3[3] -> 0[0] via P2P/IPC
ip-172-31-73-142:399:399 [1] NCCL INFO comm 0x4c48f820 rank 1 nranks 4 - COMPLETE
ip-172-31-73-142:395:395 [0] NCCL INFO comm 0x4cf74be0 rank 0 nranks 4 - COMPLETE
ip-172-31-73-142:402:402 [3] NCCL INFO comm 0x4cad0600 rank 3 nranks 4 - COMPLETE
ip-172-31-73-142:403:403 [2] NCCL INFO comm 0x4bc18eb0 rank 2 nranks 4 - COMPLETE
[2019-12-01 06:56:13.607995: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608031: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608042: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608054: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608070: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608079: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608088: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608101: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608102: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608115: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608125: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608126: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608141: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608141: D byteps/common/global.cc:201] Inited rank=1 local_rank=1 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608150: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608160: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608167: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608176: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608194: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608203: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608210: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608226: D byteps/common/global.cc:201] Inited rank=0 local_rank=0 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608244: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608277: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608308: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608323: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608337: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608344: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608357: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608359: D byteps/common/global.cc:193] Create schedule queue 0
[2019-12-01 06:56:13.608371: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608385: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608387: D byteps/common/global.cc:193] Create schedule queue 1
[2019-12-01 06:56:13.608397: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608408: D byteps/common/global.cc:193] Create schedule queue 2
[2019-12-01 06:56:13.608413: D byteps/common/global.cc:201] Inited rank=3 local_rank=3 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608420: D byteps/common/global.cc:233] Started 4 background threads. rank=1
[2019-12-01 06:56:13.608421: D byteps/common/global.cc:233] Started 4 background threads. rank=0
[2019-12-01 06:56:13.608424: D byteps/common/global.cc:193] Create schedule queue 3
[2019-12-01 06:56:13.608440: D byteps/common/global.cc:193] Create schedule queue 4
[2019-12-01 06:56:13.608454: D byteps/common/global.cc:193] Create schedule queue 5
[2019-12-01 06:56:13.608463: D byteps/common/global.cc:193] Create schedule queue 6
[2019-12-01 06:56:13.608477: D byteps/common/global.cc:193] Create schedule queue 7
[2019-12-01 06:56:13.608491: D byteps/common/global.cc:193] Create schedule queue 8
[2019-12-01 06:56:13.608505: D byteps/common/global.cc:193] Create schedule queue 9
[2019-12-01 06:56:13.608521: D byteps/common/global.cc:201] Inited rank=2 local_rank=2 size=4 local_size=4 worker_id=0
[2019-12-01 06:56:13.608550: D byteps/common/global.cc:233] Started 2 background threads. rank=3
[2019-12-01 06:56:13.608752: D byteps/common/global.cc:233] Started 4 background threads. rank=2
[2019-12-01 06:56:13.743501: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=1
[2019-12-01 06:56:13.743558: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=1
[2019-12-01 06:56:13.743580: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=1
[2019-12-01 06:56:13.743593: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=1
[2019-12-01 06:56:13.743611: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=1
[2019-12-01 06:56:13.743632: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=1
[2019-12-01 06:56:13.743645: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=1
[2019-12-01 06:56:13.743663: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=1
[2019-12-01 06:56:13.743685: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=1
[2019-12-01 06:56:13.743699: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=1
[2019-12-01 06:56:13.743725: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=1
[2019-12-01 06:56:13.743746: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=1
[2019-12-01 06:56:13.743769: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=1
[2019-12-01 06:56:13.743781: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=1
[2019-12-01 06:56:13.743794: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=1
[2019-12-01 06:56:13.743811: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=1
[2019-12-01 06:56:13.743911: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=0
[2019-12-01 06:56:13.743970: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=0
[2019-12-01 06:56:13.743990: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=0
[2019-12-01 06:56:13.744012: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=0
[2019-12-01 06:56:13.744028: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=0
[2019-12-01 06:56:13.744049: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=0
[2019-12-01 06:56:13.744062: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=0
[2019-12-01 06:56:13.744090: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=0
[2019-12-01 06:56:13.744121: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=0
[2019-12-01 06:56:13.744156: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=0
[2019-12-01 06:56:13.744187: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=0
[2019-12-01 06:56:13.744202: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=0
[2019-12-01 06:56:13.744216: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=0
[2019-12-01 06:56:13.744205: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=1
[2019-12-01 06:56:13.744238: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=0
[2019-12-01 06:56:13.744239: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.744253: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=0
[2019-12-01 06:56:13.744268: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=0
[2019-12-01 06:56:13.744269: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=3
[2019-12-01 06:56:13.744311: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=3
[2019-12-01 06:56:13.744349: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=3
[2019-12-01 06:56:13.744368: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=3
[2019-12-01 06:56:13.744381: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=3
[2019-12-01 06:56:13.744401: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=3
[2019-12-01 06:56:13.744420: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=3
[2019-12-01 06:56:13.744444: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=3
[2019-12-01 06:56:13.744462: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=3
[2019-12-01 06:56:13.744478: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=3
[2019-12-01 06:56:13.744495: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=3
[2019-12-01 06:56:13.744537: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=3
[2019-12-01 06:56:13.744569: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=3
[2019-12-01 06:56:13.744585: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=3
[2019-12-01 06:56:13.744605: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=3
[2019-12-01 06:56:13.744618: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=3
[2019-12-01 06:56:13.744617: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.744639: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.744648: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.744640: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=0
[2019-12-01 06:56:13.744665: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.744666: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=1 rank=1
[2019-12-01 06:56:13.744679: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.744688: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=1
[2019-12-01 06:56:13.744760: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.744804: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.744831: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.744909: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=0, rank=1, key=524288
[2019-12-01 06:56:13.744946: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=524288, myrank=3
[2019-12-01 06:56:13.745054: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=3
[2019-12-01 06:56:13.745085: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.745084: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.745112: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.745121: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.745144: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=0 rank=0
[2019-12-01 06:56:13.745167: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.745191: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=0
[2019-12-01 06:56:13.745317: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.745351: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.745368: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.745412: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=0, rank=0, key=524288
[2019-12-01 06:56:13.745434: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=524288, myrank=3
[2019-12-01 06:56:13.745487: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.745521: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.745531: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.745561: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=3 rank=3
[2019-12-01 06:56:13.745581: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.745598: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=3
[2019-12-01 06:56:13.745794: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.bias, declared key (not PS key): 0 rank=2
[2019-12-01 06:56:13.745835: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv1.weight, declared key (not PS key): 1 rank=2
[2019-12-01 06:56:13.745861: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.bias, declared key (not PS key): 2 rank=2
[2019-12-01 06:56:13.745884: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.conv2.weight, declared key (not PS key): 3 rank=2
[2019-12-01 06:56:13.745907: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.bias, declared key (not PS key): 4 rank=2
[2019-12-01 06:56:13.745930: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc1.weight, declared key (not PS key): 5 rank=2
[2019-12-01 06:56:13.745960: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.bias, declared key (not PS key): 6 rank=2
[2019-12-01 06:56:13.745990: D byteps/common/global.cc:320] Declared tensor byteps.Gradient.fc2.weight, declared key (not PS key): 7 rank=2
[2019-12-01 06:56:13.746024: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.bias, declared key (not PS key): 8 rank=2
[2019-12-01 06:56:13.746049: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv1.weight, declared key (not PS key): 9 rank=2
[2019-12-01 06:56:13.746073: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.bias, declared key (not PS key): 10 rank=2
[2019-12-01 06:56:13.746094: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.conv2.weight, declared key (not PS key): 11 rank=2
[2019-12-01 06:56:13.746123: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.bias, declared key (not PS key): 12 rank=2
[2019-12-01 06:56:13.746146: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc1.weight, declared key (not PS key): 13 rank=2
[2019-12-01 06:56:13.746178: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.bias, declared key (not PS key): 14 rank=2
[2019-12-01 06:56:13.746215: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.fc2.weight, declared key (not PS key): 15 rank=2
[2019-12-01 06:56:13.746604: D byteps/common/operations.cc:229] byteps.Parameter.conv1.bias partitioned to 1 part(s), total_len=40, key_range=[524288, 524288] rank=2
[2019-12-01 06:56:13.746641: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.747030: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.747062: T byteps/common/operations.cc:264] byteps.Parameter.conv1.bias: open shared memory size 40
[2019-12-01 06:56:13.747079: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.bias, size=40, parts=1
[2019-12-01 06:56:13.747096: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.bias_0, key=524288, offset=0, len=40, device=2 rank=2
[2019-12-01 06:56:13.747112: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747121: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.bias, rank=2
[2019-12-01 06:56:13.747253: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747282: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747298: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747349: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=0, rank=2, key=524288
[2019-12-01 06:56:13.747363: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=524288, myrank=3
[2019-12-01 06:56:13.747542: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.747629: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=3) key=524288, elements=10, device=3
[2019-12-01 06:56:13.747640: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=524288, myrank=1
[2019-12-01 06:56:13.747643: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=524288, myrank=2
[2019-12-01 06:56:13.747663: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.747654: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=524288, myrank=0
[2019-12-01 06:56:13.747693: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747696: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.747703: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=1) key=524288, elements=10, device=1
[2019-12-01 06:56:13.747724: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=2) key=524288, elements=10, device=2
[2019-12-01 06:56:13.747726: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL REDUCE (rank=0) key=524288, elements=10, device=0
[2019-12-01 06:56:13.747728: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.747738: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.747754: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.747759: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
ip-172-31-73-142:395:554 [0] NCCL INFO Launch mode Parallel
[2019-12-01 06:56:13.747968: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747977: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747982: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.747997: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748006: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.747998: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748011: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748022: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.748025: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.748028: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.748025: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.748045: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.748057: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748083: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748103: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748112: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748142: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748141: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=2, rank=0, key=524288
[2019-12-01 06:56:13.748160: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748158: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=524288, myrank=3
[2019-12-01 06:56:13.748183: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.748215: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=2, rank=1, key=524288
[2019-12-01 06:56:13.748218: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.bias_0, key=524288; Passing to the next queue.
[2019-12-01 06:56:13.748228: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=524288, myrank=3
[2019-12-01 06:56:13.748237: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.748280: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.bias_0 send coordinate info: Signal=2, rank=2, key=524288
[2019-12-01 06:56:13.748288: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=524288, myrank=3
[2019-12-01 06:56:13.748439: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv1.bias_0 key: 524288 rank: 3
[2019-12-01 06:56:13.748490: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=524288, myrank=0
[2019-12-01 06:56:13.748499: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=3) key=524288, elements=10, device=3
[2019-12-01 06:56:13.748498: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=524288, myrank=1
[2019-12-01 06:56:13.748506: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=524288, myrank=2
[2019-12-01 06:56:13.748519: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 0
[2019-12-01 06:56:13.748521: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 1
[2019-12-01 06:56:13.748524: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.bias_0 key: 524288 rank: 2
[2019-12-01 06:56:13.748541: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=0) key=524288, elements=10, device=0
[2019-12-01 06:56:13.748541: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=1) key=524288, elements=10, device=1
[2019-12-01 06:56:13.748547: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.bias_0 calling NCCL BROADCAST (rank=2) key=524288, elements=10, device=2
[2019-12-01 06:56:13.748556: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.748565: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.748571: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.748571: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.748743: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748748: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748752: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748773: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.748774: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.748776: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.748816: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv1.bias_0
[2019-12-01 06:56:13.748851: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.749169: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=0
[2019-12-01 06:56:13.749197: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.749499: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=3
[2019-12-01 06:56:13.749527: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.749576: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.749603: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.749623: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.749647: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=0 rank=0
[2019-12-01 06:56:13.749674: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.749686: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=0
[2019-12-01 06:56:13.749826: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.749852: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.749862: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.749890: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=1
[2019-12-01 06:56:13.749920: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=0, rank=0, key=589824
[2019-12-01 06:56:13.749915: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.749922: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[[2019-12-01 06:56:13.749974: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.749995: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
2019-12-01 06:56:13.749941: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=589824, myrank=3
[2019-12-01 06:56:13.750013: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=3 rank=3
[2019-12-01 06:56:13.750025: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.750034: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=3
[2019-12-01 06:56:13.750059: D byteps/common/operations.cc:229] byteps.Parameter.conv1.weight partitioned to 1 part(s), total_len=1000, key_range=[589824, 589824] rank=2
[2019-12-01 06:56:13.750086: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.750476: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.750509: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.750530: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.750559: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=1 rank=1
[2019-12-01 06:56:13.750582: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.750603: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=1
[2019-12-01 06:56:13.750686: T byteps/common/shared_memory.cc:43] initialized share memory size 1000
[2019-12-01 06:56:13.750714: T byteps/common/operations.cc:264] byteps.Parameter.conv1.weight: open shared memory size 1000
[2019-12-01 06:56:13.750730: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv1.weight, size=1000, parts=1
[2019-12-01 06:56:13.750760: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv1.weight_0, key=589824, offset=0, len=1000, device=2 rank=2
[2019-12-01 06:56:13.750752: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.750778: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.750783: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.750796: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv1.weight, rank=2
[2019-12-01 06:56:13.750803: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.750840: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=0, rank=1, key=589824
[2019-12-01 06:56:13.750821: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.750850: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=589824, myrank=3
[2019-12-01 06:56:13.750866: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.750883: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.750940: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=0, rank=2, key=589824
[2019-12-01 06:56:13.750950: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=589824, myrank=3
[2019-12-01 06:56:13.751032: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.751097: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=589824, myrank=0
[2019-12-01 06:56:13.751110: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=3) key=589824, elements=250, device=3
[2019-12-01 06:56:13.751126: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751120: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=589824, myrank=2
[2019-12-01 06:56:13.751108: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=589824, myrank=1
[2019-12-01 06:56:13.751148: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751152: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=0) key=589824, elements=250, device=0
[2019-12-01 06:56:13.751162: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751164: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.751176: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=2) key=589824, elements=250, device=2
[2019-12-01 06:56:13.751177: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.751186: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL REDUCE (rank=1) key=589824, elements=250, device=1
[2019-12-01 06:56:13.751201: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.751209: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.751301: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751344: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751372: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.751391: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751391: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751422: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751426: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.751435: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.751452: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.751431: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751480: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751482: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751494: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.751505: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751523: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751557: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=2, rank=0, key=589824
[2019-12-01 06:56:13.751572: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=589824, myrank=3
[2019-12-01 06:56:13.751568: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751593: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751602: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751627: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv1.weight_0, key=589824; Passing to the next queue.
[2019-12-01 06:56:13.751629: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751648: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751686: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=2, rank=2, key=589824
[2019-12-01 06:56:13.751688: T byteps/common/core_loops.cc:126] byteps.Parameter.conv1.weight_0 send coordinate info: Signal=2, rank=1, key=589824
[2019-12-01 06:56:13.751709: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=589824, myrank=3
[2019-12-01 06:56:13.751736: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=589824, myrank=3
[2019-12-01 06:56:13.751747: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv1.weight_0 key: 589824 rank: 3
[2019-12-01 06:56:13.751834: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=589824, myrank=0
[2019-12-01 06:56:13.751850: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=589824, myrank=1
[2019-12-01 06:56:13.751854: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=3) key=589824, elements=250, device=3
[2019-12-01 06:56:13.751858: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 0
[2019-12-01 06:56:13.751865: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=589824, myrank=2
[2019-12-01 06:56:13.751884: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 1
[2019-12-01 06:56:13.751890: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=0) key=589824, elements=250, device=0
[2019-12-01 06:56:13.751895: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv1.weight_0 key: 589824 rank: 2
[2019-12-01 06:56:13.751900: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=1) key=589824, elements=250, device=1
[2019-12-01 06:56:13.751911: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.751916: T byteps/common/core_loops.cc:165] byteps.Parameter.conv1.weight_0 calling NCCL BROADCAST (rank=2) key=589824, elements=250, device=2
[2019-12-01 06:56:13.751920: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.751935: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.751928: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.752001: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752004: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752002: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752026: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.752028: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.752031: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.752113: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv1.weight_0
[2019-12-01 06:56:13.752145: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.752237: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=1
[2019-12-01 06:56:13.752277: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752446: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=2
[2019-12-01 06:56:13.752475: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752601: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=0
[2019-12-01 06:56:13.752630: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752870: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.752900: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.752923: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.752950: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=1 rank=1
[2019-12-01 06:56:13.752969: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.752984: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=1
[2019-12-01 06:56:13.752966: D byteps/common/operations.cc:229] byteps.Parameter.conv2.bias partitioned to 1 part(s), total_len=80, key_range=[655360, 655360] rank=3
[2019-12-01 06:56:13.753005: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753103: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.753135: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.753153: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.753180: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.753193: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=0, rank=1, key=655360
[2019-12-01 06:56:13.753213: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.753216: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=655360, myrank=3
[2019-12-01 06:56:13.753244: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753278: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=2 rank=2
[2019-12-01 06:56:13.753298: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.753320: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=2
[2019-12-01 06:56:13.753403: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.753438: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.753460: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.753497: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=0, rank=2, key=655360
[2019-12-01 06:56:13.753508: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=655360, myrank=3
[2019-12-01 06:56:13.753510: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.753544: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.753573: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753591: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=0 rank=0
[2019-12-01 06:56:13.753609: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.753626: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=0
[2019-12-01 06:56:13.753715: T byteps/common/shared_memory.cc:43] initialized share memory size 80
[2019-12-01 06:56:13.753740: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.753749: T byteps/common/operations.cc:264] byteps.Parameter.conv2.bias: open shared memory size 80
[2019-12-01 06:56:13.753766: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.753784: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.753803: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.bias, size=80, parts=1
[2019-12-01 06:56:13.753837: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=0, rank=0, key=655360
[2019-12-01 06:56:13.753847: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.bias_0, key=655360, offset=0, len=80, device=3 rank=3
[2019-12-01 06:56:13.753850: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=655360, myrank=3
[2019-12-01 06:56:13.753876: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.753920: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.bias, rank=3
[2019-12-01 06:56:13.754048: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.754110: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=655360, myrank=0
[2019-12-01 06:56:13.754122: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=3) key=655360, elements=20, device=3
[2019-12-01 06:56:13.754122: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=655360, myrank=1
[2019-12-01 06:56:13.754140: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754131: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=655360, myrank=2
[2019-12-01 06:56:13.754158: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754172: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754178: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=0) key=655360, elements=20, device=0
[2019-12-01 06:56:13.754181: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.754196: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.754196: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=2) key=655360, elements=20, device=2
[2019-12-01 06:56:13.754192: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL REDUCE (rank=1) key=655360, elements=20, device=1
[2019-12-01 06:56:13.754214: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.754219: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.754299: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754304: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754326: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.754327: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754344: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.754348: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.754388: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754426: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754447: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.754455: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754453: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754484: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754488: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754501: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754503: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.754553: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=2, rank=0, key=655360
[2019-12-01 06:56:13.754568: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=655360, myrank=3
[2019-12-01 06:56:13.754564: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754601: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754620: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754613: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754647: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.bias_0, key=655360; Passing to the next queue.
[2019-12-01 06:56:13.754653: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=2, rank=2, key=655360
[2019-12-01 06:56:13.754661: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=655360, myrank=3
[2019-12-01 06:56:13.754663: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754704: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.bias_0 send coordinate info: Signal=2, rank=1, key=655360
[2019-12-01 06:56:13.754711: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=655360, myrank=3
[2019-12-01 06:56:13.754760: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv2.bias_0 key: 655360 rank: 3
[2019-12-01 06:56:13.754812: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=655360, myrank=0
[2019-12-01 06:56:13.754830: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 0
[2019-12-01 06:56:13.754824: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=3) key=655360, elements=20, device=3
[2019-12-01 06:56:13.754831: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=655360, myrank=2
[2019-12-01 06:56:13.754832: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=655360, myrank=1
[2019-12-01 06:56:13.754849: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 2
[2019-12-01 06:56:13.754864: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=0) key=655360, elements=20, device=0
[2019-12-01 06:56:13.754863: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.bias_0 key: 655360 rank: 1
[2019-12-01 06:56:13.754871: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=2) key=655360, elements=20, device=2
[2019-12-01 06:56:13.754891: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.754899: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.754917: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.754917: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.bias_0 calling NCCL BROADCAST (rank=1) key=655360, elements=20, device=1
[2019-12-01 06:56:13.754953: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.755050: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755081: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.755147: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755164: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755177: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.755188: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv2.bias_0
[2019-12-01 06:56:13.755202: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.755225: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.755333: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=3
[2019-12-01 06:56:13.755360: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.755709: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.755710: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=1
[2019-12-01 06:56:13.755747: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.755748: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.755766: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.755793: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=3 rank=3
[2019-12-01 06:56:13.755816: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.755838: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=3
[2019-12-01 06:56:13.755840: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=2
[2019-12-01 06:56:13.755870: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756067: D byteps/common/operations.cc:229] byteps.Parameter.conv2.weight partitioned to 1 part(s), total_len=20000, key_range=[720896, 720896] rank=0
[2019-12-01 06:56:13.756100: T byteps/common/operations.cc:243] Begin init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756229: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.756251: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.756260: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756287: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=1 rank=1
[2019-12-01 06:56:13.756310: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.756333: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=1
[2019-12-01 06:56:13.756356: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.756391: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.756401: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.756456: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=0, rank=1, key=720896
[2019-12-01 06:56:13.756475: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=720896, myrank=3
[2019-12-01 06:56:13.756581: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.756608: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.756627: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756661: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=2 rank=2
[2019-12-01 06:56:13.756692: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.756714: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=2
[2019-12-01 06:56:13.756793: T byteps/common/shared_memory.cc:43] initialized share memory size 20000
[2019-12-01 06:56:13.756821: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.756829: T byteps/common/operations.cc:264] byteps.Parameter.conv2.weight: open shared memory size 20000
[2019-12-01 06:56:13.756844: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.756854: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.conv2.weight, size=20000, parts=1
[2019-12-01 06:56:13.756864: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.756873: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.conv2.weight_0, key=720896, offset=0, len=20000, device=0 rank=0
[2019-12-01 06:56:13.756894: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.756897: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=0, rank=2, key=720896
[2019-12-01 06:56:13.756909: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.conv2.weight, rank=0
[2019-12-01 06:56:13.756912: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=720896, myrank=3
[2019-12-01 06:56:13.756975: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757009: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757037: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757073: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=0, rank=0, key=720896
[2019-12-01 06:56:13.757084: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=720896, myrank=3
[2019-12-01 06:56:13.757192: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.757251: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=720896, myrank=0
[2019-12-01 06:56:13.757262: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=3) key=720896, elements=5000, device=3
[2019-12-01 06:56:13.757268: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=720896, myrank=2
[2019-12-01 06:56:13.757276: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757265: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=720896, myrank=1
[2019-12-01 06:56:13.757294: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757300: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=0) key=720896, elements=5000, device=0
[2019-12-01 06:56:13.757299: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757314: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.757319: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=2) key=720896, elements=5000, device=2
[2019-12-01 06:56:13.757323: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.757327: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL REDUCE (rank=1) key=720896, elements=5000, device=1
[2019-12-01 06:56:13.757341: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.757362: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.757469: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757486: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757513: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.757514: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757534: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.757538: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.757529: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757563: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757558: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757585: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757585: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.757587: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757604: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757626: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757641: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.757650: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=2, rank=2, key=720896
[2019-12-01 06:56:13.757655: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757664: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=720896, myrank=3
[2019-12-01 06:56:13.757684: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757682: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757704: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.conv2.weight_0, key=720896; Passing to the next queue.
[2019-12-01 06:56:13.757704: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757723: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757761: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=2, rank=0, key=720896
[2019-12-01 06:56:13.757772: T byteps/common/core_loops.cc:126] byteps.Parameter.conv2.weight_0 send coordinate info: Signal=2, rank=1, key=720896
[2019-12-01 06:56:13.757812: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=720896, myrank=3
[2019-12-01 06:56:13.757835: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=720896, myrank=3
[2019-12-01 06:56:13.757889: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.conv2.weight_0 key: 720896 rank: 3
[2019-12-01 06:56:13.757940: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=720896, myrank=0
[2019-12-01 06:56:13.757951: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=3) key=720896, elements=5000, device=3
[2019-12-01 06:56:13.757958: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=720896, myrank=2
[2019-12-01 06:56:13.757956: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=720896, myrank=1
[2019-12-01 06:56:13.757968: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 0
[2019-12-01 06:56:13.757975: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 2
[2019-12-01 06:56:13.757979: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.conv2.weight_0 key: 720896 rank: 1
[2019-12-01 06:56:13.757984: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=0) key=720896, elements=5000, device=0
[2019-12-01 06:56:13.757990: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=2) key=720896, elements=5000, device=2
[2019-12-01 06:56:13.757994: T byteps/common/core_loops.cc:165] byteps.Parameter.conv2.weight_0 calling NCCL BROADCAST (rank=1) key=720896, elements=5000, device=1
[2019-12-01 06:56:13.757997: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.758004: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.758010: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.758051: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.758171: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758204: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.758222: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758239: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758253: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.758278: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.conv2.weight_0
[2019-12-01 06:56:13.758281: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.758304: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.758450: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=0
[2019-12-01 06:56:13.758479: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.758835: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=3
[2019-12-01 06:56:13.758854: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.758881: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.758861: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.758907: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.758925: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=0 rank=0
[2019-12-01 06:56:13.758948: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.758969: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=0
[2019-12-01 06:56:13.759015: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.759033: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.759042: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.759090: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=0, rank=0, key=786432
[2019-12-01 06:56:13.759106: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=786432, myrank=3
[2019-12-01 06:56:13.759204: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=1
[2019-12-01 06:56:13.759236: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759233: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.759262: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.759271: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759288: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=3 rank=3
[2019-12-01 06:56:13.759307: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.759325: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=3
[2019-12-01 06:56:13.759448: D byteps/common/operations.cc:229] byteps.Parameter.fc1.bias partitioned to 1 part(s), total_len=200, key_range=[786432, 786432] rank=2
[2019-12-01 06:56:13.759480: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759614: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.759634: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.759643: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.759667: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=1 rank=1
[2019-12-01 06:56:13.759691: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.759707: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=1
[2019-12-01 06:56:13.759739: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.759772: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.759789: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.759814: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=0, rank=1, key=786432
[2019-12-01 06:56:13.759824: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=786432, myrank=3
[2019-12-01 06:56:13.759929: T byteps/common/shared_memory.cc:43] initialized share memory size 200
[2019-12-01 06:56:13.759959: T byteps/common/operations.cc:264] byteps.Parameter.fc1.bias: open shared memory size 200
[2019-12-01 06:56:13.759976: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.bias, size=200, parts=1
[2019-12-01 06:56:13.760003: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.bias_0, key=786432, offset=0, len=200, device=2 rank=2
[2019-12-01 06:56:13.760028: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760042: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.bias, rank=2
[2019-12-01 06:56:13.760202: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760228: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760243: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760273: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=0, rank=2, key=786432
[2019-12-01 06:56:13.760283: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=786432, myrank=3
[2019-12-01 06:56:13.760323: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.760394: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=786432, myrank=0
[2019-12-01 06:56:13.760405: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=3) key=786432, elements=50, device=3
[2019-12-01 06:56:13.760406: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=786432, myrank=1
[2019-12-01 06:56:13.760411: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=786432, myrank=2
[2019-12-01 06:56:13.760422: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760443: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760444: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760448: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=0) key=786432, elements=50, device=0
[2019-12-01 06:56:13.760474: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=2) key=786432, elements=50, device=2
[2019-12-01 06:56:13.760476: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL REDUCE (rank=1) key=786432, elements=50, device=1
[2019-12-01 06:56:13.760477: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.760472: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.760500: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.760503: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.760592: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760618: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760628: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760637: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.760657: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760670: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760681: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.760705: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.760702: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760727: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.760727: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760722: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760744: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.760747: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760763: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.760773: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=2, rank=1, key=786432
[2019-12-01 06:56:13.760773: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760789: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=786432, myrank=3
[2019-12-01 06:56:13.760796: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760829: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.760856: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=2, rank=0, key=786432
[2019-12-01 06:56:13.760860: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760868: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=786432, myrank=3
[2019-12-01 06:56:13.760886: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.bias_0, key=786432; Passing to the next queue.
[2019-12-01 06:56:13.760896: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.760928: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.bias_0 send coordinate info: Signal=2, rank=2, key=786432
[2019-12-01 06:56:13.760934: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=786432, myrank=3
[2019-12-01 06:56:13.761051: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc1.bias_0 key: 786432 rank: 3
[2019-12-01 06:56:13.761118: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=786432, myrank=0
[2019-12-01 06:56:13.761131: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=786432, myrank=1
[2019-12-01 06:56:13.761139: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=786432, myrank=2
[2019-12-01 06:56:13.761137: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=3) key=786432, elements=50, device=3
[2019-12-01 06:56:13.761147: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 1
[2019-12-01 06:56:13.761148: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 0
[2019-12-01 06:56:13.761168: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.bias_0 key: 786432 rank: 2
[2019-12-01 06:56:13.761189: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=1) key=786432, elements=50, device=1
[2019-12-01 06:56:13.761198: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=0) key=786432, elements=50, device=0
[2019-12-01 06:56:13.761200: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.bias_0 calling NCCL BROADCAST (rank=2) key=786432, elements=50, device=2
[2019-12-01 06:56:13.761221: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.761222: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.761223: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.761225: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.761304: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761340: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.761378: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761398: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761416: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.761415: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.761425: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc1.bias_0
[2019-12-01 06:56:13.761519: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.761776: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=2
[2019-12-01 06:56:13.761848: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.761967: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=0
[2019-12-01 06:56:13.761993: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762284: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=3
[2019-12-01 06:56:13.762325: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762343: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.762371: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.762390: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762418: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=2 rank=2
[2019-12-01 06:56:13.762441: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.762468: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=2
[2019-12-01 06:56:13.762600: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.762635: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.762633: D byteps/common/operations.cc:229] byteps.Parameter.fc1.weight partitioned to 1 part(s), total_len=64000, key_range=[851968, 851968] rank=1
[2019-12-01 06:56:13.762663: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762663: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.762675: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.762704: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.762728: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.762729: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=0, rank=2, key=851968
[2019-12-01 06:56:13.762743: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=851968, myrank=3
[2019-12-01 06:56:13.762755: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=0 rank=0
[2019-12-01 06:56:13.762791: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.762809: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=0
[2019-12-01 06:56:13.762909: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.762937: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.762951: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.762981: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=0, rank=0, key=851968
[2019-12-01 06:56:13.762988: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=851968, myrank=3
[2019-12-01 06:56:13.763027: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.763061: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.763076: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.763101: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=3 rank=3
[2019-12-01 06:56:13.763117: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.763125: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=3
[2019-12-01 06:56:13.763234: T byteps/common/shared_memory.cc:43] initialized share memory size 64000
[2019-12-01 06:56:13.763260: T byteps/common/operations.cc:264] byteps.Parameter.fc1.weight: open shared memory size 64000
[2019-12-01 06:56:13.763275: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc1.weight, size=64000, parts=1
[2019-12-01 06:56:13.763295: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc1.weight_0, key=851968, offset=0, len=64000, device=1 rank=1
[2019-12-01 06:56:13.763318: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763332: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc1.weight, rank=1
[2019-12-01 06:56:13.763408: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763446: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.763471: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763522: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=0, rank=1, key=851968
[2019-12-01 06:56:13.763530: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=851968, myrank=3
[2019-12-01 06:56:13.763703: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.763757: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=851968, myrank=0
[2019-12-01 06:56:13.763769: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=3) key=851968, elements=16000, device=3
[2019-12-01 06:56:13.763774: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=851968, myrank=1
[2019-12-01 06:56:13.763776: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=851968, myrank=2
[2019-12-01 06:56:13.763784: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.763801: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.763802: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.763807: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=0) key=851968, elements=16000, device=0
[2019-12-01 06:56:13.763823: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=1) key=851968, elements=16000, device=1
[2019-12-01 06:56:13.763824: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.763824: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL REDUCE (rank=2) key=851968, elements=16000, device=2
[2019-12-01 06:56:13.763827: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.763842: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.763845: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.763978: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764008: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764007: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764053: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764056: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.764065: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.764066: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764078: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764108: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.764111: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764112: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764112: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764132: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.764134: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764134: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.764139: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764147: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764155: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764177: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=2, rank=2, key=851968
[2019-12-01 06:56:13.764192: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764190: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=851968, myrank=3
[2019-12-01 06:56:13.764196: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=2, rank=0, key=851968
[2019-12-01 06:56:13.764215: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc1.weight_0, key=851968; Passing to the next queue.
[2019-12-01 06:56:13.764219: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=851968, myrank=3
[2019-12-01 06:56:13.764232: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764271: T byteps/common/core_loops.cc:126] byteps.Parameter.fc1.weight_0 send coordinate info: Signal=2, rank=1, key=851968
[2019-12-01 06:56:13.764288: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=851968, myrank=3
[2019-12-01 06:56:13.764372: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc1.weight_0 key: 851968 rank: 3
[2019-12-01 06:56:13.764426: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=851968, myrank=0
[2019-12-01 06:56:13.764434: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=3) key=851968, elements=16000, device=3
[2019-12-01 06:56:13.764435: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=851968, myrank=1
[2019-12-01 06:56:13.764445: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 0
[2019-12-01 06:56:13.764442: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=851968, myrank=2
[2019-12-01 06:56:13.764453: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 1
[2019-12-01 06:56:13.764458: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc1.weight_0 key: 851968 rank: 2
[2019-12-01 06:56:13.764460: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=0) key=851968, elements=16000, device=0
[2019-12-01 06:56:13.764478: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=2) key=851968, elements=16000, device=2
[2019-12-01 06:56:13.764481: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.764482: T byteps/common/core_loops.cc:165] byteps.Parameter.fc1.weight_0 calling NCCL BROADCAST (rank=1) key=851968, elements=16000, device=1
[2019-12-01 06:56:13.764483: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.764506: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.764510: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.764647: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764650: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764671: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764676: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.764677: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.764679: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc1.weight_0
[2019-12-01 06:56:13.764698: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.764707: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.764975: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=1
[2019-12-01 06:56:13.764997: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765161: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=2
[2019-12-01 06:56:13.765188: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765379: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.765404: T byteps/common/operations.cc:264] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.765420: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765421: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=0
[2019-12-01 06:56:13.765443: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=1 rank=1
[2019-12-01 06:56:13.765454: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765466: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.765489: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=1
[2019-12-01 06:56:13.765598: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.765635: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.765651: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.765684: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=0, rank=1, key=917504
[2019-12-01 06:56:13.765694: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=917504, myrank=3
[2019-12-01 06:56:13.765835: D byteps/common/operations.cc:229] byteps.Parameter.fc2.bias partitioned to 1 part(s), total_len=40, key_range=[917504, 917504] rank=3
[2019-12-01 06:56:13.765868: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765901: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.765929: T byteps/common/operations.cc:264] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.765946: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.765970: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=2 rank=2
[2019-12-01 06:56:13.765994: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766011: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=2
[2019-12-01 06:56:13.766136: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766163: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.766179: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766212: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=0, rank=2, key=917504
[2019-12-01 06:56:13.766224: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=917504, myrank=3
[2019-12-01 06:56:13.766261: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.766298: T byteps/common/operations.cc:264] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.766311: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.766336: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=0 rank=0
[2019-12-01 06:56:13.766358: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766378: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=0
[2019-12-01 06:56:13.766397: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766438: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.766468: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766481: T byteps/common/shared_memory.cc:43] initialized share memory size 40
[2019-12-01 06:56:13.766496: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=0, rank=0, key=917504
[2019-12-01 06:56:13.766530: T byteps/common/operations.cc[:2019-12-01 06:56:13.766506: T byteps/common/communicator.cc:199] 264root socket recved: src=0, signal=0, key=917504, myrank=3
] byteps.Parameter.fc2.bias: open shared memory size 40
[2019-12-01 06:56:13.766593: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.bias, size=40, parts=1
[2019-12-01 06:56:13.766624: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.bias_0, key=917504, offset=0, len=40, device=3 rank=3
[2019-12-01 06:56:13.766648: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.766665: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.bias, rank=3
[2019-12-01 06:56:13.766746: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.766822: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=917504, myrank=1
[2019-12-01 06:56:13.766823: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=917504, myrank=0
[2019-12-01 06:56:13.766829: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=3) key=917504, elements=10, device=3
[2019-12-01 06:56:13.766834: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=917504, myrank=2
[2019-12-01 06:56:13.766848: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.766849: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.766855: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.766870: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=1) key=917504, elements=10, device=1
[2019-12-01 06:56:13.766872: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=0) key=917504, elements=10, device=0
[2019-12-01 06:56:13.766887: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL REDUCE (rank=2) key=917504, elements=10, device=2
[2019-12-01 06:56:13.766894: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.766905: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.766900: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.766902: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767018: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767049: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767052: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767076: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767089: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.767078: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.767089: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767092: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767131: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.767133: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767082: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767150: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767154: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767167: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767169: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767165: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.767171: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767197: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767213: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=2, rank=2, key=917504
[2019-12-01 06:56:13.767222: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767223: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.bias_0, key=917504; Passing to the next queue.
[2019-12-01 06:56:13.767227: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=917504, myrank=3
[2019-12-01 06:56:13.767241: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767259: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=2, rank=1, key=917504
[2019-12-01 06:56:13.767268: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=917504, myrank=3
[2019-12-01 06:56:13.767299: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.bias_0 send coordinate info: Signal=2, rank=0, key=917504
[2019-12-01 06:56:13.767307: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=917504, myrank=3
[2019-12-01 06:56:13.767400: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc2.bias_0 key: 917504 rank: 3
[2019-12-01 06:56:13.767475: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=917504, myrank=0
[2019-12-01 06:56:13.767489: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=3) key=917504, elements=10, device=3
[2019-12-01 06:56:13.767488: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=917504, myrank=1
[2019-12-01 06:56:13.767499: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 0
[2019-12-01 06:56:13.767494: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=917504, myrank=2
[2019-12-01 06:56:13.767518: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 1
[2019-12-01 06:56:13.767527: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.bias_0 key: 917504 rank: 2
[2019-12-01 06:56:13.767527: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=0) key=917504, elements=10, device=0
[2019-12-01 06:56:13.767534: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=1) key=917504, elements=10, device=1
[2019-12-01 06:56:13.767549: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.767549: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.bias_0 calling NCCL BROADCAST (rank=2) key=917504, elements=10, device=2
[2019-12-01 06:56:13.767552: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.767554: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767573: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.767650: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767672: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.767677: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767683: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767724: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.767740: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.767757: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc2.bias_0
[2019-12-01 06:56:13.767780: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.768142: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=3
[2019-12-01 06:56:13.768168: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768469: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=1
[2019-12-01 06:56:13.768495: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768528: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.768561: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.768561: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=2
[2019-12-01 06:56:13.768585: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768586: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.768619: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=3 rank=3
[2019-12-01 06:56:13.768645: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.768654: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=3
[2019-12-01 06:56:13.768875: D byteps/common/operations.cc:229] byteps.Parameter.fc2.weight partitioned to 1 part(s), total_len=2000, key_range=[983040, 983040] rank=0
[2019-12-01 06:56:13.768903: T byteps/common/operations.cc:243] Begin init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769000: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.769029: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.769047: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769064: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=1 rank=1
[2019-12-01 06:56:13.769086: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.769103: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=1
[2019-12-01 06:56:13.769231: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.769261: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.769277: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.769329: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=0, rank=1, key=983040
[2019-12-01 06:56:13.769345: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=983040, myrank=3
[2019-12-01 06:56:13.769365: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.769392: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.769404: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769420: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=2 rank=2
[2019-12-01 06:56:13.769443: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.769467: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=2
[2019-12-01 06:56:13.769482: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.769535: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.769566: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.769599: T byteps/common/shared_memory.cc:43] initialized share memory size 2000
[2019-12-01 06:56:13.769609: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=0, rank=2, key=983040
[2019-12-01 06:56:13.769618: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=983040, myrank=3
[2019-12-01 06:56:13.769627: T byteps/common/operations.cc:264] byteps.Parameter.fc2.weight: open shared memory size 2000
[2019-12-01 06:56:13.769638: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.fc2.weight, size=2000, parts=1
[2019-12-01 06:56:13.769655: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.fc2.weight_0, key=983040, offset=0, len=2000, device=0 rank=0
[2019-12-01 06:56:13.769686: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.769708: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.fc2.weight, rank=0
[2019-12-01 06:56:13.769855: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.769887: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.769897: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.769928: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=0, rank=0, key=983040
[2019-12-01 06:56:13.769941: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=983040, myrank=3
[2019-12-01 06:56:13.770061: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.770117: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=983040, myrank=0
[2019-12-01 06:56:13.770127: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=3) key=983040, elements=500, device=3
[2019-12-01 06:56:13.770128: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=983040, myrank=1
[2019-12-01 06:56:13.770134: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=983040, myrank=2
[2019-12-01 06:56:13.770150: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770168: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770149: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770184: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=1) key=983040, elements=500, device=1
[2019-12-01 06:56:13.770185: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=2) key=983040, elements=500, device=2
[2019-12-01 06:56:13.770188: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL REDUCE (rank=0) key=983040, elements=500, device=0
[2019-12-01 06:56:13.770185: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.770208: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.770209: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.770211: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.770288: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770308: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770321: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.770364: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770381: T byteps/common/core_loops.cc:67] Rank=3 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770391: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770401: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770403: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.770423: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770427: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.770419: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770440: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770427: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.770445: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770431: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770479: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.770479: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=2, rank=1, key=983040
[2019-12-01 06:56:13.770482: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770492: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=983040, myrank=3
[2019-12-01 06:56:13.770498: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770564: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=2, rank=2, key=983040
[2019-12-01 06:56:13.770562: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770585: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.fc2.weight_0, key=983040; Passing to the next queue.
[2019-12-01 06:56:13.770580: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=983040, myrank=3
[2019-12-01 06:56:13.770602: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770645: T byteps/common/core_loops.cc:126] byteps.Parameter.fc2.weight_0 send coordinate info: Signal=2, rank=0, key=983040
[2019-12-01 06:56:13.770652: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=983040, myrank=3
[2019-12-01 06:56:13.770725: T byteps/common/scheduled_queue.cc:129] Queue BROADCAST getTask: byteps.Parameter.fc2.weight_0 key: 983040 rank: 3
[2019-12-01 06:56:13.770775: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=983040, myrank=0
[2019-12-01 06:56:13.770785: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=3) key=983040, elements=500, device=3
[2019-12-01 06:56:13.770785: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=983040, myrank=1
[2019-12-01 06:56:13.770793: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=5, key=983040, myrank=2
[2019-12-01 06:56:13.770801: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 0
[2019-12-01 06:56:13.770804: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 1
[2019-12-01 06:56:13.770810: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Parameter.fc2.weight_0 key: 983040 rank: 2
[2019-12-01 06:56:13.770818: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=1) key=983040, elements=500, device=1
[2019-12-01 06:56:13.770816: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=0) key=983040, elements=500, device=0
[2019-12-01 06:56:13.770825: T byteps/common/core_loops.cc:165] byteps.Parameter.fc2.weight_0 calling NCCL BROADCAST (rank=2) key=983040, elements=500, device=2
[2019-12-01 06:56:13.770832: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.770835: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.770837: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.770853: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.770953: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.770984: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.770996: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 06:56:13.771017: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.771018: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.771032: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Parameter.fc2.weight_0
[2019-12-01 06:56:13.771053: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.771056: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.773168: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=0
[2019-12-01 06:56:13.773275: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=0
[2019-12-01 06:56:13.773302: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773342: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.773357: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=3
[2019-12-01 06:56:13.773455: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=3
[2019-12-01 06:56:13.773490: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773507: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.773827: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=2
[2019-12-01 06:56:13.773843: D byteps/common/global.cc:320] Declared tensor byteps.Parameter.dampening.0, declared key (not PS key): 16 rank=1
[2019-12-01 06:56:13.773932: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=2
[2019-12-01 06:56:13.773946: D byteps/common/operations.cc:229] byteps.Parameter.dampening.0 partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=1
[2019-12-01 06:56:13.773958: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773973: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.773974: T byteps/common/operations.cc:243] Begin init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.773988: D byteps/common/operations.cc:249] byteps.Parameter.dampening.0 is already on cpu, len=4
[2019-12-01 06:56:13.774463: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.774490: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.774503: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.774524: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=0
[2019-12-01 06:56:13.774541: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.774575: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=0
[2019-12-01 06:56:13.774605: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.774642: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.774663: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.774695: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=0, rank=0, key=1048576
[2019-12-01 06:56:13.774709: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=1048576, myrank=3
[2019-12-01 06:56:13.775207: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.775242: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.775258: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.775282: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=3
[2019-12-01 06:56:13.775297: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 3
[2019-12-01 06:56:13.775313: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=3
[2019-12-01 06:56:13.775656: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.775681: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.775696: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.775718: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=1
[2019-12-01 06:56:13.775733: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.775751: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=1
[2019-12-01 06:56:13.775759: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 06:56:13.775780: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.775791: T byteps/common/operations.cc:264] byteps.Parameter.dampening.0: open shared memory size 4
[2019-12-01 06:56:13.775809: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.775814: T byteps/common/operations.cc:295] Finish Init byteps.Parameter.dampening.0, size=4, parts=1
[2019-12-01 06:56:13.775825: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.775834: T byteps/common/operations.cc:185] EnqueueTensor: byteps.Parameter.dampening.0_0, key=1048576, offset=0, len=4, device=-1 rank=2
[2019-12-01 06:56:13.775856: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.775858: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=0, rank=1, key=1048576
[2019-12-01 06:56:13.775871: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.Parameter.dampening.0, rank=2
[2019-12-01 06:56:13.775867: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=1048576, myrank=3
[2019-12-01 06:56:13.775930: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.775957: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.775971: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776017: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=0, rank=2, key=1048576
[2019-12-01 06:56:13.776025: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=1048576, myrank=3
[2019-12-01 06:56:13.776092: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 3
[2019-12-01 06:56:13.776159: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=3) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776159: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=1
[2019-12-01 06:56:13.776153: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=0
[2019-12-01 06:56:13.776169: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=2
[2019-12-01 06:56:13.776182: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776189: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776198: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=0) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776178: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776207: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 06:56:13.776213: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 06:56:13.776214: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=2) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776218: T byteps/common/core_loops.cc:165] byteps.Parameter.dampening.0_0 calling NCCL REDUCE (rank=1) key=1048576, elements=1, device=-1
[2019-12-01 06:56:13.776238: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 06:56:13.776240: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 06:56:13.776316: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776323: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776354: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776356: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776371: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 06:56:13.776374: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 06:56:13.776393: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776420: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776431: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776444: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776453: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776466: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 1
[2019-12-01 06:56:13.776465: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776464: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 0
[2019-12-01 06:56:13.776498: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 06:56:13.776504: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=2, rank=1, key=1048576
[2019-12-01 06:56:13.776504: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=2, rank=0, key=1048576
[2019-12-01 06:56:13.776507: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776519: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=1048576, myrank=3
[2019-12-01 06:56:13.776535: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.Parameter.dampening.0_0, key=1048576; Passing to the next queue.
[2019-12-01 06:56:13.776537: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=1048576, myrank=3
[2019-12-01 06:56:13.776555: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.Parameter.dampening.0_0 key: 1048576 rank: 2
[2019-12-01 06:56:13.776592: T byteps/common/core_loops.cc:126] byteps.Parameter.dampening.0_0 send coordinate info: Signal=2, rank=2, key=1048576
[2019-12-01 06:56:13.776603: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=1048576, myrank=3
[2019-12-01 06:56:13.926863: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   402 Aborted                 (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 41, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134

bobzhuyb · 2019-12-01T07:09:00Z

@nowei Thank you. You are right. INFO does not give anything new. The useful level is DEBUG. However, TRACE would include anything that DEBUG outputs, so what you have is good enough for us. We'll look into this.

bobzhuyb · 2019-12-01T07:16:53Z

@nowei If you repeat multiple times with TRACE logs, does it always die on the key 1048576? From the logs you paste, you can see that the last few lines are all about key 1048576.

nowei · 2019-12-01T08:34:28Z

I ran it a few more times and they all died on key 1048576.

bobzhuyb · 2019-12-01T08:40:10Z

Thanks. This is very helpful. So, it's a deterministic bug. There has to be something special about this tensor byteps.Parameter.dampening.0_0

bobzhuyb · 2019-12-01T08:57:10Z

@nowei Would you do one more favor? Comment out this line and try again. https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L109

nowei · 2019-12-01T09:21:31Z

It ended up training for one epoch and then it crashed again. It ended with something like this:

Train Epoch: 1 [14720/15000 (98%)]      Loss: 0.529198
Train Epoch: 1 [14720/15000 (98%)]      Loss: 0.421403
Train Epoch: 1 [14720/15000 (98%)]      Loss: 0.339793
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
[2019-12-01 09:17:47.531626: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:    32 Aborted                 (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 41, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134

Here's the last few lines from the debug info. It seems like it died again on key 1048576.

[2019-12-01 09:20:21.889601: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:20:21.889606: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:20:21.889690: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889756: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889757: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889757: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889762: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Gradient.conv1.weight_0
[2019-12-01 09:20:21.889816: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889818: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889819: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=2
[2019-12-01 09:20:21.889819: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:20:21.889868: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=3
[2019-12-01 09:20:21.889864: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=1
[2019-12-01 09:20:21.889880: T byteps/common/core_loops.cc:308] Finished NCCL Group size=2 rank=0
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
[2019-12-01 09:20:22.415743: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=0
[2019-12-01 09:20:22.415922: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=0
[2019-12-01 09:20:22.415955: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.415975: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.416885: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.416914: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.416939: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.416965: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=0
[2019-12-01 09:20:22.416984: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.416993: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=0
[2019-12-01 09:20:22.417095: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.417130: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.417151: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.417189: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=0, key=1048576
[2019-12-01 09:20:22.417352: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=1048576, myrank=3
[2019-12-01 09:20:22.417706: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=3
[2019-12-01 09:20:22.417844: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=3
[2019-12-01 09:20:22.417889: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.417906: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.418722: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.418754: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.418772: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.418797: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=3
[2019-12-01 09:20:22.418823: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:20:22.418852: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=3
[2019-12-01 09:20:22.419510: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=2
[2019-12-01 09:20:22.419631: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=2
[2019-12-01 09:20:22.419679: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.419703: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.420484: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.420512: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.420527: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.420550: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=2
[2019-12-01 09:20:22.420569: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.420585: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=2
[2019-12-01 09:20:22.420653: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.420697: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.420708: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.420748: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=2, key=1048576
[2019-12-01 09:20:22.420772: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=1048576, myrank=3
[2019-12-01 09:20:22.464518: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=1
[2019-12-01 09:20:22.464642: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=1
[2019-12-01 09:20:22.464723: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.464744: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:20:22.465526: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:20:22.465557: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:20:22.465573: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:20:22.465600: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=1
[2019-12-01 09:20:22.465616: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.465626: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=1
[2019-12-01 09:20:22.465731: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.465765: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.465785: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.465825: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=1, key=1048576
[2019-12-01 09:20:22.465863: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=1048576, myrank=3
[2019-12-01 09:20:22.465868: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:20:22.465977: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=0
[2019-12-01 09:20:22.466006: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466012: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=3) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466023: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=0) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466021: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=1
[2019-12-01 09:20:22.466043: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466046: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 09:20:22.466035: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=2
[2019-12-01 09:20:22.466060: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466060: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=1) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466076: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=2) key=1048576, elements=1, device=-1
[2019-12-01 09:20:22.466092: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:20:22.466103: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 09:20:22.466138: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:20:22.466170: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466224: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466238: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466294: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 09:20:22.466311: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466326: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 09:20:22.466354: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466361: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466384: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466367: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466400: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466408: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:20:22.466416: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466421: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:20:22.466438: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 09:20:22.466451: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=1, key=1048576
[2019-12-01 09:20:22.466451: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=0, key=1048576
[2019-12-01 09:20:22.466481: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=1048576, myrank=3
[2019-12-01 09:20:22.466516: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=1048576, myrank=3
[2019-12-01 09:20:22.466557: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466594: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:20:22.466613: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:20:22.466668: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=2, key=1048576
[2019-12-01 09:20:22.466679: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=1048576, myrank=3
[2019-12-01 09:20:22.555950: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:   368 Aborted                 (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 41, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134

bobzhuyb · 2019-12-01T09:30:24Z

Okay. It dies because of this now https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L157

It seems that your K80 GPUs have problems dealing with the tensors are are not inside the model. I suspect that these tensors are placed on CPU by PyTorch, and K80, as an older GPU, cannot properly map the CPU memory address into GPU memory address space. Consequently, NCCL complains that the given memory address is invalid. We'll see how to address this.

There is one simple way to verify this -- add test_loss.cuda() right before https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L157

nowei · 2019-12-01T09:43:57Z

I think test loss is a float at that point because it was a float initially, since I'm getting
AttributeError: 'float' object has no attribute 'cuda'
I'll try making it a tensor initially and send it to cuda and see what happens.

bobzhuyb · 2019-12-01T09:45:38Z

@nowei You are right.. you can do it here https://github.com/bytedance/byteps/blob/master/example/pytorch/train_mnist_byteps.py#L132

nowei · 2019-12-01T09:52:15Z

It's still dying there:

[2019-12-01 09:49:15.741209: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 0
[2019-12-01 09:49:15.741213: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 1
[2019-12-01 09:49:15.741224: T byteps/common/scheduled_queue.cc:153] Queue BROADCAST getTask(key): byteps.Gradient.conv1.bias_0 key: 0 rank: 2
[2019-12-01 09:49:15.741228: T byteps/common/core_loops.cc:165] byteps.Gradient.conv1.bias_0 calling NCCL BROADCAST (rank=0) key=0, elements=10, device=0
[2019-12-01 09:49:15.741233: T byteps/common/core_loops.cc:165] byteps.Gradient.conv1.bias_0 calling NCCL BROADCAST (rank=1) key=0, elements=10, device=1
[2019-12-01 09:49:15.741259: T byteps/common/core_loops.cc:165] byteps.Gradient.conv1.bias_0 calling NCCL BROADCAST (rank=2) key=0, elements=10, device=2
[2019-12-01 09:49:15.741267: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 09:49:15.741270: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 09:49:15.741280: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:49:15.741283: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:49:15.741373: T byteps/common/core_loops.cc:76] Rank=0 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741432: T byteps/common/core_loops.cc:76] Rank=2 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741431: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 09:49:15.741457: T byteps/common/core_loops.cc:76] Rank=3 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741473: T byteps/common/core_loops.cc:76] Rank=1 finish processing tensor: byteps.Gradient.conv1.bias_0
[2019-12-01 09:49:15.741492: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 09:49:15.741510: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=3
[2019-12-01 09:49:15.741532: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
/usr/local/lib/python2.7/dist-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
[2019-12-01 09:49:16.267467: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=0
[2019-12-01 09:49:16.267625: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=0
[2019-12-01 09:49:16.267664: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.267681: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.268385: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=2
[2019-12-01 09:49:16.268500: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.268508: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=2
[2019-12-01 09:49:16.268532: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.268556: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.268559: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.268572: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.268577: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=0
[2019-12-01 09:49:16.268590: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.268610: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=0
[2019-12-01 09:49:16.268777: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.268820: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.268844: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.268891: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=0, key=1048576
[2019-12-01 09:49:16.268942: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=0, key=1048576, myrank=3
[2019-12-01 09:49:16.269418: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.269448: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.269465: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.269500: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=2
[2019-12-01 09:49:16.269522: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.269539: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=2
[2019-12-01 09:49:16.269688: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.269718: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.269737: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.269770: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=2, key=1048576
[2019-12-01 09:49:16.269785: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=0, key=1048576, myrank=3
[2019-12-01 09:49:16.311922: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=3
[2019-12-01 09:49:16.312066: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=3
[2019-12-01 09:49:16.312098: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.312116: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.312944: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.312973: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.312994: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.313015: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=3
[2019-12-01 09:49:16.313032: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:49:16.313059: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=3
[2019-12-01 09:49:16.319924: D byteps/common/global.cc:320] Declared tensor byteps.avg_loss, declared key (not PS key): 16 rank=1
[2019-12-01 09:49:16.320052: D byteps/common/operations.cc:229] byteps.avg_loss partitioned to 1 part(s), total_len=4, key_range=[1048576, 1048576] rank=1
[2019-12-01 09:49:16.320084: T byteps/common/operations.cc:243] Begin init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.320103: D byteps/common/operations.cc:249] byteps.avg_loss is already on cpu, len=4
[2019-12-01 09:49:16.320931: T byteps/common/shared_memory.cc:43] initialized share memory size 4
[2019-12-01 09:49:16.320959: T byteps/common/operations.cc:264] byteps.avg_loss: open shared memory size 4
[2019-12-01 09:49:16.320973: T byteps/common/operations.cc:295] Finish Init byteps.avg_loss, size=4, parts=1
[2019-12-01 09:49:16.320991: T byteps/common/operations.cc:185] EnqueueTensor: byteps.avg_loss_0, key=1048576, offset=0, len=4, device=-1 rank=1
[2019-12-01 09:49:16.321007: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321022: T byteps/common/operations.cc:199] EnqueueTensor finished: byteps.avg_loss, rank=1
[2019-12-01 09:49:16.321140: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321173: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321191: T byteps/common/scheduled_queue.cc:94] Queue REDUCE addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321233: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=0, rank=1, key=1048576
[2019-12-01 09:49:16.321271: T byteps/common/scheduled_queue.cc:129] Queue REDUCE getTask: byteps.avg_loss_0 key: 1048576 rank: 3
[2019-12-01 09:49:16.321271: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=0, key=1048576, myrank=3
[2019-12-01 09:49:16.321373: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=3) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321375: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=0
[2019-12-01 09:49:16.321383: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=1
[2019-12-01 09:49:16.321403: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321397: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=4, key=1048576, myrank=2
[2019-12-01 09:49:16.321418: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321426: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=0) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321428: T byteps/common/scheduled_queue.cc:153] Queue REDUCE getTask(key): byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321448: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=1) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321457: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=0
[2019-12-01 09:49:16.321475: T byteps/common/core_loops.cc:165] byteps.avg_loss_0 calling NCCL REDUCE (rank=2) key=1048576, elements=1, device=-1
[2019-12-01 09:49:16.321494: T byteps/common/core_loops.cc:247] NCCL Group size=1 rank=3
[2019-12-01 09:49:16.321513: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=1
[2019-12-01 09:49:16.321515: T byteps/common/communicator.cc:253] non-root socket recved: src=3, signal=6, key=0, myrank=2
[2019-12-01 09:49:16.321694: T byteps/common/core_loops.cc:67] Rank=1 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321694: T byteps/common/core_loops.cc:67] Rank=2 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321725: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321725: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321713: T byteps/common/core_loops.cc:67] Rank=0 finishes REDUCE, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321750: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=2
[2019-12-01 09:49:16.321751: T byteps/common/scheduled_queue.cc:94] Queue COORDINATE_BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321750: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=1
[2019-12-01 09:49:16.321767: T byteps/common/core_loops.cc:308] Finished NCCL Group size=1 rank=0
[2019-12-01 09:49:16.321799: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321818: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321836: T byteps/common/core_loops.cc:67] Rank=0 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321842: T byteps/common/core_loops.cc:67] Rank=2 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321863: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 2
[2019-12-01 09:49:16.321867: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 0
[2019-12-01 09:49:16.321886: T byteps/common/scheduled_queue.cc:129] Queue COORDINATE_BROADCAST getTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321906: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=2, key=1048576
[2019-12-01 09:49:16.321912: T byteps/common/core_loops.cc:67] Rank=1 finishes COORDINATE_BROADCAST, tensor: byteps.avg_loss_0, key=1048576; Passing to the next queue.
[2019-12-01 09:49:16.321908: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=0, key=1048576
[2019-12-01 09:49:16.321922: T byteps/common/communicator.cc:199] root socket recved: src=2, signal=2, key=1048576, myrank=3
[2019-12-01 09:49:16.321937: T byteps/common/scheduled_queue.cc:94] Queue BROADCAST addTask: byteps.avg_loss_0 key: 1048576 rank: 1
[2019-12-01 09:49:16.321945: T byteps/common/communicator.cc:199] root socket recved: src=0, signal=2, key=1048576, myrank=3
[2019-12-01 09:49:16.322000: T byteps/common/core_loops.cc:126] byteps.avg_loss_0 send coordinate info: Signal=2, rank=1, key=1048576
[2019-12-01 09:49:16.322019: T byteps/common/communicator.cc:199] root socket recved: src=1, signal=2, key=1048576, myrank=3
[2019-12-01 09:49:16.408270: F byteps/common/nccl_manager.cc:37] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh: line 20:  2332 Aborted                 (core dumped) python $path/train_mnist_byteps.py $@
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/byteps/launcher/launch.py", line 41, in worker
    subprocess.check_call(command, env=my_env, stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/byteps/example/pytorch/start_pytorch_byteps.sh' returned non-zero exit status 134

I also tried sending the tensor to cuda and it still gets the same error. Wait, I had to save the tensor back to itself and I think it's running more, I'm turning off the debug info to see if it actually makes it past one epoch.

nowei · 2019-12-01T09:56:05Z

Oh, hey, it's training! Thanks!

bobzhuyb · 2019-12-01T10:00:04Z

@nowei Thanks. Apparently, Pascal architecture has some improvement over Kelper on the CPU/GPU memory management (https://devblogs.nvidia.com/unified-memory-cuda-beginners/) so this particular problem does not appear in our own environment. We will improve the code robustness.

bobzhuyb added the bug Something isn't working label Dec 1, 2019

bobzhuyb mentioned this issue Jan 31, 2020

Segmentation fault docker #201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch Docker image fails to train MNIST with multiple GPUs #165

Pytorch Docker image fails to train MNIST with multiple GPUs #165

nowei commented Nov 27, 2019 •

edited

Loading

bobzhuyb commented Nov 27, 2019

nowei commented Nov 27, 2019

bobzhuyb commented Nov 28, 2019

nowei commented Dec 1, 2019 •

edited

Loading

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019 •

edited

Loading

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019 •

edited

Loading

nowei commented Dec 1, 2019 •

edited

Loading

bobzhuyb commented Dec 1, 2019

Pytorch Docker image fails to train MNIST with multiple GPUs #165

Pytorch Docker image fails to train MNIST with multiple GPUs #165

Comments

nowei commented Nov 27, 2019 • edited Loading

bobzhuyb commented Nov 27, 2019

nowei commented Nov 27, 2019

bobzhuyb commented Nov 28, 2019

nowei commented Dec 1, 2019 • edited Loading

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019 • edited Loading

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019

bobzhuyb commented Dec 1, 2019

nowei commented Dec 1, 2019 • edited Loading

nowei commented Dec 1, 2019 • edited Loading

bobzhuyb commented Dec 1, 2019

nowei commented Nov 27, 2019 •

edited

Loading

nowei commented Dec 1, 2019 •

edited

Loading

bobzhuyb commented Dec 1, 2019 •

edited

Loading

nowei commented Dec 1, 2019 •

edited

Loading

nowei commented Dec 1, 2019 •

edited

Loading