You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I updated the example training script with some intended error:
sym = net.get_symbol(**vars(args))
# train
import adfsadsfdasf # <------- this line will fail
fit.fit(args, sym, data.get_rec_iter)
Normally it gives me the right error message:
root@ip-172-31-4-79:/usr/local/byteps/example/mxnet# python3 /usr/local/byteps/launcher/launch.py bash /usr/local/byteps/example/mxnet/start_mxnet_byteps.sh
BytePS launching worker
[2019-10-23 05:34:02.617091: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-10-23 05:34:02.617202: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_0
[2019-10-23 05:34:02.617233: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_0
[2019-10-23 05:34:02.617287: D byteps/common/communicator.cc:121] This is ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:34:02.617295: D byteps/common/global.cc:99] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-10-23 05:34:02.617304: D byteps/common/global.cc:122] Number of worker=1, launching distributed job
[2019-10-23 05:34:02.617336: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:34:02.687281: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-10-23 05:34:02.687301: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1
[2019-10-23 05:34:02.687307: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-10-23 05:34:02.687312: D byteps/common/nccl_manager.cc:160] nccl_num_rings set to 1
[2019-10-23 05:34:02.687372: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_nccl0
[2019-10-23 05:34:02.687399: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_nccl0
[2019-10-23 05:34:02.687442: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:34:02.687449: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0
[2019-10-23 05:34:02.687552: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:34:05.239510: D byteps/common/nccl_manager.cc:104] root nccl_id is 5693710654875303938
[2019-10-23 05:34:05.450266: D byteps/common/global.cc:181] Create schedule queue 0
[2019-10-23 05:34:05.450296: D byteps/common/global.cc:181] Create schedule queue 1
[2019-10-23 05:34:05.450302: D byteps/common/global.cc:181] Create schedule queue 2
[2019-10-23 05:34:05.450307: D byteps/common/global.cc:181] Create schedule queue 3
[2019-10-23 05:34:05.450312: D byteps/common/global.cc:181] Create schedule queue 4
[2019-10-23 05:34:05.450333: D byteps/common/global.cc:181] Create schedule queue 5
[2019-10-23 05:34:05.450338: D byteps/common/global.cc:181] Create schedule queue 6
[2019-10-23 05:34:05.450343: D byteps/common/global.cc:181] Create schedule queue 7
[2019-10-23 05:34:05.450347: D byteps/common/global.cc:181] Create schedule queue 8
[2019-10-23 05:34:05.450352: D byteps/common/global.cc:181] Create schedule queue 9
[2019-10-23 05:34:05.450361: D byteps/common/global.cc:187] Inited rank=0 local_rank=0 size=1 local_size=1 worker_id=0
[2019-10-23 05:34:05.450593: D byteps/common/global.cc:219] Started 6 background threads. rank=0
Traceback (most recent call last):
File "/usr/local/byteps/example/mxnet/train_imagenet_byteps.py", line 66, in <module>
import adfs
ImportError: No module named 'adfs'
[2019-10-23 05:34:05.457955: D byteps/common/shared_memory.h:45] Clear BytePSSharedMemory: All BytePS shared memory released/unregistered.
However, if I redirect the error to another file, I neither see the ImportError on screen, nor in err_log:
root@ip-172-31-4-79:/usr/local/byteps/example/mxnet# python3 /usr/local/byteps/launcher/launch.py bash /usr/local/byteps/example/mxnet/start_mxnet_byteps.sh 2> err_log
BytePS launching worker
[2019-10-23 05:37:23.615065: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-10-23 05:37:23.615170: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_0
[2019-10-23 05:37:23.615217: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_0
[2019-10-23 05:37:23.615282: D byteps/common/communicator.cc:121] This is ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:37:23.615291: D byteps/common/global.cc:99] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-10-23 05:37:23.615300: D byteps/common/global.cc:122] Number of worker=1, launching distributed job
[2019-10-23 05:37:23.615399: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:37:23.684944: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-10-23 05:37:23.684969: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1
[2019-10-23 05:37:23.684975: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-10-23 05:37:23.684980: D byteps/common/nccl_manager.cc:160] nccl_num_rings set to 1
[2019-10-23 05:37:23.685025: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_nccl0
[2019-10-23 05:37:23.685078: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_nccl0
[2019-10-23 05:37:23.685136: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:37:23.685143: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0
[2019-10-23 05:37:23.685254: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:37:26.243052: D byteps/common/nccl_manager.cc:104] root nccl_id is 5693710654005575682
[2019-10-23 05:37:26.456053: D byteps/common/global.cc:181] Create schedule queue 0
[2019-10-23 05:37:26.456077: D byteps/common/global.cc:181] Create schedule queue 1
[2019-10-23 05:37:26.456083: D byteps/common/global.cc:181] Create schedule queue 2
[2019-10-23 05:37:26.456088: D byteps/common/global.cc:181] Create schedule queue 3
[2019-10-23 05:37:26.456093: D byteps/common/global.cc:181] Create schedule queue 4
[2019-10-23 05:37:26.456098: D byteps/common/global.cc:181] Create schedule queue 5
[2019-10-23 05:37:26.456103: D byteps/common/global.cc:181] Create schedule queue 6
[2019-10-23 05:37:26.456108: D byteps/common/global.cc:181] Create schedule queue 7
[2019-10-23 05:37:26.456113: D byteps/common/global.cc:181] Create schedule queue 8
[2019-10-23 05:37:26.456118: D byteps/common/global.cc:181] Create schedule queue 9
[2019-10-23 05:37:26.456123: D byteps/common/global.cc:187] Inited rank=0 local_rank=0 size=1 local_size=1 worker_id=0
[2019-10-23 05:37:26.456338: D byteps/common/global.cc:219] Started 6 background threads. rank=0
[2019-10-23 05:37:26.463837: D byteps/common/shared_memory.h:45] Clear BytePSSharedMemory: All BytePS shared memory released/unregistered.
The text was updated successfully, but these errors were encountered:
I updated the example training script with some intended error:
Normally it gives me the right error message:
However, if I redirect the error to another file, I neither see the ImportError on screen, nor in
err_log
:The text was updated successfully, but these errors were encountered: