Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redirect stderr and stdout with byteps launcher #133

Open
eric-haibin-lin opened this issue Oct 23, 2019 · 1 comment
Open

Redirect stderr and stdout with byteps launcher #133

eric-haibin-lin opened this issue Oct 23, 2019 · 1 comment
Labels
bug Something isn't working

Comments

@eric-haibin-lin
Copy link
Contributor

I updated the example training script with some intended error:

     sym = net.get_symbol(**vars(args))

     # train
    import adfsadsfdasf # <------- this line will fail
     fit.fit(args, sym, data.get_rec_iter)

Normally it gives me the right error message:

root@ip-172-31-4-79:/usr/local/byteps/example/mxnet# python3 /usr/local/byteps/launcher/launch.py  bash /usr/local/byteps/example/mxnet/start_mxnet_byteps.sh
BytePS launching worker
[2019-10-23 05:34:02.617091: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-10-23 05:34:02.617202: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_0
[2019-10-23 05:34:02.617233: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_0
[2019-10-23 05:34:02.617287: D byteps/common/communicator.cc:121] This is ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:34:02.617295: D byteps/common/global.cc:99] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-10-23 05:34:02.617304: D byteps/common/global.cc:122] Number of worker=1, launching distributed job
[2019-10-23 05:34:02.617336: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:34:02.687281: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-10-23 05:34:02.687301: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1
[2019-10-23 05:34:02.687307: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-10-23 05:34:02.687312: D byteps/common/nccl_manager.cc:160] nccl_num_rings set to 1
[2019-10-23 05:34:02.687372: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_nccl0
[2019-10-23 05:34:02.687399: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_nccl0
[2019-10-23 05:34:02.687442: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:34:02.687449: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0
[2019-10-23 05:34:02.687552: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:34:05.239510: D byteps/common/nccl_manager.cc:104] root nccl_id is 5693710654875303938
[2019-10-23 05:34:05.450266: D byteps/common/global.cc:181] Create schedule queue 0
[2019-10-23 05:34:05.450296: D byteps/common/global.cc:181] Create schedule queue 1
[2019-10-23 05:34:05.450302: D byteps/common/global.cc:181] Create schedule queue 2
[2019-10-23 05:34:05.450307: D byteps/common/global.cc:181] Create schedule queue 3
[2019-10-23 05:34:05.450312: D byteps/common/global.cc:181] Create schedule queue 4
[2019-10-23 05:34:05.450333: D byteps/common/global.cc:181] Create schedule queue 5
[2019-10-23 05:34:05.450338: D byteps/common/global.cc:181] Create schedule queue 6
[2019-10-23 05:34:05.450343: D byteps/common/global.cc:181] Create schedule queue 7
[2019-10-23 05:34:05.450347: D byteps/common/global.cc:181] Create schedule queue 8
[2019-10-23 05:34:05.450352: D byteps/common/global.cc:181] Create schedule queue 9
[2019-10-23 05:34:05.450361: D byteps/common/global.cc:187] Inited rank=0 local_rank=0 size=1 local_size=1 worker_id=0
[2019-10-23 05:34:05.450593: D byteps/common/global.cc:219] Started 6 background threads. rank=0
Traceback (most recent call last):
  File "/usr/local/byteps/example/mxnet/train_imagenet_byteps.py", line 66, in <module>
    import adfs
ImportError: No module named 'adfs'
[2019-10-23 05:34:05.457955: D byteps/common/shared_memory.h:45] Clear BytePSSharedMemory: All BytePS shared memory released/unregistered.

However, if I redirect the error to another file, I neither see the ImportError on screen, nor in err_log:

root@ip-172-31-4-79:/usr/local/byteps/example/mxnet# python3 /usr/local/byteps/launcher/launch.py  bash /usr/local/byteps/example/mxnet/start_mxnet_byteps.sh 2> err_log
BytePS launching worker
[2019-10-23 05:37:23.615065: D byteps/common/communicator.cc:63] Using Communicator=Socket
[2019-10-23 05:37:23.615170: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_0
[2019-10-23 05:37:23.615217: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_0
[2019-10-23 05:37:23.615282: D byteps/common/communicator.cc:121] This is ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:37:23.615291: D byteps/common/global.cc:99] Partition bound set to 4096000 bytes, aligned to 4096000 bytes
[2019-10-23 05:37:23.615300: D byteps/common/global.cc:122] Number of worker=1, launching distributed job
[2019-10-23 05:37:23.615399: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:37:23.684944: D byteps/common/nccl_manager.cc:133] nccl_group_size set to 4
[2019-10-23 05:37:23.684969: D byteps/common/nccl_manager.cc:152] nccl_pcie_size set to 1
[2019-10-23 05:37:23.684975: D byteps/common/nccl_manager.cc:154] nccl_pcie_num set to 1
[2019-10-23 05:37:23.684980: D byteps/common/nccl_manager.cc:160] nccl_num_rings set to 1
[2019-10-23 05:37:23.685025: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_send_nccl0
[2019-10-23 05:37:23.685078: D byteps/common/communicator.cc:151] Init socket at /tmp/socket_recv_nccl0
[2019-10-23 05:37:23.685136: D byteps/common/communicator.cc:55] This is nccl ROOT device, rank=0, all sockets create successfully
[2019-10-23 05:37:23.685143: D byteps/common/nccl_manager.cc:85] Constructing NCCL communicators. 0
[2019-10-23 05:37:23.685254: D byteps/common/communicator.cc:158] Listening on socket 0
[2019-10-23 05:37:26.243052: D byteps/common/nccl_manager.cc:104] root nccl_id is 5693710654005575682
[2019-10-23 05:37:26.456053: D byteps/common/global.cc:181] Create schedule queue 0
[2019-10-23 05:37:26.456077: D byteps/common/global.cc:181] Create schedule queue 1
[2019-10-23 05:37:26.456083: D byteps/common/global.cc:181] Create schedule queue 2
[2019-10-23 05:37:26.456088: D byteps/common/global.cc:181] Create schedule queue 3
[2019-10-23 05:37:26.456093: D byteps/common/global.cc:181] Create schedule queue 4
[2019-10-23 05:37:26.456098: D byteps/common/global.cc:181] Create schedule queue 5
[2019-10-23 05:37:26.456103: D byteps/common/global.cc:181] Create schedule queue 6
[2019-10-23 05:37:26.456108: D byteps/common/global.cc:181] Create schedule queue 7
[2019-10-23 05:37:26.456113: D byteps/common/global.cc:181] Create schedule queue 8
[2019-10-23 05:37:26.456118: D byteps/common/global.cc:181] Create schedule queue 9
[2019-10-23 05:37:26.456123: D byteps/common/global.cc:187] Inited rank=0 local_rank=0 size=1 local_size=1 worker_id=0
[2019-10-23 05:37:26.456338: D byteps/common/global.cc:219] Started 6 background threads. rank=0
[2019-10-23 05:37:26.463837: D byteps/common/shared_memory.h:45] Clear BytePSSharedMemory: All BytePS shared memory released/unregistered.

@ymjiang
Copy link
Member

ymjiang commented Oct 23, 2019

Thanks for the feedback. We will try to reproduce the case.

@ymjiang ymjiang added the bug Something isn't working label Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants