Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do i still need to set kv_store in using mxnet? why? #80

Open
ZHAIXINGZHAIYUE opened this issue Aug 8, 2019 · 9 comments
Open

Do i still need to set kv_store in using mxnet? why? #80

ZHAIXINGZHAIYUE opened this issue Aug 8, 2019 · 9 comments

Comments

@ZHAIXINGZHAIYUE
Copy link

Do i still need to set kv_store in using mxnet? why?

@ymjiang
Copy link
Member

ymjiang commented Aug 8, 2019

You don't need to. MXNet-BytePS's implementation bypasses kvstore.

@ZHAIXINGZHAIYUE
Copy link
Author

您好,如果我在使用 mxnet native 进行分布式训练的时候,也单独设置scheduler节点, server节点,worker节点,让它们运行在单独的服务器上。这样的话,使用byteps mxnet 与使用 mxnet native 进行训练,效率上会有差异吗?如果有的话,差异主要来自哪里?谢谢。

@ymjiang
Copy link
Member

ymjiang commented Aug 8, 2019

There will be performance difference even if using the same setup as you said. We did many performance optimizations on BytePS. For example, compared to mxnet native, BytePS-mxnet eliminates some extra copy. BytePS also supports RDMA, which is obviously faster than mxnet-native TCP. We will have a technical report talking about these optimizations in the future.

@ZHAIXINGZHAIYUE
Copy link
Author

thank you

@bobzhuyb
Copy link
Member

bobzhuyb commented Aug 8, 2019

Below are some numbers. The following experiments are performed on a public cloud with 20 Gbps networks. Each machine has 8 Tesla V100 16GB GPUs (with NVLink-enabled). The batch size is 32 for each GPU, and we use fp32 training. We calculate the "total images per second" as the metric.

image

@ZHAIXINGZHAIYUE
Copy link
Author

ZHAIXINGZHAIYUE commented Aug 9, 2019

@bobzhuyb the number of server is same? between mxnet-native and mxnet-byteps.

@bobzhuyb
Copy link
Member

bobzhuyb commented Aug 9, 2019

Yes. You can try them yourself. The original ps-lite implementation is pretty poor -- it is slower than Horovod, let alone BytePS.

@ZHAIXINGZHAIYUE
Copy link
Author

你好,我这里还有一个问题。在使用原版mxnet 进行分布式训练的时候,不时的会遇到Check failed: (my_node_.port) != (-1) bind failed, 在byteps中,这个问题还存在吗?

@ymjiang
Copy link
Member

ymjiang commented Dec 7, 2019

@ZHAIXINGZHAIYUE I believe you won't have that problem if you configure byteps correctly. We never meet this when using byteps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants