82
82
Before using RPC and distributed autograd primitives, initialization must take
83
83
place. To initialize the RPC framework we need to use
84
84
:meth: `~torch.distributed.rpc.init_rpc ` which would initialize the RPC
85
- framework, RRef framework and distributed autograd. By default, this will also
86
- initialize the ``ProcessGroup `` (:meth: `~torch.distributed.init_process_group `)
87
- backend for RPC communication. The ``ProcessGroup `` backend internally uses gloo
88
- for communication.
85
+ framework, RRef framework and distributed autograd.
89
86
90
87
.. automodule :: torch.distributed.rpc
91
88
.. autofunction :: init_rpc
@@ -109,9 +106,6 @@ and move it to the desired devices on the callee if necessary.
109
106
.. autofunction :: shutdown
110
107
.. autoclass :: WorkerInfo
111
108
:members:
112
- .. autoclass :: ProcessGroupRpcBackendOptions
113
- :members:
114
- :inherited-members:
115
109
116
110
117
111
The RPC package also provides decorators which allow applications to specify
@@ -122,8 +116,124 @@ how a given function should be treated on the callee side.
122
116
123
117
.. autofunction :: torch.distributed.rpc.functions.async_execution
124
118
125
- .. _rref :
126
119
120
+ .. _rpc-backends :
121
+
122
+ Backends
123
+ ^^^^^^^^
124
+
125
+ The RPC module can leverage different backends to perform the communication
126
+ between the nodes. The backend to be used can be specified in the
127
+ :func: `~torch.distributed.rpc.init_rpc ` function, by passing a certain value of
128
+ the :class: `~torch.distributed.rpc.BackendType ` enum. Regardless of what backend
129
+ is used, the rest of the RPC API won't change. Each backend also defines its own
130
+ subclass of the :class: `~torch.distributed.rpc.RpcBackendOptions ` class, an
131
+ instance of which can also be passed to :func: `~torch.distributed.rpc.init_rpc `
132
+ to configure the backend's behavior.
133
+
134
+ .. autoclass :: BackendType
135
+
136
+ .. autoclass :: RpcBackendOptions
137
+ :members:
138
+
139
+
140
+ Process Group Backend
141
+ """""""""""""""""""""
142
+
143
+ The Process Group agent, which is the default, instantiates a process group from
144
+ the :mod: `~torch.distributed ` module and utilizes its point-to-point
145
+ communication capabilities to send RPC messages across. Internally, the process
146
+ group uses `the Gloo library <https://github.com/facebookincubator/gloo/ >`_.
147
+
148
+ Gloo has been hardened by years of extensive use in PyTorch and is thus very
149
+ reliable. However, as it was designed to perform collective communication, it
150
+ may not always be the best fit for RPC. For example, each networking operation
151
+ is synchronous and blocking, which means that it cannot be run in parallel with
152
+ others. Moreover, it opens a connection between all pairs of nodes, and brings
153
+ down all of them when one fails, thus reducing the resiliency and the elasticity
154
+ of the system.
155
+
156
+ Example::
157
+
158
+ >>> import os
159
+ >>> from torch.distributed import rpc
160
+ >>> os.environ['MASTER_ADDR'] = 'localhost'
161
+ >>> os.environ['MASTER_PORT'] = '29500'
162
+ >>>
163
+ >>> rpc.init_rpc(
164
+ >>> "worker1",
165
+ >>> rank=0,
166
+ >>> world_size=2,
167
+ >>> rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
168
+ >>> num_send_recv_threads=16,
169
+ >>> rpc_timeout=20 # 20 second timeout
170
+ >>> )
171
+ >>> )
172
+ >>>
173
+ >>> # omitting init_rpc invocation on worker2
174
+
175
+
176
+ .. autoclass :: ProcessGroupRpcBackendOptions
177
+ :members:
178
+ :inherited-members:
179
+
180
+
181
+ TensorPipe Backend
182
+ """"""""""""""""""
183
+
184
+ .. warning ::
185
+ The TensorPipe backend is a **beta feature **.
186
+
187
+ The TensorPipe agent leverages `the TensorPipe library
188
+ <https://github.com/pytorch/tensorpipe> `_, which provides a natively
189
+ point-to-point communication primitive specifically suited for machine learning
190
+ that fundamentally addresses some of the limitations of Gloo. Compared to Gloo,
191
+ it has the advantage of being asynchronous, which allows a large number of
192
+ transfers to occur simultaneously, each at their own speed, without blocking
193
+ each other. It will only open pipes between pairs of nodes when needed, on
194
+ demand, and when one node fails only its incident pipes will be closed, while
195
+ all other ones will keep working as normal. In addition, it is able to support
196
+ multiple different transports (TCP, of course, but also shared memory, NVLink,
197
+ InfiniBand, ...) and can automatically detect their availability and negotiate
198
+ the best transport to use for each pipe.
199
+
200
+ The TensorPipe backend has been introduced in PyTorch v1.6 and is being actively
201
+ developed. At the moment, it only supports CPU tensors, with GPU support coming
202
+ soon. It comes with a TCP-based transport, just like Gloo. It is also able to
203
+ automatically chunk and multiplex large tensors over multiple sockets and
204
+ threads in order to achieve very high bandwidths. In addition to that, it packs
205
+ two Linux-specific transports for communication between processes on a same
206
+ machine (one based on ringbuffers stored in shared memory, the other on the
207
+ cross-memory attach syscalls) which can achieve lower latencies than TCP.
208
+ The agent will be able to pick the best transport on its own, with no
209
+ intervention required.
210
+
211
+ Example::
212
+
213
+ >>> import os
214
+ >>> from torch.distributed import rpc
215
+ >>> os.environ['MASTER_ADDR'] = 'localhost'
216
+ >>> os.environ['MASTER_PORT'] = '29500'
217
+ >>>
218
+ >>> rpc.init_rpc(
219
+ >>> "worker1",
220
+ >>> rank=0,
221
+ >>> world_size=2,
222
+ >>> backend=rpc.BackendType.TENSORPIPE,
223
+ >>> rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
224
+ >>> num_worker_threads=8,
225
+ >>> rpc_timeout=20 # 20 second timeout
226
+ >>> )
227
+ >>> )
228
+ >>>
229
+ >>> # omitting init_rpc invocation on worker2
230
+
231
+ .. autoclass :: TensorPipeRpcBackendOptions
232
+ :members:
233
+ :inherited-members:
234
+
235
+
236
+ .. _rref :
127
237
128
238
RRef
129
239
----
0 commit comments