forked from horovod/horovod
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add LSF and jsrun support to horovodrun (horovod#1805)
Example to run on a LSF cluster (e.g. Summit): horovodrun python train.py Perform cpu/mem process binding to get the best performance. Contributors: @bethune-bryant @nvcastet Signed-off-by: Nicolas V Castet <[email protected]>
- Loading branch information
Showing
12 changed files
with
669 additions
and
239 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -119,6 +119,8 @@ Guides | |
|
||
spark_include | ||
|
||
lsf_include | ||
|
||
tensor-fusion_include | ||
|
||
timeline_include | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
.. inclusion-marker-start-do-not-remove | ||
Horovod in LSF | ||
============== | ||
|
||
This page includes examples for running Horovod in a LSF cluster. | ||
``horovodrun`` will automatically detect the host names and GPUs of your LSF job. | ||
If the LSF cluster supports ``jsrun``, ``horovodrun`` will use it as launcher | ||
otherwise it will default to ``mpirun``. | ||
|
||
Inside a LSF batch file or in an interactive session, you just need to use: | ||
|
||
.. code-block:: bash | ||
horovodrun python train.py | ||
Here, Horovod will start a process per GPU on all the hosts of the LSF job. | ||
|
||
You can also limit the run to a subset of the job resources. For example, using only 6 GPUs: | ||
|
||
.. code-block:: bash | ||
horovodrun -np 6 python train.py | ||
You can still pass extra arguments to ``horovodrun``. For example, to trigger CUDA-Aware MPI: | ||
|
||
.. code-block:: bash | ||
horovodrun --mpi-args="-gpu" python train.py | ||
.. inclusion-marker-end-do-not-remove |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
.. include:: ./lsf.rst | ||
:start-after: inclusion-marker-start-do-not-remove | ||
:end-before: inclusion-marker-end-do-not-remove |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.