The table shows the reproduced speed benchmark on AmoebaNet-D (18, 256), as reported in Table 2 of GPipe by Huang et al. Note that we replaced K in the paper with n.
The benchmark cares of only training speed rather than the model's accuracy. The batch size is adjusted to achieve higher throughput without any large batch training tricks. This example also doesn't feed the actual ImageNet dataset. Instead, fake 3×224×224 tensors over 1000 labels are used to eliminate data loading overhead.
Every experiment setting is optimized for Tesla P40 GPUs.
Experiment | Throughput | torchgpipe | Huang et al. |
---|---|---|---|
n=2, m=1 | 26.733/s | 1× | 1× |
n=2, m=4 | 41.133/s | 1.546× | 1.07× |
n=2, m=32 | 47.386/s | 1.780× | 1.21× |
n=4, m=1 | 26.827/s | 1.006× | 1.13× |
n=4, m=4 | 44.543/s | 1.680× | 1.26× |
n=4, m=32 | 72.412/s | 2.711× | 1.84× |
n=8, m=1 | 24.918/s | 0.932× | 1.38× |
n=8, m=4 | 70.065/s | 2.625× | 1.72× |
n=8, m=32 | 132.413/s | 4.966× | 3.48× |
(n: number of partitions, m: number of micro-batches)
- torchgpipe 0.0.5
- Python 3.6.9
- PyTorch 1.3.0
- CUDA 10.1.243
- 8 Tesla P40 GPUs
- 8+ Intel E5-2650 v4 CPUs
First, resolve the dependencies. We highly recommend to use a separate virtual environment only for this benchmark:
$ pip install -r requirements.txt
Then, run each benchmark:
$ python main.py n2m1
$ python main.py n2m4
$ python main.py n2m32
$ python main.py n4m1
$ python main.py n4m4
$ python main.py n4m32
$ python main.py n8m1
$ python main.py n8m4
$ python main.py n8m32