This repository contains the source code for the paper "Reducing Memory Footprint in Deep Network Training by Gradient Space Reutilization," which proposes to reuse the oldest gradient space for storing the intermediate variables once it is no longer needed. We apply this method on several mainstream optimizers to obtain the memory reduced variants, named Adam_R, Adan_R, and Lion_R, respectively.
If you find this work useful, please cite this paper and also star this repository. Thanks!
@inproceedings{dong2024reducing,
title={Reducing Memory Footprint in Deep Network Training by Gradient Space Reutilization},
author={Dong, Yiming and Lin, Zhouchen},
booktitle={Chinese Conference on Pattern Recognition and Computer Vision (PRCV)},
pages={376--390},
year={2024},
organization={Springer}
}
Our paper won the PRCV Best Paper Award! πππ
The experimental results demonstrate the efficacy of our memory reduction strategies across various model architectures. Below are the summarized results showing the peak memory usage and savings achieved with the memory-reduced variants of the optimizers.
Model | # Params | AdamW (MB) | AdamW-R (MB) | Savings (%) | ZeRO |
---|---|---|---|---|---|
ViT-S | 22.9M | 526 | 417 | 20.71 | β |
ViT-B | 88.2M | 2007 | 1629 | 18.81 | β |
ViT-L | 305.5M | 6367 | 5046 | 20.75 | β |
ViT-H | 630.8M | 13336 | 10777 | 19.19 | β |
ViT-G | 1.0B | 21542 | 17408 | 19.19 | β |
ConvNeXt-T | 28.6M | 684 | 621 | 9.20 | β |
ConvNeXt-S | 50.2M | 1177 | 1009 | 14.26 | β |
ConvNeXt-B | 88.6M | 1894 | 1629 | 13.95 | β |
ConvNeXt-L | 197.8M | 4387 | 3706 | 15.54 | β |
ConvNeXt-XL | 350.2M | 7218 | 6004 | 16.82 | β |
BLOOM-560M | 559.2M | 15531 | 13822 | 11.00 | β |
BLOOM-560M | 559.2M | 5339 | 5011 | 6.15 | β |
BLOOM-3B | 3.0B | 23477 | 21964 | 6.45 | β |
BLOOM-7B | 7.1B | 44826 | 41296 | 7.87 | β |
Phi-1.5 | 1.4B | 36650 | 36008 | 1.75 | β |
Phi-1.5 | 1.4B | 18616 | 17949 | 3.59 | β |
Phi-2 | 2.8B | 27581 | 26132 | 5.26 | β |
Qwen-0.5B | 464.0M | 12581 | 11272 | 10.40 | β |
Qwen-0.5B | 464.0M | 4897 | 4837 | 1.23 | β |
Qwen-1.8B | 1.8B | 46410 | 38986 | 16.00 | β |
Qwen-1.8B | 1.8B | 12756 | 11902 | 6.69 | β |
LLaMA-2-7B | 6.7B | 32325 | 29002 | 10.28 | β |
LLaMA-2-13B | 13.0B | 49103 | 45768 | 6.79 | β |
Gemma-2B | 2.5B | 19609 | 18365 | 6.35 | β |
Gemma-7B | 8.5B | 47029 | 42841 | 8.90 | β |
Vicuna-7B | 6.7B | 32351 | 28993 | 10.38 | β |
Vicuna-13B | 13.0B | 49327 | 46089 | 6.57 | β |
ChatGLM3-6B | 6.2B | 31491 | 28369 | 9.92 | β |
Falcon-7B | 6.9B | 33643 | 30168 | 10.33 | β |
Model | # Params | Adan (MB) | Adan-R (MB) | Savings (%) | ZeRO |
---|---|---|---|---|---|
ViT-S | 22.9M | 711 | 621 | 12.68 | β |
ViT-B | 88.2M | 2806 | 2407 | 14.20 | β |
ViT-L | 305.5M | 8812 | 7491 | 14.99 | β |
ViT-H | 630.8M | 18639 | 16110 | 13.57 | β |
ViT-G | 1.0B | 30130 | 25910 | 14.00 | β |
ConvNeXt-T | 28.6M | 927 | 864 | 6.78 | β |
ConvNeXt-S | 50.2M | 1634 | 1466 | 10.27 | β |
ConvNeXt-B | 88.6M | 2632 | 2355 | 10.52 | β |
ConvNeXt-L | 197.8M | 6078 | 5417 | 10.87 | β |
ConvNeXt-XL | 350.2M | 10008 | 8823 | 11.84 | β |
BLOOM-560M | 559.2M | 20005 | 18296 | 8.55 | β |
BLOOM-560M | 559.2M | 5859 | 5544 | 5.38 | β |
BLOOM-3B | 3.0B | 26472 | 24965 | 5.69 | β |
BLOOM-7B | 7.1B | 48355 | 48184 | 0.35 | β |
Phi-1.5 | 1.4B | 20098 | 19370 | 3.62 | β |
Phi-2 | 2.8B | 30301 | 28907 | 4.59 | β |
Qwen-0.5B | 464.0M | 16437 | 15129 | 7.96 | β |
Qwen-0.5B | 464.0M | 5509 | 5491 | 0.33 | β |
Qwen-1.8B | 1.8B | 14691 | 13673 | 6.93 | β |
LLaMA-2-7B | 6.7B | 39115 | 35713 | 8.70 | β |
Gemma-2B | 2.5B | 22118 | 20870 | 5.64 | β |
Gemma-7B | 8.5B | 49424 | 48484 | 1.91 | β |
Vicuna-7B | 6.7B | 32351 | 28993 | 10.38 | β |
ChatGLM3-6B | 6.2B | 37670 | 34614 | 8.11 | β |
Falcon-7B | 6.9B | 40548 | 37099 | 8.51 | β |
Model | # Params | Lion (MB) | Lion-R (MB) | Savings (%) | ZeRO |
---|---|---|---|---|---|
ViT-S | 22.9M | 415 | 327 | 21.21 | β |
ViT-B | 88.2M | 1629 | 1231 | 24.45 | β |
ViT-L | 305.5M | 5144 | 3827 | 25.60 | β |
ViT-H | 630.8M | 10687 | 8087 | 24.33 | β |
ViT-G | 1.0B | 17226 | 13189 | 23.43 | β |
ConvNeXt-T | 28.6M | 552 | 489 | 11.41 | β |
ConvNeXt-S | 50.2M | 958 | 791 | 17.51 | β |
ConvNeXt-B | 88.6M | 1529 | 1281 | 16.19 | β |
ConvNeXt-L | 197.8M | 3521 | 2861 | 18.77 | β |
ConvNeXt-XL | 350.2M | 5862 | 4618 | 21.22 | β |
BLOOM-560M | 559.2M | 13294 | 11996 | 9.76 | β |
BLOOM-560M | 559.2M | 4513 | 4508 | 0.12 | β |
BLOOM-3B | 3.0B | 21957 | 20462 | 6.81 | β |
BLOOM-7B | 7.1B | 41306 | 37761 | 8.58 | β |
Phi-1.5 | 1.4B | 17950 | 17273 | 3.77 | β |
Phi-2 | 2.8B | 26159 | 24809 | 5.15 | β |
Qwen-0.5B | 464.0M | 10614 | 9666 | 8.93 | β |
Qwen-0.5B | 464.0M | 4897 | 4855 | 0.86 | β |
Qwen-1.8B | 1.8B | 38986 | 31562 | 19.04 | β |
Qwen-1.8B | 1.8B | 11913 | 10945 | 8.13 | β |
LLaMA-2-7B | 6.7B | 29007 | 25618 | 11.68 | β |
LLaMA-2-13B | 13.0B | 47297 | 39249 | 17.02 | β |
Gemma-2B | 2.5B | 18347 | 17123 | 6.67 | β |
Gemma-7B | 8.5B | 48279 | 39416 | 8.08 | β |
Vicuna-7B | 6.7B | 28978 | 25596 | 11.67 | β |
Vicuna-13B | 13.0B | 47596 | 39514 | 16.98 | β |
ChatGLM3-6B | 6.2B | 28302 | 25180 | 11.03 | β |
Falcon-7B | 6.9B | 30187 | 26719 | 11.49 | β |
The memory-reduced variants AdamW-R and Adan-R maintain exact identical training dynamics as their original counterparts when initialized with the same random seed, as indicated by Table 1 of our paper. While Lion-R introduces a minor change in the computational sequence due to variable substitution, it retains theoretical equivalence with the original Lion optimizer, with a minimal impact on the overall optimization outcomes.