Skip to content

adonis-dym/memory_reduced_optimizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reducing Memory Footprint in Deep Network Training by Gradient Space Reutilization

This repository contains the source code for the paper "Reducing Memory Footprint in Deep Network Training by Gradient Space Reutilization," which proposes to reuse the oldest gradient space for storing the intermediate variables once it is no longer needed. We apply this method on several mainstream optimizers to obtain the memory reduced variants, named Adam_R, Adan_R, and Lion_R, respectively.

Citation

If you find this work useful, please cite this paper and also star this repository. Thanks!

@inproceedings{dong2024reducing,
  title={Reducing Memory Footprint in Deep Network Training by Gradient Space Reutilization},
  author={Dong, Yiming and Lin, Zhouchen},
  booktitle={Chinese Conference on Pattern Recognition and Computer Vision (PRCV)},
  pages={376--390},
  year={2024},
  organization={Springer}
}

News

Our paper won the PRCV Best Paper Award! πŸŽ‰πŸŽ‰πŸŽ‰

Experimental Results

The experimental results demonstrate the efficacy of our memory reduction strategies across various model architectures. Below are the summarized results showing the peak memory usage and savings achieved with the memory-reduced variants of the optimizers.

Memory Usage and Savings for AdamW and AdamW-R

Model # Params AdamW (MB) AdamW-R (MB) Savings (%) ZeRO
ViT-S 22.9M 526 417 20.71 βœ—
ViT-B 88.2M 2007 1629 18.81 βœ—
ViT-L 305.5M 6367 5046 20.75 βœ—
ViT-H 630.8M 13336 10777 19.19 βœ—
ViT-G 1.0B 21542 17408 19.19 βœ—
ConvNeXt-T 28.6M 684 621 9.20 βœ—
ConvNeXt-S 50.2M 1177 1009 14.26 βœ—
ConvNeXt-B 88.6M 1894 1629 13.95 βœ—
ConvNeXt-L 197.8M 4387 3706 15.54 βœ—
ConvNeXt-XL 350.2M 7218 6004 16.82 βœ—
BLOOM-560M 559.2M 15531 13822 11.00 βœ—
BLOOM-560M 559.2M 5339 5011 6.15 βœ“
BLOOM-3B 3.0B 23477 21964 6.45 βœ“
BLOOM-7B 7.1B 44826 41296 7.87 βœ“
Phi-1.5 1.4B 36650 36008 1.75 βœ—
Phi-1.5 1.4B 18616 17949 3.59 βœ“
Phi-2 2.8B 27581 26132 5.26 βœ“
Qwen-0.5B 464.0M 12581 11272 10.40 βœ—
Qwen-0.5B 464.0M 4897 4837 1.23 βœ“
Qwen-1.8B 1.8B 46410 38986 16.00 βœ—
Qwen-1.8B 1.8B 12756 11902 6.69 βœ“
LLaMA-2-7B 6.7B 32325 29002 10.28 βœ“
LLaMA-2-13B 13.0B 49103 45768 6.79 βœ“
Gemma-2B 2.5B 19609 18365 6.35 βœ“
Gemma-7B 8.5B 47029 42841 8.90 βœ“
Vicuna-7B 6.7B 32351 28993 10.38 βœ“
Vicuna-13B 13.0B 49327 46089 6.57 βœ“
ChatGLM3-6B 6.2B 31491 28369 9.92 βœ“
Falcon-7B 6.9B 33643 30168 10.33 βœ“

Memory Usage and Savings for Adan and Adan-R

Model # Params Adan (MB) Adan-R (MB) Savings (%) ZeRO
ViT-S 22.9M 711 621 12.68 βœ—
ViT-B 88.2M 2806 2407 14.20 βœ—
ViT-L 305.5M 8812 7491 14.99 βœ—
ViT-H 630.8M 18639 16110 13.57 βœ—
ViT-G 1.0B 30130 25910 14.00 βœ—
ConvNeXt-T 28.6M 927 864 6.78 βœ—
ConvNeXt-S 50.2M 1634 1466 10.27 βœ—
ConvNeXt-B 88.6M 2632 2355 10.52 βœ—
ConvNeXt-L 197.8M 6078 5417 10.87 βœ—
ConvNeXt-XL 350.2M 10008 8823 11.84 βœ—
BLOOM-560M 559.2M 20005 18296 8.55 βœ—
BLOOM-560M 559.2M 5859 5544 5.38 βœ“
BLOOM-3B 3.0B 26472 24965 5.69 βœ“
BLOOM-7B 7.1B 48355 48184 0.35 βœ“
Phi-1.5 1.4B 20098 19370 3.62 βœ“
Phi-2 2.8B 30301 28907 4.59 βœ“
Qwen-0.5B 464.0M 16437 15129 7.96 βœ—
Qwen-0.5B 464.0M 5509 5491 0.33 βœ“
Qwen-1.8B 1.8B 14691 13673 6.93 βœ“
LLaMA-2-7B 6.7B 39115 35713 8.70 βœ“
Gemma-2B 2.5B 22118 20870 5.64 βœ“
Gemma-7B 8.5B 49424 48484 1.91 βœ“
Vicuna-7B 6.7B 32351 28993 10.38 βœ“
ChatGLM3-6B 6.2B 37670 34614 8.11 βœ“
Falcon-7B 6.9B 40548 37099 8.51 βœ“

Memory Usage and Savings for Lion and Lion-R

Model # Params Lion (MB) Lion-R (MB) Savings (%) ZeRO
ViT-S 22.9M 415 327 21.21 βœ—
ViT-B 88.2M 1629 1231 24.45 βœ—
ViT-L 305.5M 5144 3827 25.60 βœ—
ViT-H 630.8M 10687 8087 24.33 βœ—
ViT-G 1.0B 17226 13189 23.43 βœ—
ConvNeXt-T 28.6M 552 489 11.41 βœ—
ConvNeXt-S 50.2M 958 791 17.51 βœ—
ConvNeXt-B 88.6M 1529 1281 16.19 βœ—
ConvNeXt-L 197.8M 3521 2861 18.77 βœ—
ConvNeXt-XL 350.2M 5862 4618 21.22 βœ—
BLOOM-560M 559.2M 13294 11996 9.76 βœ—
BLOOM-560M 559.2M 4513 4508 0.12 βœ“
BLOOM-3B 3.0B 21957 20462 6.81 βœ“
BLOOM-7B 7.1B 41306 37761 8.58 βœ“
Phi-1.5 1.4B 17950 17273 3.77 βœ“
Phi-2 2.8B 26159 24809 5.15 βœ“
Qwen-0.5B 464.0M 10614 9666 8.93 βœ—
Qwen-0.5B 464.0M 4897 4855 0.86 βœ“
Qwen-1.8B 1.8B 38986 31562 19.04 βœ—
Qwen-1.8B 1.8B 11913 10945 8.13 βœ“
LLaMA-2-7B 6.7B 29007 25618 11.68 βœ“
LLaMA-2-13B 13.0B 47297 39249 17.02 βœ“
Gemma-2B 2.5B 18347 17123 6.67 βœ“
Gemma-7B 8.5B 48279 39416 8.08 βœ“
Vicuna-7B 6.7B 28978 25596 11.67 βœ“
Vicuna-13B 13.0B 47596 39514 16.98 βœ“
ChatGLM3-6B 6.2B 28302 25180 11.03 βœ“
Falcon-7B 6.9B 30187 26719 11.49 βœ“

Equivalence to the Original Algorithms

The memory-reduced variants AdamW-R and Adan-R maintain exact identical training dynamics as their original counterparts when initialized with the same random seed, as indicated by Table 1 of our paper. While Lion-R introduces a minor change in the computational sequence due to variable substitution, it retains theoretical equivalence with the original Lion optimizer, with a minimal impact on the overall optimization outcomes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages