Skip to content

Release 0.3.3

Compare
Choose a tag to compare
@workingloong workingloong released this 25 Jan 02:28
· 742 commits to master since this release

Features:

  • Support Python > 3.10.
  • Support restarting the training process on Ascend NPU.
  • Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:

  • Fix the checkpoint shard inconsistency of all ranks.
  • Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
  • Fix the bug to load the Megatron-LM checkpoint.