Release 0.3.3

workingloong released this 25 Jan 02:28

· 742 commits to master since this release

654240d

Features:

Support Python > 3.10.
Support restarting the training process on Ascend NPU.
Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.

BugFix:

Fix the checkpoint shard inconsistency of all ranks.
Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
Fix the bug to load the Megatron-LM checkpoint.

Assets 2