Release 0.3.3
Features:
- Support Python > 3.10.
- Support restarting the training process on Ascend NPU.
- Support asynchronously saving the checkpoint of the distributed optimizer of Megatron-LM to the storage.
BugFix:
- Fix the checkpoint shard inconsistency of all ranks.
- Fix the bug to asynchronously save the Megatron-LM checkpoint of the job with multi-GPUs on multi-nodes.
- Fix the bug to load the Megatron-LM checkpoint.