Skip to content

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

License

Notifications You must be signed in to change notification settings

okoge-kaz/nvidia-resiliency-ext

 
 

Repository files navigation

NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.

Core Components and Capabilities

  • Fault Tolerance

    • Detection of hung ranks.
    • Restarting training in-job, without the need to reallocate SLURM nodes.
  • In-Process Restarting

    • Detecting failures and enabling quick recovery.
  • Async Checkpointing

    • Providing an efficient framework for asynchronous checkpointing.
  • Local Checkpointing

    • Providing an efficient framework for local checkpointing.
  • Straggler Detection

    • Monitoring GPU and CPU performance of ranks.
    • Identifying slower ranks that may impede overall training efficiency.
  • PyTorch Lightning Callbacks

    • Facilitating seamless NVRx integration with PyTorch Lightning.

Installation

From sources

  • git clone https://github.com/NVIDIA/nvidia-resiliency-ext
  • cd nvidia-resiliency-ext
  • pip install .

From PyPI wheel

  • pip install nvidia-resiliency-ext

Platform Support

Category Supported Versions / Requirements
Architecture x86_64
Operating System Ubuntu 22.04
Python Version >= 3.10, < 3.13
PyTorch Version 2.3+
CUDA & CUDA Toolkit 12.5+
NVML Driver 550 or later
NCCL Version 2.21.5+

Note: The package is designed to support Python >= 3.10, CUDA >= 11.8, PyTorch >= 2.0 and Ubuntu 20.04, but the recommended and tested environment for production is Python >= 3.10, < 3.13, CUDA 12.5+, and Ubuntu 22.04.

Usage

For detailed documentation and usage information about each component, please refer to the https://nvidia.github.io/nvidia-resiliency-ext/.

About

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.8%
  • Shell 1.9%
  • C++ 1.3%