Starred repositories
Pipeline Parallelism Emulation and Visualization
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Enabling PyTorch on XLA Devices (e.g. Google TPU)
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Examples for Recommenders - easy to train and deploy on accelerated infrastructure.
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…
Official Repo for Open-Reasoner-Zero
verl: Volcano Engine Reinforcement Learning for LLMs
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient MLA decoding kernels
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4…
An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & vLLM & RFT & Dynamic Sampling & Async Agent RL)
Finetune Qwen3, Llama 4, TTS, DeepSeek-R1 & Gemma 3 LLMs 2x faster with 70% less memory! 🦥
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Open source platform for the machine learning lifecycle
DLRover: An Automatic Distributed Deep Learning System
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
GoogleTest - Google Testing and Mocking Framework