
Starred repositories
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels, FA2, HGEMM via MMA and CuTe (~99% TFLOPS of cuBLAS/FA2 🎉).
SGLang is a fast serving framework for large language models and vision language models.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Additional utils and helpers to extend TensorFlow when build recommendation systems, contributed and maintained by SIG Recommenders.
A machine learning compiler for GPUs, CPUs, and ML accelerators
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…
how to learn PyTorch and OneFlow
how to optimize some algorithm in cuda.
Fast and memory-efficient exact attention
Awesome-LLM: a curated list of Large Language Model
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models. ICML 2021
Alluxio, data orchestration for analytics and machine learning in the cloud
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.