Stars
CUDA Python: Performance meets Productivity
Distributed Triton for Parallel Systems
A Datacenter Scale Distributed Inference Serving Framework
ademeure / DeeperGEMM
Forked from deepseek-ai/DeepGEMMDeeperGEMM: crazy optimized version
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
MoBA: Mixture of Block Attention for Long-Context LLMs
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
verl: Volcano Engine Reinforcement Learning for LLMs
Universal LLM Deployment Engine with ML Compilation
An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
Tile primitives for speedy kernels
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …