RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 869 205 Updated Apr 17, 2025

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 226 31 Updated Apr 3, 2025

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 885 56 Updated Apr 15, 2025

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

C++ 7,298 1,197 Updated Apr 10, 2025

deepseek-ai / DualPipe

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,725 289 Updated Mar 10, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,222 563 Updated Apr 16, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 7,443 713 Updated Apr 16, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA decoding kernels

C++ 11,441 822 Updated Mar 1, 2025

MoonshotAI / MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 1,741 104 Updated Apr 3, 2025

NVIDIA / cccl

CUDA Core Compute Libraries

C++ 1,599 208 Updated Apr 17, 2025

CalebDu / Awesome-Cute

C++ 54 11 Updated Jan 18, 2025

SafeAILab / EAGLE

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.

Python 1,179 129 Updated Apr 14, 2025

InternLM / turbomind

C++ 81 6 Updated Mar 26, 2025

volcengine / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 6,771 731 Updated Apr 17, 2025

leimao / CUTLASS-Examples

CUTLASS and CuTe Examples

Cuda 48 7 Updated Jan 4, 2025

mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation

Python 20,422 1,712 Updated Apr 6, 2025

OpenRLHF / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)

Python 6,305 620 Updated Apr 17, 2025

LoongServe / LoongServe

Jupyter Notebook 95 8 Updated Nov 11, 2024

mit-han-lab / duo-attention

[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Python 452 28 Updated Feb 10, 2025

Tongkaio / CUDA_Kernel_Samples

CUDA 算子手撕与面试指南

Cuda 308 31 Updated Jan 15, 2025

thu-nics / MoA

The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>

Python 123 6 Updated Dec 6, 2024

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,265 134 Updated Apr 17, 2025

caiwanxianhust / FasterLLaMA

使用 CUDA C++ 实现的 llama 模型推理框架

Cuda 49 5 Updated Nov 8, 2024

microsoft / MInference

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …

Python 971 47 Updated Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yy-space imisszxq

Block or report imisszxq

Stars

NVIDIA / cuda-python

flashinfer-ai / cutlass-viz

ByteDance-Seed / Triton-distributed

deepseek-ai / EPLB

ai-dynamo / dynamo

ademeure / DeeperGEMM

rapidsai / raft