Stars
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
Differentiable fast wavelet transforms in PyTorch with GPU support.
Numerical integration in arbitrary dimensions on the GPU using PyTorch / TF / JAX
Minimal reproduction of DeepSeek R1-Zero
FlashInfer: Kernel Library for LLM Serving
Fully open reproduction of DeepSeek-R1
My learning notes/codes for ML SYS.
Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.
Acode - powerful text/code editor for android
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, Parallelism, MLA, etc.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Tile primitives for speedy kernels
SGLang is a fast serving framework for large language models and vision language models.
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Hackable and optimized Transformers building blocks, supporting a composable construction.
A Easy-to-understand TensorOp Matmul Tutorial
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.