Stars
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
how to optimize some algorithm in cuda.
FlashInfer: Kernel Library for LLM Serving
FSA/FST algorithms, differentiable, with PyTorch compatibility.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
A simple high performance CUDA GEMM implementation.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed th…
Tutorials for writing high-performance GPU operators in AI frameworks.
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
使用 cutlass 实现 flash-attention 精简版,具有教学意义
imoneoi / cutlass_grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
Programming Massively Parallel Processors 4th edition codes