Stars
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
FlashInfer: Kernel Library for LLM Serving
how to optimize some algorithm in cuda.
FSA/FST algorithms, differentiable, with PyTorch compatibility.
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
A simple high performance CUDA GEMM implementation.
This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed th…
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
Tutorials for writing high-performance GPU operators in AI frameworks.
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
使用 cutlass 实现 flash-attention 精简版,具有教学意义
imoneoi / cutlass_grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
Programming Massively Parallel Processors 4th edition codes