Stars
Examples of CUDA implementations by Cutlass CuTe
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Open standard for machine learning interoperability
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Fast and memory-efficient exact attention
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
how to optimize some algorithm in cuda.
flash attention tutorial written in python, triton, cuda, cutlass
FlashAttention2 implementation with TensorCore WMMA API
Implement FlashAttention v2 with minimal code to learn.
A flash attention2 extension for stable diffusion webui in Linux pytorch-rocm environments.
使用 cutlass 实现 flash-attention 精简版,具有教学意义
OpenPPL / CuAssembler
Forked from cloudcores/CuAssemblerAn unofficial cuda assembler, for all generations of SASS, hopefully :)
Flash Attention in raw Cuda C beating PyTorch
A framework that support executing unmodified CUDA source code on non-NVIDIA devices.
Codes & examples for "CUDA - From Correctness to Performance"
LLVM/MLIR based compiler instrumentation of AMD GPU kernels
CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the…
HIP: C++ Heterogeneous-Compute Interface for Portability
CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…