Stars
Fast and memory-efficient exact attention
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
how to optimize some algorithm in cuda.
flash attention tutorial written in python, triton, cuda, cutlass
FlashAttention2 implementation with TensorCore WMMA API
Implement FlashAttention v2 with minimal code to learn.
A flash attention2 extension for stable diffusion webui in Linux pytorch-rocm environments.
使用 cutlass 实现 flash-attention 精简版,具有教学意义
OpenPPL / CuAssembler
Forked from cloudcores/CuAssemblerAn unofficial cuda assembler, for all generations of SASS, hopefully :)
Flash Attention in raw Cuda C beating PyTorch
A framework that support executing unmodified CUDA source code on non-NVIDIA devices.
Codes & examples for "CUDA - From Correctness to Performance"
LLVM/MLIR based compiler instrumentation of AMD GPU kernels
CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the…
HIP: C++ Heterogeneous-Compute Interface for Portability
CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
A collection of pre-trained, state-of-the-art models in the ONNX format
Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.
Protocol Buffers - Google's data interchange format