Skip to content
View lanshanikilven's full-sized avatar

Block or report lanshanikilven

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Examples of CUDA implementations by Cutlass CuTe

Makefile 182 24 Updated Feb 2, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Cuda 4,428 467 Updated May 17, 2025

llvm slides and books and other

45 3 Updated Feb 2, 2025

Open standard for machine learning interoperability

Python 18,982 3,740 Updated May 22, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 414 79 Updated Sep 8, 2024

AMD's Machine Intelligence Library

Assembly 1,147 252 Updated May 23, 2025

Fast and memory-efficient exact attention

Python 17,471 1,693 Updated May 22, 2025

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 36 5 Updated Feb 27, 2025

ROCm BLAS marshalling library

C++ 142 83 Updated May 22, 2025

how to optimize some algorithm in cuda.

Cuda 2,200 192 Updated May 23, 2025

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 362 37 Updated May 14, 2025

FlashAttention2 implementation with TensorCore WMMA API

Cuda 3 Updated Apr 8, 2024

Implement FlashAttention v2 with minimal code to learn.

Cuda 11 1 Updated Jun 12, 2024

A flash attention2 extension for stable diffusion webui in Linux pytorch-rocm environments.

Cuda 7 1 Updated Jul 9, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 41 5 Updated Aug 12, 2024

An unofficial cuda assembler, for all generations of SASS, hopefully :)

Python 83 10 Updated Mar 20, 2023

Flash Attention in raw Cuda C beating PyTorch

Cuda 21 3 Updated May 14, 2024

A framework that support executing unmodified CUDA source code on non-NVIDIA devices.

C++ 127 15 Updated Jan 3, 2025

Codes & examples for "CUDA - From Correctness to Performance"

C++ 98 21 Updated Oct 24, 2024

LLVM/MLIR based compiler instrumentation of AMD GPU kernels

C++ 18 5 Updated Apr 29, 2025

CUDA on non-NVIDIA GPUs

Rust 11,358 726 Updated May 20, 2025

AMD ROCm™ Software - GitHub Home

Shell 5,290 432 Updated May 22, 2025

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the…

LLVM 3 Updated Nov 10, 2023

HIP: C++ Heterogeneous-Compute Interface for Portability

C++ 4,019 553 Updated May 23, 2025

Implementation of a simple CNN using CUDA

Cuda 68 21 Updated May 2, 2017

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs.

LLVM 36 4 Updated Nov 19, 2023

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,428 424 Updated May 22, 2025
Next