lanshanikilven

Follow

lanshanikilven

Follow

0 followers · 2 following

Stars

DD-DuDa / Cute-Learning

Examples of CUDA implementations by Cutlass CuTe

Makefile 182 24 Updated Feb 2, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Cuda 4,428 467 Updated May 17, 2025

iiicp / study-llvm-from-scratch

llvm slides and books and other

45 3 Updated Feb 2, 2025

onnx / onnx

Open standard for machine learning interoperability

Python 18,982 3,740 Updated May 22, 2025

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 414 79 Updated Sep 8, 2024

ROCm / MIOpen

AMD's Machine Intelligence Library

Assembly 1,147 252 Updated May 23, 2025

gthparch / NVPTX-SPIRV-Translator

LLVM 21 3 Updated Oct 25, 2021

ColfaxResearch / cfx-article-src

C++ 108 24 Updated May 7, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 17,471 1,693 Updated May 22, 2025

Bruce-Lee-LY / flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

C++ 36 5 Updated Feb 27, 2025

ROCm / hipBLAS

ROCm BLAS marshalling library

C++ 142 83 Updated May 22, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,200 192 Updated May 23, 2025

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 362 37 Updated May 14, 2025

zhumakhan / flash-attention-wmma

FlashAttention2 implementation with TensorCore WMMA API

Cuda 3 Updated Apr 8, 2024

Felix-Zhenghao / flash-attention-v2-minimal

Implement FlashAttention v2 with minimal code to learn.

Cuda 11 1 Updated Jun 12, 2024

Repeerc / sd-webui-flash-attention2-rdna3-rocm

A flash attention2 extension for stable diffusion webui in Linux pytorch-rocm environments.

Cuda 7 1 Updated Jul 9, 2024

weishengying / tiny-flash-attention

使用 cutlass 实现 flash-attention 精简版，具有教学意义

Cuda 41 5 Updated Aug 12, 2024

OpenPPL / CuAssembler

Forked from cloudcores/CuAssembler

An unofficial cuda assembler, for all generations of SASS, hopefully ：）

Python 83 10 Updated Mar 20, 2023

decodecudabinary / Decoding-CUDA-Binary

C++ 51 14 Updated Nov 21, 2019

kilianhae / FlashAttention.C

Flash Attention in raw Cuda C beating PyTorch

Cuda 21 3 Updated May 14, 2024

cupbop / CuPBoP

A framework that support executing unmodified CUDA source code on non-NVIDIA devices.

C++ 127 15 Updated Jan 3, 2025

interestingLSY / CUDA-From-Correctness-To-Performance-Code

Codes & examples for "CUDA - From Correctness to Performance"

C++ 98 21 Updated Oct 24, 2024

CRobeck / instrument-amdgpu-kernels

LLVM/MLIR based compiler instrumentation of AMD GPU kernels

C++ 18 5 Updated Apr 29, 2025

vosen / ZLUDA

CUDA on non-NVIDIA GPUs

Rust 11,358 726 Updated May 20, 2025

ROCm / ROCm

AMD ROCm™ Software - GitHub Home

Shell 5,290 432 Updated May 22, 2025

Remind8 / CuPBoP-AMD

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the…

LLVM 3 Updated Nov 10, 2023

ROCm / hip

HIP: C++ Heterogeneous-Compute Interface for Portability

C++ 4,019 553 Updated May 23, 2025

paramhanji / CUDA-CNN

Implementation of a simple CNN using CUDA

Cuda 68 21 Updated May 2, 2017

gthparch / CuPBoP-AMD

CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs.

LLVM 36 4 Updated Nov 19, 2023

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,428 424 Updated May 22, 2025