Skip to content
View Ma-Dan's full-sized avatar

Block or report Ma-Dan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
19 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 26,651 3,063 Updated May 10, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

Cuda 4,428 467 Updated May 17, 2025

Fast parallel CTC.

Cuda 4,078 1,037 Updated Mar 4, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 3,019 310 Updated May 22, 2025

how to optimize some algorithm in cuda.

Cuda 2,200 192 Updated May 23, 2025

FSA/FST algorithms, differentiable, with PyTorch compatibility.

Cuda 1,198 224 Updated May 22, 2025

GPU database engine

Cuda 1,172 120 Updated Jan 30, 2017

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,049 155 Updated Jul 29, 2023

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 414 79 Updated Sep 8, 2024

A simple high performance CUDA GEMM implementation.

Cuda 370 41 Updated Jan 4, 2024

This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed th…

Cuda 194 27 Updated Jul 20, 2022

Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)

Cuda 134 20 Updated Aug 18, 2020

Tutorials for writing high-performance GPU operators in AI frameworks.

Cuda 130 16 Updated Aug 12, 2023

play gemm with tvm

Cuda 91 10 Updated Jul 22, 2023

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 66 4 Updated Aug 12, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 41 5 Updated Aug 12, 2024
Cuda 13 2 Updated Aug 14, 2024

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 7 1 Updated Dec 27, 2023

Programming Massively Parallel Processors 4th edition codes

Cuda 6 1 Updated Jun 10, 2024