Skip to content
View Ma-Dan's full-sized avatar

Block or report Ma-Dan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
19 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 25,022 2,852 Updated Oct 2, 2024

Fast parallel CTC.

Cuda 4,070 1,041 Updated Mar 4, 2024

📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 1,956 206 Updated Jan 13, 2025

how to optimize some algorithm in cuda.

Cuda 1,820 151 Updated Jan 11, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 1,771 179 Updated Jan 9, 2025

GPU database engine

Cuda 1,170 120 Updated Jan 30, 2017

FSA/FST algorithms, differentiable, with PyTorch compatibility.

Cuda 1,147 217 Updated Jan 3, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 886 141 Updated Jul 29, 2023

A simple high performance CUDA GEMM implementation.

Cuda 342 37 Updated Jan 4, 2024

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 329 68 Updated Sep 8, 2024

This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed th…

Cuda 193 27 Updated Jul 20, 2022

Tutorials for writing high-performance GPU operators in AI frameworks.

Cuda 126 16 Updated Aug 12, 2023

Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)

Cuda 122 19 Updated Aug 18, 2020

play gemm with tvm

Cuda 85 10 Updated Jul 22, 2023

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 52 3 Updated Aug 12, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 34 4 Updated Aug 12, 2024
Cuda 7 2 Updated Aug 14, 2024

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 5 1 Updated Dec 27, 2023

Programming Massively Parallel Processors 4th edition codes

Cuda 4 Updated Jun 10, 2024