Skip to content
View Ma-Dan's full-sized avatar

Block or report Ma-Dan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
19 stars written in Cuda
Clear filter

LLM training in simple, raw C/CUDA

Cuda 25,939 2,970 Updated Oct 2, 2024

Fast parallel CTC.

Cuda 4,071 1,039 Updated Mar 4, 2024

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,733 283 Updated Mar 4, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 2,298 240 Updated Mar 6, 2025

how to optimize some algorithm in cuda.

Cuda 1,951 173 Updated Mar 5, 2025

GPU database engine

Cuda 1,171 120 Updated Jan 30, 2017

FSA/FST algorithms, differentiable, with PyTorch compatibility.

Cuda 1,170 222 Updated Mar 5, 2025

This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 936 149 Updated Jul 29, 2023

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 360 73 Updated Sep 8, 2024

A simple high performance CUDA GEMM implementation.

Cuda 350 40 Updated Jan 4, 2024

This is a Tensor Train based compression library to compress sparse embedding tables used in large-scale machine learning models such as recommendation and natural language processing. We showed th…

Cuda 193 27 Updated Jul 20, 2022

Tutorials for writing high-performance GPU operators in AI frameworks.

Cuda 129 16 Updated Aug 12, 2023

Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)

Cuda 126 19 Updated Aug 18, 2020

play gemm with tvm

Cuda 89 10 Updated Jul 22, 2023

使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention

Cuda 56 3 Updated Aug 12, 2024

使用 cutlass 实现 flash-attention 精简版,具有教学意义

Cuda 37 4 Updated Aug 12, 2024
Cuda 8 2 Updated Aug 14, 2024

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 6 1 Updated Dec 27, 2023

Programming Massively Parallel Processors 4th edition codes

Cuda 5 Updated Jun 10, 2024