luliyucoordinate

🏅

Focusing

LiYu Lu luliyucoordinate

🏅

Focusing

Pytorch/TensorFlow/CUDA/HPC/more

282 followers · 57 following

hangzhou

Achievements

Organizations

Lists (1)

Sort

✨ Inspiration

2 repositories

Stars

coderonion / awesome-cuda-triton-hpc

🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.

193 24 Updated Feb 2, 2025

BBuf / tensorrt-llm-moe

C++ 7 Updated Feb 3, 2025

NVIDIA / gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C++ 933 147 Updated Dec 16, 2024

microsoft / AttentionEngine

Python 24 2 Updated Jan 23, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 186 14 Updated Feb 2, 2025

chengzeyi / QuantumAttention

[WIP] Better (FP8) attention for Hopper

C++ 19 Updated Jan 29, 2025

Tencent / Hunyuan3D-2

High-Resolution 3D Assets Generation with Large Scale Hunyuan3D Diffusion Models.

Python 5,475 383 Updated Feb 1, 2025

MoonshotAI / Kimi-k1.5

2,440 134 Updated Feb 2, 2025

deepseek-ai / DeepSeek-R1

56,456 6,992 Updated Feb 1, 2025

sgl-project / whl

Kernel Library Wheel for SGLang

HTML 7 1 Updated Jan 30, 2025

MiniMax-AI / MiniMax-01

Python 2,030 134 Updated Jan 16, 2025

nicolaswilde / amx-gemm-handwritten

Handwritten GEMM using Intel AMX (Advanced Matrix Extension)

C 4 Updated Jan 11, 2025

chengzeyi / Comfy-WaveSpeed

[WIP] The all in one inference optimization solution for ComfyUI, universal, flexible, and fast.

Python 682 22 Updated Feb 2, 2025

casper-hansen / AutoAWQ_kernels

Cuda 64 24 Updated Nov 26, 2024

DefTruth / ffpa-attn-mma

📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.

Cuda 75 4 Updated Feb 3, 2025

fla-org / flash-bidirectional-linear-attention

Triton implement of bi-directional (non-causal) linear attention

Python 39 1 Updated Jan 13, 2025

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 132 17 Updated Nov 30, 2024

deepseek-ai / DeepSeek-V3

Python 69,222 10,425 Updated Jan 26, 2025

FlagOpen / FlagCX

C++ 39 9 Updated Jan 23, 2025

IBM / triton-dejavu

Framework to reduce autotune overhead to zero for well known deployments.

Python 59 9 Updated Jan 28, 2025

DefTruth / Awesome-Diffusion-Inference

📖A curated list of Awesome Diffusion Inference Papers with codes, such as Sampling, Caching, Multi-GPUs, etc. 🎉🎉

176 11 Updated Jan 16, 2025

NX-AI / flashrnn

FlashRNN - Fast RNN Kernels with I/O Awareness

Python 71 1 Updated Dec 12, 2024

microsoft / FractalTensor

Python 22 4 Updated Dec 21, 2024

andrewkchan / yalm

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

C++ 215 19 Updated Jan 15, 2025

AIDC-AI / Marco-o1

An Open Large Reasoning Model for Real-World Solutions

Python 1,417 75 Updated Nov 28, 2024

howardlau1999 / rdmapp

C++ interfaces for RDMA access

C++ 65 4 Updated Jan 20, 2025

microsoft / TileFusion

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

Cuda 45 5 Updated Jan 28, 2025

zhihu / ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 842 100 Updated Jan 24, 2025

KONAKONA666 / q8_kernels

Cuda 58 5 Updated Dec 27, 2024

Tencent / HunyuanVideo

HunyuanVideo: A Systematic Framework For Large Video Generation Model

Python 8,013 650 Updated Jan 24, 2025