Skip to content
View luliyucoordinate's full-sized avatar
🏅
Focusing
🏅
Focusing
  • hangzhou

Organizations

@llcv

Block or report luliyucoordinate

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.

193 24 Updated Feb 2, 2025
C++ 7 Updated Feb 3, 2025

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C++ 933 147 Updated Dec 16, 2024
Python 24 2 Updated Jan 23, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 186 14 Updated Feb 2, 2025

[WIP] Better (FP8) attention for Hopper

C++ 18 Updated Jan 29, 2025

High-Resolution 3D Assets Generation with Large Scale Hunyuan3D Diffusion Models.

Python 5,470 383 Updated Feb 1, 2025

Kernel Library Wheel for SGLang

HTML 7 1 Updated Jan 30, 2025
Python 2,030 134 Updated Jan 16, 2025

Handwritten GEMM using Intel AMX (Advanced Matrix Extension)

C 4 Updated Jan 11, 2025

[WIP] The all in one inference optimization solution for ComfyUI, universal, flexible, and fast.

Python 682 22 Updated Feb 2, 2025

📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.

Cuda 75 4 Updated Feb 3, 2025

Triton implement of bi-directional (non-causal) linear attention

Python 39 1 Updated Jan 13, 2025

Fastest kernels written from scratch

Cuda 132 17 Updated Nov 30, 2024
C++ 39 9 Updated Jan 23, 2025

Framework to reduce autotune overhead to zero for well known deployments.

Python 59 9 Updated Jan 28, 2025

📖A curated list of Awesome Diffusion Inference Papers with codes, such as Sampling, Caching, Multi-GPUs, etc. 🎉🎉

175 11 Updated Jan 16, 2025

FlashRNN - Fast RNN Kernels with I/O Awareness

Python 71 1 Updated Dec 12, 2024
Python 22 4 Updated Dec 21, 2024

Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O

C++ 215 19 Updated Jan 15, 2025

An Open Large Reasoning Model for Real-World Solutions

Python 1,417 75 Updated Nov 28, 2024

C++ interfaces for RDMA access

C++ 65 4 Updated Jan 20, 2025

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

Cuda 45 5 Updated Jan 28, 2025

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 842 100 Updated Jan 24, 2025
Cuda 58 5 Updated Dec 27, 2024

HunyuanVideo: A Systematic Framework For Large Video Generation Model

Python 8,009 650 Updated Jan 24, 2025
Next