Lists (1)
Sort Name ascending (A-Z)
Stars
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
High-Resolution 3D Assets Generation with Large Scale Hunyuan3D Diffusion Models.
Handwritten GEMM using Intel AMX (Advanced Matrix Extension)
[WIP] The all in one inference optimization solution for ComfyUI, universal, flexible, and fast.
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
Triton implement of bi-directional (non-causal) linear attention
Framework to reduce autotune overhead to zero for well known deployments.
📖A curated list of Awesome Diffusion Inference Papers with codes, such as Sampling, Caching, Multi-GPUs, etc. 🎉🎉
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
An Open Large Reasoning Model for Real-World Solutions
TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.
A highly optimized LLM inference acceleration engine for Llama and its variants.
HunyuanVideo: A Systematic Framework For Large Video Generation Model