Lists (3)
Sort Name ascending (A-Z)
Stars
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Efficient Triton Kernels for LLM Training
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
Adaptive Caching for Faster Video Generation with Diffusion Transformers
High-speed Large Language Model Serving for Local Deployment
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input
Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".
Code for paper "Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System"
Sample codes for my CUDA programming book
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
A self-learning tutorail for CUDA High Performance Programing.
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)
Dynamic Memory Management for Serving LLMs without PagedAttention
πA curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. ππ
π° Must-read papers and blogs on LLM based Long Context Modeling π₯