Skip to content
View diliu0349's full-sized avatar

Block or report diliu0349

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Python 54 2 Updated Mar 6, 2025

LLM serving cluster simulator

Jupyter Notebook 93 8 Updated Apr 25, 2024

Efficient Triton Kernels for LLM Training

Python 4,597 278 Updated Mar 8, 2025

Compression for Foundation Models

Jupyter Notebook 27 3 Updated Feb 14, 2025
Python 7 Updated Jan 16, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,783 171 Updated Mar 7, 2025

A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems

Python 148 9 Updated Oct 15, 2024

The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.

Jupyter Notebook 58 4 Updated Jan 25, 2025

Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling

Python 10 3 Updated Mar 7, 2024

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Python 143 6 Updated Nov 5, 2024

High-speed Large Language Model Serving for Local Deployment

C++ 8,143 424 Updated Feb 19, 2025

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python 21,750 2,389 Updated Aug 12, 2024

official impelmentation of Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input

Python 63 Updated Aug 30, 2024

Microsoft Azure Traces

Jupyter Notebook 895 152 Updated Feb 25, 2025

Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".

Python 78 6 Updated Mar 5, 2025

Code for paper "Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System"

Python 52 4 Updated Nov 14, 2024

Sample codes for my CUDA programming book

Cuda 1,660 338 Updated Feb 15, 2025

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".

Python 237 8 Updated Dec 26, 2024

CUDA Templates for Linear Algebra Subroutines

C++ 7,019 1,149 Updated Mar 10, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 360 73 Updated Sep 8, 2024

A self-learning tutorail for CUDA High Performance Programing.

JavaScript 422 51 Updated Mar 6, 2025

[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Python 97 3 Updated Dec 23, 2024

Tender: Accelerating Large Language Models via Tensor Decompostion and Runtime Requantization (ISCA'24)

Python 13 1 Updated Jul 4, 2024

Dynamic Memory Management for Serving LLMs without PagedAttention

C 306 23 Updated Feb 20, 2025

πŸ“–A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. πŸŽ‰πŸŽ‰

197 12 Updated Jan 16, 2025

LLM inference in C/C++

C++ 76,208 11,025 Updated Mar 10, 2025

πŸ“° Must-read papers and blogs on LLM based Long Context Modeling πŸ”₯

1,308 45 Updated Mar 10, 2025
Jupyter Notebook 85 6 Updated Nov 11, 2024
Next