
-
University of Science and Technology of China
- Hefei, Anhui
Highlights
- Pro
Starred repositories
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Official repository for VisionZip (CVPR 2025)
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.
allRank is a framework for training learning-to-rank neural models based on PyTorch.
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | A tiny BERT model can tell you the verbosity of an LLM (with low latency overhead!)
My learning notes/codes for ML SYS.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Community maintained hardware plugin for vLLM on Ascend
Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Awesome-LLM: a curated list of Large Language Model
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
Large Language Model (LLM) Systems Paper List
A throughput-oriented high-performance serving framework for LLMs
📰 Must-read papers and blogs on Speculative Decoding ⚡️
SGLang is a fast serving framework for large language models and vision language models.
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
A high-throughput and memory-efficient inference and serving engine for LLMs