-
University of Washington
- Bellevue
-
16:49
- 7h behind - https://orcid.org/0009-0007-8680-7030
Lists (5)
Sort Name ascending (A-Z)
Starred repositories
PerFlow-AI is a programmable performance analysis, modeling, prediction tool for AI system.
The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
My learning notes/codes for ML SYS.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
A simple, performant and scalable Jax LLM!
Large Language Model (LLM) Systems Paper List
A PyTorch native library for large model training
Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
verl: Volcano Engine Reinforcement Learning for LLMs
📰 Must-read papers and blogs on Speculative Decoding ⚡️
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Hackable and optimized Transformers building blocks, supporting a composable construction.
2025 AI/ML internship & new graduate job list updated daily
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Open CS Application | 开源CS申请
FlashInfer: Kernel Library for LLM Serving
A throughput-oriented high-performance serving framework for LLMs
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
Disaggregated serving system for Large Language Models (LLMs).