-
University of California, Berkeley
- Berkeley, CA
- https://andy-yang-1.github.io/
Highlights
- Pro
Stars
A WebUI for Side-by-Side Comparison of Media (Images/Videos) Across Multiple Folders
Sky-T1: Train your own O1 preview model within $450
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
Puzzles for learning Triton, play it with minimal environment configuration!
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
awesome synthetic (text) datasets
A fast communication-overlapping library for tensor parallelism on GPUs.
A throughput-oriented high-performance serving framework for LLMs
FlashInfer: Kernel Library for LLM Serving
SGLang is a fast serving framework for large language models and vision language models.
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉
Latency and Memory Analysis of Transformer Models for Training and Inference
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
A list of awesome compiler projects and papers for tensor computation and deep learning.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers
A curated list for Efficient Large Language Models
Fast and memory-efficient exact attention