Skip to content
View andy-yang-1's full-sized avatar

Highlights

  • Pro

Block or report andy-yang-1

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A WebUI for Side-by-Side Comparison of Media (Images/Videos) Across Multiple Folders

Python 19 1 Updated Jan 24, 2025

Sky-T1: Train your own O1 preview model within $450

Python 2,478 269 Updated Feb 12, 2025

[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Cuda 638 39 Updated Feb 11, 2025

StreamDiffusion: A Pipeline-Level Solution for Real-Time Interactive Generation

Python 9,986 739 Updated Dec 4, 2024

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

Python 3,312 198 Updated Feb 10, 2025

Puzzles for learning Triton, play it with minimal environment configuration!

Python 220 15 Updated Dec 3, 2024

[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Python 422 26 Updated Feb 10, 2025

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Python 414 19 Updated Oct 16, 2024

Materials for learning SGLang

232 15 Updated Feb 6, 2025

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

C++ 214 21 Updated Sep 30, 2024

awesome synthetic (text) datasets

Jupyter Notebook 257 11 Updated Oct 29, 2024

A fast communication-overlapping library for tensor parallelism on GPUs.

C++ 292 25 Updated Oct 30, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 729 28 Updated Sep 21, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,979 201 Updated Feb 11, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 9,304 887 Updated Feb 12, 2025

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 292 26 Updated Jul 2, 2024

CUDA Kernel Benchmarking Library

Cuda 557 71 Updated Nov 20, 2024

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

Python 941 105 Updated Jan 2, 2025

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,720 229 Updated Feb 11, 2025

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,189 72 Updated Oct 14, 2024

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉

3,392 234 Updated Jan 31, 2025

Latency and Memory Analysis of Transformer Models for Training and Inference

Python 385 44 Updated Nov 13, 2024

Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"

Python 296 24 Updated Dec 20, 2023

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Python 1,786 102 Updated Jan 21, 2024

A list of awesome compiler projects and papers for tensor computation and deep learning.

2,483 307 Updated Oct 19, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,376 1,099 Updated Feb 11, 2025

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

Python 205 9 Updated Aug 19, 2024

A curated list for Efficient Large Language Models

Python 1,434 106 Updated Feb 10, 2025

Fast and memory-efficient exact attention

Python 15,415 1,452 Updated Feb 11, 2025
Next