Stars
Official Implementation of "Pay Attention to What You Need"
TokenSkip: Controllable Chain-of-Thought Compression in LLMs
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
Source code of DRAGIN, ACL 2024 main conference Long Paper
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting"
[ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Awesome LLM compression research papers and tools.
A collection of AWESOME things about mixture-of-experts
PyTorch-UVM on super-large language models.
Library for faster pinned CPU <-> GPU transfer in Pytorch
PyTorch library for cost-effective, fast and easy serving of MoE models.