Skip to content
View oujieww's full-sized avatar

Block or report oujieww

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Official Implementation of "Pay Attention to What You Need"

Python 41 10 Updated Feb 22, 2025

CR-LT KGQA Dataset Repository

Python 8 Updated Dec 18, 2024

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Python 60 2 Updated Feb 25, 2025

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 442 32 Updated Feb 19, 2025

This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)

Python 291 18 Updated Dec 21, 2024

The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.

Python 43 Updated Oct 18, 2024
Python 4 4 Updated Aug 20, 2024
Python 23 4 Updated Jun 26, 2024

Source code of DRAGIN, ACL 2024 main conference Long Paper

Python 123 17 Updated Feb 21, 2025

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

C++ 232 8 Updated Feb 23, 2025

Low-bit LLM inference on CPU with lookup table

C++ 690 53 Updated Jan 9, 2025

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Python 103 16 Updated Oct 15, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,597 1,126 Updated Mar 3, 2025

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Python 335 30 Updated Apr 20, 2024
Python 1 Updated Sep 2, 2024

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …

Python 924 46 Updated Feb 25, 2025
Python 41 Updated Feb 19, 2024
Python 10 Updated Sep 2, 2024
Python 4 Updated Jul 5, 2024
Python 12 Updated Jun 4, 2024
Python 49 1 Updated May 13, 2024
Python 259 76 Updated Feb 11, 2025

[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting"

Python 50 8 Updated Jun 26, 2024

[ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Python 84 9 Updated May 24, 2024

Awesome LLM compression research papers and tools.

1,397 89 Updated Mar 4, 2025

A collection of AWESOME things about mixture-of-experts

1,059 79 Updated Dec 8, 2024

10x faster matrix and vector operations

C++ 2,480 172 Updated Oct 12, 2022

PyTorch-UVM on super-large language models.

Python 15 5 Updated Dec 21, 2020

Library for faster pinned CPU <-> GPU transfer in Pytorch

Python 685 39 Updated Feb 21, 2020

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 140 12 Updated Feb 27, 2025
Next