oujieww

Follow

oujieww

Follow

6 followers · 17 following

Achievements

Achievements

Stars

yileijin / PayAttn

Official Implementation of "Pay Attention to What You Need"

Python 41 10 Updated Feb 22, 2025

D3Mlab / cr-lt-kgqa

CR-LT KGQA Dataset Repository

Python 8 Updated Dec 18, 2024

hemingkx / TokenSkip

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Python 60 2 Updated Feb 25, 2025

feifeibear / long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 442 32 Updated Feb 19, 2025

FudanDNN-NLP / RAG

This is an implementation of the paper: Searching for Best Practices in Retrieval-Augmented Generation (EMNLP2024)

Python 291 18 Updated Dec 21, 2024

sail-sg / SimLayerKV

The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.

Python 43 Updated Oct 18, 2024

yakir-yehuda / InterrogateLLM

Python 4 4 Updated Aug 20, 2024

THU-KEG / SeaKR

Python 23 4 Updated Jun 26, 2024

oneal2000 / DRAGIN

Source code of DRAGIN, ACL 2024 main conference Long Paper

Python 123 17 Updated Feb 21, 2025

HanGuo97 / flute

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

C++ 232 8 Updated Feb 23, 2025

microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table

C++ 690 53 Updated Jan 9, 2025

GATECH-EIC / ShiftAddLLM

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Python 103 16 Updated Oct 15, 2024

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,597 1,126 Updated Mar 3, 2025

thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Python 335 30 Updated Apr 20, 2024

oujieww / LCF

Python 1 Updated Sep 2, 2024

microsoft / MInference

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …

Python 924 46 Updated Feb 25, 2025

zankner / Hydra

Python 41 Updated Feb 19, 2024

oujieww / ANPD

Python 10 Updated Sep 2, 2024

oujieww / BFFN

Python 4 Updated Jul 5, 2024

VITA-Group / Q-Hitter

Python 12 Updated Jun 4, 2024

hdong920 / LESS

Python 49 1 Updated May 13, 2024

yileijin / Bootstrap-GS

Python 259 76 Updated Feb 11, 2025

Equationliu / Kangaroo

[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting"

Python 50 8 Updated Jun 26, 2024

Lucky-Lance / Expert_Sparsity

[ACL 2024] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Python 84 9 Updated May 24, 2024

HuangOwen / Awesome-LLM-Compression

Awesome LLM compression research papers and tools.

1,397 89 Updated Mar 4, 2025

XueFuzhao / awesome-mixture-of-experts

A collection of AWESOME things about mixture-of-experts

1,059 79 Updated Dec 8, 2024

dblalock / bolt

10x faster matrix and vector operations

C++ 2,480 172 Updated Oct 12, 2022

kooyunmo / cuda-uvm-gpt2

PyTorch-UVM on super-large language models.

Python 15 5 Updated Dec 21, 2020

Santosh-Gupta / SpeedTorch

Library for faster pinned CPU <-> GPU transfer in Pytorch

Python 685 39 Updated Feb 21, 2020

EfficientMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 140 12 Updated Feb 27, 2025