Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 1,173 73 Updated Mar 21, 2025

FFY0 / AdaKV

The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Python 67 Updated Jan 23, 2025

mit-han-lab / duo-attention

[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Python 438 26 Updated Feb 10, 2025

pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch

C++ 2,616 489 Updated Mar 21, 2025

FreedomIntelligence / LongLLaVA

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Python 200 14 Updated Jan 6, 2025

QwenLM / Qwen2.5-VL

Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 8,936 629 Updated Mar 7, 2025

luxuantao / advanced_LLM_interview_notes

大模型进阶面经

38 1 Updated Feb 11, 2025

cstsunfu / intc

Python Intelligence Config Manager. A superset of hydra+pydantic+lsp

C 26 Updated Feb 4, 2025

FlagOpen / FlagGems

FlagGems is an operator library for large language models implemented in Triton Language.

Python 456 73 Updated Mar 21, 2025

feifeibear / Odysseus-Transformer

Odysseus: Playground of LLM Sequence Parallelism

Python 66 3 Updated Jun 17, 2024

lm-sys / RouteLLM

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality

Python 3,739 282 Updated Aug 10, 2024

LLaVA-VL / LLaVA-NeXT

Python 3,578 333 Updated Feb 24, 2025

arcee-ai / DistillKit

An Open Source Toolkit For LLM Distillation

Python 542 65 Updated Jan 7, 2025

NVlabs / Minitron

A family of compressed models obtained via pruning and knowledge distillation

329 18 Updated Nov 13, 2024

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 555 40 Updated Feb 14, 2025

facebookresearch / chameleon

Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.

Python 1,953 113 Updated Jul 29, 2024

zhuzilin / aqt-pytorch

Python 7 Updated Jun 26, 2024

JinjieNi / MixEval

The official evaluation suite and dynamic data release for MixEval.

Python 233 41 Updated Nov 10, 2024

zhongmz / SiMT-Hallucination

source code of paper "On the Hallucination in Simultaneous Machine Translation"

Python 2 Updated Jun 1, 2024

Hannibal046 / xRAG

[Neurips2024] Source code for xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token

Jupyter Notebook 128 12 Updated Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GeneZC GeneZC

Achievements

Achievements

Block or report GeneZC

Stars

huggingface / picotron

fanshiqing / grouped_gemm

apple / ml-cross-entropy

kyleliang919 / C-Optim

ollama / ollama

mistralai / mistral-evals

FairyFali / SLMs-Survey

datvodinh / rag-chatbot

BobMcDear / attorch

feifeibear / ChituAttention

thu-ml / SageAttention