Stars
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Code of "Recurrent Transformers Trade-off Parallelism for Length Generalization on Regular Languages"
Fast inference from large lauguage models via speculative decoding
Bringing BERT into modernity via both architecture changes and scaling
Minimalistic 4D-parallelism distributed training framework for education purpose
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense f…
Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT, ReRanker.
Text-to-image search with OpenCLIP, Docker, Flask, Faiss, etc. and a basic front-end.
Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
scalable and robust tree-based speculative decoding algorithm
📰 Must-read papers and blogs on Speculative Decoding ⚡️
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
SGLang is a fast serving framework for large language models and vision language models.
Natural Language Reinforcement Learning
Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
A debugging and profiling tool that can trace and visualize python code execution
Hackable and optimized Transformers building blocks, supporting a composable construction.
Open source platform for the machine learning lifecycle
The official code for paper "parallel speculative decoding with adaptive draft length."