Stars
Code Transformer neural network components piece by piece
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
A highly optimized LLM inference acceleration engine for Llama and its variants.
lightweight, standalone C++ inference engine for Google's Gemma models.
Disaggregated serving system for Large Language Models (LLMs).
SGLang is a fast serving framework for large language models and vision language models.
Model Compression Toolbox for Large Language Models and Diffusion Models
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
Video+code lecture on building nanoGPT from scratch
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Deep learning inference nodes for ROS / ROS2 with support for NVIDIA Jetson and TensorRT
Agent framework and applications built upon Qwen>=2.0, featuring Function Calling, Code Interpreter, RAG, and Chrome extension.
haileyschoelkopf / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Adlik / smoothquantplus
Forked from mit-han-lab/smoothquant[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
TinyChatEngine: On-Device LLM Inference Library
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉
Fast inference from large lauguage models via speculative decoding
An annotated implementation of the Transformer paper.
Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
REST: Retrieval-Based Speculative Decoding, NAACL 2024
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads