Stars
SGLang is a fast serving framework for large language models and vision language models.
Large Language Model Text Generation Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Large Language Model (LLM) Systems Paper List
FlashInfer: Kernel Library for LLM Serving
Sample codes for my CUDA programming book
Foundational Models for State-of-the-Art Speech and Text Translation
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
Robust Speech Recognition via Large-Scale Weak Supervision
Buzz transcribes and translates audio offline on your personal computer. Powered by OpenAI's Whisper.
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Source code for "On the Relationship between Self-Attention and Convolutional Layers"
buger / goreplay
Forked from taboola/goreplayGoReplay is an open-source tool for capturing and replaying live HTTP traffic into a test environment in order to continuously test your system with real data. It can be used to increase confidence…
The simplest, fastest repository for training/finetuning medium-sized GPTs.
A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
Development repository for the Triton language and compiler
Distributed LLM and StableDiffusion inference for mobile, desktop and server.
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Doing simple retrieval from LLM models at various context lengths to measure accuracy
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
A high-throughput and memory-efficient inference and serving engine for LLMs
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
Spising: ⚡️Open-source AI LangChain-like RAG (Retrieval-Augmented Generation) knowledge database with web UI and Enterprise SSO⚡️, supports OpenAI, Azure, LLaMA, Google Gemini, HuggingFace, Claude,…
Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.