-
Microsoft
Stars
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Helpful tools and examples for working with flex-attention
how to optimize some algorithm in cuda.
A Python implementation of global optimization with gaussian processes.
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Disaggregated serving system for Large Language Models (LLMs).
A large-scale simulation framework for LLM inference
A low-latency & high-throughput serving engine for LLMs
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
A Data Streaming Library for Efficient Neural Network Training
A guidance language for controlling large language models.
SGLang is a fast serving framework for large language models and vision language models.
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Ring attention implementation with flash attention
Code and documentation to train Stanford's Alpaca models, and generate the data.
Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
FlashInfer: Kernel Library for LLM Serving
This is a place for various problem detectors running on the Kubernetes nodes.
A tool for bandwidth measurements on NVIDIA GPUs.
MinIO is a high-performance, S3 compatible object store, open sourced under GNU AGPLv3 license.
Some reference and example networking plugins, maintained by the CNI team.
Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.