Skip to content
View wangraying's full-sized avatar

Organizations

@microsoft @BaguaSys @MicrosoftCopilot

Block or report wangraying

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

C 221 22 Updated Jan 24, 2025

Material for gpu-mode lectures

Jupyter Notebook 3,576 362 Updated Jan 6, 2025

Puzzles for learning Triton

Jupyter Notebook 1,341 98 Updated Nov 18, 2024

Helpful tools and examples for working with flex-attention

Python 603 33 Updated Jan 26, 2025

how to optimize some algorithm in cuda.

Cuda 1,850 154 Updated Jan 26, 2025

A Python implementation of global optimization with gaussian processes.

Python 8,046 1,555 Updated Jan 2, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,447 145 Updated Jan 24, 2025

Microsoft Azure Traces

Jupyter Notebook 872 149 Updated Dec 12, 2024

Disaggregated serving system for Large Language Models (LLMs).

Jupyter Notebook 453 50 Updated Aug 19, 2024

A large-scale simulation framework for LLM inference

Python 316 55 Updated Nov 19, 2024

A low-latency & high-throughput serving engine for LLMs

Python 301 38 Updated Sep 12, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 5,377 474 Updated Jan 27, 2025

A Data Streaming Library for Efficient Neural Network Training

Python 1,205 153 Updated Jan 27, 2025

A guidance language for controlling large language models.

Jupyter Notebook 19,545 1,065 Updated Jan 29, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 8,213 797 Updated Jan 31, 2025

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,183 71 Updated Oct 14, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,270 1,086 Updated Jan 31, 2025

Ring attention implementation with flash attention

Python 661 57 Updated Dec 19, 2024

Large Context Attention

Python 678 53 Updated Jan 24, 2025

Code and documentation to train Stanford's Alpaca models, and generate the data.

Python 29,767 4,056 Updated Jul 17, 2024

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)

Python 11,721 1,037 Updated Jan 29, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 1,896 189 Updated Jan 31, 2025

RDC

C++ 26 11 Updated Jan 31, 2025

This is a place for various problem detectors running on the Kubernetes nodes.

Go 3,068 641 Updated Jan 27, 2025

Multi-GPU CUDA stress test

C++ 1,528 305 Updated Aug 20, 2024

A tool for bandwidth measurements on NVIDIA GPUs.

C++ 344 30 Updated Oct 18, 2024

MinIO is a high-performance, S3 compatible object store, open sourced under GNU AGPLv3 license.

Go 49,816 5,636 Updated Jan 29, 2025

Some reference and example networking plugins, maintained by the CNI team.

Go 2,268 798 Updated Jan 29, 2025

Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.

Shell 982 100 Updated Jul 29, 2024
Next