Skip to content
View Yanxing-Shi's full-sized avatar
:octocat:
Busy with learning :>)
:octocat:
Busy with learning :>)
  • AMD.inc
  • Beijing, China

Block or report Yanxing-Shi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models

243 7 Updated Jan 7, 2025

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 850 102 Updated Jan 24, 2025

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Cuda 635 38 Updated Feb 4, 2025

Efficient LLM Inference over Long Sequences

Python 353 17 Updated Dec 28, 2024

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 2,256 239 Updated Feb 7, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 9,105 875 Updated Feb 10, 2025

This is a code repository for pytorch c++ (or libtorch) tutorial.

C++ 760 126 Updated Nov 2, 2021

BS::thread_pool: a fast, lightweight, modern, and easy-to-use C++17 / C++20 / C++23 thread pool library

C++ 2,360 268 Updated Dec 20, 2024

Lightning fast C++/CUDA neural network framework

C++ 3,867 478 Updated Jan 27, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 901 53 Updated Feb 10, 2025

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …

Python 6,210 515 Updated Feb 8, 2025

Model Compression Toolbox for Large Language Models and Diffusion Models

Python 320 23 Updated Feb 1, 2025

An Attention Superoptimizer

C++ 21 Updated Jan 20, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 1,970 198 Updated Feb 9, 2025

Stable Diffusion and Flux in pure C/C++

C++ 3,786 339 Updated Feb 5, 2025

Python Interface to HIP and hiprtc Library

Python 9 5 Updated Nov 19, 2023

Kernel Tuner

Python 307 52 Updated Feb 9, 2025

Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Python 134 6 Updated Oct 30, 2024

DrugAssist: A Large Language Model for Molecule Optimization

Python 120 10 Updated Jan 16, 2025

High-speed Large Language Model Serving for Local Deployment

C++ 8,083 421 Updated Jan 28, 2025

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Python 5,777 527 Updated Dec 14, 2024

Implementation of the Llama architecture with RLHF + Q-learning

Python 160 7 Updated Feb 1, 2025

Implementation of super-fast C++-styled namedtuple, for compile-time reflection.

C++ 5 Updated Feb 5, 2023

Subprocessing with modern C++

C++ 463 92 Updated Feb 28, 2024

📦 CMake's missing package manager. A small CMake script for setup-free, cross-platform, reproducible dependency management.

CMake 3,214 190 Updated Dec 29, 2024

c++11基础库

C++ 999 456 Updated Nov 18, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,787 377 Updated Jul 11, 2024

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 1,934 238 Updated Jan 20, 2025

Fast inference from large lauguage models via speculative decoding

Python 647 66 Updated Aug 22, 2024
Next