![:octocat: :octocat:](https://github.githubassets.com/images/icons/emoji/octocat.png)
-
AMD.inc
- Beijing, China
Lists (4)
Sort Name ascending (A-Z)
Starred repositories
Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models
A highly optimized LLM inference acceleration engine for Llama and its variants.
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Efficient LLM Inference over Long Sequences
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
SGLang is a fast serving framework for large language models and vision language models.
This is a code repository for pytorch c++ (or libtorch) tutorial.
BS::thread_pool: a fast, lightweight, modern, and easy-to-use C++17 / C++20 / C++23 thread pool library
Lightning fast C++/CUDA neural network framework
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …
Model Compression Toolbox for Large Language Models and Diffusion Models
FlashInfer: Kernel Library for LLM Serving
Stable Diffusion and Flux in pure C/C++
Offical Code for GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation
DrugAssist: A Large Language Model for Molecule Optimization
High-speed Large Language Model Serving for Local Deployment
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Implementation of the Llama architecture with RLHF + Q-learning
Implementation of super-fast C++-styled namedtuple, for compile-time reflection.
📦 CMake's missing package manager. A small CMake script for setup-free, cross-platform, reproducible dependency management.
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
Fast inference from large lauguage models via speculative decoding