Skip to content
View tuanhe's full-sized avatar

Block or report tuanhe

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Code Transformer neural network components piece by piece

Jupyter Notebook 330 175 Updated May 1, 2023

JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).

Python 288 36 Updated Mar 3, 2025

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 868 102 Updated Mar 3, 2025

lightweight, standalone C++ inference engine for Google's Gemma models.

C++ 6,143 524 Updated Mar 4, 2025

Disaggregated serving system for Large Language Models (LLMs).

Jupyter Notebook 474 51 Updated Aug 19, 2024

SGLang is a fast serving framework for large language models and vision language models.

Python 11,262 1,130 Updated Mar 4, 2025

Model Compression Toolbox for Large Language Models and Diffusion Models

Python 354 26 Updated Feb 21, 2025

Implement a ChatGPT-like LLM in PyTorch from scratch, step by step

Jupyter Notebook 41,307 5,569 Updated Mar 2, 2025

Video+code lecture on building nanoGPT from scratch

Python 3,927 576 Updated Aug 13, 2024

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

C 236 24 Updated Feb 13, 2025

Deep learning inference nodes for ROS / ROS2 with support for NVIDIA Jetson and TensorRT

C++ 916 259 Updated Jul 13, 2024

Agent framework and applications built upon Qwen>=2.0, featuring Function Calling, Code Interpreter, RAG, and Chrome extension.

Python 5,997 534 Updated Jan 24, 2025

LLM training in simple, raw C/CUDA

Cuda 25,892 2,971 Updated Oct 2, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 5 1 Updated Mar 5, 2024

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 20 Updated Mar 15, 2024

TinyChatEngine: On-Device LLM Inference Library

C++ 816 82 Updated Jul 4, 2024

Design pattern demo code

C++ 1,085 272 Updated Apr 17, 2024

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

HTML 14,837 1,715 Updated Mar 2, 2025

Inference Llama 2 in one file of pure C

C 18,113 2,205 Updated Aug 6, 2024

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …

Python 6,765 555 Updated Mar 3, 2025

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,789 232 Updated Mar 3, 2025

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,730 509 Updated Jan 21, 2025

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉

3,563 246 Updated Mar 4, 2025

Fast inference from large lauguage models via speculative decoding

Python 670 67 Updated Aug 22, 2024

An annotated implementation of the Transformer paper.

Jupyter Notebook 6,045 1,280 Updated Apr 7, 2024

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

Python 992 109 Updated Feb 18, 2025

REST: Retrieval-Based Speculative Decoding, NAACL 2024

C 194 12 Updated Dec 2, 2024

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,444 168 Updated Jun 25, 2024
Next