DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

C 236 24 Updated Feb 13, 2025

dusty-nv / ros_deep_learning

Deep learning inference nodes for ROS / ROS2 with support for NVIDIA Jetson and TensorRT

C++ 916 259 Updated Jul 13, 2024

QwenLM / Qwen-Agent

Agent framework and applications built upon Qwen>=2.0, featuring Function Calling, Code Interpreter, RAG, and Chrome extension.

Python 5,997 534 Updated Jan 24, 2025

karpathy / llm.c

LLM training in simple, raw C/CUDA

Cuda 25,892 2,971 Updated Oct 2, 2024

haileyschoelkopf / vllm

Forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 5 1 Updated Mar 5, 2024

Adlik / smoothquantplus

Forked from mit-han-lab/smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 20 Updated Mar 15, 2024

mit-han-lab / TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

C++ 816 82 Updated Jul 4, 2024

FengJungle / DesignPattern

Design pattern demo code

C++ 1,085 272 Updated Apr 17, 2024

liguodongiot / llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

HTML 14,837 1,715 Updated Mar 2, 2025

karpathy / llama2.c

Inference Llama 2 in one file of pure C

C 18,113 2,205 Updated Aug 6, 2024

xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …

Python 6,765 555 Updated Mar 3, 2025

mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,789 232 Updated Mar 3, 2025

AutoGPTQ / AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,730 509 Updated Jan 21, 2025

DefTruth / Awesome-LLM-Inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉

3,563 246 Updated Mar 4, 2025

feifeibear / LLMSpeculativeSampling

Fast inference from large lauguage models via speculative decoding

Python 670 67 Updated Aug 22, 2024

harvardnlp / annotated-transformer

An annotated implementation of the Transformer paper.

Jupyter Notebook 6,045 1,280 Updated Apr 7, 2024

SafeAILab / EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

Python 992 109 Updated Feb 18, 2025

FasterDecoding / REST

REST: Retrieval-Based Speculative Decoding, NAACL 2024

C 194 12 Updated Dec 2, 2024

FasterDecoding / Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,444 168 Updated Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tuanhe

Achievements

Achievements

Block or report tuanhe

Stars

ajhalthor / Transformer-Neural-Network

AI-Hypercomputer / JetStream

zhihu / ZhiLight

comfyanonymous / ComfyUI_TensorRT

google / gemma.cpp

LLMServe / DistServe

sgl-project / sglang

intel / xFasterTransformer

mit-han-lab / deepcompressor

rasbt / LLMs-from-scratch

karpathy / build-nanogpt

modelscope / dash-infer