Skip to content
View Han-fangyuan's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report Han-fangyuan

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A low-latency & high-throughput serving engine for LLMs

Python 301 37 Updated Sep 12, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,879 184 Updated Jan 28, 2025

Machine Learning Engineering Open Book

Python 12,545 769 Updated Jan 28, 2025

📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥

1,186 43 Updated Jan 17, 2025
Python 311 38 Updated Apr 2, 2024

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,393 161 Updated Jun 25, 2024

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)

Python 102 21 Updated Jul 10, 2024

Inference code for LLaMA models

Python 113 26 Updated Aug 13, 2023

This is a Chinese translation of the CUDA programming guide

1,391 215 Updated Nov 13, 2024

LLM inference in C/C++

C++ 71,699 10,363 Updated Jan 28, 2025

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 38,865 4,769 Updated Jan 28, 2025

Fast and memory-efficient exact attention

Python 15,206 1,436 Updated Jan 18, 2025

Curated collection of papers in MoE model inference

41 3 Updated Jan 21, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 7,999 779 Updated Jan 28, 2025

Code for the paper Fine-Tuning Language Models from Human Preferences

Python 1,272 164 Updated Jul 25, 2023
Python 48 6 Updated Nov 4, 2024

Fast Multimodal LLM on Mobile Devices

C++ 671 75 Updated Jan 25, 2025

Material for gpu-mode lectures

Jupyter Notebook 3,567 358 Updated Jan 6, 2025

Efficient and easy multi-instance LLM serving

Python 284 18 Updated Jan 23, 2025

The code based on vLLM for the paper “ Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention”.

Python 6 1 Updated Sep 19, 2024

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

HTML 13,406 1,510 Updated Jan 15, 2025

Transformer related optimization, including BERT, GPT

C++ 5,997 897 Updated Mar 27, 2024

High-speed Large Language Model Serving for Local Deployment

C++ 8,065 418 Updated Sep 6, 2024

A self-learning tutorail for CUDA High Performance Programing.

JavaScript 336 38 Updated Dec 17, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 5,340 469 Updated Jan 27, 2025

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,251 1,084 Updated Jan 24, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 606 54 Updated Jan 21, 2025
C 3 1 Updated Jun 5, 2024
Python 264 40 Updated Jan 27, 2025
Next