Stars
SGLang is a fast serving framework for large language models and vision language models.
This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Hackable and optimized Transformers building blocks, supporting a composable construction.
A Easy-to-understand TensorOp Matmul Tutorial
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A high-throughput and memory-efficient inference and serving engine for LLMs
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
A modern, C++-native, test framework for unit-tests, TDD and BDD - using C++14, C++17 and later (C++11 support is in v2.x branch, and C++03 on the Catch1.x branch)
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
经济学人(含音频)、纽约客、卫报、连线、大西洋月刊等英语杂志免费下载,支持epub、mobi、pdf格式, 每周更新
A self-learning tutorail for CUDA High Performance Programing.
Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.
A profiler to disclose and quantify hardware features on GPUs.
An innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification.
[ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation