Skip to content
View imisszxq's full-sized avatar

Block or report imisszxq

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

CUDA Python: Performance meets Productivity

Python 2,301 138 Updated Apr 17, 2025
Python 54 1 Updated Apr 12, 2025

Distributed Triton for Parallel Systems

MLIR 433 23 Updated Apr 8, 2025

Expert Parallelism Load Balancer

Python 1,143 187 Updated Mar 24, 2025

A Datacenter Scale Distributed Inference Serving Framework

Rust 3,750 300 Updated Apr 17, 2025

DeeperGEMM: crazy optimized version

Cuda 66 Updated Apr 3, 2025

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …

Cuda 869 205 Updated Apr 17, 2025

Fastest kernels written from scratch

Cuda 226 31 Updated Apr 3, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 885 56 Updated Apr 15, 2025

CUDA Templates for Linear Algebra Subroutines

C++ 7,298 1,197 Updated Apr 10, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,725 289 Updated Mar 10, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,222 563 Updated Apr 16, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 7,443 713 Updated Apr 16, 2025

FlashMLA: Efficient MLA decoding kernels

C++ 11,441 822 Updated Mar 1, 2025

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 1,741 104 Updated Apr 3, 2025

CUDA Core Compute Libraries

C++ 1,599 208 Updated Apr 17, 2025
C++ 54 11 Updated Jan 18, 2025

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.

Python 1,179 129 Updated Apr 14, 2025
C++ 81 6 Updated Mar 26, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 6,771 731 Updated Apr 17, 2025

CUTLASS and CuTe Examples

Cuda 48 7 Updated Jan 4, 2025

Universal LLM Deployment Engine with ML Compilation

Python 20,422 1,712 Updated Apr 6, 2025

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention & RFT)

Python 6,305 620 Updated Apr 17, 2025
Jupyter Notebook 95 8 Updated Nov 11, 2024

[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Python 452 28 Updated Feb 10, 2025

CUDA 算子手撕与面试指南

Cuda 308 31 Updated Jan 15, 2025

The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>

Python 123 6 Updated Dec 6, 2024

Tile primitives for speedy kernels

Cuda 2,265 134 Updated Apr 17, 2025

使用 CUDA C++ 实现的 llama 模型推理框架

Cuda 49 5 Updated Nov 8, 2024

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …

Python 971 47 Updated Apr 16, 2025
Next