Highlights
- Pro
Starred repositories
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
Analyze computation-communication overlap in V3/R1.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
MoBA: Mixture of Block Attention for Long-Context LLMs
NVIDIA Linux open GPU with P2P support
Doing simple retrieval from LLM models at various context lengths to measure accuracy
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
how to optimize some algorithm in cuda.
Large Language Model Text Generation Inference
A high-throughput and memory-efficient inference and serving engine for LLMs
Transformer related optimization, including BERT, GPT
Grasper: A High Performance Distributed System for OLAP on Property Graphs.
A solver for subgraph isomorphism problems, based upon a series of papers by subsets of McCreesh, Prosser, and Trimble.
CP 2015 subgraph isomorphism experiments, data and paper
Open-source graph database, tuned for dynamic analytics environments. Easy to adopt, scale and own.