Lists (2)
Sort Name ascending (A-Z)
Stars
Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.
Compile Time Regular Expression in C++
FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs
Fast and memory-efficient exact attention
Port of OpenAI's Whisper model in C/C++
Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (Third Edition)
Solution of Programming Massively Parallel Processors
An MLIR-based compiler framework bridges DSLs (domain-specific languages) to DSAs (domain-specific architectures).
AKG (Auto Kernel Generator) is an optimizer for operators in Deep Learning Networks, which provides the ability to automatically fuse ops with specific patterns.
LightSeq: A High Performance Library for Sequence Processing and Generation
Machine learning compiler based on MLIR for Sophgo TPU.
A minimal GPU design in Verilog to learn how GPUs work from the ground up
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Fast OS-level support for GPU checkpoint and restore
how to optimize some algorithm in cuda.
My learning notes/codes for ML SYS.
A highly optimized LLM inference acceleration engine for Llama and its variants.
A lightweight C++20 serialization and RPC library