Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily

C++ 218 56 Updated Feb 26, 2025

deepseek-ai / 3FS

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 4,866 391 Updated Mar 1, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,458 389 Updated Feb 28, 2025

Tencent / TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

C++ 1,509 200 Updated Jun 12, 2023

google / ml-compiler-opt

Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.

Python 660 95 Updated Mar 1, 2025

hanickadot / compile-time-regular-expressions

Compile Time Regular Expression in C++

C++ 3,484 190 Updated Feb 25, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs

C++ 10,790 709 Updated Mar 1, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 15,996 1,507 Updated Mar 1, 2025

ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++

C++ 38,130 3,960 Updated Feb 28, 2025

ggml-org / ggml

Tensor library for machine learning

C++ 11,991 1,150 Updated Feb 28, 2025

Lancern / mlir-gccjit

MLIR dialect for libgccjit

C++ 21 Updated Dec 3, 2024

nvixnu / pmpp__programming_massively_parallel_processors

Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (Third Edition)

Cuda 64 17 Updated Jan 21, 2021

guanrenyang / Programming-Massively-Parallel-Processors

Solution of Programming Massively Parallel Processors

C++ 41 5 Updated Jan 15, 2024

buddy-compiler / buddy-mlir

An MLIR-based compiler framework bridges DSLs (domain-specific languages) to DSAs (domain-specific architectures).

C++ 565 182 Updated Feb 26, 2025

ROCm / AMDMIGraphX

AMD's graph optimization engine.

C++ 209 94 Updated Mar 1, 2025

mindspore-ai / akg

AKG (Auto Kernel Generator) is an optimizer for operators in Deep Learning Networks, which provides the ability to automatically fuse ops with specific patterns.

Python 219 38 Updated Mar 21, 2024

bytedance / lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation

C++ 3,253 332 Updated May 16, 2023

Tony-Tan / CUDA_Freshman

Cuda 2,331 454 Updated Jan 16, 2024

sophgo / tpu-mlir

Machine learning compiler based on MLIR for Sophgo TPU.

C++ 675 168 Updated Feb 24, 2025

adam-maj / tiny-gpu

A minimal GPU design in Verilog to learn how GPUs work from the ground up

SystemVerilog 7,901 603 Updated Aug 18, 2024

gpgpu-sim / gpgpu-sim_distribution

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…

C++ 1,240 539 Updated Feb 15, 2025