Skip to content
View GaoXiangYa's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report GaoXiangYa

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily

C++ 218 56 Updated Feb 26, 2025

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 4,688 374 Updated Mar 1, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 4,438 385 Updated Feb 28, 2025

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.

C++ 1,509 200 Updated Jun 12, 2023

Infrastructure for Machine Learning Guided Optimization (MLGO) in LLVM.

Python 660 95 Updated Mar 1, 2025

Compile Time Regular Expression in C++

C++ 3,484 190 Updated Feb 25, 2025

FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs

C++ 10,765 706 Updated Feb 27, 2025

Fast and memory-efficient exact attention

Python 15,994 1,505 Updated Mar 1, 2025

Port of OpenAI's Whisper model in C/C++

C++ 38,128 3,960 Updated Feb 28, 2025

Tensor library for machine learning

C++ 11,991 1,150 Updated Feb 28, 2025

MLIR dialect for libgccjit

C++ 21 Updated Dec 3, 2024

Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (Third Edition)

Cuda 64 17 Updated Jan 21, 2021

Solution of Programming Massively Parallel Processors

C++ 41 5 Updated Jan 15, 2024

An MLIR-based compiler framework bridges DSLs (domain-specific languages) to DSAs (domain-specific architectures).

C++ 565 182 Updated Feb 26, 2025

AMD's graph optimization engine.

C++ 209 94 Updated Mar 1, 2025

AKG (Auto Kernel Generator) is an optimizer for operators in Deep Learning Networks, which provides the ability to automatically fuse ops with specific patterns.

Python 219 38 Updated Mar 21, 2024

LightSeq: A High Performance Library for Sequence Processing and Generation

C++ 3,253 332 Updated May 16, 2023

Machine learning compiler based on MLIR for Sophgo TPU.

C++ 675 168 Updated Feb 24, 2025

A minimal GPU design in Verilog to learn how GPUs work from the ground up

SystemVerilog 7,901 603 Updated Aug 18, 2024

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as…

C++ 1,240 539 Updated Feb 15, 2025

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 7,016 1,940 Updated Feb 26, 2025

Fast OS-level support for GPU checkpoint and restore

C++ 156 13 Updated Feb 25, 2025
C++ 426 19 Updated Feb 28, 2025

CUDA on non-NVIDIA GPUs

Rust 10,803 695 Updated Feb 24, 2025

how to optimize some algorithm in cuda.

Cuda 1,926 172 Updated Feb 26, 2025

My learning notes/codes for ML SYS.

Python 1,155 55 Updated Mar 1, 2025

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 868 102 Updated Feb 28, 2025
C++ 3 Updated Feb 17, 2025

A lightweight C++20 serialization and RPC library

C++ 808 60 Updated Feb 24, 2025
Next