Skip to content

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

License

Notifications You must be signed in to change notification settings

YangWang92/TiledCUDA

Repository files navigation

TiledCUDA

Introduction

TiledCUDA is a kernel template library that is designed to be highly efficient. It provides a wrapper for cutlass CuTe to simplifly the process of implementing complex fused kernels that utilize tensor core GEMM.

TiledCUDA utilizes PyTorch as its runtime environment and leverages the Tensor class of PyTorch for convenient testing.

Quick Start

Download

git clone [email protected]:TiledTensor/TiledCUDA.git
cd TiledCUDA && git submodule update --init --recursive

Installation

TileCUDA requires a C++20 host compiler, CUDA 12.0 or later, and GCC version 10.0 or higher to support C++20 features.

Unit Test

  • Run a single unit test: make unit_test UNIT_TEST=test_scatter_nd.py
  • Run all unit tests: ./scripts/unittests/python.sh
  • Run a single cpp unit test: make unit_test_cpp CPP_UT=test_copy
  • Run all cpp unit tests: make unit_test_cpps

Features

  • Implemented __device__ function wrapper that enables static/dynamic copying between different memory hierarchy.
  • Implemented __device__ function wrapper for CUDA micro kernels, such as copy_async and tensor core operations.
  • Implemented template wrapper for CuTe to simplify its usage.
  • Implemented fused kernels such as GEMM, Back2Back GEMM, Batched GEMM, Lstm Cell, etc.

About

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 51.7%
  • Cuda 41.1%
  • Python 4.1%
  • CMake 2.4%
  • Other 0.7%