Lists (1)
Sort Name ascending (A-Z)
Stars
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
how to optimize some algorithm in cuda.
Learn CUDA Programming, published by Packt
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
A set of hands-on tutorials for CUDA programming
CUDA Matrix Multiplication Optimization
HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on h…
Benchmark tests supporting the TiledCUDA library.