Stars
- All languages
- Assembly
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- CoffeeScript
- Cuda
- Cython
- Dart
- Dockerfile
- Erlang
- GLSL
- Go
- HTML
- Handlebars
- Haskell
- Java
- JavaScript
- Julia
- Jupyter Notebook
- Kotlin
- Less
- Lua
- MLIR
- Makefile
- OCaml
- Objective-C
- PDDL
- PHP
- Perl
- Python
- Ruby
- Rust
- SCSS
- Shell
- Starlark
- Svelte
- Swift
- TeX
- TypeScript
- Verilog
- Vim Script
- Vue
Instant neural graphics primitives: lightning fast NeRF and more
how to optimize some algorithm in cuda.
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Deformable ConvNets V2 (DCNv2) in PyTorch
[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Efficient GPU kernels for block-sparse matrix multiplication and convolution
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
A simple GPU hash table implemented in CUDA using lock free techniques
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
PyTorch-Based Fast and Efficient Processing for Various Machine Learning Applications with Diverse Sparsity
Parallel CUDA FloodFill algorithm working on 2D and 3D arrays with obstacles