Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 924 55 Updated Jan 30, 2025

andyzeng / tsdf-fusion

Fuse multiple depth frames into a TSDF voxel volume.

Cuda 746 135 Updated May 7, 2019

princeton-vl / lietorch

Cuda 713 55 Updated Oct 20, 2023

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 694 61 Updated Dec 30, 2024

olcf / cuda-training-series

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

Cuda 682 250 Updated Aug 19, 2024

19reborn / NeuS2

[ICCV 2023] Official code for NeuS2

Cuda 655 44 Updated Mar 22, 2024

siboehm / SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

Cuda 614 79 Updated Dec 28, 2023

clu0 / unet.cu

UNet diffusion model in pure CUDA

Cuda 597 27 Updated Jun 28, 2024

vincentfpgarcia / kNN-CUDA

Fast k nearest neighbor search using GPU

Cuda 516 109 Updated Aug 6, 2018

creiser / kilonerf

Code for KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

Cuda 483 54 Updated Jun 16, 2021

MegviiRobot / MegBA

MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment

Cuda 459 61 Updated Jun 3, 2024

graphdeco-inria / nerfshop

NeRFshop: Interactive Editing of Neural Radiance Fields

Cuda 456 24 Updated Mar 27, 2023

cudpp / cudpp

CUDA Data Parallel Primitives Library

Cuda 426 96 Updated Nov 9, 2018

ashawkey / diff-gaussian-rasterization

Cuda 392 36 Updated Jul 24, 2024

kwea123 / pytorch-cppcuda-tutorial

tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)

Cuda 390 37 Updated Apr 17, 2023

facebookresearch / DABA

Official implementation of "Decentralization and Acceleration Enables Large-Scale Bundle Adjustment"

Cuda 354 32 Updated Jul 28, 2023

vchoutas / torch-mesh-isect

Cuda 316 76 Updated Oct 5, 2022

unlimblue / KNN_CUDA

pytorch knn [cuda version]

Cuda 298 38 Updated Dec 14, 2021

MarvinChung / Orbeez-SLAM

Cuda 270 29 Updated Oct 9, 2023

b0nes164 / GPUSorting

State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

Hyeontae Son countywest

Highlights

Starred repositories

Algorithm

3D

differentiable-rendering