This repository contains supplemental code to accompany my blog post on optimizing inference for Diffusion Policy. Special thanks to Cheng Chi and the team at TRI/Columbia for their clean code release, which has been instrumental for pedagogical purposes.
-
Part 3 - Profiling a Pytorch Forward Pass
diffusion_inference.py
: Code to run an end-to-end evaluation of Diffusion Policy with a 2D Push-T environment, including coarse/fine profiling of program run-time.log/diffusion/unet_prof.pt.trace.json
: Pytorch profile trace for U-Net forward pass. Can be viewed usingchrome://tracing
.hta.ipynb
: A Jupyter notebook demonstrating the use of Meta's Holistic Trace Analysis tool for detailed U-Net GPU utilization and kernel-level performance metrics analysis.
-
Part 4 - 1D Convolution in CUDA (Naive)
conv1d_naive.cu
: Standalone version of the naive 1D convolution kernel.conv1d_naive.ncu-rep
: NCU profile of the naive 1D convolution kernel's performance.
-
Part 5 - 1D Convolution in CUDA (Optimized)
conv1d_optimized.cu
: Standalone version of the optimized 1D convolution kernel discussed in the blog post.conv1d_optimized.ncu-rep
: NCU profile of the optimized 1D convolution kernel's performance.
-
Part 6 - Kernel Fusion in CUDA
gnm.cu
: Standalone version of the kernel fusion example discussed in the blog post.
-
Part 7 - A Dive Into DDPMs & a CUDA kernel for Denoising
denoise_kernel.cu
: Standalone version of the denoising kernel.
-
Part 8 - Integrating a Custom CUDA Kernel & CUDA Graphs in Pytorch
conv1d.cpp
: C++ file with Python binding & CUDA kernel wrapper for the 1D Convolution kernel.conv1d_kernel.cu
: CUDA file with the Conv1D kernel and driver function.cuda_graph_example.py
: Script demonstrating how to integrate a custom CUDA kernel into Pytorch, including the use of CUDA graphs to reduce CPU overhead.