Fused neighborhood attention (#111)

* Fused neighborhood attention (FNA) kernels (forward pass only for now) * 1D, 2D and 3D Neighborhood Attention are supported, * Causal neighborhood attention is implemented, * Window (kernel) size, dilation, and causality can be defined *per-axis*, * All GPU architectures since Maxwell (SM50) are supported, * SM50 up to SM70 are SIMT-only, but support both FP16 and FP32, * SM70 and SM75 target Tensor Cores in FP16, and SIMT-style in FP32, * SM80 and above target Tensor Cores in FP16, BF16, and FP32. * Relative positional biases are implemented (not defined for causal masking yet), * Memory layout in FNA is different from existing kernels (`[B, *, heads, dim]` instead of `[B, heads, *, dim]`.) * Eventually this layout can skip over the permute/explicit reshape step in the attention module following the QKV projection. * Naive kernels now implement and allow causal masking, * Naive kernels (CPU and CUDA) now allow varying parameters (window size, dilation, causal) across axes, * Major bug fix in Volta GEMM kernels * The epilogue was different for Volta, and it slipped through unit tests, * Tests are now more aggressive, and the issue has been fixed. * Minor torch bug fixed * Streams were not being selected correctly if users set a tensor to a device other than cuda:0. Thanks to @AdityaKane2001 for discovering it. * Documentation (finally): * Better late than never, but finally added more documentation and reorganized docs under docs/ instead of shoving everything into the readme. * So much more that I forgot (in part due to lack of documentation).
SHI-Labs · Mar 8, 2024 · 9b99173 · 9b99173
1 parent bdee155
commit 9b99173
Show file tree

Hide file tree

Showing 254 changed files with 133,543 additions and 55,740 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,24 @@
 # Changelog
 
+## [Main branch]
+* Fused neighborhood attention (FNA) kernels (forward pass only for now)
+  * 1D, 2D and 3D Neighborhood Attention are supported,
+  * Causal neighborhood attention is implemented,
+  * Window (kernel) size, dilation, and causality can be defined *per-axis*,
+  * All GPU architectures since Maxwell (SM50) are supported,
+    * SM50 up to SM70 are SIMT-only, but support both FP16 and FP32,
+    * SM70 and SM75 target Tensor Cores in FP16, and SIMT-style in FP32,
+    * SM80 and above target Tensor Cores in FP16, BF16, and FP32.
+  * Relative positional biases are implemented (not defined for causal masking yet),
+  * Memory layout in FNA is different from existing kernels (`[B, *, heads, dim]` instead of `[B, heads, *, dim]`.)
+    * Eventually this layout can skip over the permute/explicit reshape step in the attention module following
+    the QKV projection.
+* Naive kernels now implement and allow causal masking,
+* Naive kernels (CPU and CUDA) now allow varying parameters (window size, dilation, causal) across axes,
+* Major bug fix in Volta GEMM kernels
+  * The epilogue was different for Volta, and it slipped through unit tests,
+  * Tests are now more aggressive, and the issue has been fixed.
+
 ## [0.15.1] - 2024-01-24
 * Attention tensors can now be views, which allows combining neighborhood and any other attention pattern (i.e. registers,
   cross attention tokens, and the like) without extra copies. ([#85](https://github.com/SHI-Labs/NATTEN/pull/85) and [#87](https://github.com/SHI-Labs/NATTEN/pull/87)).

diff --git a/LICENSE b/LICENSE
@@ -19,3 +19,37 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
+
+Fused Neighborhood Attention kernels are heavily based on the memory-efficient
+attention kernels from the xFormers project by Meta Platforms, Inc.
+
+Copyright (c) Facebook, Inc. and its affiliates
+
+BSD 3-Clause License
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+3. Neither the names of Facebook, Deepmind Technologies, NYU, NEC Laboratories America
+   and IDIAP Research Institute nor the names of its contributors may be
+   used to endorse or promote products derived from this software without
+   specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
diff --git a/Makefile b/Makefile
@@ -56,7 +56,7 @@ install:
 	NATTEN_CUDA_ARCH="${CUDA_ARCH}" NATTEN_N_WORKERS="${WORKERS}" NATTEN_VERBOSE="${VERBOSE}" pip install -v -e . 2>&1 | tee install.out
 
 test:
-	pytest -v -x ./tests
+	PYTORCH_NO_CUDA_MEMORY_CACHING=1 pytest -v -x ./tests
 
 style:
 	ufmt format $(check_dirs)