- Torch 2.0 support
- Optional RPB.
- Python 3.10 and 3.11 wheels!
- Only for supported torch versions.
- Support torch 1.13.
- Tiled NA2D for 3x3 kernels.
- Minor changes to the setup script to fix
pip install natten
.
- CPU support!
- CPP backend for CPU computation.
- CPU-only builds now supported.
- Note we only have naive kernels for CPU at the moment. Feel free to open a PR!
- Refactored the CPP/CUDA backend.
- Unit tests for NA1D and NA2D
- Gradcheck tests in slow and fast mode
- Gradcheck tests for CPU backend
- Allclose tests between CPU and CUDA outputs and gradients
- NATTEN is now available through PyPI and comes with pre-compiled wheels.
- Wheel links and more information available on the project: shi-labs.com/natten.
- NATTEN now exists as a separate repository: SHI-Labs/NATTEN.
- Refactored the CPP/CUDA backend.
- Dropped LegacyNeighborhoodAttention (pure torch implementation) in favor of the upcoming CPU implementation.
- Added dilation support to all kernels, including tiled NA kernels.
- Renamed
NeighborhoodAttention
toNeighborhoodAttention2D
in the python interface.NeighborhoodAttention
is now deprecated and will be removed in future versions.
- Renamed
NeighborhoodAttention
toNeighborhoodAttention2D
in the python interface.NeighborhoodAttention
is now deprecated and will be removed in future versions.
- Fixed the race condition in K-Backward and V-Backward kernels.
- This was handled previously with atomic adds (non-deterministic).
- Now the kernels compute inverse neighbors and compute gradients without a race condition.
- "Tiled" Neighborhood Attention kernels for 5x5, 7x7, 9x9, 11x11, and 13x13 window sizes.
- Applies only to QKRPB-Forward and A-Backward.
- Only supports dim per head = 32 for now.
- Try to keep your channels a multiple of 32.
- Improved FP16 support.
- Specific kernels for FP16 that use
half2
addressing. - Different threads per block for FP16.
- Specific kernels for FP16 that use
- New 1D NA kernels
- Slightly more efficient.
- With FP16 support.
- Arbitrary kernel sizes supported (there's no upper bound of 13x13 like the 2D version).
- Window size templating, and some other common bits are factored out into a
commons
header file.- Makes extending support and editing easier.
- Gradchecks now run in fast mode (faster and less memory usage) unless instructed otherwise.
- 1D Neighborhood Attention
- Classification and downstream kernels combined.
- Refactored cuda extension to
natten/
. - Minor speed improvements.
Initial public release.