Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.0.27] - TBD

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
backwards pass for merge_attentions
fMHA: Added torch.compile support for 2 biases (LowerTriangularMask and LowerTriangularMaskWithTensorBias)
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

[0.0.26] - 2024-04-29

Added

[2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
[2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

merge_attentions no longer needs inputs to be stacked.
fMHA: triton_splitk now supports additive bias
fMHA: benchmark cleanup

[0.0.25.post1] - 2024-03-29

Pre-built binary wheels require PyTorch 2.2.2

[0.0.25] - 2024-03-14

Pre-built binary wheels require PyTorch 2.2.1

Added

New merge_attentions function
fMHA: New gappy attention biases.

Improved

fMHA: Updated Flash-Attention to v2.5.6: this has a performance improvement for multiquery.
fMHA: triton_splitk changed and expanded. Now amalgamates using LSE. Can autotune, supports causal with a small number of queries - not just 1. Experimental support for paged attention.
rope_padded: Fixed CUDA error with many queries (more than 65k)
rmsnorm: Fixed CUDA error with large inputs (enables 512k+ sequence length on Llama2 70B)

Removed

fMHA: Removed triton operator (fmha.triton.*, xformers.ops.MemoryEfficientAttentionTritonFwdFlashBwOp, xformers.ops.TritonFlashAttentionOp), as it has correctness issues under some conditions, and is slower than other implementations.

[0.0.24] - 2024-01-31

Pre-built binary wheels require PyTorch 2.2.0

Added

Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free. Read more
Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

Make selective activation checkpointing be compatible with torch.compile.

Removed

Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
Removed support for PyTorch version older than 2.1.0

[0.0.23] - 2023-12-05

Pre-built binary wheels require PyTorch 2.1.1 (xFormers 0.0.23) or PyTorch 2.1.2 (xFormers 0.0.23.post1).

Fixed

fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

fMHA: Added LocalAttentionFromBottomRightMask (local)
fMHA: Added LowerTriangularFromBottomRightMask (causal)
fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

Removed xformers.triton.sum_strided

[0.0.22] - 2023-09-27

Fixed

fMHA: Backward pass now works in PyTorch deterministic mode (although slower)

Added

fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
Added an example of efficient LLaMa decoding using xformers operators
Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
Added an efficient rope implementation in triton, to be used in LLM decoding
Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info now indicates the Flash-Attention version used

Removed

fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue facebookresearch#848)

[0.0.21] - 2023-08-18

Improved

fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available

Bug fixes

fMHA/cutlass: Fix potential race condition in the FW/BW passes
fMHA/cutlass: Fix attn_bias stride overflow for very long sequences (>32k)
LowerTriangularMask is now backward compatible with older xformers versions

Breaking changes

memory_efficient_attention now expects the attn_bias argument to have a head dimension
memory_efficient_attention no longer broadcasts the batch/head dimensions of attn_bias. Please use .expand if you need to broadcast the bias
Remove causal_diagonal argument from BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

Binary wheels on pypi/conda now contain H100 kernels
fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery

NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.

[0.0.20] - 2023-05-23

Improved

fMHA/cutlass (backward): Massive performance improvements when batch_size * num_heads is low (10x+)
fMHA/cutlass: Further performance improvements for both the forward & backward kernels
fMHA (backward): Now dispatching to cutlass when embed_dim>64
fMHA: Updated Flash-Attention to v1.0.5

Added

fMHA now runs on H100 (support is experimental)

[0.0.19] - 2023-04-28

Added

Display nvcc version used to compile xformers in python -m xformers.info

Fixed

Fixed performance regression with nvcc>11.6 (facebookresearch#712)
fMHA/cutlass: Fixed nan in the output when using a torch.Tensor with -inf prefixes as attn_bias (facebookresearch#722)
fMHA/cutlass: Fixed nan in the output when the sequence length is larger than 2 ** 15 (facebookresearch#719)
fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
fMHA/cutlass: The kernel are now deterministic
fMHA/cutlass: Fixed backward pass correctness when using dropout (facebookresearch#724)

[0.0.18] - 2023-03-31

Added

Added xformers.ops.index_select_cat and xformers.ops.scaled_index_add - those are experimental functions that only work with a few shapes, and can be used to write efficient stochastic depth in transformer architectures for instance

Fixed

fMHA: memory_efficient_attention now accepts torch.Tensor as attention bias for any seqlen, although there are still requirements on the alignment of the bias tensor (see facebookresearch#683)

[0.0.17] - 2023-03-28

Fixed

fMHA: Fixed BW pass on Sm86/Sm89 GPUs when K > 64 (RTX 3090, RTX 4090, A6000, ..) [facebookresearch#631]

Added

fMHA/CUTLASS: Added tensor attn bias support [facebookresearch#587] - contribution from @jfc4050
fMHA/CUTLASS: Added tensor attn bias grad support [facebookresearch#587] - contribution from @jfc4050
fMHA/CUTLASS: Added dropout support [facebookresearch#587] - contribution from @jfc4050
fMHA: Added support for varying sequence lengths [facebookresearch#500]

[0.0.16] - 2023-01-31

Fixed

Updated triton dependency [facebookresearch#418]
Stripe lineinfo from binaries, reducing the binary size [facebookresearch#549]
Added support for pip wheels [facebookresearch#588, facebookresearch#573, facebookresearch#534, facebookresearch#523, ...] big thanks to @AbdBarho!
Fixed compatibility with Python 3.7 [facebookresearch#541] - thanks to @susumuota
fMHA: Fixed strides for QKV gradients for cutlass attention [facebookresearch#535]
fMHA: Stricter inputs validation to avoid CUDA errors for unsupported inputs [facebookresearch#592]
fMHA/Flash-Attention: Updated to https://github.com/HazyResearch/flash-attention/commit/a1f49a2b92b6fa022379bbebafed9d7f5e96a675 with multiple changes from @TriDao that make the operator up to 20% faster
fMHA/Flash-Attention: Fixed backward pass wrapper, where non-contiguous gradients could give the wrong result [facebookresearch#548]
fMHA: Separate each operator into forward and backward operators. It's now possible to use any combination of forward+backward (for instance Triton forward and Flash-Attention backward) [facebookresearch#560]

Added

fMHA: Added Triton operator for forward pass from Flash-Attention authored by @TriDao, will be automatically used on A100 when compatible
fMHA: Added xformers.ops.memory_efficient_attention_forward, xformers.ops.memory_efficient_attention_forward_requires_grad, xformers.ops.memory_efficient_attention_backward for power-users who write custom autograd functions [facebookresearch#560]
fMHA: Support for custom scaling for the CUTLASS-based kernel [facebookresearch#530] - contribution from @comaniac

[0.0.15] - Skipped

[0.0.14] - 2022-11-10

Fixed

fMHA/CUTLASS: The current CUDA stream is now used by the kernel [facebookresearch#491]
fMHA/CUTLASS: Improve overall performance

Added

SwiGLU: Added xformers.ops.SwiGLU and its functional counterpart (xformers.ops.swiglu) [facebookresearch#490]
fMHA: Possible to combine CUTLASS's forward with flash-attention's backward pass [facebookresearch#469] - improves performance on A100 for K = 128
fMHA: Add custom xformers.ops.unbind operator to avoid a cat in the attention block [facebookresearch#458]

[0.0.13] - 2022-09-26

Added

fMHA: Added CUTLASS-based kernel for xformers.ops.memory_efficient_attention. This kernel is automatically depending on the inputs, and works on any GPU after P100 [facebookresearch#362]

[0.0.12] - 2022-08-08

Fixed

Removed duplicated biases in the FusedMLP layers [facebookresearch#317]
Rotary embeddings respecting input types [facebookresearch#326]
Poolformer style instantiating useless projection layers [facebookresearch#349]
Fix layer position not being properly tracked, causing extra layernorms for programmatic xformers [facebookresearch#348]
Pass use_triton flag to LayerNorm module [facebookresearch#336]

Added

Four blocksparsity layouts from DeepSpeed [facebookresearch#320]
Support several initialization options [facebookresearch#312]
Conv2DFeedforward feedforward part [facebookresearch#321]
VisualAttention [facebookresearch#329]
Automatic blocksparse for causal attention [facebookresearch#334]
Better hierarchical transformer generation [facebookresearch#345]
Fused operations with AOTAutograd/NVFuser, integration into MLP [facebookresearch#357]
Refactor LRA code to use Pytorch Lightning [facebookresearch#343]

[0.0.11] - 2022-05-30

Fixed

Fix some torchscriptability [facebookresearch#246]
Fix FourierMix being compatible with AMP [facebookresearch#258]
Better asserts on QKV dimensions [facebookresearch#264]
Better perfs for FusedMLP and FusedLinearLayer [facebookresearch#283]
Deepnorm init missing self-attention [facebookresearch#284]

Added

Simplicial Embeddings [facebookresearch#259]
Mem efficient attention, FW pass [facebookresearch#267]
MHA benchmark
MLP benchmark
Move all triton kernels to triton v2 [facebookresearch#272]
Mem efficient attention, BW pass [facebookresearch#281]
Metaformer support [facebookresearch#294]

[0.0.10] - 2022-03-14

Fixed

Expose bias flag for feedforwards, same default as Timm [facebookresearch#220]
Update eps value for layernorm, same default as torch [facebookresearch#221]
PreNorm bugfix, only one input was normalized [facebookresearch#233]
Fix bug where embedding dimensions that did not match model dim would lead to a crash [facebookresearch#244]

Added

Add DeepNet (DeepNorm) residual path and init [facebookresearch#227]

[0.0.9] - 2022-02-09

Added

Compositional Attention [facebookresearch#41]
Experimental Ragged attention [facebookresearch#189]
Mixture of Experts [facebookresearch#181]
BlockSparseTensor [facebookresearch#202]
Nd-tensor support for triton softmax [facebookresearch#210]

Fixed

Bugfix Favor, single feature map [facebookresearch#183]
Sanity check blocksparse settings [facebookresearch#207]
Fixed some picklability [facebookresearch#204]

[0.0.8] - 2022-01-07

Fixed

Much faster fused dropout [facebookresearch#164]
Fused dropout repeatability [facebookresearch#173]

Added

Embedding weight tying option [facebookresearch#172]

[0.0.7] - 2021-11-30

Fixed

Dropout setting not properly passed in many attentions [facebookresearch#123]

[0.0.6] - 2021-11-24

Fixed

Fix self attention optimization not being triggered, broken residual path [facebookresearch#119]
Improve speed by not using contiguous Tensors when not needed [facebookresearch#119]

Added

Attention mask wrapper [facebookresearch#113]
ViT comparison benchmark [facebookresearch#117]

[0.0.4] - 2021-11-16

Fixed

Homogenizing the masks, additive or bool [facebookresearch#79][facebookresearch#85][facebookresearch#86]
Fix causality flag not being respected [facebookresearch#103]
Enabling FusedLayerNorm by default in the factory if Triton is available
Fixing Favor with fp16
Fixing Favor trainability

Added

Fused dropout/bias/activation layer [facebookresearch#58]
Fused layernorm used by default in the factory [facebookresearch#92]

[0.0.3] - 2021-11-01

Fixed

Nystrom causal attention [facebookresearch#75]

[0.0.2] - 2021-11-01

Fixed

More robust blocksparse [facebookresearch#24]

Added

Rotary embeddings [facebookresearch#32]
More flexible layernorm [facebookresearch#50]

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[0.0.27] - TBD

Added

Improved

Removed

[0.0.26] - 2024-04-29

Added

Improved

[0.0.25.post1] - 2024-03-29

[0.0.25] - 2024-03-14

Added

Improved

Removed

[0.0.24] - 2024-01-31

Added

Improved

Removed

[0.0.23] - 2023-12-05

Fixed

Added

Removed

[0.0.22] - 2023-09-27

Fixed

Added

Removed

[0.0.21] - 2023-08-18

Improved

Bug fixes

Breaking changes

Added

[0.0.20] - 2023-05-23

Improved

Added

[0.0.19] - 2023-04-28

Added

Fixed

[0.0.18] - 2023-03-31

Added

Fixed

[0.0.17] - 2023-03-28

Fixed

Added

[0.0.16] - 2023-01-31

Fixed

Added

[0.0.15] - Skipped

[0.0.14] - 2022-11-10

Fixed

Added

[0.0.13] - 2022-09-26

Added

[0.0.12] - 2022-08-08

Fixed

Added

[0.0.11] - 2022-05-30

Fixed

Added

[0.0.10] - 2022-03-14

Fixed

Added

[0.0.9] - 2022-02-09

Added

Fixed

[0.0.8] - 2022-01-07

Fixed

Added

[0.0.7] - 2021-11-30

Fixed

[0.0.6] - 2021-11-24

Fixed

Added

[0.0.4] - 2021-11-16

Fixed

Added