Skip to content

Latest commit

 

History

History
280 lines (235 loc) · 16.8 KB

CHANGELOG.md

File metadata and controls

280 lines (235 loc) · 16.8 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.0.27] - TBD

Added

  • fMHA: PagedBlockDiagonalGappyKeysMask
  • fMHA: heterogeneous queries in triton_splitk
  • fMHA: support for paged attention in flash
  • backwards pass for merge_attentions
  • fMHA: Added torch.compile support for 2 biases (LowerTriangularMask and LowerTriangularMaskWithTensorBias)
  • fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
  • 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

  • fMHA: Fixed out-of-bounds reading for Split-K triton implementation
  • Profiler: fix bug with modules that take a single tuple as argument
  • Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

  • Removed support for PyTorch version older than 2.2.0

[0.0.26] - 2024-04-29

Added

  • [2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
  • [2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
  • Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

  • merge_attentions no longer needs inputs to be stacked.
  • fMHA: triton_splitk now supports additive bias
  • fMHA: benchmark cleanup

[0.0.25.post1] - 2024-03-29

Pre-built binary wheels require PyTorch 2.2.2

[0.0.25] - 2024-03-14

Pre-built binary wheels require PyTorch 2.2.1

Added

  • New merge_attentions function
  • fMHA: New gappy attention biases.

Improved

  • fMHA: Updated Flash-Attention to v2.5.6: this has a performance improvement for multiquery.
  • fMHA: triton_splitk changed and expanded. Now amalgamates using LSE. Can autotune, supports causal with a small number of queries - not just 1. Experimental support for paged attention.
  • rope_padded: Fixed CUDA error with many queries (more than 65k)
  • rmsnorm: Fixed CUDA error with large inputs (enables 512k+ sequence length on Llama2 70B)

Removed

  • fMHA: Removed triton operator (fmha.triton.*, xformers.ops.MemoryEfficientAttentionTritonFwdFlashBwOp, xformers.ops.TritonFlashAttentionOp), as it has correctness issues under some conditions, and is slower than other implementations.

[0.0.24] - 2024-01-31

Pre-built binary wheels require PyTorch 2.2.0

Added

  • Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free. Read more
  • Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

  • Make selective activation checkpointing be compatible with torch.compile.

Removed

  • Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
  • Removed support for PyTorch version older than 2.1.0

[0.0.23] - 2023-12-05

Pre-built binary wheels require PyTorch 2.1.1 (xFormers 0.0.23) or PyTorch 2.1.2 (xFormers 0.0.23.post1).

Fixed

  • fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
  • fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • fMHA: Added LocalAttentionFromBottomRightMask (local)
  • fMHA: Added LowerTriangularFromBottomRightMask (causal)
  • fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

  • Removed xformers.triton.sum_strided

[0.0.22] - 2023-09-27

Fixed

  • fMHA: Backward pass now works in PyTorch deterministic mode (although slower)

Added

  • fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
  • fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
  • Added an example of efficient LLaMa decoding using xformers operators
  • Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
  • Added an efficient rope implementation in triton, to be used in LLM decoding
  • Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
  • xformers.info now indicates the Flash-Attention version used

Removed

  • fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
  • DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue facebookresearch#848)

[0.0.21] - 2023-08-18

Improved

  • fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available

Bug fixes

  • fMHA/cutlass: Fix potential race condition in the FW/BW passes
  • fMHA/cutlass: Fix attn_bias stride overflow for very long sequences (>32k)
  • LowerTriangularMask is now backward compatible with older xformers versions

Breaking changes

  • memory_efficient_attention now expects the attn_bias argument to have a head dimension
  • memory_efficient_attention no longer broadcasts the batch/head dimensions of attn_bias. Please use .expand if you need to broadcast the bias
  • Remove causal_diagonal argument from BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • Binary wheels on pypi/conda now contain H100 kernels
  • fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery

NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.

[0.0.20] - 2023-05-23

Improved

  • fMHA/cutlass (backward): Massive performance improvements when batch_size * num_heads is low (10x+)
  • fMHA/cutlass: Further performance improvements for both the forward & backward kernels
  • fMHA (backward): Now dispatching to cutlass when embed_dim>64
  • fMHA: Updated Flash-Attention to v1.0.5

Added

  • fMHA now runs on H100 (support is experimental)

[0.0.19] - 2023-04-28

Added

  • Display nvcc version used to compile xformers in python -m xformers.info

Fixed

  • Fixed performance regression with nvcc>11.6 (facebookresearch#712)
  • fMHA/cutlass: Fixed nan in the output when using a torch.Tensor with -inf prefixes as attn_bias (facebookresearch#722)
  • fMHA/cutlass: Fixed nan in the output when the sequence length is larger than 2 ** 15 (facebookresearch#719)
  • fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
  • fMHA/cutlass: The kernel are now deterministic
  • fMHA/cutlass: Fixed backward pass correctness when using dropout (facebookresearch#724)

[0.0.18] - 2023-03-31

Added

  • Added xformers.ops.index_select_cat and xformers.ops.scaled_index_add - those are experimental functions that only work with a few shapes, and can be used to write efficient stochastic depth in transformer architectures for instance

Fixed

  • fMHA: memory_efficient_attention now accepts torch.Tensor as attention bias for any seqlen, although there are still requirements on the alignment of the bias tensor (see facebookresearch#683)

[0.0.17] - 2023-03-28

Fixed

  • fMHA: Fixed BW pass on Sm86/Sm89 GPUs when K > 64 (RTX 3090, RTX 4090, A6000, ..) [facebookresearch#631]

Added

[0.0.16] - 2023-01-31

Fixed

Added

[0.0.15] - Skipped

[0.0.14] - 2022-11-10

Fixed

  • fMHA/CUTLASS: The current CUDA stream is now used by the kernel [facebookresearch#491]
  • fMHA/CUTLASS: Improve overall performance

Added

  • SwiGLU: Added xformers.ops.SwiGLU and its functional counterpart (xformers.ops.swiglu) [facebookresearch#490]
  • fMHA: Possible to combine CUTLASS's forward with flash-attention's backward pass [facebookresearch#469] - improves performance on A100 for K = 128
  • fMHA: Add custom xformers.ops.unbind operator to avoid a cat in the attention block [facebookresearch#458]

[0.0.13] - 2022-09-26

Added

  • fMHA: Added CUTLASS-based kernel for xformers.ops.memory_efficient_attention. This kernel is automatically depending on the inputs, and works on any GPU after P100 [facebookresearch#362]

[0.0.12] - 2022-08-08

Fixed

Added

[0.0.11] - 2022-05-30

Fixed

Added

[0.0.10] - 2022-03-14

Fixed

Added

[0.0.9] - 2022-02-09

Added

Fixed

[0.0.8] - 2022-01-07

Fixed

Added

[0.0.7] - 2021-11-30

Fixed

[0.0.6] - 2021-11-24

Fixed

Added

[0.0.4] - 2021-11-16

Fixed

Added

[0.0.3] - 2021-11-01

Fixed

[0.0.2] - 2021-11-01

Fixed

Added