Skip to content

MoBA: Mixture of Block Attention for Long-Context LLMs

License

Notifications You must be signed in to change notification settings

st-rnd/MoonshotAI_MoBA

 
 

Repository files navigation

MoBA: Mixture of Block Attention for Long-Context LLMs

Full Report

🚀 Introducing MoBA --- Mixture of Block Attention

  • Trainable Block Sparse Attention: The full context is divided into blocks, where each query token learns to attend to the most relevant KV blocks, enabling efficient processing of long sequences.
  • Parameter-less Gating Mechanism: A novel Parameter-less top-k gating mechanism is introduced to selects the most relevant blocks for each query token, ensuring that the model focuses only on the most informative blocks.
  • Seamlessly Transition between Full and Sparse Attention: MoBA is designed to be a flexible substitute for full attention, allowing seamless transitions between full and sparse attention modes.

Abstract

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the “less structure” principle, allowing the model to autonomously determine where to attend, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi’s long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at MoonshotAI/MoBA.

Evaluation with 1M context length

Environment Setup

Note that current kernel implementations rely on flash-attn==2.6.3 and torch >= 2.1.0

conda create -n moba python=3.10
conda activate moba
pip install .

Quick Start

We provide a transformers-friendly implementation for MoBA.

Feel free to choose attention backends by --attn between moba and moba_naive.

python3 examples/llama.py --model meta-llama/Llama-3.1-8B --attn moba

Unit Tests

pytest tests/test_moba_attn.py

References

Citation

If you find MoBA is useful or want to use in your projects, please kindly cite our paper:

@article{MoonshotMoBA,
  author = {Lu, Enzhe and Jiang, Zhejun and Liu, Jingyuan and Du, Yulun and Jiang, Tao and Hong, Chao and Liu, Shaowei and He, Weiran and Yuan, Enming and Wang, Yuzhi and Huang, Zhiqi and Yuan, Huan and Xu, Suting and Xu, Xinran and Lai, Guokun and Chen, Yanru and Zheng, Huabin and Yan, Junjie and Su, Jianlin and Wu, Yuxin and Zhang, Neo Y. and Yang, Zhilin and Zhou, Xinyu and Zhang, Mingxing and Qiu, Jiezhong},
  title = {MoBA: Mixture of Block Attention for Long-Context LLMs},
  year = {2025},
}

About

MoBA: Mixture of Block Attention for Long-Context LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%