State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training.
Figure 1: Visualization of the Mamba-130M model's attention map (log scale) of 5 sampled channels under (a) training sequence length (2,000 tokens) and (b) extended sequence length (16,000 tokens).
🔍 Finding 1: Hidden state channels in SSMs can be categorized into two classes, based on their receptive field lengths at training sequence length (Fig. 1(a)):
🔍 Finding 2: Global channels struggle to keep global attention coverage (global receptive field) at extended context lengths, as shown in Fig. 1(b), which could limit SSMs' capability for capturing and understanding long-context information.
Figure 2: (a) Challenge of directly applying SSM to a sequence (length denoted as \(S\)) longer than the training sequence length (denoted as \(L\)); (b) the proposed LongMamba framework, where we enlarge the receptive fields of the global channels using the two-step pipeline detailed below.
Length | Method | S1 | S2 | S3 | MK1 | MV | MQ | VT | CWE | FWE | QA1 | QA2 | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16k | Vanilla | 30.00 | 11.00 | 7.00 | 6.00 | 8.00 | 0.00 | 0.20 | 0.10 | 13.67 | 0.00 | 1.00 | 7.00 |
LongMamba | 79.00 | 92.00 | 31.00 | 23.00 | 58.00 | 49.25 | 0.20 | 2.30 | 0.67 | 1.00 | 11.00 | 31.58 | |
24k | Vanilla | 43.00 | 9.00 | 0.00 | 8.00 | 7.50 | 0.25 | 0.00 | 0.00 | 1.67 | 1.00 | 7.00 | 7.04 |
LongMamba | 56.00 | 79.00 | 29.00 | 25.00 | 20.50 | 18.00 | 7.00 | 0.30 | 1.67 | 2.00 | 6.00 | 22.22 | |
32k | Vanilla | 26.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.80 | 0.10 | 1.00 | 0.00 | 1.00 | 2.72 |
LongMamba | 34.00 | 73.00 | 16.00 | 17.00 | 0.80 | 0.50 | 4.00 | 0.20 | 0.67 | 2.00 | 4.00 | 13.83 |
Table 1: Per-task accuracy (%) of LongMamba-enhanced and the vanilla Zamba2-1.2B model under different sequence lengths on RULER.
Model | Method | PC | PR | GR | MN | MQA | QA | 2WM | HQA | SS | TR | TQA | LCC | RB | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mamba-1.4B | Vanilla | 1.41 | 4.59 | 7.41 | 8.19 | 7.97 | 3.91 | 7.40 | 4.43 | 4.72 | 15.67 | 17.29 | 16.39 | 9.45 | 8.37 |
DeciMamba | 0.80 | 5.44 | 9.17 | 10.39 | 8.82 | 4.01 | 6.99 | 5.39 | 8.60 | 14.33 | 29.70 | 39.02 | 31.32 | 13.38 | |
LongMamba | 1.00 | 4.77 | 11.74 | 8.93 | 10.60 | 2.92 | 8.73 | 6.41 | 7.00 | 37.00 | 40.63 | 46.29 | 39.26 | 17.33 | |
Mamba2-1.3B | Vanilla | 1.29 | 0.81 | 7.67 | 5.84 | 11.45 | 2.19 | 2.31 | 2.88 | 4.69 | 14.67 | 10.08 | 22.94 | 19.89 | 8.21 |
LongMamba | 2.02 | 3.51 | 14.33 | 10.28 | 14.73 | 5.14 | 5.73 | 5.52 | 14.00 | 21.67 | 48.74 | 42.99 | 36.75 | 17.34 | |
Zamba2-1.2B | Vanilla | 5.72 | 1.75 | 9.31 | 5.39 | 4.30 | 4.53 | 4.42 | 6.29 | 9.80 | 33.33 | 28.14 | 15.56 | 20.07 | 11.43 |
LongMamba | 3.33 | 3.67 | 10.42 | 8.67 | 8.92 | 5.85 | 6.64 | 8.29 | 25.79 | 39.00 | 63.02 | 22.39 | 25.63 | 17.82 |
Table 2: Per-task accuracy (%) of LongMamba-enhanced and the vanilla Zamba2-1.2B model under different sequence lengths on LongBench-E.
📊 Effectiveness: As presented in Tab. 1, the proposed LongMamba framework consistently surpasses the baseline Zamba2-1.2B model in per-task accuracy across all evaluated sequence lengths on the RULER benchmark, yielding up to a 24.58% improvement in average accuracy. This performance advantage is further corroborated in Tab. 2, where LongMamba achieves up to a 17.33% improvement on the LongBench-E benchmark.
🧩 General Applicability: Tab. 2 further highlights the broad applicability of LongMamba. The method improves long-context performance on both pure state-space models (Mamba-1.4B and Mamba2-1.3B) and hybrid Transformer-SSM models (Zamba2-1.2B), demonstrating that LongMamba is generalizable across diverse model architectures.
⚡ Minimal Overhead: Despite its substantial gains in accuracy, LongMamba introduces only a marginal increase in inference latency (less than 4% averaged across 8k, 16k, 32k, and 64k sequence lengths). These results are obtained on an NVIDIA A100 GPU for three representative models: Mamba-1.4B, Mamba2-1.3B, and Zamba2-1.2B.
For a deeper dive into our methodology and full benchmarking results, please refer to our paper on arXiv.
@inproceedings{
ye2025longmamba,
title={LongMamba: Enhancing Mamba's Long-Context Capabilities via Training-Free Receptive Field Enlargement},
author={Zhifan Ye and Kejing Xia and Yonggan Fu and Xin Dong and Jihoon Hong and Xiangchi Yuan and Shizhe Diao and Jan Kautz and Pavlo Molchanov and Yingyan Celine Lin},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=fMbLszVO1H}
}