LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement

1Georgia Institute of Technology, 2NVIDIA
ICLR 2025

*Indicates Equal Contribution

TL;DR: A training-free method for extending the context length of SSMs (State Space Models).
  • Effectiveness: Up to 24.58% average accuracy improvement on RULER;
  • Generic applicability: Applicable to both SSMs and hybrid Transformer-SSM models;
  • Minimal overhead: Less than 4% average latency overhead on A100.

Abstract

State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training.

Key Findings and Analysis

Attention

Figure 1: Visualization of the Mamba-130M model's attention map (log scale) of 5 sampled channels under (a) training sequence length (2,000 tokens) and (b) extended sequence length (16,000 tokens).

🔍 Finding 1: Hidden state channels in SSMs can be categorized into two classes, based on their receptive field lengths at training sequence length (Fig. 1(a)):

  • Local channels: Channels exhibit receptive fields that are significantly shorter than the training sequence length, suggesting that these channels function like a convolution layer or a sliding window attention that captures localized information.
  • Global channels: Channels with receptive fields that are comparable to the entire training sequence length, meaning the global channels learn to capture information globally from the entire sequence.

🔍 Finding 2: Global channels struggle to keep global attention coverage (global receptive field) at extended context lengths, as shown in Fig. 1(b), which could limit SSMs' capability for capturing and understanding long-context information.

LongMamba: A Two-step Pipeline

Overview

Figure 2: (a) Challenge of directly applying SSM to a sequence (length denoted as \(S\)) longer than the training sequence length (denoted as \(L\)); (b) the proposed LongMamba framework, where we enlarge the receptive fields of the global channels using the two-step pipeline detailed below.


In light of the above findings, where we identify that the limited receptive field of the global channels poses a major bottleneck for long-context understanding (as shown in Fig. 2(a)), the key idea of LongMamba is to mitigate memory decay in the aforementioned global channels by preventing the accumulation of unimportant tokens in their memory. As illustrated in Fig. 2(b), this is achieved through the following two-step pipeline:
  • Step 1: Identifying global channels within a target SSM. LongMamba first identifies global channels, which enter the following step, while keeping the other channels (i.e., local channels) untouched.
  • Step 2: Enlarging the receptive fields of the global channels. LongMamba selects critical tokens along the sequence and applies token filtering to remove less important tokens from the global channels' memory. As a result, the global channels' receptive fields are enlarged, allowing them to capture global information from the entire sequence.

Evaluation Results

Length Method S1 S2 S3 MK1 MV MQ VT CWE FWE QA1 QA2 Avg.
16k Vanilla 30.00 11.00 7.00 6.00 8.00 0.00 0.20 0.10 13.67 0.00 1.00 7.00
LongMamba 79.00 92.00 31.00 23.00 58.00 49.25 0.20 2.30 0.67 1.00 11.00 31.58
24k Vanilla 43.00 9.00 0.00 8.00 7.50 0.25 0.00 0.00 1.67 1.00 7.00 7.04
LongMamba 56.00 79.00 29.00 25.00 20.50 18.00 7.00 0.30 1.67 2.00 6.00 22.22
32k Vanilla 26.00 0.00 0.00 0.00 0.00 0.00 1.80 0.10 1.00 0.00 1.00 2.72
LongMamba 34.00 73.00 16.00 17.00 0.80 0.50 4.00 0.20 0.67 2.00 4.00 13.83

Table 1: Per-task accuracy (%) of LongMamba-enhanced and the vanilla Zamba2-1.2B model under different sequence lengths on RULER.

Model Method PC PR GR MN MQA QA 2WM HQA SS TR TQA LCC RB Avg.
Mamba-1.4B Vanilla 1.41 4.59 7.41 8.19 7.97 3.91 7.40 4.43 4.72 15.67 17.29 16.39 9.45 8.37
DeciMamba 0.80 5.44 9.17 10.39 8.82 4.01 6.99 5.39 8.60 14.33 29.70 39.02 31.32 13.38
LongMamba 1.00 4.77 11.74 8.93 10.60 2.92 8.73 6.41 7.00 37.00 40.63 46.29 39.26 17.33
Mamba2-1.3B Vanilla 1.29 0.81 7.67 5.84 11.45 2.19 2.31 2.88 4.69 14.67 10.08 22.94 19.89 8.21
LongMamba 2.02 3.51 14.33 10.28 14.73 5.14 5.73 5.52 14.00 21.67 48.74 42.99 36.75 17.34
Zamba2-1.2B Vanilla 5.72 1.75 9.31 5.39 4.30 4.53 4.42 6.29 9.80 33.33 28.14 15.56 20.07 11.43
LongMamba 3.33 3.67 10.42 8.67 8.92 5.85 6.64 8.29 25.79 39.00 63.02 22.39 25.63 17.82

Table 2: Per-task accuracy (%) of LongMamba-enhanced and the vanilla Zamba2-1.2B model under different sequence lengths on LongBench-E.

📊 Effectiveness: As presented in Tab. 1, the proposed LongMamba framework consistently surpasses the baseline Zamba2-1.2B model in per-task accuracy across all evaluated sequence lengths on the RULER benchmark, yielding up to a 24.58% improvement in average accuracy. This performance advantage is further corroborated in Tab. 2, where LongMamba achieves up to a 17.33% improvement on the LongBench-E benchmark.

🧩 General Applicability: Tab. 2 further highlights the broad applicability of LongMamba. The method improves long-context performance on both pure state-space models (Mamba-1.4B and Mamba2-1.3B) and hybrid Transformer-SSM models (Zamba2-1.2B), demonstrating that LongMamba is generalizable across diverse model architectures.

⚡ Minimal Overhead: Despite its substantial gains in accuracy, LongMamba introduces only a marginal increase in inference latency (less than 4% averaged across 8k, 16k, 32k, and 64k sequence lengths). These results are obtained on an NVIDIA A100 GPU for three representative models: Mamba-1.4B, Mamba2-1.3B, and Zamba2-1.2B.

For a deeper dive into our methodology and full benchmarking results, please refer to our paper on arXiv.

BibTeX

@inproceedings{
        ye2025longmamba,
        title={LongMamba: Enhancing Mamba's Long-Context Capabilities via Training-Free Receptive Field Enlargement},
        author={Zhifan Ye and Kejing Xia and Yonggan Fu and Xin Dong and Jihoon Hong and Xiangchi Yuan and Shizhe Diao and Jan Kautz and Pavlo Molchanov and Yingyan Celine Lin},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=fMbLszVO1H}
        }