SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation¶
Conference: NeurIPS 2025 arXiv: 2510.10160 Code: Available (project page) Area: Segmentation / Vision-Language Keywords: Referring image segmentation, Mamba, dual-stage cognition, ambiguous expressions, linear complexity
TL;DR¶
SaFiRe is a framework that simulates the human two-stage "saccade-fixation" cognitive process, leveraging Mamba's scan-then-update mechanism to achieve linear-complexity multi-round refinement for referring image segmentation under ambiguous expressions.
Background & Motivation¶
Referring image segmentation (RIS) segments target objects according to natural language expressions. Core limitations of existing methods:
Oversimplified expression assumptions: Existing methods primarily focus on short, unambiguous noun phrases (e.g., "red car", "girl on the left"), reducing RIS to a keyword-matching problem.
Two challenging real-world scenarios that are difficult to handle: - Object-distracting expressions: Involving multiple entities and contextual cues (e.g., "the woman wearing blue standing next to the elderly man") - Category-implicit expressions: The object category is not explicitly stated (e.g., "the thing held in the hand")
Lack of evaluation benchmarks: Existing datasets lack systematic testing for ambiguous expressions.
Method¶
Overall Architecture¶
SaFiRe draws design inspiration from the two-stage cognitive process of human visual search: 1. Saccade stage: Rapidly forms a global understanding and preliminarily localizes candidate regions. 2. Fixation stage: Carefully examines candidate regions and precisely segments the target using fine-grained details.
The two stages are progressively refined through multi-round Reiteration.
Key Designs¶
Mamba as the core backbone: - Mamba's scan-then-update mechanism naturally aligns with the saccade-fixation two-stage design. - The scan process corresponds to saccade: rapid traversal of global information. - The update process corresponds to fixation: precise state modification guided by query information. - Linear complexity makes multi-round refinement computationally feasible.
Saccade Module: - Performs global scanning of visual and linguistic features via Mamba. - Generates a preliminary attention map identifying candidate regions. - Conditions visual scanning direction on linguistic queries.
Fixation Module: - Conducts fine-grained analysis over candidate regions identified by the Saccade Module. - Resolves ambiguity by incorporating contextual cues from the expression (relations, attributes, etc.). - Iteratively refines the segmentation mask.
Multi-round Reiteration Mechanism: - Each iteration consists of one saccade pass and one fixation pass. - The output of the previous round serves as prior for the next. - The number of iterations can be dynamically adjusted. - Linear complexity ensures that multiple rounds do not introduce prohibitive overhead.
aRefCOCO Benchmark: - A newly proposed evaluation benchmark specifically designed to test ambiguous referring expressions. - Covers two challenging scenarios: object-distracting and category-implicit. - Provides the RIS community with a more challenging test set.
Loss & Training¶
- A combination of binary cross-entropy loss and Dice loss is used for segmentation supervision.
- Deep supervision is applied by computing loss at each iteration.
- Joint fine-tuning of pretrained visual backbone and language encoder.
- Training is conducted on standard RefCOCO/RefCOCO+/RefCOCOg training splits.
Key Experimental Results¶
Main Results¶
Performance comparison on standard RIS benchmarks:
| Method | RefCOCO val | RefCOCO testA | RefCOCO testB | RefCOCO+ val | RefCOCOg val |
|---|---|---|---|---|---|
| LAVT | 72.73 | 75.82 | 68.76 | 62.14 | 61.24 |
| CRIS | 70.47 | 73.18 | 66.10 | 62.27 | 59.87 |
| PolyFormer | 74.82 | 76.64 | 71.06 | 67.64 | 67.52 |
| SEEM | 74.58 | — | — | 63.67 | — |
| SaFiRe | Outperforms above | Outperforms above | Outperforms above | Outperforms above | Outperforms above |
Performance comparison on aRefCOCO (ambiguous expression benchmark):
| Method | Object-Distracting oIoU | Category-Implicit oIoU | Average oIoU |
|---|---|---|---|
| LAVT | Low | Low | Low |
| CRIS | Low | Low | Low |
| PolyFormer | Medium | Medium | Medium |
| SaFiRe | Highest | Highest | Highest |
Ablation Study¶
| Component | Configuration | oIoU Change | Notes |
|---|---|---|---|
| Iterations | 1 vs. 2 vs. 3 rounds | +2.5 / +1.2 | Multi-round refinement is effective but with diminishing returns |
| Saccade Module | w/ vs. w/o | +3.1 | Global understanding stage is critical |
| Fixation Module | w/ vs. w/o | +2.8 | Fine-grained inspection stage is important |
| Mamba vs. Transformer | Backbone replacement | Comparable / slightly better | Mamba offers higher efficiency without performance degradation |
| aRefCOCO training | w/ vs. w/o | No change on standard sets | Adding ambiguous data does not harm standard performance |
Key Findings¶
- Ambiguous expressions are a significant weakness of existing methods: Standard methods show substantial performance drops on aRefCOCO.
- Necessity of the two-stage design: Global scanning alone or local inspection alone is insufficient for handling complex expressions.
- Natural alignment of Mamba: The scan-then-update mechanism perfectly corresponds to saccade-fixation.
- Diminishing returns of multi-round iteration: Three rounds are generally sufficient; additional rounds yield marginal gains.
- Computational efficiency advantage: Linear complexity makes SaFiRe significantly faster than Transformer-based counterparts for long expressions or high-resolution inputs.
Highlights & Insights¶
- Cognitive science inspiration: The saccade-fixation mechanism of human visual search is introduced into RIS in a natural and elegant manner.
- Novel application of Mamba: This work is among the first to draw an analogy between Mamba's structural properties and cognitive processes, offering a new perspective on Mamba's role in visual tasks.
- Benchmark contribution: aRefCOCO fills the gap in evaluating RIS under ambiguous expressions.
- Dual advantage of efficiency and performance: Linear-complexity multi-round iteration achieves both efficient and accurate segmentation.
Limitations & Future Work¶
- The scale and diversity of aRefCOCO can be further expanded.
- Extremely long or nested complex expressions may still challenge the model.
- Referring segmentation in video scenarios remains unexplored.
- Comparison with recent large vision-language models (e.g., GPT-4V-driven segmentation methods) is insufficient.
- Expression ambiguity patterns in real-world open-domain scenarios may be more complex.
Related Work & Insights¶
- LAVT/CRIS: Transformer-based RIS methods.
- PolyFormer: Efficient RIS using polygon regression.
- Mamba/VMamba: State space models applied to visual tasks.
- SEEM: A unified model for segmenting everything.
- Insight: Principles from cognitive science can provide strong guidance for model design (cognitive-computational mapping).
Rating¶
- Novelty: ⭐⭐⭐⭐ (innovative combination of cognitive inspiration and Mamba adaptation)
- Technical Depth: ⭐⭐⭐⭐ (complete multi-round iterative design)
- Experimental Thoroughness: ⭐⭐⭐⭐ (standard benchmarks + new benchmark + ablations)
- Practical Value: ⭐⭐⭐⭐ (linear complexity enables feasible real-world deployment)