SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation¶
Conference: NeurIPS 2025 arXiv: 2505.21795 Code: GitHub Area: Image Segmentation Keywords: Few-Shot Segmentation, SAM2, semantic alignment, feature adaptation, Memory Attention
TL;DR¶
SANSA reveals that SAM2, despite being pre-trained in a class-agnostic manner, implicitly encodes rich semantic structure in its features. By inserting lightweight AdaptFormer adapters into the last two layers of a frozen SAM2 Image Encoder, the method redirects the Memory Attention mechanism from visual-similarity matching to semantic-similarity matching. This unified architecture achieves state-of-the-art performance on few-shot segmentation while being more than 3× faster and 4–5× smaller in parameter count than competing approaches.
Background & Motivation¶
Few-shot segmentation (FSS) aims to segment novel categories given only a handful of annotated examples. Existing methods typically decompose FSS into a two-stage pipeline: semantic correspondence is first established via DINOv2 feature matching, and high-quality masks are then generated by SAM. Although effective, this modular design introduces additional computational overhead and the complexity of coordinating multiple models.
The authors observe that SAM2's "prompt-and-propagate" mechanism naturally unifies the two core capabilities required by FSS—dense feature matching (via Memory Attention for cross-frame correspondence) and high-quality mask generation (via the Mask Decoder). The central question is therefore: can SAM2 be extended from visual-similarity tracking to "semantic tracking" grounded in shared conceptual categories?
Through empirical investigation, the authors identify a critical phenomenon: on datasets with low semantic diversity (e.g., lung X-rays, skin lesions), frozen SAM2 performs comparably to or even better than state-of-the-art methods; however, on datasets with high semantic diversity (e.g., COCO, LVIS), performance drops sharply. The intuitive conclusion would be that SAM2 has not learned semantic representations, but the authors challenge this interpretation. They note that SAM2's pre-training objective—matching object instances across frames—resembles self-supervised learning frameworks that have been shown to induce semantic understanding through view-invariance. They therefore hypothesize that SAM2 does encode semantic information, but this information is entangled with instance-level features optimized for tracking. If the hypothesis holds, a lightweight bottleneck transformation should be sufficient to disentangle this structure, and the learned semantic mapping should generalize to unseen categories.
Method¶
Overall Architecture¶
SANSA reframes FSS as "tracking a semantic concept in a pseudo-video." Given \(K\) reference image–mask pairs and a target image, they are concatenated into a pseudo-video sequence:
SAM2's streaming pipeline processes the annotated reference frames sequentially and propagates masks to the unannotated target frame via Memory Attention, achieving segmentation based on semantic similarity.
Key Designs¶
-
Redirecting from object tracking to semantic tracking: SAM2's functionality is conceptually decomposed into two parts. (a) Dense feature matching: the Memory Encoder fuses reference masks with frame features into memory representations \(\mathcal{I}_r^k = \mathcal{F}_r^k + \text{conv\_down}(\hat{y}_r^k)\), which are stored in the Memory Bank; target frame features then establish dense correspondences via cross-attention in Memory Attention: \(\mathcal{F}_{t,\text{match}} = \text{Attention}(Q(\mathcal{F}_t)\, K([\mathcal{I}_r^0,\ldots,\mathcal{I}_r^k])^T)\, V([\mathcal{I}_r^0,\ldots,\mathcal{I}_r^k])\). (b) High-quality mask generation: the Mask Decoder refines coarse feature-matching results into segmentation outputs. A key design choice is that reference frames are encoded into the Memory Bank without passing through Memory Attention, avoiding cross-reference and ensuring that target predictions are invariant to the order of reference images.
-
SAM2 feature adaptation (AdaptFormer): AdaptFormer modules are inserted into the last two layers of the frozen SAM2 Image Encoder. Given a down-projection matrix \(\mathbf{W}_{down} \in \mathcal{R}^{d,\tilde{d}}\) and an up-projection matrix \(\mathbf{W}_{up} \in \mathcal{R}^{\tilde{d},d}\), AdaptFormer operates token-wise as: \(\mathcal{A}(x) = \sigma(x \cdot \mathbf{W}_{down}) \cdot \mathbf{W}_{up}\), where \(\sigma\) is ReLU and \(\tilde{d} < d\) is the bottleneck dimension. The adapted features are added residually to the Transformer block: \(x' = \text{MLP}(x_\text{self}) + x_\text{self} + \mathcal{A}(x_\text{self})\). Only the projection matrices (~10M parameters) are trained while SAM2 remains frozen, preserving its pre-trained priors. The last two layers are selected because they encode higher-level semantic representations. Experiments show that higher-capacity adapters (e.g., larger bottlenecks or MONA) actually degrade generalization.
-
Training objective — pseudo-reference self-training: An episodic training paradigm is adopted, but with an inverted \(k\)-shot setup: the model receives a single annotated reference image and is tasked with propagating the concept to multiple unannotated target images. The training episode is \(\mathcal{M}_{train} = [x_r, a_r] \cup [x_t^j, \varnothing]_{j=1}^J\). A key design is that the predicted representation \(\mathcal{I}_t^j\) for each target frame is also encoded into the Memory Bank, turning intermediate frames into pseudo-references for subsequent frames. This forces the model to disentangle semantic information from low-level features and prevents overfitting to individual image pairs.
Loss & Training¶
- Binary Cross-Entropy loss and Dice loss supervise the predicted masklets \(\{\hat{y}_t^j\}_{j=1}^J\)
- AdamW optimizer with learning rate \(10^{-4}\)
- 5 epochs for the standard FSS setting; 20 epochs for the generalization setting
- Training uses \(k=1\) (single reference) and sequence length \(J=3\); the same model is evaluated under both 1-shot and 5-shot settings
Key Experimental Results¶
Main Results¶
| Dataset | Metric (1-shot mIoU) | SANSA | Prev. SOTA | Gain |
|---|---|---|---|---|
| LVIS-92i | 1-shot mIoU | 48.8 | 40.5 (SegIC) | +8.3 |
| COCO-20i | 1-shot mIoU | 60.2 | 53.9 (VRP-SAM) | +6.3 |
| FSS-1000 | 1-shot mIoU | 91.4 | 90.2 (DiffewS) | +1.2 |
| LVIS-92i | 5-shot mIoU | 53.9 | 43.7 (DiffewS) | +10.2 |
| COCO-20i | 5-shot mIoU | 64.3 | 60.7 (DiffewS) | +3.6 |
Ablation Study¶
| Configuration | COCO-20i mIoU | Notes |
|---|---|---|
| Frozen SAM2 | 32.2 | Baseline, no adaptation |
| Full Fine-tuning (224M) | 51.6 | All parameters fine-tuned |
| QKV Fine-tuning (50M) | 55.3 | Only QKV projections fine-tuned |
| LoRA | 58.0 | Adaptation method |
| AdaptFormer (0.3× bottleneck) | 60.2 | SANSA (10M parameters) |
| MONA (complex adaptation) | 56.9 | Higher capacity hurts generalization |
| Adapting all stages (0–3) | 59.4 | Full-layer adaptation |
| Adapting late stages (2–3) | 60.2 | Last two layers are sufficient |
Key Findings¶
- SANSA uses only 234M parameters, is more than 3× faster than GF-SAM (945M), and outperforms it by 13.6% on LVIS-92i
- In promptable FSS, the performance gap from point prompts to mask prompts is only 6.8% (vs. 15.5% for VRP-SAM)
- In the generalization (in-context) setting, SANSA demonstrates cross-task generalization even without training on part-level data, surpassing DiffewS by 7.5% on Pascal-Part
- PCA visualizations clearly show that adapted features form clusters organized by semantic category, and this semantic organization generalizes to unseen categories
Highlights & Insights¶
- Core insight: SAM2's class-agnostic pre-training implicitly encodes rich semantic structure, analogous to the semantic understanding induced by view-invariance in self-supervised learning. This finding challenges the prevailing assumption that SAM2 lacks semantic understanding
- Minimalist design philosophy: Inserting only the simplest AdaptFormer into the last two layers, with ~10M trainable parameters, achieves state-of-the-art performance—demonstrating the principle that more constrained adaptation generalizes better
- Unified architecture advantage: The dual-model DINOv2+SAM pipeline is eliminated; a single SAM2 architecture simultaneously performs feature matching and mask generation
Limitations & Future Work¶
- A gap of −2.5% remains compared to GF-SAM on 5-shot COCO-20i
- Episodic training may be constrained by the diversity of training categories
- The generalization capability of the adapters depends on the degree of semantic relatedness between base and novel categories
Related Work & Insights¶
- Compared to single-model methods such as SegIC and DiffewS, SANSA demonstrates that SAM2's Memory Attention is an ideal mechanism for unifying feature matching and mask generation
- The work provides a new paradigm for "mining hidden capabilities" of foundation models: exposing latent structure in pre-trained features through lightweight adaptation
- The pseudo-reference self-training strategy is transferable to other video understanding tasks
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to reveal and exploit the implicit semantic structure of SAM2; highly original perspective
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation on LVIS/COCO/FSS-1000, detailed ablations, and compelling PCA visualizations
- Writing Quality: ⭐⭐⭐⭐⭐ — Problem-driven narrative structured around three research questions; clear and logical
- Value: ⭐⭐⭐⭐⭐ — Minimalist design, state-of-the-art performance, and efficient inference make this highly practical