SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation¶

Conference: NeurIPS 2025 arXiv: 2505.21795 Code: GitHub Area: Image Segmentation Keywords: Few-Shot Segmentation, SAM2, semantic alignment, feature adaptation, Memory Attention

TL;DR¶

SANSA reveals that SAM2, despite being pre-trained in a class-agnostic manner, implicitly encodes rich semantic structure in its features. By inserting lightweight AdaptFormer adapters into the last two layers of a frozen SAM2 Image Encoder, the method redirects the Memory Attention mechanism from visual-similarity matching to semantic-similarity matching. This unified architecture achieves state-of-the-art performance on few-shot segmentation while being more than 3× faster and 4–5× smaller in parameter count than competing approaches.

Background & Motivation¶

Few-shot segmentation (FSS) aims to segment novel categories given only a handful of annotated examples. Existing methods typically decompose FSS into a two-stage pipeline: semantic correspondence is first established via DINOv2 feature matching, and high-quality masks are then generated by SAM. Although effective, this modular design introduces additional computational overhead and the complexity of coordinating multiple models.

The authors observe that SAM2's "prompt-and-propagate" mechanism naturally unifies the two core capabilities required by FSS—dense feature matching (via Memory Attention for cross-frame correspondence) and high-quality mask generation (via the Mask Decoder). The central question is therefore: can SAM2 be extended from visual-similarity tracking to "semantic tracking" grounded in shared conceptual categories?

Through empirical investigation, the authors identify a critical phenomenon: on datasets with low semantic diversity (e.g., lung X-rays, skin lesions), frozen SAM2 performs comparably to or even better than state-of-the-art methods; however, on datasets with high semantic diversity (e.g., COCO, LVIS), performance drops sharply. The intuitive conclusion would be that SAM2 has not learned semantic representations, but the authors challenge this interpretation. They note that SAM2's pre-training objective—matching object instances across frames—resembles self-supervised learning frameworks that have been shown to induce semantic understanding through view-invariance. They therefore hypothesize that SAM2 does encode semantic information, but this information is entangled with instance-level features optimized for tracking. If the hypothesis holds, a lightweight bottleneck transformation should be sufficient to disentangle this structure, and the learned semantic mapping should generalize to unseen categories.

Method¶

Overall Architecture¶

SANSA reframes FSS as "tracking a semantic concept in a pseudo-video." Given \(K\) reference image–mask pairs and a target image, they are concatenated into a pseudo-video sequence:

\[\mathcal{M} = [x_r^k, a_r^k]_{k=1}^K \cup [x_t, \varnothing]\]

SAM2's streaming pipeline processes the annotated reference frames sequentially and propagates masks to the unannotated target frame via Memory Attention, achieving segmentation based on semantic similarity.

Key Designs¶

Redirecting from object tracking to semantic tracking: SAM2's functionality is conceptually decomposed into two parts. (a) Dense feature matching: the Memory Encoder fuses reference masks with frame features into memory representations \(\mathcal{I}_r^k = \mathcal{F}_r^k + \text{conv\_down}(\hat{y}_r^k)\), which are stored in the Memory Bank; target frame features then establish dense correspondences via cross-attention in Memory Attention: \(\mathcal{F}_{t,\text{match}} = \text{Attention}(Q(\mathcal{F}_t)\, K([\mathcal{I}_r^0,\ldots,\mathcal{I}_r^k])^T)\, V([\mathcal{I}_r^0,\ldots,\mathcal{I}_r^k])\). (b) High-quality mask generation: the Mask Decoder refines coarse feature-matching results into segmentation outputs. A key design choice is that reference frames are encoded into the Memory Bank without passing through Memory Attention, avoiding cross-reference and ensuring that target predictions are invariant to the order of reference images.
SAM2 feature adaptation (AdaptFormer): AdaptFormer modules are inserted into the last two layers of the frozen SAM2 Image Encoder. Given a down-projection matrix \(\mathbf{W}_{down} \in \mathcal{R}^{d,\tilde{d}}\) and an up-projection matrix \(\mathbf{W}_{up} \in \mathcal{R}^{\tilde{d},d}\), AdaptFormer operates token-wise as: \(\mathcal{A}(x) = \sigma(x \cdot \mathbf{W}_{down}) \cdot \mathbf{W}_{up}\), where \(\sigma\) is ReLU and \(\tilde{d} < d\) is the bottleneck dimension. The adapted features are added residually to the Transformer block: \(x' = \text{MLP}(x_\text{self}) + x_\text{self} + \mathcal{A}(x_\text{self})\). Only the projection matrices (~10M parameters) are trained while SAM2 remains frozen, preserving its pre-trained priors. The last two layers are selected because they encode higher-level semantic representations. Experiments show that higher-capacity adapters (e.g., larger bottlenecks or MONA) actually degrade generalization.
Training objective — pseudo-reference self-training: An episodic training paradigm is adopted, but with an inverted \(k\)-shot setup: the model receives a single annotated reference image and is tasked with propagating the concept to multiple unannotated target images. The training episode is \(\mathcal{M}_{train} = [x_r, a_r] \cup [x_t^j, \varnothing]_{j=1}^J\). A key design is that the predicted representation \(\mathcal{I}_t^j\) for each target frame is also encoded into the Memory Bank, turning intermediate frames into pseudo-references for subsequent frames. This forces the model to disentangle semantic information from low-level features and prevents overfitting to individual image pairs.

Loss & Training¶

Binary Cross-Entropy loss and Dice loss supervise the predicted masklets \(\{\hat{y}_t^j\}_{j=1}^J\)
AdamW optimizer with learning rate \(10^{-4}\)
5 epochs for the standard FSS setting; 20 epochs for the generalization setting
Training uses \(k=1\) (single reference) and sequence length \(J=3\); the same model is evaluated under both 1-shot and 5-shot settings

Key Experimental Results¶

Main Results¶

Dataset	Metric (1-shot mIoU)	SANSA	Prev. SOTA	Gain
LVIS-92i	1-shot mIoU	48.8	40.5 (SegIC)	+8.3
COCO-20i	1-shot mIoU	60.2	53.9 (VRP-SAM)	+6.3
FSS-1000	1-shot mIoU	91.4	90.2 (DiffewS)	+1.2
LVIS-92i	5-shot mIoU	53.9	43.7 (DiffewS)	+10.2
COCO-20i	5-shot mIoU	64.3	60.7 (DiffewS)	+3.6

Ablation Study¶

Configuration	COCO-20i mIoU	Notes
Frozen SAM2	32.2	Baseline, no adaptation
Full Fine-tuning (224M)	51.6	All parameters fine-tuned
QKV Fine-tuning (50M)	55.3	Only QKV projections fine-tuned
LoRA	58.0	Adaptation method
AdaptFormer (0.3× bottleneck)	60.2	SANSA (10M parameters)
MONA (complex adaptation)	56.9	Higher capacity hurts generalization
Adapting all stages (0–3)	59.4	Full-layer adaptation
Adapting late stages (2–3)	60.2	Last two layers are sufficient

Key Findings¶

SANSA uses only 234M parameters, is more than 3× faster than GF-SAM (945M), and outperforms it by 13.6% on LVIS-92i
In promptable FSS, the performance gap from point prompts to mask prompts is only 6.8% (vs. 15.5% for VRP-SAM)
In the generalization (in-context) setting, SANSA demonstrates cross-task generalization even without training on part-level data, surpassing DiffewS by 7.5% on Pascal-Part
PCA visualizations clearly show that adapted features form clusters organized by semantic category, and this semantic organization generalizes to unseen categories

Highlights & Insights¶

Core insight: SAM2's class-agnostic pre-training implicitly encodes rich semantic structure, analogous to the semantic understanding induced by view-invariance in self-supervised learning. This finding challenges the prevailing assumption that SAM2 lacks semantic understanding
Minimalist design philosophy: Inserting only the simplest AdaptFormer into the last two layers, with ~10M trainable parameters, achieves state-of-the-art performance—demonstrating the principle that more constrained adaptation generalizes better
Unified architecture advantage: The dual-model DINOv2+SAM pipeline is eliminated; a single SAM2 architecture simultaneously performs feature matching and mask generation

Limitations & Future Work¶

A gap of −2.5% remains compared to GF-SAM on 5-shot COCO-20i
Episodic training may be constrained by the diversity of training categories
The generalization capability of the adapters depends on the degree of semantic relatedness between base and novel categories

Compared to single-model methods such as SegIC and DiffewS, SANSA demonstrates that SAM2's Memory Attention is an ideal mechanism for unifying feature matching and mask generation
The work provides a new paradigm for "mining hidden capabilities" of foundation models: exposing latent structure in pre-trained features through lightweight adaptation
The pseudo-reference self-training strategy is transferable to other video understanding tasks

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to reveal and exploit the implicit semantic structure of SAM2; highly original perspective
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation on LVIS/COCO/FSS-1000, detailed ablations, and compelling PCA visualizations
Writing Quality: ⭐⭐⭐⭐⭐ — Problem-driven narrative structured around three research questions; clear and logical
Value: ⭐⭐⭐⭐⭐ — Minimalist design, state-of-the-art performance, and efficient inference make this highly practical