Skip to content

Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation

Conference: ICML2025 Spotlight
arXiv: 2506.07376
Code: To be confirmed
Area: Image Segmentation
Keywords: Cross-domain few-shot segmentation, adapter, domain decoupling, SAM, singular value decomposition

TL;DR

This paper discovers that adapters naturally possess the capability of domain information decoupling (based on architecture rather than loss). Consequently, the authors propose the Domain Feature Navigator (DFN) as a structural domain decoupler, coupled with SAM-SVN to prevent overfitting on the source domain. This approach significantly outperforms state-of-the-art methods in cross-domain few-shot semantic segmentation (CD-FSS), achieving a 1-shot average of 63.99% and a 5-shot average of 69.77% MIoU.

Background & Motivation

Background: Cross-domain few-shot semantic segmentation (CD-FSS) requires a model pre-trained on a large amount of annotated data in the source domain to be transferred to a target domain with only a few annotations for pixel-level segmentation. Existing methods are mainly divided into two categories: meta-learning-based few-shot segmentation methods (such as HSNet, PATNet) and adapter-based parameter-efficient fine-tuning methods.

Limitations of Prior Work: CD-FSS faces two core challenges—(1) The massive domain gap between the source and target domains, making it difficult for models to generalize from the source to the target domain; (2) The extreme scarcity of target domain data, which hampers the adaptation of models to new domain distributions. Existing domain decoupling methods (e.g., DIFEX, CCSA) rely on additional domain adversarial losses or regularization to separate domain-invariant/specific features, which increases training complexity and yields limited performance.

Key Challenge: Loss-based domain decouplers require explicit domain labels to constrain the degree of decoupling, but such constraints are hand-crafted and fail to exploit the intrinsic capability of the network architecture itself. Meanwhile, adapters have long been viewed in the literature solely as parameter-efficient fine-tuning tools, and their implicit domain decoupling properties have never been explored.

Goal: (a) How to achieve domain information separation without adding domain decoupling losses? (b) Under what conditions does the structural decoupling of adapters hold? (c) How to control the risk of overfitting introduced by structural decoupling?

Key Insight: Through Centered Kernel Alignment (CKA) similarity analysis, the authors discover a key phenomenon—inserting an adapter via residual connections into the deep layers of a frozen backbone network naturally causes the adapter to absorb domain-specific information, while guiding the subsequent encoder to learn domain-invariant features. This phenomenon is closely related to the design architecture of the adapter (position and connection style) rather than the training loss.

Core Idea: Adapters naturally serve as domain information decouplers. Utilizing their structural characteristics (residual connection and deep-layer positioning), the authors propose DFN, coupled with SAM-SVN to constrain singular values and prevent overfitting, achieving structural domain decoupling without requiring additional losses.

Method

Overall Architecture

The inputs consist of a support image (with annotation) and a query image, which pass through a frozen ResNet-50 backbone network to extract multi-level feature pyramids \(\{(F_l^q, F_l^s)\}_{l=1}^L\). The DFN is attached to the deep-layer features of the backbone via residual connections, outputting navigated feature maps \(\{NF_l^q, NF_l^s\}\). The support features are masked and used alongside query features to compute a 4D cosine correlation tensor, which is then fed into a 4D convolutional pyramid encoder and a 2D decoder to obtain the final segmentation result.

Training consists of two phases: - Source Domain Training: The DFN is jointly trained with the encoder/decoder. Through its architectural characteristics, the DFN naturally absorbs source domain-specific information, guiding the model to learn domain-invariant knowledge; concurrently, SAM-SVN is applied to constrain the DFN to avoid overfitting. - Target Domain Fine-tuning: The encoder/decoder are frozen, and only the DFN is fine-tuned to learn target domain-specific features, integrating domain-specific and domain-invariant features to achieve cross-domain alignment.

Key Designs

  1. Adapter as a Decoupler: Discovery and Validation:

    • Function: To reveal the conditions under which adapters naturally possess the capability of domain information decoupling through systematic experiments.
    • Mechanism: Centered Kernel Alignment (CKA) similarity metrics are used to analyze the changes in domain similarity of the backbone network and encoder outputs before and after attaching the adapter, across the Pascal source domain and four target domains. Experiments show that after attaching the adapter, the CKA of the backbone outputs decreases (indicating more domain-specific information is absorbed by the adapter), while the CKA of the encoder outputs increases (indicating the encoder focuses more on domain-invariant information).
    • Design Motivation: To validate two critical conditions—Position: The adapter must be inserted in the deep layers of the frozen backbone (as deep-layer features are more semantic and domain-specific); Architecture: Residual connections must be used (serial connections would block the propagation of general features). The specific design of the module (e.g., traditional conv or LoRA) does not affect the decoupling capability. This finding pioneers a new paradigm of "structural decoupling."
  2. Domain Feature Navigator (DFN):

    • Function: To act as a structural domain decoupler, absorbing domain-specific information and guiding the model to focus on domain-invariant knowledge.
    • Mechanism: DFN is implemented as a \(1 \times 1\) convolution with parameters \(\alpha \in \mathbb{R}^{C \times C \times 1 \times 1}\) and matching input/output channels. It is attached to the deep-layer features of the backbone via a residual connection: \(NF_l^s = \hat{F_l^s} + \mathcal{N}_\alpha(\hat{F_l^s})\), \(NF_l^q = F_l^q + \mathcal{N}_\alpha(F_l^q)\). DFNs are connected to low, mid, and high-level features respectively to guarantee semantic consistency. The navigated features are used to build a 4D correlation tensor via cosine similarity: \(C_l(m,n) = \text{ReLU}(\frac{NF_l^q(m) \cdot NF_l^s(n)}{\|NF_l^q(m)\| \|NF_l^s(n)\|})\).
    • Design Motivation: Compared with loss-based decouplers (which require extra domain labels and adversarial training), the DFN utilizes the inherent properties of the network architecture to achieve decoupling, which is simpler and requires no additional loss functions. The discrepancy between the frozen pre-trained backbone and the from-scratch training of the adapter naturally guides the adapter to capture source domain-specific information.
  3. SAM-SVN (Sharpness-Aware Minimization on Singular Values of Navigator):

    • Function: To constrain the degree of overfitting of the DFN during source domain training, preventing it from learning sample-specific rather than domain-specific knowledge.
    • Mechanism: Singular value decomposition (SVD) is performed on the DFN weight matrix, yielding \(\alpha' = USV^T\). SAM perturbations are only applied to the singular value matrix \(S\): \(\epsilon = \rho \nabla L(S) / \|\nabla L(S)\|_2\). The gradients are then recomputed and updated using the perturbed parameters \(\hat{\alpha} = U(S+\epsilon)V^T\). Since singular values control the importance of different representations, constraining only the singular values flattens the loss landscape (preventing overfitting) without hindering the DFN's capability to absorb domain information via the \(U, V\) matrices.
    • Design Motivation: Pure structural decoupling has its vulnerabilities—the DFN might learn excessively complex patterns and overfit to specific source domain samples rather than the domain distribution (as indicated by the loss fluctuation rising from 0.398 to 0.521 in Table 5). Applying SAM directly to all parameters would restrict the absorption of domain information. Inspired by SAM-ON and BSP, the authors propose a compromise of constraining only the singular values. Because singular values are most sensitive to overfitting, constraining them controls the level of overfitting while the \(U\) and \(V\) matrices retain their capability to absorb domain information.

Loss & Training

  • Loss Function: Standard binary cross-entropy (BCE) loss, without extra domain-decoupling loss.
  • Source Domain Training: The backbone (ResNet-50, pre-trained on ImageNet) is frozen. DFN + encoder + decoder are jointly trained using the Adam optimizer with a learning rate of 1e-3, SAM hyperparameter \(\rho = 0.5\), and an image resolution of 400x400.
  • Target Domain Fine-tuning: Only the DFN is fine-tuned for 50 iterations, with different learning rates used for different datasets (FSS-1000: 1e-3, Deepglobe: 5e-1, ISIC/ChestX: 5e-3).

Key Experimental Results

Main Results

Comparison of 1-shot and 5-shot MIoU on standard CD-FSS benchmarks (source domain: Pascal, 4 target domains):

Method Backbone FSS-1000 (1/5) Deepglobe (1/5) ISIC (1/5) ChestX (1/5) Average (1/5)
HSNet (ICCV-21) Res-50 77.53/80.99 29.65/35.08 31.20/35.10 51.88/54.36 47.57/51.38
PATNet (ECCV-22) Res-50 78.59/81.23 37.89/42.97 41.16/53.58 66.61/70.20 56.06/61.99
ABCDFSS (CVPR-24) Res-50 74.60/76.20 42.60/45.70 45.70/53.30 79.80/81.40 60.67/64.97
APSeg (CVPR-24) ViT-base 79.71/81.90 35.94/39.98 45.43/53.98 84.10/84.50 61.30/65.09
APM (NeurIPS-24) Res-50 79.29/81.83 40.86/44.92 41.71/51.76 78.25/82.81 60.03/65.18
DFN (Ours) Res-50 80.73/85.80 45.66/47.98 36.30/51.13 85.21/90.34 61.98/68.81
DFN (Ours) ViT-base 82.97/85.72 39.45/47.67 50.36/58.53 83.18/87.14 63.99/69.77

Under the Res-50 backbone, the 1-shot average surpasses the previous SOTA (APSeg) by 2.69%, and the 5-shot average surpasses it by 4.68%. Under the ViT-base backbone, the performance goes a step further, with the 1-shot performance hitting 63.99%.

Ablation Study

Configuration 1-shot MIoU 5-shot MIoU Description
Baseline (HSNet) 47.57 51.38 Without DFN, without SAM
+ DFN 59.89 66.59 DFN contributes +12.32% (1-shot)
+ DFN + SAM 60.65 67.74 Apply SAM to all DFN parameters
+ DFN + SAM-SVN 61.98 68.81 Apply SAM only to singular values, optimal

Comparison of SAM perturbation targets:

SAM Target Module 1-shot MIoU 5-shot MIoU
Enc.+Dec.+DFN 60.04 66.98
DFN only 60.65 67.74
SVN (singular values) only 61.98 68.81

Key Findings

  • DFN contributes the most: Introducing DFN yields a substantial boost of +12.32% (1-shot), validating the core value of structural decoupling.
  • SAM-SVN outperforms full-parameter SAM: Applying SAM to all DFN parameters restrains the absorption of domain information, whereas constraining only the singular values perfectly balances overfitting control and domain absorption.
  • Improved stability: SAM-SVN reduces performance fluctuations (best-worst) from 1.76/2.33/3.12/2.03 (with DFN only) to 0.94/1.53/1.68/1.18.
  • Usage mode of DFN: Removing the DFN in the target domain still improves upon the baseline (56.85 vs. 47.57), but retaining and fine-tuning the DFN yields the best performance (61.98).
  • Particularly outstanding performance on ChestX: The 1-shot performance increases from the baseline of 51.88 to 85.21 (+33.33%), demonstrating that the advantages of structural decoupling are more pronounced in scenarios with massive domain gaps.

Highlights & Insights

  • A new perspective of "Adapters as decouplers": This is an extremely elegant discovery—while adapters were designed as parameter-efficient fine-tuning tools, their architectural properties (residual connections + deep insertion + training from scratch vs. a frozen backbone) naturally lead to the separation of domain information. This "free" decoupling capability requires no additional losses, simplifying the entire methodological pipeline.

  • Systematic experimental validation methodology: Instead of proposing the method directly, the authors first systematically validate the conditions under which an adapter can act as a decoupler through controlled variable experiments (position x architecture x connection style), before designing the method based on these findings. This research paradigm of "phenomenon discovery - mechanism explanation - method design" is highly exemplary.

  • Ingenious design of SAM-SVN: By performing SVD separation and only applying SAM perturbation to the singular values, the model controls overfitting while preserving the ability to absorb domain features. This idea can be transferred to any scenario requiring "partial regularization"—identifying the parameter subspace most sensitive to overfitting via SVD and constraining only that subspace.

  • Theoretical backing: The authors explain from the perspective of Information Bottleneck theory why adapters (with low capacity \(\theta_g \ll \theta_f\)) tend to absorb domain-specific information, and how the residual architecture achieves complementary learning objectives through gradient flow separation.

Limitations & Future Work

  • Limitations acknowledged by the authors: Only the few-shot scenario is validated; the applicability to many-shot settings has not been explored.
  • Unstable performance on the ISIC dataset: Under Res-50, the 1-shot performance on ISIC is only 36.30%, which is significantly lower than APSeg's 45.43%, suggesting that under certain specific domain gaps, structural decoupling might be inferior to loss-based methods.
  • Hyperparameter sensitivity: Fine-tuning learning rates vary dramatically across different target domains (from 1e-3 to 5e-1), requiring per-target-domain parameter tuning.
  • Simple DFN architecture: Implemented using only a \(1\times1\) convolution, it possesses limited capacity; more complex adapter structures (such as multi-layer MLP adapters) might yield further performance gains.
  • Lack of comparison with foundation models: The method is not sufficiently compared with large models such as SAM (Segment Anything Model) under the CD-FSS setup.
  • vs. PATNet: PATNet bridges the domain gap through style transfer augmentation in the frequency domain, representing an external data augmentation route. In contrast, this work achieves decoupling from within the model architecture. These two approaches are complementary and can be combined.
  • vs. ABCDFSS: ABCDFSS utilizes multiple adapter configurations for cross-domain adaptation but misses the decoupling property of adapters. The discoveries of this paper can directly improve the adapter design strategies of ABCDFSS.
  • vs. APSeg: APSeg adapts models via prompts based on ViTs, which differs from the adapter-based route of this work but shares the same goals. The proposed DFN+SAM-SVN is also applicable to ViTs and yields superior results.
  • vs. Domain-Invariant Feature Extraction (DIFEX): DIFEX maximizes invariant features and domain-specific features through extra regularization terms, representing a classic loss-based decoupling approach. This work demonstrates that decoupling can be achieved solely through architecture, producing superior results.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The discovery of "adapters as decouplers" is highly novel, offering a fundamental re-understanding of the role of adapters.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Standard benchmarks on four target domains + comprehensive ablation + CKA/MMD analysis + visualizations; however, the weak result on ISIC is a slight downside.
  • Writing Quality: ⭐⭐⭐⭐⭐ The paper is clearly structured, presenting a very smooth logical chain from phenomenon to explanation and finally to method, with intuitive figure and table designs.
  • Value: ⭐⭐⭐⭐ The discovery has broad applicability (not limited to CD-FSS), and the SAM-SVN technique is highly transferable. However, the instability on ISIC and hyperparameter sensitivity impact its practical value.