Bridging Granularity Gaps: Hierarchical Semantic Learning for Cross-Domain Few-Shot Segmentation¶

Conference: AAAI 2026 arXiv: 2511.12200 Code: Available Area: Segmentation Keywords: Cross-domain few-shot segmentation, hierarchical semantic learning, style randomization, superpixel, prototype confidence

TL;DR¶

This paper proposes the HSL framework, which addresses the segmentation granularity gap between source and target domains in cross-domain few-shot segmentation (CD-FSS) via three modules — Dual Style Randomization (DSR), Hierarchical Semantic Mining (HSM), and Prototype Confidence-modulated Thresholding (PCMT) — achieving state-of-the-art performance across four target-domain datasets.

Background & Motivation¶

Cross-domain few-shot segmentation (CD-FSS) aims to segment novel categories from unseen target domains given only a few annotated samples. Existing methods primarily address the style gap (e.g., color, texture) between source and target domains, while overlooking a fundamental issue: the segmentation granularity gap.

Specifically, the distinction between foreground (e.g., birds) and background in the source domain is typically coarse-grained (i.e., visually prominent), whereas in the target domain, foreground and background may be highly similar (e.g., lesion regions versus normal skin), resembling the fine-grained intra-foreground variations in the source domain (e.g., subtle color differences among feathers). Since models are trained exclusively on the source domain, they tend to treat the foreground as a monolithic entity and fail to capture fine-grained foreground–background semantic distinctions in the target domain.

The core mechanism is to extract hierarchical semantic features that endow the model with intra-class compactness and inter-class discriminability at multiple granularities.

Method¶

Overall Architecture¶

The HSL framework consists of three core modules:

DSR (Dual Style Randomization): Applies foreground-level and global-level style randomization to training data.
HSM (Hierarchical Semantic Mining): Mines hierarchical semantic features using multi-scale superpixel masks.
PCMT (Prototype Confidence-modulated Thresholding): Mitigates segmentation ambiguity caused by high foreground–background similarity at test time.

The pipeline proceeds as follows: multi-scale superpixel masks are obtained via a superpixel segmentation model → DSR augmentation → feature extraction via an image encoder → feature enhancement via HSM → prototype computation via the SSP module → final prediction generation via PCMT.

Key Designs¶

DSR: Dual Style Randomization¶

Foreground Style Randomization: - A local region image \(\mathbf{I}^{local}\) is randomly sampled from the coarsest superpixel mask. - Both the foreground image \(\mathbf{I}^{fg}\) and the local region image are decomposed into amplitude and phase spectra via FFT. - Their amplitude spectra are fused: \(\mathbf{A}^{fusion} = \omega \mathbf{A}^{local} + (1-\omega) \mathbf{A}^{fg}\), where \(\omega \sim N(0, \sigma_f^2)\). - The foreground phase spectrum is preserved, and the fused amplitude is used to reconstruct the image via IFFT, simulating varying degrees of foreground–background dissimilarity.

Global Style Randomization: - A random convolution (RC) layer perturbs local textures of the foreground-randomized image. - Similarly FFT-based: the phase spectrum of the original image is retained while the amplitude spectrum of the RC output is used for reconstruction. - This prevents the RC layer from directly corrupting content details.

HSM: Hierarchical Semantic Mining¶

The core idea is that multi-scale superpixel masks naturally partition an image into local regions at varying granularities, approximating semantic regions at different scales.

Procedure: 1. For each superpixel scale, binary masks are generated for each region. 2. Shallow low-level features \(\mathbf{F}^l\) and deep high-level features \(\mathbf{F}^h\) are extracted. 3. Low-level features are downsampled, and Masked Average Pooling (MAP) is applied to obtain low-level and high-level prototypes for each region. 4. Low-level prototypes are enhanced via two layers of multi-head self-attention (MSA) and then fused with high-level prototypes: \(\mathbf{p}_{ij} = \alpha \tilde{\mathbf{p}}_{ij}^l + (1-\alpha) \mathbf{p}_{ij}^h\). 5. Region prototypes are mapped back to feature maps via RMAP, and all scale feature maps are aggregated onto the high-level features.

This allows each pixel to be influenced by multi-scale region prototypes, enhancing intra-class compactness and inter-class discriminability across granularities.

PCMT: Prototype Confidence-modulated Thresholding¶

To address segmentation ambiguity arising from high foreground–background similarity at test time:

A foreground confidence map is computed: \(\mathbf{M}_q^{conf} = \mathbf{M}_q^{fg} - \mathbf{M}_q^{bg}\).
An adaptive threshold \(t\) is computed via OTSU.
A prototype confidence score \(C\) is introduced to quantify the probability of segmentation ambiguity (based on cross-view prototype similarity).
The final threshold is: \(\frac{1}{1+e^{\beta(C+\gamma)}} t\)
- High prototype confidence → threshold approaches 0 (equivalent to conventional similarity comparison).
- Low prototype confidence → adaptive threshold \(t\) is applied (mitigating ambiguity).

Loss & Training¶

BCE loss is used for training.
DSR is applied only during training; PCMT is applied only during inference.
A meta-learning paradigm is adopted, with each episode comprising a support set and a query set.
Four superpixel scales are used: \(\{5^2, 10^2, 15^2, 20^2\}\).
SGD optimization is applied for 5 epochs with a learning rate of 1e-3.

Key Experimental Results¶

Main Results¶

Models are trained on PASCAL VOC 2012 + SBD and evaluated on four target-domain datasets:

Method	Backbone	Deepglobe 1/5-shot	ISIC 1/5-shot	Chest X-ray 1/5-shot	FSS-1000 1/5-shot	Avg. 1/5-shot
DRA	Res-50	41.29/50.12	40.77/48.87	82.35/82.31	79.05/80.40	60.86/65.42
LoEC	ViT-base	42.12/51.48	52.91/62.43	83.94/84.12	81.05/83.69	65.01/70.43
HSL (Ours)	Res-50	46.13/53.80	48.01/55.56	84.57/85.34	78.22/80.36	64.23/68.77
HSL (Ours)	ViT-base	45.77/54.56	59.36/64.62	85.95/86.25	81.89/83.84	68.24/72.32

With ViT-base, the proposed method surpasses the previous SOTA LoEC by +3.23% and +1.89% on 1-shot and 5-shot settings, respectively.

Ablation Study¶

DSR	HSM	PCMT	Res-50 mIoU	ViT-base mIoU
✗	✗	✗	57.82	62.24
✓	✗	✗	60.44 (+2.62)	64.55 (+2.31)
✗	✓	✗	60.92 (+3.10)	65.29 (+3.05)
✓	✓	✗	62.97	67.05
✓	✓	✓	64.23	68.24

Ablation over multi-scale superpixel masks: removing any single scale consistently degrades performance, with the 4-scale configuration being optimal (e.g., removing the \(5\times5\) scale reduces performance from 67.05 to 66.34).

Ablation over thresholding strategies (ViT-base, 1-shot):

Strategy	Deepglobe	ISIC	Chest	FSS	Avg.
Fixed threshold 0	44.54	55.79	85.93	81.93	67.05
OTSU	45.57	59.98	85.81	80.43	67.95
PCMT (Ours)	45.77	59.36	85.95	81.89	68.24

Key Findings¶

HSM contributes most: Introducing HSM alone yields a larger gain than DSR alone (3.10% vs. 2.62%), indicating that hierarchical semantic mining is the primary driver for addressing the granularity gap.
PCMT achieves flexible balance: OTSU performs well on ambiguity-prone ISIC but degrades on FSS-1000 (which is closer to the source domain); PCMT adaptively adjusts the threshold on a per-sample basis.
All scales are complementary: Removing any single scale results in performance degradation, confirming the complementary nature of fine-grained and coarse-grained information.

Highlights & Insights¶

First work to focus on the segmentation granularity gap: Unlike prior CD-FSS methods that target style discrepancy, this paper identifies and addresses the overlooked issue of granularity gap.
Elegant use of FFT frequency-domain operations: Foreground style randomization alters appearance while preserving content by fusing amplitude spectra.
Superpixels as hierarchical semantic priors: Multi-scale superpixels naturally provide semantic partitions at varying granularities in a simple yet effective manner.
Adaptive mechanism in PCMT: Avoids a one-size-fits-all thresholding strategy by smoothly transitioning between conventional similarity comparison and adaptive thresholding based on prototype confidence.

Limitations & Future Work¶

The quality of the superpixel segmentation model directly affects HSM performance; for structurally simple target domains (e.g., medical images), superpixels may not constitute the optimal prior.
The four superpixel scales and various hyperparameters (\(\sigma_f\), \(\sigma_g\), \(K\), \(\alpha\), \(\beta\), \(\gamma\)) require careful tuning.
PCMT relies on the OTSU algorithm, which may not be robust for confidence maps with multimodal distributions.
Validation is limited to the PASCAL VOC → 4 target domain setting; generalization to larger-scale source domains remains unexplored.

PATNet (ECCV 2022): The pioneering CD-FSS work, proposing to transform features into a domain-agnostic space.
DRA (CVPR 2024): Enhances generalization through domain randomization.
LoEC (CVPR 2025): Applies style perturbation at the feature level.
Broad applicability of frequency-domain methods: FFT is increasingly important in domain adaptation and generalization; the amplitude-swap paradigm merits broader adoption across tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — The granularity gap perspective is novel; the DSR+HSM+PCMT three-module design is well-motivated and coherent.
Technical Depth: ⭐⭐⭐⭐ — FFT-based frequency augmentation, multi-scale superpixel mining, and adaptive threshold modulation are technically solid.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablations covering two backbones and four target domains.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated with intuitive illustrations.