Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment¶

Conference: CVPR 2026 arXiv: 2603.17655 Code: CC-CDFSL Area: Medical Imaging Keywords: Cross-Domain Few-Shot Learning, CLIP, Local Feature Alignment, Cycle Consistency, Interpretability

TL;DR¶

This paper identifies and addresses the degradation of local feature alignment in CLIP under cross-domain few-shot learning (CDFSL), and proposes CC-CDFSL, a cycle-consistency-based framework. Through bidirectional T-I-T and I-T-I cyclic paths and a semantic anchor mechanism, CC-CDFSL improves patch-level vision-language alignment while enhancing model interpretability.

Background & Motivation¶

Background: Vision-language models such as CLIP provide a strong foundation for cross-domain few-shot learning.
Limitations of Prior Work: After fine-tuning on target domains, models struggle to focus on fine-grained visual cues (e.g., ground-glass opacities and local nodules in chest X-rays). Although CLIP can roughly cover all salient regions in the source domain, local patch-text alignment degrades far more severely than global alignment after domain transfer.
Key Challenge: Quantitative analysis measures global alignment score \(\text{A}_g\) and local alignment score \(\text{A}_l\), revealing that the drop in \(\text{A}_l\) is significantly larger than that in \(\text{A}_g\) on cross-domain tasks, confirming that domain shift and data scarcity disproportionately harm local feature alignment.
Goal: This degradation is especially critical in domains requiring fine-grained recognition such as medical diagnosis—subtle textures or density variations indicative of pneumonia appear in only a few patches, yet model heatmaps can only coarsely outline body contours.

Method¶

Overall Architecture¶

Three modules are appended on top of CLIP fine-tuning: Semantic Anchor (SA) augmentation stage → T-I-T cycle consistency → SA contraction stage → I-T-I cycle consistency. The entire framework is incorporated as regularization losses added to the standard cross-entropy loss.

Key Designs¶

T-I-T Cycle Consistency (Text-to-Image-to-Text): For each text feature \(\mathbf{T}_j\), the most similar patch feature is selected from all patch features: \(\mathbf{L}_j^* = \mathbf{L}_{\arg\max_i \mathbf{D}_{j,i}^{txt}}\). This patch is then mapped back to the text space to retrieve the most similar text \(\mathbf{T}_j^{rec}\), enforcing \(\mathbf{T}_j \approx \mathbf{T}_j^{rec}\). Loss: \(\mathcal{L}_{\text{cyc\_txt}} = 1 - \frac{1}{C}\sum_{j=1}^{C}\text{sim}(\mathbf{T}_j, \mathbf{T}_j^{rec})\). Design Motivation: Analogous to cycle consistency in machine translation, this aligns local visual and textual semantics without requiring patch-level annotations.
Semantic Anchor Mechanism (SA): Augmentation stage — \(A\) augmented views are generated per image to expand the patch candidate pool \(\mathbf{X}_{aug} \in \mathbb{R}^{((A+1) \cdot M) \times d}\). Contraction stage — the top-\(k\) most similar patches per class are selected as semantic anchors \(\mathbf{X}_{anchor}\), filtering out noisy and semantically irrelevant background regions. Design Motivation: The visual modality is information-rich but noisy; augmentation first provides diversity (enlarging the candidate pool for T-I-T), then contraction removes noise (retaining core semantics for I-T-I).
I-T-I Cycle Consistency (Image-to-Text-to-Image): For each anchor \(\mathbf{x}_n\), the most similar text \(t_n\) is retrieved, which is then used to retrieve the most similar patch \(\hat{\mathbf{x}}_n\) from the augmented view space, enforcing \(\mathbf{x}_n \approx \hat{\mathbf{x}}_n\). The cross-view retrieval strategy enhances robustness to input transformations (rotation, flipping).

Loss & Training¶

\[\mathcal{L}_{total} = \mathcal{L}_{CE} + \lambda_1 \mathcal{L}_{\text{cyc\_txt}} + \lambda_2 \mathcal{L}_{\text{cyc\_img}}\]

\(\lambda_1 = 3.0\), \(\lambda_2 = 2.0\) (determined via grid search on ISIC)
\(k=10\) (number of anchor patches), fixed across all experiments
ViT-Base/16 CLIP backbone, 100 epochs of fine-tuning, single RTX 4090
2-layer MLP to project local patch features into the text feature space

Key Experimental Results¶

Main Results¶

Dataset	Task	CLIP-LoRA	CLIP-LoRA + Ours	Gain
ISIC (Dermatology)	5-way 1-shot	35.23	38.13	+2.90
ChestX (Chest X-ray)	5-way 1-shot	21.73	22.21	+0.48
EuroSAT (Satellite)	5-way 1-shot	81.49	86.07	+4.58
CropDisease	5-way 1-shot	85.11	88.91	+3.80
ISIC	5-way 5-shot	50.68	54.72	+4.04
EuroSAT	5-way 5-shot	92.63	94.35	+1.72

Ablation Study¶

Configuration	ISIC	ChestX	EuroSAT	Crop.	Avg.
Baseline	50.68	24.44	92.63	96.20	65.98
+ T-I-T	51.13	25.15	93.79	96.37	66.61
+ T-I-T + SA	54.30	25.35	94.33	96.95	67.73
+ I-T-I + SA	53.81	25.14	93.83	97.01	67.45
Full (T-I-T + I-T-I + SA)	54.72	25.47	94.35	97.08	67.90

Key Findings¶

T-I-T cycle contributes more than I-T-I cycle (+0.63 vs. +1.47 avg.), as T-I-T focuses on the most semantically relevant patches and reduces interference.
The SA mechanism yields significant gains for both cycles (+1.12 and +0.84 avg., respectively).
Cross-view retrieval > intra-image retrieval > full-image retrieval; augmentation view diversity is the key factor.
CC-CDFSL functions as a plug-and-play module compatible with multiple PEFT methods including CoOp, CLIP-Adapter, MaPLe, and CLIP-LoRA.
Improvements are also observed on 11 datasets in base-to-new generalization, particularly on EuroSAT (+3.6%).

Highlights & Insights¶

This work is the first to identify and quantify the phenomenon that local alignment degradation exceeds global alignment degradation in CLIP under CDFSL.
Introducing cycle consistency from machine translation into VLM local alignment is an elegant self-supervised idea that requires no additional annotations.
The SA "augment-then-contract" design elegantly balances candidate diversity and noise filtering.
Interpretability of the T-I-T path: even when reconstructed texts are not perfectly matched, the method reveals pathological regions attended to by the model and cross-class semantic relationships.
Formulating the method as a regularization term endows it with strong plug-and-play generality.

Limitations & Future Work¶

Gains on ChestX are limited (+0.48 / +1.03), possibly due to the greater semantic complexity of chest radiographs.
\(\lambda_1\) and \(\lambda_2\) require tuning on the target domain; optimal hyperparameters may differ across datasets.
The specific data augmentation strategies used for augmented view generation are not elaborated.
Validation is limited to ViT architectures; extension to other visual encoders has not been explored.
Computational overhead analysis is insufficient; the added patch similarity computations may affect training efficiency.

The cycle consistency concept from CycleGAN (Liu et al. 2017) is creatively applied to VLM local alignment.
FG-CLIP (Xie et al. 2025) and related works investigate the insufficient fine-grained capability of CLIP.
CLIP-LoRA (Zanella & Ben Ayed 2024) serves as the strongest baseline; the proposed method achieves an average improvement of +2.94 (1-shot) over this baseline.

Rating¶

Novelty: ⭐⭐⭐⭐ The problem is precisely identified; applying cycle consistency to VLM local alignment is a novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 4 datasets, 4 PEFT methods, 2 backbones, with detailed ablations — exceptionally comprehensive.
Writing Quality: ⭐⭐⭐⭐ Logically rigorous with rich visualizations; the problem–observation–solution narrative flows smoothly.
Value: ⭐⭐⭐⭐ A versatile plug-and-play framework with significant practical value for few-shot scenarios requiring fine-grained recognition, such as medical imaging.