Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning¶

Conference: CVPR2026 arXiv: 2603.13341 Code: zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap Area: Medical Imaging / Cross-Domain Few-Shot Learning Keywords: Source-Free CDFSL, Vision-Language Model, Cross-Modal Alignment, Visual Discriminability Trap, CLIP Fine-tuning

TL;DR¶

This paper reveals that enhancing visual discriminability during VLM fine-tuning for cross-domain few-shot learning paradoxically degrades cross-modal alignment — a phenomenon termed the "discriminability trap." Two plug-and-play modules, SVL and RA, are proposed to suppress visual learning shortcuts and guide cross-modal alignment, achieving state-of-the-art performance on 4 CDFSL benchmarks and 11 FSL datasets.

Background & Motivation¶

Background: The Source-Free CDFSL setting involves target domains (e.g., medical or remote sensing images) with only a handful of labeled samples and no access to source-domain data, requiring direct fine-tuning of pretrained VLMs.

Limitations of Prior Work: VLMs such as CLIP and SigLIP perform classification by computing cosine similarity between image and text features, making cross-modal alignment quality the decisive performance factor. While conventional wisdom in visual models holds that more discriminative visual features lead to better classification, the authors find that in VLM-based SF-CDFSL, increasing visual discriminability actually reduces cross-modal classification accuracy. Moreover, existing methods — including prompt learning (CoOp/MaPLe), adapters (LP++/LDC), and LoRA fine-tuning — all overlook the shortcut effect of visual learning.

Key Challenge: The cross-entropy loss \(\mathcal{L}_{\text{vlm}}\) encompasses two optimization directions: visual learning and cross-modal learning. Visual learning can reduce the loss without improving cross-modal alignment, acting analogously to a bypass valve in a dual-valve drainage system that diverts resources away from the intended channel.

Goal: To identify and mitigate the discriminability trap by suppressing visual learning shortcuts and explicitly guiding cross-modal alignment during VLM fine-tuning for SF-CDFSL.

Method¶

Overall Architecture¶

The method builds upon a VLM backbone (CLIP/SigLIP/PE-Core) and adopts a two-stage training strategy:

Early stage (first 3/5 epochs): \(\mathcal{L} = \mathcal{L}_{\text{vlm}} + \beta \mathcal{L}_{\text{ra}} + \lambda \mathcal{L}_{\text{ad}}\), suppressing visual learning while guiding cross-modal alignment.
Late stage (last 2/5 epochs): \(\mathcal{L} = \mathcal{L}_{\text{vlm}}\), resuming standard fine-tuning to allow visual learning.

Key Designs¶

Key Design 1: Suppressing Visual Learning (SVL)

Design Motivation: Visual learning causes intra-class visual features to cluster and inter-class features to diverge, but this constitutes a shortcut that bypasses cross-modal alignment.
Mechanism: An anti-visual-learning loss \(\mathcal{L}_{\text{ad}}\) is introduced. Classifier weights \(w'\) are generated by randomly sampling from the support set; the cross-entropy is then computed and its gradient is negated.
Function: Perturbs discriminative clustering of visual features, compelling the model to reduce \(\mathcal{L}_{\text{vlm}}\) through the cross-modal pathway instead.

Key Design 2: Relationship Alignment (RA)

Design Motivation: Suppressing visual learning alone is insufficient; the visual modality requires a correct learning direction for its internal relational structure.
Fused Relationship Matrix: \(A^{\text{fuse}} = (1 - \frac{e}{E}) A^v + \frac{e}{E} A^t[L,L]\), which progressively replaces visual relational structure with textual relational structure as training advances.
Alignment Loss: \(\mathcal{L}_{\text{ra}} = D_{KL}(A^v \| A^{\text{fuse}})\)
Progressive Strategy: In early epochs, \(A^{\text{fuse}} \approx A^v\) serves a stabilizing role; in later epochs, textual semantic relations are gradually introduced to guide visual feature alignment.

Loss & Training¶

\[\mathcal{L} = \begin{cases} \mathcal{L}_{\text{vlm}} + \beta \mathcal{L}_{\text{ra}} + \lambda \mathcal{L}_{\text{ad}} & \text{(early stage)} \\ \mathcal{L}_{\text{vlm}} & \text{(late stage)} \end{cases}\]

Hyperparameters: \(\lambda = 0.1\) (visual branch) or \(0.001\) (text branch), \(\beta = 3\).

Key Experimental Results¶

Main Results: 4 CDFSL Datasets (5-way 1-shot / 5-shot)¶

Method	Backbone	ISIC	EuroSAT	CropDisease	ChestX	Avg
CLIP-LoRA-Vision	ViT/CLIP	36.40	81.72	84.62	21.86	56.07
CLIP-LoRA + Ours	ViT/CLIP	38.12	85.02	87.20	22.68	58.26
PE-Core-LoRA	ViT/PE-Core	40.89	84.49	91.75	22.02	59.78
PE-Core-LoRA + Ours	ViT/PE-Core	45.01	86.83	93.03	23.66	62.14

Under the 5-shot setting, PE-Core-LoRA + Ours achieves an average accuracy of 70.29% (vs. baseline 68.64%).

Method	CropDisease Gap↓	EuroSAT Gap↓	ISIC Gap↓	ChestX Gap↓
Fine-tune	0.014	0.048	0.406	0.356
FT + \(\mathcal{L}_v\)	0.022	0.072	0.626	0.742
FT + \(\mathcal{L}_{ad}\)	0.012	0.024	0.191	0.249
FT + \(\mathcal{L}_{ad}\) + \(\mathcal{L}_{ra}\)	0.009	0.027	0.171	0.238

A smaller gap indicates better modal alignment. Enhancing visual learning significantly worsens alignment, whereas SVL+RA yields substantial improvement.

Ablation Study¶

SVL	RA	CropDisease	EuroSAT	ISIC	ChestX	Avg
✗	✗	84.6	81.7	36.4	21.8	56.07
✓	✗	86.4	83.8	37.6	22.4	57.55
✗	✓	85.9	83.2	37.4	22.2	57.17
✓	✓	87.2	85.0	38.1	22.7	58.26

Key Findings¶

Suppression timing: Suppressing visual learning in the early stage is most effective (Begin); suppression in the late stage is counterproductive (Last performs worse than the baseline).
Negligible computational overhead: Parameter count increases by 0.0028%, FLOPs by 0.000021%, while accuracy improves by 3.9%.
Strong generalizability: The method is effective across three VLM backbones (CLIP, SigLIP2, PE-Core) and three fine-tuning paradigms (CoOp, MaPLe, LoRA).
Consistent gains on 11 FSL datasets across all shot configurations.

Highlights & Insights¶

Insightful observation: This work is the first to reveal that visual discriminability learning constitutes a shortcut that undermines cross-modal alignment in VLM fine-tuning, supported by a triple chain of evidence: theoretical derivation, empirical validation, and visualization.
Intuitive dual-valve analogy: The optimization process is likened to a dual-valve drainage system, where visual learning acts as a bypass valve that diverts resources away from cross-modal alignment.
Minimalist design: SVL and RA require only a few lines of code, are plug-and-play, and are compatible with diverse VLM fine-tuning methods.
Zero extra overhead: Parameters and computation remain virtually unchanged, yet yield significant performance gains.
Elegant Gap Shift experimental design: Modal alignment is quantified by manually adjusting inter-modal distances, enabling intuitive and precise measurement.

Limitations & Future Work¶

Validation is limited to classification tasks; extension to downstream tasks such as detection and segmentation remains unexplored.
The two-stage training switch point (3/5 of total epochs) is a fixed ratio rather than an adaptive schedule.
Sensitivity of hyperparameters \(\lambda\) and \(\beta\) across domain-shift scenarios of varying magnitude is insufficiently discussed.
Anti-visual-learning may be unnecessary or even detrimental when the domain gap is small (e.g., in-domain FSL).
Text prompts rely on the simple template "a photo of [class]," without leveraging richer textual descriptions.

SF-CDFSL: Methods such as StepSTP and FWT address source-free cross-domain few-shot learning but do not analyze the adverse effects of visual learning.
VLM fine-tuning: CoOp (prompt learning), CLIP-Adapter, and CLIP-LoRA all employ cross-entropy fine-tuning and are susceptible to the discriminability trap.
Modality gap research: Liang et al. first identified the modality gap; subsequent work analyzed modal misalignment in cross-domain settings but assumed standard fine-tuning was sufficient for correction.
Shortcut learning: Previously studied in conventional vision models; this paper is the first to introduce the concept into the context of VLM cross-modal fine-tuning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to expose the conflict between visual discriminability and cross-modal alignment in VLM fine-tuning.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple VLM backbones, multiple fine-tuning paradigms, 15 datasets, theoretical analysis, ablations, and visualizations.
Writing Quality: ⭐⭐⭐⭐⭐ — The dual-valve analogy is elegant and the argumentation is logically rigorous.
Value: ⭐⭐⭐⭐ — A universally applicable plug-and-play method, though currently limited to classification tasks.