Skip to content

Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

Conference: CVPR2026 arXiv: 2603.13341 Code: zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap Area: Medical Imaging / Cross-Domain Few-Shot Learning Keywords: Source-Free CDFSL, Vision-Language Model, Cross-Modal Alignment, Visual Discriminability Trap, CLIP Fine-tuning

TL;DR

This paper reveals that enhancing visual discriminability during VLM fine-tuning for cross-domain few-shot learning paradoxically degrades cross-modal alignment — a phenomenon termed the "discriminability trap." Two plug-and-play modules, SVL and RA, are proposed to suppress visual learning shortcuts and guide cross-modal alignment, achieving state-of-the-art performance on 4 CDFSL benchmarks and 11 FSL datasets.

Background & Motivation

Background: The Source-Free CDFSL setting involves target domains (e.g., medical or remote sensing images) with only a handful of labeled samples and no access to source-domain data, requiring direct fine-tuning of pretrained VLMs.

Limitations of Prior Work: VLMs such as CLIP and SigLIP perform classification by computing cosine similarity between image and text features, making cross-modal alignment quality the decisive performance factor. While conventional wisdom in visual models holds that more discriminative visual features lead to better classification, the authors find that in VLM-based SF-CDFSL, increasing visual discriminability actually reduces cross-modal classification accuracy. Moreover, existing methods — including prompt learning (CoOp/MaPLe), adapters (LP++/LDC), and LoRA fine-tuning — all overlook the shortcut effect of visual learning.

Key Challenge: The cross-entropy loss \(\mathcal{L}_{\text{vlm}}\) encompasses two optimization directions: visual learning and cross-modal learning. Visual learning can reduce the loss without improving cross-modal alignment, acting analogously to a bypass valve in a dual-valve drainage system that diverts resources away from the intended channel.

Goal: To identify and mitigate the discriminability trap by suppressing visual learning shortcuts and explicitly guiding cross-modal alignment during VLM fine-tuning for SF-CDFSL.

Method

Overall Architecture

The method builds upon a VLM backbone (CLIP/SigLIP/PE-Core) and adopts a two-stage training strategy:

  • Early stage (first 3/5 epochs): \(\mathcal{L} = \mathcal{L}_{\text{vlm}} + \beta \mathcal{L}_{\text{ra}} + \lambda \mathcal{L}_{\text{ad}}\), suppressing visual learning while guiding cross-modal alignment.
  • Late stage (last 2/5 epochs): \(\mathcal{L} = \mathcal{L}_{\text{vlm}}\), resuming standard fine-tuning to allow visual learning.

Key Designs

Key Design 1: Suppressing Visual Learning (SVL)

  • Design Motivation: Visual learning causes intra-class visual features to cluster and inter-class features to diverge, but this constitutes a shortcut that bypasses cross-modal alignment.
  • Mechanism: An anti-visual-learning loss \(\mathcal{L}_{\text{ad}}\) is introduced. Classifier weights \(w'\) are generated by randomly sampling from the support set; the cross-entropy is then computed and its gradient is negated.
  • Function: Perturbs discriminative clustering of visual features, compelling the model to reduce \(\mathcal{L}_{\text{vlm}}\) through the cross-modal pathway instead.

Key Design 2: Relationship Alignment (RA)

  • Design Motivation: Suppressing visual learning alone is insufficient; the visual modality requires a correct learning direction for its internal relational structure.
  • Fused Relationship Matrix: \(A^{\text{fuse}} = (1 - \frac{e}{E}) A^v + \frac{e}{E} A^t[L,L]\), which progressively replaces visual relational structure with textual relational structure as training advances.
  • Alignment Loss: \(\mathcal{L}_{\text{ra}} = D_{KL}(A^v \| A^{\text{fuse}})\)
  • Progressive Strategy: In early epochs, \(A^{\text{fuse}} \approx A^v\) serves a stabilizing role; in later epochs, textual semantic relations are gradually introduced to guide visual feature alignment.

Loss & Training

\[\mathcal{L} = \begin{cases} \mathcal{L}_{\text{vlm}} + \beta \mathcal{L}_{\text{ra}} + \lambda \mathcal{L}_{\text{ad}} & \text{(early stage)} \\ \mathcal{L}_{\text{vlm}} & \text{(late stage)} \end{cases}\]

Hyperparameters: \(\lambda = 0.1\) (visual branch) or \(0.001\) (text branch), \(\beta = 3\).

Key Experimental Results

Main Results: 4 CDFSL Datasets (5-way 1-shot / 5-shot)

Method Backbone ISIC EuroSAT CropDisease ChestX Avg
CLIP-LoRA-Vision ViT/CLIP 36.40 81.72 84.62 21.86 56.07
CLIP-LoRA + Ours ViT/CLIP 38.12 85.02 87.20 22.68 58.26
PE-Core-LoRA ViT/PE-Core 40.89 84.49 91.75 22.02 59.78
PE-Core-LoRA + Ours ViT/PE-Core 45.01 86.83 93.03 23.66 62.14

Under the 5-shot setting, PE-Core-LoRA + Ours achieves an average accuracy of 70.29% (vs. baseline 68.64%).

Method CropDisease Gap↓ EuroSAT Gap↓ ISIC Gap↓ ChestX Gap↓
Fine-tune 0.014 0.048 0.406 0.356
FT + \(\mathcal{L}_v\) 0.022 0.072 0.626 0.742
FT + \(\mathcal{L}_{ad}\) 0.012 0.024 0.191 0.249
FT + \(\mathcal{L}_{ad}\) + \(\mathcal{L}_{ra}\) 0.009 0.027 0.171 0.238

A smaller gap indicates better modal alignment. Enhancing visual learning significantly worsens alignment, whereas SVL+RA yields substantial improvement.

Ablation Study

SVL RA CropDisease EuroSAT ISIC ChestX Avg
84.6 81.7 36.4 21.8 56.07
86.4 83.8 37.6 22.4 57.55
85.9 83.2 37.4 22.2 57.17
87.2 85.0 38.1 22.7 58.26

Key Findings

  • Suppression timing: Suppressing visual learning in the early stage is most effective (Begin); suppression in the late stage is counterproductive (Last performs worse than the baseline).
  • Negligible computational overhead: Parameter count increases by 0.0028%, FLOPs by 0.000021%, while accuracy improves by 3.9%.
  • Strong generalizability: The method is effective across three VLM backbones (CLIP, SigLIP2, PE-Core) and three fine-tuning paradigms (CoOp, MaPLe, LoRA).
  • Consistent gains on 11 FSL datasets across all shot configurations.

Highlights & Insights

  • Insightful observation: This work is the first to reveal that visual discriminability learning constitutes a shortcut that undermines cross-modal alignment in VLM fine-tuning, supported by a triple chain of evidence: theoretical derivation, empirical validation, and visualization.
  • Intuitive dual-valve analogy: The optimization process is likened to a dual-valve drainage system, where visual learning acts as a bypass valve that diverts resources away from cross-modal alignment.
  • Minimalist design: SVL and RA require only a few lines of code, are plug-and-play, and are compatible with diverse VLM fine-tuning methods.
  • Zero extra overhead: Parameters and computation remain virtually unchanged, yet yield significant performance gains.
  • Elegant Gap Shift experimental design: Modal alignment is quantified by manually adjusting inter-modal distances, enabling intuitive and precise measurement.

Limitations & Future Work

  • Validation is limited to classification tasks; extension to downstream tasks such as detection and segmentation remains unexplored.
  • The two-stage training switch point (3/5 of total epochs) is a fixed ratio rather than an adaptive schedule.
  • Sensitivity of hyperparameters \(\lambda\) and \(\beta\) across domain-shift scenarios of varying magnitude is insufficiently discussed.
  • Anti-visual-learning may be unnecessary or even detrimental when the domain gap is small (e.g., in-domain FSL).
  • Text prompts rely on the simple template "a photo of [class]," without leveraging richer textual descriptions.
  • SF-CDFSL: Methods such as StepSTP and FWT address source-free cross-domain few-shot learning but do not analyze the adverse effects of visual learning.
  • VLM fine-tuning: CoOp (prompt learning), CLIP-Adapter, and CLIP-LoRA all employ cross-entropy fine-tuning and are susceptible to the discriminability trap.
  • Modality gap research: Liang et al. first identified the modality gap; subsequent work analyzed modal misalignment in cross-domain settings but assumed standard fine-tuning was sufficient for correction.
  • Shortcut learning: Previously studied in conventional vision models; this paper is the first to introduce the concept into the context of VLM cross-modal fine-tuning.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First to expose the conflict between visual discriminability and cross-modal alignment in VLM fine-tuning.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Multiple VLM backbones, multiple fine-tuning paradigms, 15 datasets, theoretical analysis, ablations, and visualizations.
  • Writing Quality: ⭐⭐⭐⭐⭐ — The dual-valve analogy is elegant and the argumentation is logically rigorous.
  • Value: ⭐⭐⭐⭐ — A universally applicable plug-and-play method, though currently limited to classification tasks.