ICML2025 Multimodal VLM VLM Pseudolabels CLIP Fine-tuning Imbalanced Pseudolabels Concept Alignment Calibrated Margin Unsupervised/Semi-supervised Learning

Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin¶

Conference: ICML2025
arXiv: 2505.02056
Code: GitHub
Area: Multimodal VLM
Keywords: VLM Pseudolabels, CLIP Fine-tuning, Imbalanced Pseudolabels, Concept Alignment, Calibrated Margin, Unsupervised/Semi-supervised Learning

TL;DR¶

Proposes the CAP framework, which addresses the class imbalance problem of VLMs generating pseudolabels through concept alignment (detecting and fixing concept mismatch) and confusion-aware calibrated margin (alleviating concept confusion), achieving a 6.29% relative improvement over SOTA models across six datasets and three paradigms.

Background & Motivation¶

Utilizing the zero-shot capability of VLMs (such as CLIP) to generate pseudolabels for downstream unlabeled data for fine-tuning has recently become a research hotspot. The core challenge is that VLMs exhibit preference bias toward different categories, which leads to a severely imbalanced pseudolabel distribution and consequently triggers confirmation bias.

Existing methods such as UPL and FPL adopt a forced balance by selecting top-k high-confidence samples per class, GRIP increments the $k$ value round-by-round, and CPL assigns a candidate pseudolabel set for each sample—yet these are all ex-post remedies that fail to deeply analyze the root causes of the imbalance.

This work provides the first in-depth analysis of the two root causes of imbalance:

Concept Mismatch: The text features of class names are severely misaligned with image features, making it nearly impossible to correctly predict that class (e.g., approximately 5% of classes in RESISC45 suffer from this issue).

Concept Confusion: Text features of similar categories fail to capture the most discriminative visual concepts, causing predictions to skew toward a certain class (affecting about 30% of classes).

The authors visualized the clustering distribution of the five classes with the lowest accuracy on RESISC45—revealing that although image features cluster well, the zero-shot prediction accuracy of CLIP is extremely low, confirming the existence of a semantic gap.

Method¶

Overall Architecture: CAP¶

CAP (Concept-Adaptive Pseudolabeling) consists of three steps:

Concept Alignment (§3.1): Detects concept mismatch categories and enhances text descriptions using LLMs.
Confusion-Aware Calibrated Margin (§3.2): Constructs a margin matrix based on inter-class similarity and prediction tendencies.
Dual-Adapter Fine-Tuning (§3.3): Learns from high-quality pseudolabels and dynamic pseudolabels, respectively.

3.1 Concept Alignment¶

Mismatch Detection Algorithm: An iterative clustering strategy gradually removes well-matched classes, leaving the remaining ones as mismatch classes.

Performs K-Means clustering (number of clusters = number of classes) on image features $\mathcal{I}$.
Computes the similarity matrix $\mathbf{S}^{\mathcal{TC}}$ between text features and cluster centers, applying softmax to obtain the probability matrix.
Finds the (text feature, cluster center) pair $(i^*, j^*)$ with the highest confidence and removes it.
Iterates until the number of remaining classes is below a threshold $t$. The remaining classes are the candidate mismatch classes.
Takes the intersection with the classes having the fewest predicted samples: $\mathcal{Y}_{\text{MM}} = \mathcal{Y}_{\text{final}} \cap \mathcal{Y}_{\text{low-}t}$.

LLM Text Enhancement: For mismatch classes, an LLM is queried to generate $n$ enhanced descriptions. The description with the highest similarity to the cluster center is selected to replace the original class name template, and pseudolabels are assigned based on top-k cosine similarity.

3.2 Confusion-Aware Calibrated Margin¶

Core Idea: Incorporates adaptive margins into the cross-entropy loss, encouraging the model to make more discriminative predictions between confusable classes.

Calibrated Margin Loss:

\[\mathcal{L}_m(y, \mathbf{z}) = -\log \frac{e^{z_y}}{e^{z_y} + \sum_{c \neq y} e^{z_c + \mathbf{M}_{yc}}}\]

Construction of Margin Matrix $\mathbf{M}$:

Inter-class Similarity Matrix $\mathbf{S}$: Takes the maximum of visual prototype similarity and textual prototype similarity. $$\mathbf{S}_{ij} = \max(\text{sim}(\bar{\mathbf{v}}_i, \bar{\mathbf{v}}_j), \text{sim}(\mathbf{w}_i, \mathbf{w}_j))$$
Class Prediction Tendency $\delta_c$: Counts the number of samples $\sigma(c)$ predicted as class $c$ with confidence exceeding the threshold $\tau$. $$\delta_c = 1 - \frac{\sigma(c)}{\max_j \sigma(j)}$$
Class Margin Scaling $m_c = m \times \Delta \times \delta_c$, where $\Delta = \max_c(\delta_c)$.
Final Margin Matrix: $\mathbf{M} = \mathbf{S} \odot \mathbf{m}$ (Hadamard product).

Key Design: $\mathbf{M}$ is updated once per epoch to progressively mitigate confusion. For classes with low prediction tendencies (large $\delta_c$) and high similarity to other classes, a larger margin penalty is applied.

3.3 Dual-Adapter Fine-Tuning Framework¶

Based on MaPLe prompt tuning, two independent visual adapters are deployed:

Main Adapter ($\phi^m$): Learns solely from the high-precision pseudolabels $\mathcal{D}_{\text{PL}}$ obtained during the concept alignment phase, while simultaneously generating pseudolabels for unlabeled data.
Pseudo Adapter ($\phi^p$): Learns solely from the unlabeled data $\mathcal{D}_{\text{UL}}$ with dynamic pseudolabels (FixMatch-style, filtered by threshold $\tau$).
An adapter $\psi^a$ is also deployed on the text branch.
All adapters are disabled during inference.

Total Loss: $\mathcal{L} = \mathcal{L}_{\text{PL}} + \mathcal{L}_{\text{UL}}$ (Unsupervised). SSL/TRZSL additionally adds $\mathcal{L}_{\text{L}}$.

Key Experimental Results¶

The approach is compared across 3 paradigms (SSL / UL / TRZSL) on 6 datasets (Flowers102, RESISC45, DTD, EuroSAT, CUB, FGVCAircraft):

Method	Flowers102 (UL)	RESISC45 (UL)	DTD (UL)	EuroSAT (UL)	CUB (UL)
Zero-shot CLIP	63.40	54.46	43.45	30.54	51.57
FPL	65.67	68.13	44.96	48.96	53.04
GRIP	69.84	74.11	46.09	57.21	51.42
CPL	72.90	80.98	51.91	67.26	—
CAP (Ours)	76.80	83.32	55.29	—	—

Achieves a 3.9 percentage point improvement over CPL on Flowers102 UL, and a 2.3 percentage point improvement on RESISC45 UL.
Achieves SOTA performance across all three paradigms (SSL/UL/TRZSL); obtaining an overall relative improvement of 6.29% over CPL.
Particularly shows significant improvement on classes suffering from severe concept mismatch.

Highlights & Insights¶

In-depth Problem Analysis: For the first time, pseudolabel imbalance is attributed to two manifestations of the semantic gap: concept mismatch and concept confusion, supported by quantitative statistics (5% mismatch classes, 30% confusion classes).
Iterative Clustering Detection: Automatically discovers mismatch categories in an unsupervised manner without requiring any annotation.
Ingenious Margin Matrix Design: Jointly utilizes inter-class similarity and prediction tendencies to adaptively adjust the decision boundaries between different class pairs.
Dual-Adapter Noise Isolation: The main adapter maintains high precision without being contaminated by dynamic pseudolabel noise, providing a simple yet effective architecture.
Comprehensive Paradigm Coverage: Unifies three learning paradigms (UL / SSL / TRZSL) into a single framework.

Limitations & Future Work¶

LLM Dependency: Concept alignment requires calling an LLM to generate text descriptions, increasing the pipeline complexity and cost.
Hyperparameter Sensitivity: Parameters such as threshold $t$ (for mismatch detection), $\tau$ (for confidence filtering), and the margin scale $m$ require tuning.
Limited to Classification Tasks: The framework design centers around image classification, and its applicability to tasks like object detection or segmentation has not been explored.
Limited Dataset Scale: The 6 datasets evaluated are of medium size; validation on large-scale scenarios such as ImageNet has not been performed.
Adapters Disabled during Inference: Adapters are enabled during training but disabled at inference. This training-inference discrepancy may limit the performance ceiling.

Rating¶

Novelty: ⭐⭐⭐⭐ — The problem analysis (mismatch vs. confusion) is novel; the combined scheme of iterative clustering detection and calibrated margin is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 6 datasets × 3 paradigms with complete ablation studies, though it lacks large-scale validation.
Writing Quality: ⭐⭐⭐⭐ — Features clear motivational diagrams and complete mathematical derivations, resulting in good overall readability.
Value: ⭐⭐⭐⭐ — VLM pseudolabel imbalance is a real practical pain point; the proposed method is transferable to other VLM fine-tuning scenarios.