Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning¶

Conference: CVPR 2025
arXiv: 2504.11930
Authors: Hairui Ren, Fan Tang, He Zhao, et al. Institutions: Jilin University, Chinese Academy of Sciences (CAS), CSIRO
Area: Multimodal VLM
Keywords: VLM, unsupervised prompt learning, diffusion model, pseudo-label, auxiliary classifier

TL;DR¶

This paper proposes the AiR (Augmenting discriminative Richness) framework, which utilizes a LoRA-fine-tuned Stable Diffusion model to generate synthetic images and construct an auxiliary classifier. By complementarily fusing it with the text classifier, the text-to-image matching paradigm in unsupervised prompt learning is extended to image-to-image matching, significantly improving classification accuracy on challenging datasets such as fine-grained categorizations and remote sensing.

Background & Motivation¶

Background: Vision-language models like CLIP have demonstrated powerful zero-shot classification capabilities via text-image alignment, yet a significant performance gap remains on specific downstream tasks. Prompt learning methods (e.g., CoOp, CoCoOp) adapt to downstream tasks by learning continuous prompts, but mostly require labeled data. Unsupervised prompt learning approaches (e.g., UPL, CPL) utilize pseudo-labels to bypass manual annotation, but the inherent noise in these pseudo-labels severely degrades learning quality.

Limitations of Prior Work: (1) Purely textual prompts possess limited descriptive granularity, making it difficult to capture fine-grained semantic differences in visual details (e.g., different flower species or remote sensing land cover types). (2) Pseudo-labels depend on CLIP's initial text classifier, which itself exhibits low accuracy on challenging datasets, creating a "chicken-and-egg" dilemma. (3) Existing methods remain bound to the text-to-image matching paradigm, ignoring informative discriminative cues that could be provided by image-to-image similarity.

Key Challenge: An inherent modal gap exists between textual descriptions and visual features—while the textual description of a category is uniquely fixed, its visual appearance is highly diverse. Consequently, relying solely on text prompts fails to fully capture intra-class diversity and inter-class differences.

Goal: How to leverage the "prior knowledge" of generative models to enhance the classifier's discriminative ability in the absence of labeled data, thereby compensating for the limitations of purely textual prompts in visual discrimination.

Key Insight: Treating diffusion models as a "visual knowledge base" and constructing an auxiliary image-to-image classifier by generating representative class samples, which is then dynamically fused with the original text classifier.

Core Idea: Using synthetic images generated by diffusion models as "visual prototypes" for each category, thereby extending classification from simple text-image matching to joint text-plus-image-to-image matching.

Method¶

Overall Architecture¶

The AiR framework comprises three core modules: (1) a LoRA-fine-tuned Stable Diffusion generator to synthesize high-quality, representative class-specific images; (2) an Auxiliary Classifier Generation (ACG) module to select representative samples from synthetic images to construct an auxiliary classifier; and (3) a Pseudo-Label Generation (PLG) module to fuse predictions from both the text and auxiliary classifiers, outputting more accurate pseudo-labels.

Key Designs¶

LoRA-Fine-Tuned Diffusion Model:
- Function: Performs domain adaptation on Stable Diffusion to align generated images closely with the visual distribution of the target dataset.
- Mechanism: Utilizes unlabeled images from the target dataset to perform lightweight fine-tuning on the U-Net of Stable Diffusion via LoRA. LoRA only updates low-rank decomposition matrices, maintaining an extremely small operational footprint (approximately 0.1% of the original parameters).
- Design Motivation: Pre-trained Stable Diffusion performs well on general-domain images but lacks generation quality in specialized areas like remote sensing and medical imaging. LoRA fine-tuning adapts to the target domain distribution at a minimal computational cost, producing highly discriminative synthetic samples.
- Gain: LoRA fine-tuning accounts for a +3.4% to +8% improvement in accuracy across datasets.
ACG Module—Auxiliary Classifier Generation:
- Function: Generates \(M\) synthetic images per category and selects the most representative samples to serve as class prototypes.
- Mechanism: Generates \(M\) images using the category name as a text prompt (e.g., "a photo of a residential area"). Visual features are extracted using the CLIP image encoder. The cosine similarity between each synthetic image's feature and the average feature of all synthetic images in that class is calculated. The \(K\) images with the highest similarity are chosen as representatives. The auxiliary classifier's prediction is formulated as: \(\hat{p}_c = \frac{1}{K}\sum_{k=1}^{K} \text{sim}(f_{\text{img}}, f_{\text{syn},k}^c)\)
- Design Motivation: Unfiltered use of all synthetic images introduces noise (as some generated images are of poor quality or deviate from class semantics). Selecting samples closest to the class centroid improves the auxiliary classifier's reliability. Quantitative analysis reveals that approximately 120 images per class is the optimal quantity.
PLG Module—Pseudo-Label Generation:
- Function: Fuses predictions from both the text and auxiliary classifiers to generate more accurate pseudo-labels.
- Mechanism: The final prediction is a weighted fusion, defined as \(p_c^* = p_c + \lambda \hat{p}_c\), where \(p_c\) represents the prediction probability from the CLIP text-based classifier, \(\hat{p}_c\) is the prediction probability from the auxiliary classifier, and \(\lambda\) controls the weight of the auxiliary classifier.
- Design Motivation: The two classifiers yield complementary discriminative details: the text-based classifier excels at capturing high-level semantic class features, whereas the auxiliary classifier specializes in extracting visual texture and structural variations. Weighted fusion successfully combines the strengths of both.
Loss & Training:
- The total training loss is defined as: \(L = L_r + \beta L_s\)
- \(L_r\) represents the cross-entropy loss based on the fused pseudo-labels, used to optimize continuous prompts.
- \(L_s\) acts as an auxiliary self-supervised regularization loss, which constrains the consistency of augmented views to prevent overfitting to noisy pseudo-labels.
- \(\beta\) controls the strength of the regularization loss.

Key Experimental Results¶

Main Results: Comparison with SOTA Methods¶

Method	RESISC45	Flowers102	EuroSAT	DTD	Average
CLIP Zero-Shot	60.2%	66.1%	42.0%	43.8%	53.0%
UPL	72.4%	65.8%	48.3%	55.2%	60.4%
CPL	77.3%	69.2%	52.1%	57.9%	64.1%
AiR (Ours)	79.9%	71.4%	55.7%	60.1%	66.8%
Gain vs CPL	+2.6%	+2.2%	+3.6%	+2.2%	+2.7%

Ablation Study¶

Configuration	RESISC45	Description
Baseline (CPL)	70.6%	Without auxiliary classifier
+ \(\hat{p}_c\) (Auxiliary Classifier)	72.3%	+1.7%, validating the efficacy of image-image matching
+ \(L_s\) (Regularization Loss)	72.9%	+2.3%, regularization reduces pseudo-label noise
+ \(\hat{p}_c\) + \(L_s\)	73.6%	+3.0%, demonstrating mutual complementarity
+ LoRA Fine-tuning	76.5% → 79.9%	LoRA provides an additional +3.4% to +6.3% improvement
Without LoRA vs With LoRA	76.5% vs 79.9%	LoRA fine-tuning represents a critical component

Impact of Synthetic Sample Quantity¶

Synthetic Samples per Class	RESISC45	Flowers102	EuroSAT
20	77.1%	68.9%	52.8%
60	78.4%	70.1%	54.2%
120	79.9%	71.4%	55.7%
200	79.6%	71.1%	55.3%
300	79.2%	70.8%	54.9%

Key Findings¶

Complementarity of the Auxiliary Classifier: Adding the auxiliary classifier \(\hat{p}_c\) yields a 1.7% accuracy gain on RESISC45, confirming that image-to-image matching captures richer visual discriminative details compared to standard text-to-image matching.
Crucial Role of LoRA Fine-Tuning: LoRA fine-tuning provides +3.4% to +8% improvements across all datasets, indicating that domain adaptation has a decisive impact on the quality of synthesized images.
Optimal Synthetic Samples Around 120 per Class: Insufficient samples (<60) fail to capture intra-class variation, while excessive samples (>200) introduce unwanted noise which compromises overall performance.
Greater Gains on Challenging Datasets: Larger target improvements are observed on EuroSAT (+3.6%) and DTD (+2.2%), which represent fine-grained and remote sensing environments where textual descriptions struggle to draw precise distinctions.
Independent Effectiveness of \(L_s\) Regularization: Adding only the regularization loss improves accuracy by 2.3% even without the auxiliary classifier, suggesting that the self-supervised consistency constraint successfully mitigates pseudo-label noise.

Highlights & Insights¶

Generative Models as a "Visual Knowledge Base": Departing from the conventional "generate-to-augment" training pipeline, AiR directly incorporates synthetic images as an integrated component of the classifier (viz., class prototypes). This avoids the typical domain shift issues that arise when mixing synthetic and real data during training.
Dual-Channel Text + Vision Classification: Fusing text-image matching with image-image matching mathematically serves to leverage both linguistic and visual "anchors" simultaneously within CLIP's feature space.
High Cost-Efficiency of LoRA Fine-Tuning: Fine-tuning merely ~0.1% of the model parameters generates significant performance improvements. Crucially, it does not require manual annotations (using raw, unlabeled images from the target domain), making it highly practical for real-world deployments.
Orthogonality to Pseudo-Labeling Schemes: Conceptually, the auxiliary classifier in AiR can be integrated with virtually any candidate-unsupervised pseudo-labeling method, demonstrating exceptional extensibility.

Limitations & Future Work¶

Synthetic image generation entails auxiliary computational overhead (LoRA fine-tuning and image synthesis), which may limit utility in resource-constrained environments.
The efficacy of the auxiliary classifier remains contingent on the diffusion model's capacity to represent the target domain, potentially failing on extreme out-of-distribution or highly novel classes.
Tuning hyperparameters such as \(\lambda\) and \(\beta\) relies on validation sets, the construction of which remains inherently challenging under fully unsupervised conditions.
Evaluation has been exclusively performed using CLIP's visual encoder; suitability for alternative VLMs (such as BLIP-2 or SigLIP) requires further validation.
It remains unexplored whether leveraging more advanced diffusion backbones (e.g., SDXL, Flux) could further elevate synthetic image quality and final classification performance.