Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification¶

Conference: CVPR 2025
arXiv: 2503.16873
Code: https://github.com/k0u-id/CCD
Authors: Dongseob Kim, Hyunjung Shim
Institutions: Samsung Electronics, KAIST
Area: Social Computing
Keywords: CLIP, unsupervised multi-label classification, CAM, debiasing, pseudo-labels

TL;DR¶

This paper proposes Classifier-guided CLIP Distillation (CCD), which achieves unsupervised multi-label classification performance on par with fully supervised methods (90.1% mAP on VOC12) without any manual annotations by leveraging two core techniques: CAM-guided local view label aggregation and CLIP prediction debiasing.

Background & Motivation¶

Background: Multi-label image classification requires models to simultaneously predict multiple semantic labels for an image. Traditional fully supervised methods rely heavily on manual annotations, which are highly expensive—an image requires on average 2-3 positive classes and dozens of negative classes annotated. The emergence of vision-language pre-trained models like CLIP brings new possibilities for unsupervised classification, as they possess zero-shot classification capabilities after pre-training on massive image-text pairs.

Limitations of Prior Work: (1) CLIP predictions suffer from view-dependency (view-dependent predictions)—CLIP yields completely different class probabilities for different cropped regions of the same image. For example, in an image containing "person + horse", cropping to the person's area only predicts "person", cropping to the horse's area only predicts "horse", and the global image may yield low probabilities for both. (2) CLIP exhibits an inherent prediction bias—the predicted probabilities for certain classes (e.g., "person") are systematically higher, while others (e.g., "sofa") are systematically lower, even if they are equally prominent in the image. This bias stems from the category frequency imbalance in the pre-training corpus. (3) Existing unsupervised methods (such as CDUL), although attempting to utilize CLIP, fail to effectively address these two issues, leaving a significant performance gap compared to fully supervised approaches.

Key Challenge: CLIP possesses rich visual-semantic knowledge, which theoretically suffices for high-quality multi-label classification. However, its systematic deficiencies—view-dependency and prediction bias—hinder the effective transfer of this knowledge. Existing methods either ignore these limitations or only partially address them.

Goal: How to systematically resolve the view-dependency and prediction bias of CLIP, seamlessly distilling its knowledge to downstream classifiers to achieve high-quality multi-label classification without manual annotations?

Key Insight: Utilizing the Class Activation Maps (CAM) generated during the training of the downstream classifier itself to backward-guide CLIP's label generation. CAM indicates the spatial location of each category in the image, based on which local views are cropped for CLIP's local predictions and then aggregated into labels. This backward-distillation concept of "classifier-guiding-CLIP" serves as the core innovation.

Core Idea: Use CAM generated by the classifier to guide CLIP in predicting and aggregating labels on local regions, while debiasing CLIP probabilities using training set statistics, thereby achieving annotation-free multi-label classification.

Method¶

Overall Architecture¶

CCD consists of four phases: (1) Generating initial pseudo-labels using CLIP; (2) Updating pseudo-labels with classifier-CAM guided local view selection; (3) Debiasing CLIP predictions; (4) Final training with a consistency loss. The entire pipeline requires zero manual annotations, relying solely on the pre-trained CLIP model and a list of category names.

Key Designs¶

Phase 1: Initial Pseudo-Label Generation:
- Function: Generates initial multi-label pseudo-labels for each image in the training set.
- Mechanism: Computes the cosine similarity between the image and each category text ("a photo of {class}") using a frozen CLIP model to obtain class probability vectors. Maximize probabilities over the global image and multiple random crops to serve as initial probabilities. A fixed threshold (0.5) is used to binarize them into pseudo-labels.
- Design Motivation: Global predictions may miss small objects, whereas random crops provide a degree of local view coverage. However, the label quality at this stage is relatively low (approx. 86.4% mAP on VOC12) because the cropping is random rather than semantically guided.
Phase 2: Classifier-Guided Label Update:
- Function: Leverages CAM information from the training classifier to specifically crop local regions containing target classes for CLIP to re-predict.
- Mechanism: (a) Perform a forward pass on the image using the current classifier to obtain the CAM heatmap for each class; (b) Threshold each CAM to obtain bounding boxes of activated regions; (c) Randomly crop multiple (approx. 10) local views around the bounding boxes; (d) Use CLIP to compute class probabilities for each local view and aggregate them via max pooling—retaining a class as long as any local view predicts its presence; (e) Weighted fusion of the locally aggregated labels with the initial labels: \(\hat{y} = \alpha \cdot y_{init} + (1-\alpha) \cdot y_{local}\), where \(\alpha=0.4\).
- Design Motivation: CAM naturally indicates "where the classifier believes the category is in the image." Relying on it to guide cropping ensures that local views indeed contain the target objects. Max pooling aggregation is chosen because, in multi-label scenarios, over-detection is preferable to missing targets (prioritizing high recall).
Phase 3: Debiasing CLIP Predictions:
- Function: Eliminates category frequency bias in CLIP predictions.
- Mechanism: Count the frequency of each category being predicted as positive by CLIP on the training set (occurrence count \(n_c\)), and normalize the raw CLIP probability using this frequency: \(p_{debiased}(c) = p_{raw}(c) / n_c^{\gamma}\), where \(\gamma\) controls the debiasing strength. Intuitively, if CLIP consistently scores "person" highly, \(n_{person}\) will be large, and its normalized probability will be appropriately scaled down.
- Design Motivation: Since CLIP is pre-trained on web-scale data, text-image pairs for high-frequency categories like "person" are far more abundant than low-frequency ones like "sofa", leading to systematic bias. This debiasing operation is analogous to IDF (Inverse Document Frequency) in information retrieval, where the weights of high-frequency categories are reduced.
Phase 4: Training with Consistency Loss:
- Function: Train the final classifier using the debiased pseudo-labels.
- Mechanism: The total loss is formulated as \(L_{total} = L_{ce} + \beta \cdot L_{consist}\). Here, \(L_{ce}\) is the standard cross-entropy loss on pseudo-labels. \(L_{consist}\) represents consistency regularization—enforcing identical probability distributions (by minimizing KL divergence) for two different data augmentations of the same image (e.g., random cropping + color jitter). Training pipeline: Warm-up for the first 2 epochs (using only \(L_{ce}\)), followed by joint training with the consistency loss until the 10th epoch.
- Design Motivation: Consistency loss is a classic semi-supervised learning technique (e.g., FixMatch) that provides additional learning signals when pseudo-labels are noisy. The warm-up phase allows the model to establish baseline classification capabilities first, preventing the consistency target from conflicting with the main loss during early training.

Model Architecture¶

Classifier backbone: ImageNet pre-trained ResNet-101
CLIP model: ResNet50×64 variant (frozen, used for inference only)
CAM method: GradCAM (applied to the last convolutional layer)

Key Experimental Results¶

Main Results: Unsupervised vs Fully Supervised Multi-Label Classification (mAP %)¶

Method	Supervision Type	VOC12	VOC07	COCO	NUS-WIDE
BCE (ResNet-101)	Fully Supervised	90.1	91.3	78.5	50.7
ASL	Fully Supervised	91.2	91.8	80.2	—
ADDS	Semi-Supervised (10%)	87.0	88.5	72.1	—
DualCoOp	Zero-shot (CLIP)	85.0	83.3	64.2	39.8
CDUL	Unsupervised	88.6	89.0	69.2	44.0
CCD (Ours)	Unsupervised	90.1	91.0	70.3	44.5

Ablation Study (VOC12 mAP %)¶

Configuration	mAP	Change relative to baseline	Description
Baseline (Global CLIP only)	86.4	—	No local views, no debiasing
+ Label Update (Phase 2)	88.7	+2.3	CAM-guided local view aggregation
+ Debiasing (Phase 3)	89.4	+3.0	CLIP prediction debiasing
+ Consistency (Phase 4)	90.1	+3.7	Complete CCD framework
Random cropping instead of CAM-guided	87.9	+1.5	CAM-guided outperforms random cropping
Mean pooling instead of Max pooling	88.1	+1.7	Max pooling is more suitable for multi-label

Improvements on Low-Frequency Categories from Debiasing (VOC12 per-class AP Change)¶

Category	AP Before Debiasing	AP After Debiasing	Gain
Plant	83.2	86.5	+3.3
Sofa	72.4	76.0	+3.6
TV Monitor	80.1	82.6	+2.5
Person	96.8	96.2	-0.6
Car	93.5	93.1	-0.4

Key Findings¶

Unsupervised competitive with fully supervised: CCD achieves 90.1% mAP on VOC12, matching the ResNet-101 fully supervised baseline (90.1%) and lagging by only 0.3 percentage points on VOC07 (91.0 vs 91.3). This demonstrates that CLIP's visual-semantic knowledge, when properly guided, is sufficient to replace manual annotations.
CAM guidance significantly outperforms random cropping: Using CAM-guided local view selection earns a 0.8% mAP improvement over random cropping (88.7 vs 87.9), proving that semantically guided cropping can more accurately cover target object regions.
Debiasing significantly improves long-tail categories: Low-frequency categories (Plant +3.3, Sofa +3.6) obtain noticeable improvements, while high-frequency categories (Person, Car) drop only slightly by 0.4-0.6, yielding a positive overall gain. This verifies that CLIP bias primarily affects low-frequency classes.
Approximately 10 local inferences is optimal: The mAP saturates after reaching 10 local inferences; further increases introduce noise. This provides a clear balance between efficiency and performance for practical deployments.
Gaps persist on complex datasets: On COCO and NUS-WIDE, there remains a notable gap compared to fully supervised methods (70.3 vs 78.5, 44.5 vs 50.7). This is because these datasets feature more classes and complex object co-occurrence, where CAM's co-occurrence issue leads to multiple objects mixed within local crops.
Max pooling vs Mean pooling: Max pooling substantially outperforms Mean pooling in multi-label scenarios (88.7 vs 88.1), since multi-label classification requires high recall—retaining the category if any local view detects it.

Highlights & Insights¶

Innovative "classifier backward-guiding CLIP" concept: While most CLIP distillation methods are unidirectional (CLIP -> Classifier), this work proposes bidirectional feedback where CAMs generated by the classifier guide CLIP on where to look. This mutually reinforcing framework is simple yet highly effective.
Elegance of the debiasing method: Systemic bias is eliminated with a simple normalization based on class appearance frequency in the training set, eliminating the need for complex calibration techniques. This can be directly transferred to any scenario generating pseudo-labels with CLIP.
Milestone achievement of unsupervised matching fully supervised: For the first time, unsupervised performance matches the fully supervised counterpart on VOC12. This holds significant practical value for annotation-sensitive applications like medical and remote sensing imagery.
Modular design: The four phases can be independently replaced and upgraded. For instance, GradCAM can be swapped with superior CAM methods (like ScoreCAM), and CLIP can be replaced by other VLMs.

Limitations & Future Work¶

CAM co-occurrence issue: When multiple objects spatially overlap in an image, CAM cannot cleanly separate individual object regions, mixing other categories into local crops and posing a performance bottleneck on complex datasets like COCO/NUS-WIDE.
Reliance on CLIP's category priors: The method assumes CLIP possesses basic recognition capability for all target categories, which may fail on fine-grained classes rarely seen in CLIP's pretraining data (e.g., specific bird subspecies).
Limitations of fixed thresholds: The pseudo-label binarization uses a fixed threshold of 0.5, overlooking that different categories might require different optimal thresholds. Adaptive thresholding strategies could further improve performance.
ResNet-101 backbone: Experiments only employed ResNet-101 as the downstream classifier backbone, leaving its efficacy on more modern architectures like Vision Transformers unvalidated.
Training efficiency: Each epoch of label updates requires multiple CLIP inferences on all training images (approx. 10 local inferences per image), incurring substantial computational costs.

vs CDUL: CDUL also utilizes CLIP for unsupervised multi-label classification but fails to handle view-dependency and prediction bias. CCD systematically addresses these issues via CAM guidance and debiasing.
vs DualCoOp: DualCoOp adapts CLIP by learning positive and negative prompts, but it operates in a zero-shot setting (requiring class definitions) and scores significantly lower (85.0% on VOC12) than CCD.
vs FixMatch/MixMatch: Semi-supervised learning methods require a small amount of annotated data, whereas CCD requires none, though it borrows the concept of consistency regularization.
vs CAM-based WSS: Weakly supervised semantic segmentation extensively uses CAM to generate pixel-level pseudo-labels from image-level labels. CCD conversely uses CAM to refine the quality of image-level labels, representing a novel application of CAM.

Rating¶

Novelty: ⭐⭐⭐⭐ Classifier backward-guiding CLIP and CLIP debiasing are both novel and effective designs.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 4 datasets with comprehensive ablations and per-class analyses, though lacking ViT backbone experiments.
Writing Quality: ⭐⭐⭐⭐ Clear problem definition and easy-to-follow methodology.
Value: ⭐⭐⭐⭐ First unsupervised model to match fully supervised performance, showing great potential for practical applications.