Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation¶
Conference: CVPR 2025
arXiv: 2503.20826
Code: https://github.com/zwyang6/ExCEL
Area: Semantic Segmentation
Keywords: Weakly Supervised Semantic Segmentation, CLIP, patch-text alignment, Class Activation Map, vision-language pre-training
TL;DR¶
ExCEL proposes utilizing the patch-text alignment paradigm (instead of traditional image-text alignment) to mine the dense knowledge of CLIP for weakly supervised semantic segmentation. By enhancing dense alignment capabilities through Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules, it substantially surpasses SOTA on PASCAL VOC and MS COCO while requiring only 3.2GB of VRAM and 6% of the training time.
Background & Motivation¶
Background: Weakly supervised semantic segmentation (WSSS) aims to achieve pixel-level predictions using only image-level labels, typically relying on Class Activation Maps (CAMs) to provide localization cues. Recently, CLIP has been introduced to WSSS. For instance, CLIP-ES utilizes image-text alignment to generate GradCAM, and WeCLIP directly utilizes CLIP's vision encoder for segmentation.
Limitations of Prior Work: Existing methods primarily leverage CLIP's global image-text alignment capability while overlooking the potential of CLIP's dense knowledge in patch-text alignment. Global alignment can only indicate which objects are present in the image, but fails to precisely locate each pixel of the objects.
Key Challenge: Patch-text alignment faces two key challenges: (1) Sparse text semantics—templates like "a photo of [CLASS]" can only represent the existence of objects and lack the rich semantics required for localization; (2) Insufficient fine-grained visual features—due to its image-text contrastive pre-training, CLIP tends to extract global representations, resulting in overly uniform q-k attention maps that lose fine-grained spatial details.
Goal: (1) How to enrich text representations to support accurate patch-level matching; (2) How to mine fine-grained spatial information from CLIP's visual features.
Key Insight: The authors observe that the intra-correlation of q/k/v in intermediate layers of CLIP preserves more fine-grained information than the cross-spatial q-k attention. Meanwhile, category descriptions generated by LLMs can be clustered into an implicit attribute space to enhance text representations.
Core Idea: Replace traditional image-text alignment with patch-text cosine similarity to generate CAMs, and address the two major bottlenecks in dense alignment via LLM-based text semantic enrichment and intermediate-layer intra-correlation visual feature calibration.
Method¶
Overall Architecture¶
The input to ExCEL is an image and its category labels, and the output is pixel-level segmentation pseudo-labels. The overall pipeline consists of four steps: (1) The TSE module enriches text semantics, generating rich category text representations \(T_c\); (2) The SVC module replaces q-k attention with intra-correlation to extract fine-grained visual features \(P_s\) from frozen CLIP features, calculating cosine similarity with \(T_c\) to generate static CAMs; (3) The LVC module learns dynamic distribution shifts via a lightweight adapter, further optimizing visual features to generate dynamic CAMs; (4) The dynamic CAMs are refined into pseudo-labels to supervise the training of the segmentation network.
Key Designs¶
-
Text Semantic Enrichment (TSE):
- Function: Expands sparse category text templates into semantically rich text representations.
- Mechanism: First, GPT-4 is used to generate \(n=20\) detailed descriptions (including attributes like appearance, color, shape, etc.) for each category, which are encoded into a knowledge base \(\mathcal{T}\) using CLIP's text encoder. Then, as a key step, instead of directly fusing these descriptions, K-means is applied to cluster all descriptions into \(B\) implicit attributes (e.g., \(B=112\) for VOC). Finally, the global text embedding \(t_c\) is used to retrieve the TOP-K most relevant attributes in the attribute space, which are weighted and aggregated to obtain the final representation \(T_c = t_c + \lambda \sum softmax(t_c^T A_c) a_j\).
- Design Motivation: Explicit descriptions may have incomplete coverage and contain noise. Clustered implicit attributes are not only more compact but also capture shared knowledge across categories (e.g., "having wings" is related to both birds and airplanes), supplementing the missing information in single-category descriptions.
-
Static Visual Calibration (SVC):
- Function: Extracts fine-grained visual features from CLIP's intermediate layers in a parameter-free manner.
- Mechanism: The original q-k attention of CLIP produces overly uniform attention maps, leading to the homogenization of different tokens. SVC replaces q-k attention with Intra-correlation: instead of calculating \(q^T k\), it calculates \(q^T q\), \(k^T k\), and \(v^T v\) (i.e., the self-correlation within each space) and accumulates them over the last \(N=5\) intermediate layers. This is equivalent to comparing the similarity of each patch with other patches in its own space, thereby preserving spatial structural information.
- Design Motivation: The q-k attention is trained for global image-text alignment, naturally tending to homogenize tokens to capture broad semantics. Intra-correlation bypasses this homogenization effect and directly exposes spatial relationships between patches. It can generate CAMs comparable to trained methods (74.6% mIoU) without any training.
-
Learnable Visual Calibration (LVC):
- Function: Dynamically calibrates frozen visual features using a lightweight adapter.
- Mechanism: The frozen features from layers 1-12 of CLIP are passed through independent MLPs and then concatenated, followed by a convolutional layer to generate dynamic features \(F_d\). The self-similarity of \(F_d\) is calculated, mean-subtracted, and scaled to obtain a dynamic relationship matrix \(R\), where negative values are set to \(-\inf\) to remove irrelevant relationships. Finally, \(softmax(R)\) is added to the static attention map of SVC as a distribution shift.
- Design Motivation: SVC features are frozen and fixed, unable to dynamically adjust to specific images. LVC introduces only a distribution shift without altering pre-trained CLIP weights, preserving transferability while enhancing dense segmentation performance.
Loss & Training¶
The training objective is \(\mathcal{L}_{ExCEL} = \mathcal{L}_{seg} + \gamma \mathcal{L}_{div}\). \(\mathcal{L}_{seg}\) is the cross-entropy loss supervised by dynamic pseudo-labels. \(\mathcal{L}_{div}\) is a diversity loss that utilizes the pixel affinity of static pseudo-labels generated by SVC to supervise the token relationship learning of adapter features \(F_d\): correlation of token pairs from the same class should be maximized, while that of different classes minimized. \(\gamma=0.1\). The AdamW optimizer is used with a learning rate of 1e-4, training for 30K iterations on VOC and 100K iterations on COCO.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | ExCEL | WeCLIP (Prev. SOTA) | Gain |
|---|---|---|---|---|
| VOC val | mIoU | 78.4% | 76.4% | +2.0% |
| VOC test | mIoU | 78.5% | 77.2% | +1.3% |
| COCO val | mIoU | 50.3% | 47.1% | +3.2% |
The training of ExCEL requires only 3.2GB of VRAM and 6% of the training time of prior methods. The training-free mode (only SVC+TSE without training) achieves 74.6% mIoU on CAM seeds, already surpassing most methods that require training.
Ablation Study¶
| Configuration | mIoU | Description |
|---|---|---|
| Baseline (CLIP) | 12.1% | Original CLIP used directly for segmentation |
| + SVC | 72.5% | Intra-correlation replaces q-k attention |
| + SVC + TSE | 74.7% | Added text semantic enrichment, recall improved by 3.6% |
| + SVC + LVC | 75.1% | Added learnable visual calibration |
| ExCEL (All) | 77.2% | Synergy of three modules |
Key Findings¶
- SVC contributes the most (+60.4% mIoU), proving that intra-correlation is far superior to the original q-k attention for dense localization.
- Implicit attribute clustering (B=112) performs 2.1% better than directly fusing 20 explicit descriptions, validating the value of cross-category knowledge sharing.
- Intra-correlation performs best in the last 5 layers (rather than just the last layer): single layer 69.7% → multi-layers 74.6%, indicating that fine-grained information in intermediate layers needs to be accumulated layer by layer.
Highlights & Insights¶
- Replacing q-k attention with Intra-correlation is an elegant design: Without any training parameters, it boosts the CAM quality of CLIP from 11.2% to 74.6%. The core insight is that the homogenization of q-k attention is a byproduct of CLIP's global alignment training, rather than a flaws with the patch-level features themselves.
- The design concept of implicit attribute space is transferable: The idea of clustering category descriptions into shared cross-category attributes can be applied to any visual task requiring text-enhanced guidance (such as open-vocabulary detection).
- Remarkably low training cost is noteworthy: Surpassing all SOTA with only 3.2GB VRAM + 6% training time indicates that fully utilizing the dense knowledge of pre-trained models is far more effective than brute-force training.
Limitations & Future Work¶
- Reliance on GPT-4 to generate category descriptions introduces dependencies on external large models; exploring open-source LLMs as alternatives is a future direction.
- The number of clustered attributes B needs manual tuning for different datasets (VOC 112, COCO 224); adaptively determining the value of B is a potential improvement.
- Currently only validated on ViT-B; whether larger CLIP models (such as ViT-L/14) can yield further improvements remains to be explored.
Related Work & Insights¶
- vs CLIP-ES: CLIP-ES generates GradCAM using gradients of image-text alignment, which is essentially still a global alignment approach; ExCEL directly computes similarity at the patch-text level, resulting in more accurate localization.
- vs WeCLIP: WeCLIP is also a single-stage method that directly uses CLIP for segmentation, but it does not modify the attention mechanism or enhance the text; ExCEL outperforms it by 2.0% under the same single-stage setting.
- vs MaskCLIP: MaskCLIP only uses the value features of the last layer, while ExCEL's intra-correlation is accumulated across multiple layers, which is more comprehensive (65.8% vs 74.6%).
Rating¶
- Novelty: ⭐⭐⭐⭐ The patch-text alignment paradigm and intra-correlation are valuable new concepts, though individual modules are not entirely novel when viewed in isolation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-dataset evaluation (VOC + COCO), comprehensive ablation studies, and dual evaluation of both CAM seeds and segmentation.
- Writing Quality: ⭐⭐⭐⭐ Clear logic, intuitive illustrations, and well-articulated motivations.
- Value: ⭐⭐⭐⭐⭐ Extremely low training cost + SOTA performance, highly practical value, and insightful for mining CLIP's dense knowledge.