A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties¶

Conference: ECCV 2024
arXiv: 2312.13764
Code: https://github.com/lambert-x/ProLab
Area: Semantic Segmentation / Open-Vocabulary / Semantic Space Construction
Keywords: Attribute-level Label Space, LLM Knowledge Retrieval, Descriptive Properties, Interpretable Segmentation, K-Means Clustering

TL;DR¶

ProLab uses LLMs to generate common-sense descriptions of categories, compressing them into 256 interpretable descriptive properties via sentence embeddings and K-Means clustering. This constructs an attribute-level multi-hot label space to supervise the segmentation model, replacing traditional one-hot category labels. It consistently outperforms category-level supervision across five classic benchmarks and shows emergent out-of-domain generalization capabilities.

Background & Motivation¶

Traditional semantic segmentation models (DeepLab, UperNet, SegFormer, etc.) use a one-hot category label space. This design ignores semantic correlations between categories—the semantic distance between "cat" and "dog" is treated the same as between "cat" and "sky". Previous solutions either manually merged labels or modeled hierarchical relationships, but often led to performance drops and scalability issues. Another approach is to construct semantic spaces using CLIP text encoders, but CLIP lacks deep common-sense knowledge and represents rare scenes poorly due to the long-tail distribution of image-text paired data. The Key Challenge is that we need a label space that can both model inter-category semantic relationships and possess human interpretability.

Core Problem¶

How to construct an interpretable semantic space that does not rely on vision-language pre-training, enabling segmentation models to learn semantic associations between categories? Furthermore, can such a semantic space simultaneously bring stronger closed-set performance and open-vocabulary generalization capabilities?

Method¶

The core idea of ProLab is: instead of directly teaching the model "this is a cat," teach it "this region has claws, fur, and appears in parks"—that is, using a set of interpretable descriptive properties to replace category labels.

Overall Architecture¶

The input is an RGB image, and the output is an activation vector of 256 descriptive properties for each pixel (rather than probabilities over N categories). During inference, category predictions are restored by comparing the cosine similarity between the pixel's attribute activation and the attribute signatures (multi-hot vectors) of each category. The overall pipeline consists of three steps: (1) extracting category descriptions from LLMs \(\rightarrow\) (2) encoding them with a sentence embedding model and clustering them into an attribute set \(\rightarrow\) (3) supervising the segmentation model with attribute-level multi-hot labels.

Key Designs¶

LLM Attribute Knowledge Retrieval: Using GPT-3.5 (also supporting LLaMA2-7B), standardized descriptions are generated for each category through a meticulously designed structured prompt, covering shape, orientation, physical features (materials, textures, colors), and common associated environments. The structured prompt ensures that descriptions across different categories align in terms of granularity and information types, which is critical for the subsequent clustering quality.
Description Embedding and Clustering: A pure language sentence embedding model (BGE-Sentence or Sentence Transformers) is used to encode all descriptions into vectors. Here, CLIP is intentionally not used because sentence embedding models trained via contrastive learning can better preserve the semantic correlation between sentences. K-Means clustering is then applied to group thousands of descriptions into 256 "descriptive properties," where each property represents a group of semantically similar descriptions (e.g., "smooth and shiny surface," "covered in fur").
Attribute-level Supervision and Classification: Instead of having a single category label per pixel (one-hot), it is changed to multi-hot labels—if a pixel belongs to "cat," it activates all attributes related to "cat." The model outputs an embedding \(\mathbf{e}_i\) for each pixel, which is multiplied by the attribute embedding bank \(\mathbf{E} \in \mathbb{R}^{d \times k}\) to obtain attribute-level logits \(\mathbf{z}_i = \sigma(\mathbf{e}_i \mathbf{E})\). During inference, categories are mapped back by computing \(c_i = \arg\max_j \text{sim}(\mathbf{y}_j, \mathbf{z}_i)\).

Loss & Training¶

Uses Cosine Similarity Loss + sigmoid (outperforms BCE and cosine loss without sigmoid), with a sigmoid temperature of 0.04.
When training large models (ViT-L), warm-up is first performed for 40K steps using one-hot category labels, before switching to attribute-level labels.
Optimizer: AdamW, learning rate: 6e-5, total batch size: 16, Poly learning rate schedule.

Key Experimental Results¶

Dataset	Metric	Ours (ProLab)	Category-level Baseline	Gain
ADE20K	mIoU	49.0	48.4	+0.6
COCO-Stuff	mIoU	45.4	43.1	+2.3
Pascal Context	mIoU	58.2	53.3	+4.9
Cityscapes	mIoU	81.4	79.9	+1.5
BDD	mIoU	65.7	60.7	+5.0

SOTA Comparison (ADE20K val): ProLab + ViT-L + BEiTv2 + 896 resolution = 58.7 mIoU, which is comparable to SwinV2-G (59.3) but with only 1/5 of the parameters.

Generalization to other frameworks: DeepLabv3+ (42.7 \(\rightarrow\) 43.6), SegFormer (41.5 \(\rightarrow\) 42.3), proving that the method is architecture-agnostic.

Open-Vocabulary Segmentation: ProLab + linear probe achieves 92.5 mIoU on PAS-20 and 57.7 mIoU on PC-59, outperforming a large number of CLIP-pre-trained methods.

Ablation Study¶

Number of Clusters: 256 is the best (48.3), compared to 64 \(\rightarrow\) 47.8, 128 \(\rightarrow\) 48.0, 512 \(\rightarrow\) 47.6. The best performance is observed when the number of clusters is about 1/6 to 1/8 of the total number of descriptions.
Embedding Models: BGE-Base (768d) > BGE-Small (384d) > Sent.TR-Base > Sent.TR-Small. The BGE series performs better than Sentence Transformers.
Text Encoder vs. Attribute Space: BGE + Attribute Space (49.0) > CLIP + Category Space (48.6) > BGE + Category Space (47.7), indicating that the attribute space itself is the key constraint.
LLM Selection: GPT-3.5 > LLaMA2-7B, structured prompt > naive prompt (contributing ~0.5 and ~0.2 mIoU respectively).
Loss Function: Cosine similarity + sigmoid (47.7) > BCE (47.4) > Cosine similarity without sigmoid (47.3).
Scalability: As the training steps increase, ProLab's performance continuously goes up, while category-level supervision shows saturation/overfitting.

Highlights & Insights¶

Redefining Label Space: Transforming "teaching the model to recognize categories" to "teaching the model to recognize attributes" is a highly elegant label space re-engineering idea, enabling the model to learn richer semantic correlations.
No Need for Vision-Language Pre-training: The entire semantic space is built strictly using pure language models (LLMs + sentence embeddings) without relying on image-text pre-training like CLIP, offering lower costs and richer common-sense knowledge.
Emergent Generalization Capability: Although categories like PS5, AirPods, and quolls are absent in the training set, the model successfully segments them through compositions of learned attributes ("plastic and metal," "electronic device"). This compositional generalization from attributes to unseen categories is highly inspiring.
Interpretable Segmentation: The model not only outputs "cat" but can also explain why—because attributes like "has claws" and "has fur" are activated, which is highly valuable for domains requiring interpretability (medical, autonomous driving).
Plug-and-Play: It can be seamlessly integrated into various frameworks such as DeepLabv3+, SegFormer, ViT-Adapter, etc.

Limitations & Future Work¶

Attribute descriptions are purely generated from LLM text without visual grounding, which may lead to "correct but visually irrelevant" attributes (e.g., "usually found in parks" might not be useful for pixel-level segmentation).
K-Means clustering assumes the attribute space is spherical, which might not be optimal for semantic embedding spaces; hierarchical or graph-based clustering could be explored.
Inference requires an extra attribute-to-category mapping step, which increases computational overhead.
Open-vocabulary experiments still rely on ImageNet pre-training, making the comparison with methods pre-trained on massive image-text pairs (such as ODISE) somewhat unfair.
The scenario of joint training on multiple datasets was not explored, although the attribute space is naturally suited for unifying labels across datasets.
Whether the number of attributes (256) scales linearly with the number of classes in a dataset remains unvalidated for extremely large label spaces (e.g., LVIS with 1000+ classes).

vs. CLIP-based methods (LSeg, ZegFormer): ProLab's semantic space is derived from pure-language LLMs instead of image-text alignment models. It has richer common-sense knowledge and is not limited by the long-tail distribution of image-text paired data. However, the lack of visual alignment might make it slightly inferior to CLIP in fine-grained differentiation.
vs. Hierarchical Label Methods (HSSN): Hierarchical methods require manually defined semantic hierarchies and often suffer from performance degradation. In contrast, ProLab's attribute space is automatically generated from LLMs, which is more flexible and delivers superior performance.
vs. Attribute Learning (Traditional ZSL methods): Traditional attribute learning relies on manually defined attribute lists (e.g., color, texture). ProLab automatically extracts attributes from LLMs and uses clustering to de-duplicate them, possessing a much stronger capability to scale up compared to manual designs.

This concept of "replacing category labels with interpretable attributes" can be transferred to 3D segmentation (attribute-level supervision of point clouds/voxels) or video segmentation (temporally consistent attribute propagation).
The idea of attribute clustering can be used for automatic label unification in multi-dataset training—mapping categories from different datasets into the same attribute space.
The evaluation based on images generated by DALL-E 3 is highly interesting, indicating that attribute-level understanding is more robust in cross-domain scenarios than category-level recognition.

Rating¶

Novelty: ⭐⭐⭐⭐ Replacing category space with an attribute-level label space is a solid insight, though attribute learning itself has a long history.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five datasets + three frameworks + detailed ablations + open-vocabulary + generated image evaluation.
Writing Quality: ⭐⭐⭐⭐ Clear logic, rich figures and tables, and the attribute lists and category comparison analysis in the appendix add to the credibility.
Value: ⭐⭐⭐⭐ Provides a new perspective on semantic space construction; the emergent generalization ability is highly inspiring, and the plug-and-play nature increases practical utility.