Auto-Vocabulary Semantic Segmentation¶
Conference: ICCV 2025 arXiv: 2312.04539 Code: Coming soon Area: Segmentation / Open-Vocabulary / VLM Keywords: auto-vocabulary segmentation, open-vocabulary, BLIP, zero-shot, LLM evaluator
TL;DR¶
This paper introduces Auto-Vocabulary Semantic Segmentation (AVS), a new task in which the AutoSeg framework autonomously discovers target categories from images and performs segmentation without any human-specified vocabulary. AutoSeg achieves 87.1 mIoU on PASCAL VOC, far surpassing the only comparable method ZeroSeg (20.1), and even outperforming several open-vocabulary methods that require explicit category specification.
Background & Motivation¶
Background: Semantic segmentation has evolved from closed-set (fixed categories) to open-vocabulary segmentation (OVS), where VLMs such as CLIP enable handling of arbitrarily specified text categories. Nevertheless, OVS still requires a "human in the loop"—users must either specify the vocabulary at inference time or the method implicitly encodes a fixed set of output categories through training on manually annotated datasets.
Limitations of Prior Work: (a) Manual specification of which categories to segment is required for each image, limiting scalability. (b) Consider a kitchen robot that must distinguish among various tools and ingredients—providing category names before every grasp is impractical. (c) Text-query-based methods (e.g., LISA) require multiple inference passes for multi-category scenes, resulting in extremely poor efficiency.
Key Challenge: VLMs (CLIP, BLIP) are trained on image-level features and lack the capacity for fine-grained local region reasoning, making precise segmentation boundaries difficult to obtain directly. Moreover, having a model automatically determine "what should be segmented" is a strictly harder problem than "segmenting a specified category."
Goal: (a) Eliminate human intervention from the segmentation pipeline by enabling the model to autonomously identify all relevant categories in an image. (b) Evaluate the correspondence between automatically generated categories and annotated categories—synonymy and hierarchical relationships make discrete category comparison non-trivial.
Key Insight: BLIP patch embeddings inherently encode local semantic information, which can be clustered and decoded into textual descriptions—effectively automating local image captioning—after which the generated category names drive an OVS segmenter.
Core Idea: Cluster BLIP features and decode them to generate an image-specific vocabulary, then feed this vocabulary as a self-guided input to an OVS model to achieve semantic segmentation without human intervention.
Method¶
Overall Architecture¶
The AutoSeg pipeline proceeds as follows: input image → BLIP encoder extracts patch embeddings → multi-resolution, multi-\(k\) clustering → CRF + majority filtering for denoising → BLIP decoder decodes each cluster into text → nouns are extracted to construct a vocabulary → vocabulary is fed into X-Decoder (OVS model) → semantic segmentation masks are produced. During evaluation, LAVE (LLM-based Auto-Vocabulary Evaluator) maps automatically generated categories to ground-truth categories for mIoU computation.
Key Designs¶
-
BBoost (Bootstrapped BLIP-based Vocabulary Generation):
- Function: Automatically discovers all relevant object category names from the input image.
- Mechanism: (a) The BLIP ViT encoder partitions the image into patch embeddings \(\hat{\mathbf{B}} \in \mathbb{R}^{N \times d}\). (b) \(k\)-means clustering with \(k \in \{2, \ldots, 8\}\) is applied at two resolutions (384×384 and 512×512), yielding 14 sets of clusters. (c) Hungarian matching aligns cluster indices across sets, and each patch obtains a probability distribution over clusters. (d) CRF denoising and majority filtering further refine cluster boundaries. (e) The set of patch embeddings belonging to each cluster is fed into the BLIP text decoder to generate descriptive text.
- Design Motivation: Although BLIP is trained at the image level, its patch embeddings naturally preserve local semantic information—clustering them enables "local captioning," an emergent capability discovered in this work.
- Distinction from ZeroSeg: ZeroSeg requires four models (DINO clustering, CLIP embedding, custom attention, and GPT-2), whereas AutoSeg uses only BLIP for both clustering and captioning.
-
Cross-Clustering Consistency:
- Function: Unifies clustering results across multiple \(k\)-means runs.
- Mechanism: The run with the largest number of clusters is designated as reference set \(S\); remaining runs are aligned to \(S\) via IoU-based Hungarian matching, and each patch obtains a cluster probability distribution \(P(n|p)\).
- Design Motivation: A single \(k\)-means run is noisy; ensembling across multiple runs reduces variance while allowing weak clusters to vanish naturally, thus providing adaptive cluster count selection.
-
LAVE (LLM-based Auto-Vocabulary Evaluator):
- Function: Maps automatically generated category names to the fixed categories of annotated datasets for mIoU computation.
- Mechanism: Llama-2-7B receives all auto-generated category names together with the dataset category names and performs semantic mapping based on synonymy and hypernym/hyponym relationships.
- Design Motivation: Conventional cosine-similarity matching (e.g., Sentence-BERT) produces errors such as mapping "taxi" to "road" rather than "car"; LLMs handle complex semantic relationships more robustly.
Loss & Training¶
AutoSeg requires no training or fine-tuning whatsoever—both BLIP and X-Decoder use pretrained weights. Hyperparameters are tuned on VOC via Bayesian optimization: CRF smoothness weight 6, smoothness \(\theta = 0.8\), positional encoding dimension 256, nucleus sampling top-\(P = 1\), repetition penalty 100.
Key Experimental Results¶
Main Results¶
Comparison under the Auto-Vocabulary setting (automatically generated vocabulary vs. ground-truth vocabulary):
| Method | VOC mIoU | PC mIoU | ADE mIoU | CS mIoU |
|---|---|---|---|---|
| AutoSeg (Ours) | 87.1 | 11.7 | 6.0 | 30.0 |
| ZeroSeg | 20.1 | 11.4 | — | — |
| LLaVA+X-Decoder | 56.7 | 11.4 | — | 23.4 |
| SAM+BLIP+X-Decoder | 41.1 | 11.3 | — | 27.4 |
| LLaVA+LISA | 7.7 | 0.2 | — | 1.5 |
Comparison with OVS methods (which receive ground-truth category names as input):
| Method | Requires Vocabulary | VOC | PC | ADE | CS |
|---|---|---|---|---|---|
| CAT-Seg | ✗ | 97.2 | 19.0 | 13.3 | — |
| X-Decoder | ✗ | 96.2 | 16.1 | 6.4 | 50.8 |
| AutoSeg | ✓ (automatic) | 87.1 | 11.7 | 6.0 | 30.0 |
| OVSeg | ✗ | 94.5 | 11.0 | 9.0 | — |
| OpenSeg | ✗ | 72.2 | 9.0 | 8.8 | — |
Ablation Study¶
| Configuration | VOC c-mIoU | PC c-mIoU | ADE c-mIoU | CS c-mIoU |
|---|---|---|---|---|
| AutoSeg (full) | 71.8 | 47.7 | 29.2 | 35.8 |
| X-Decoder+BLIP | 38.0 | 35.3 | 26.7 | 29.2 |
| BBoost Embeddings only | 16.3 | 16.3 | 11.3 | 0.85 |
Key Findings¶
- AutoSeg achieves 87.1 mIoU on VOC, reaching 91% of the best OVS method (97.2)—without requiring any category specification.
- AutoSeg surpasses OpenSeg, ODISE, and OVSeg—methods that rely on ground-truth category names—on VOC and PC.
- AutoSeg discovers far more categories than annotated datasets contain (VOC: 938 vs. 20; ADE: 1,578 vs. 847), capturing fine-grained categories absent from annotations (e.g., hawk, coke, dachshund).
- LAVE mapping differs minimally from human mapping (VOC: 87.1 vs. 88.2), confirming the reliability of the LLM-based evaluator.
Highlights & Insights¶
- Emergent capability of BLIP patch embeddings: Although BLIP is trained solely on image-level captioning, clustering its patch embeddings and feeding them to the decoder enables local captioning—a surprising emergent property with broad implications.
- Multi-resolution, multi-\(k\) ensemble: Ensembling 14 clustering runs substantially reduces the noise of individual \(k\)-means runs and improves stability; the strategy is transferable to any clustering-based pipeline.
- Self-guidance paradigm: Using a self-generated vocabulary to guide segmentation eliminates external input dependencies, yielding a truly end-to-end, intervention-free system.
- LAVE evaluator: Leveraging an LLM for vocabulary mapping resolves the semantic alignment problem between auto-generated and annotated categories, providing considerably greater robustness than cosine-similarity-based approaches.
Limitations & Future Work¶
- Performance remains relatively weak on datasets with a large number of categories (ADE with 847 classes, Cityscapes), leaving room for improvement.
- BBoost occasionally misses non-salient objects (background regions), as captioning training is biased toward foreground targets.
- Segmentation quality is bounded by X-Decoder; as OVS models support more categories, AutoSeg's upper bound will rise accordingly.
- The optimal number of captioning rounds depends on dataset characteristics, and an adaptive strategy is currently lacking.
Related Work & Insights¶
- vs. ZeroSeg: ZeroSeg requires three separate models (DINO, CLIP, GPT-2) and achieves only 20.1 mIoU on VOC; AutoSeg reaches 87.1 using BLIP alone.
- vs. LISA: LISA necessitates multiple inference passes per category, making it extremely inefficient and ineffective in multi-category scenes (7.7 mIoU).
- vs. OVS methods: AutoSeg surpasses OpenSeg and OVSeg—both of which require ground-truth category names—without any category input, demonstrating the competitiveness of automatic vocabulary generation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first truly intervention-free auto-vocabulary semantic segmentation framework; the discovery of BLIP's emergent local captioning capability is particularly inspiring.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Four datasets, extensive ablations, and comparisons with OVS methods; failure case analysis is somewhat lacking.
- Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear; the AVS vs. OVS comparison figure is intuitive.
- Value: ⭐⭐⭐⭐⭐ — Eliminating the human in the loop represents a meaningful advance for segmentation; the LAVE evaluation framework is broadly reusable.