Towards Open-Ended Visual Recognition with Large Language Model¶
Conference: ECCV 2024
arXiv: 2311.08400
Code: GitHub
Area: LLM/NLP
Keywords: Open-ended visual recognition, large language model, mask classification, generative recognition, multi-dataset training
TL;DR¶
This paper proposes the OmniScient Model (OSM)—a generative mask classifier based on a frozen CLIP-ViT, a trainable MaskQ-Former, and a frozen LLM (Vicuna-7B). It shifts visual recognition from "selecting categories from a predefined vocabulary" to "directly generating category names," eliminating the dependency on predefined vocabularies during both training and testing. It outperforms DaTaSeg by +4.3 PQ on COCO panoptic segmentation.
Background & Motivation¶
Object localization and recognition in the open world are long-standing challenges in computer vision. Current mainstream methods decompose this problem into two sub-tasks:
Category-agnostic mask/box proposal: Models like SAM have demonstrated strong zero-shot generalization capabilities.
Open-vocabulary classification: Classifying based on pre-extracted text embeddings from VLMs like CLIP.
However, existing open-vocabulary recognition methods suffer from two fundamental limitations:
Limitation 1: Category names must be provided during testing. Users need to predefine all possible semantic categories, making the recognition performance highly dependent on the quality of this predefined vocabulary. In real-world scenarios, pre-enumerating all possible category names is almost impossible.
Limitation 2: Label conflicts in multi-dataset training. Label definitions from different datasets might conflict (e.g., "tv" in COCO is semantically equivalent to "monitor" in another dataset). Existing methods either need to train a separate decoder/classifier for each dataset or require manual consolidation of label spaces, significantly increasing engineering complexity.
The authors propose a paradigm shift: from discriminative recognition (selecting from a vocabulary) to generative recognition (directly generating category names), referred to as open-ended visual recognition.
Method¶
Overall Architecture¶
OSM adopts a modular architecture consisting of three core components:
- Frozen CLIP-ViT: Extracts high-resolution visual features (utilizing a sliding-window approach).
- Trainable MaskQ-Former: A mask-aware visual feature resampling module.
- Frozen LLM (Vicuna-7B): Predicts category names in a generative manner.
Overall pipeline: Input image \(\rightarrow\) CLIP-ViT extracts features \(\rightarrow\) Segmentation model (SAM/kMaX-DeepLab) generates masks \(\rightarrow\) MaskQ-Former extracts mask region features \(\rightarrow\) LLM generates category names.
Problem Formulation: From Classification to Generation¶
Traditional classification: Given an image \(\mathbf{I}\) and \(M\) segmentation masks \(\mathbf{M}\), the task is to predict the category \(c_i \in C\) for each mask, where \(C\) is a predefined category set.
Open-ended recognition: Assumes that the vocabulary \(C\) is unknown during both training and testing. Recognition is reformulated as a text generation task that maximizes the conditional likelihood of the category name:
where \(c_{i,j}\) is the \(j\)-th token of the category name. The model directly generates the category name instead of selecting from a candidate set.
High-Resolution Feature Extraction (Sliding Window Strategy)¶
Problem: The pre-trained resolution of CLIP-ViT is 224×224, and directly applying it to high-resolution inputs severely degrades performance (the frozen ViT has poor generalization capability to resolution changes).
Solution: A sliding-window strategy is applied at the input level: - Slice the high-resolution image (e.g., 896×896) into multiple windows that match the pre-trained size. - Process each window independently through the frozen ViT to extract features. - Add global positional encodings to compensate for missing positional information across windows.
Experiments demonstrate that increasing the resolution from 224 to 448 yields a +12.8% improvement in classification accuracy, with the optimal resolution being 1120×1120. Direct global adaptation of high-resolution inputs (without sliding windows) leads to a performance drop of -7.2%.
MaskQ-Former: Mask-Aware Feature Resampling¶
Existing Q-Former or Perceiver Resampler architectures use global attention to aggregate image features without considering segmentation mask information. OSM proposes MaskQ-Former, which contains two sets of learnable queries:
Mask Queries: - Focus exclusively on the pixel features within the mask area via masked cross-attention. - Concentrate on the visual information of the target object itself.
Context Queries: - Focus on the expanded region of the mask (e.g., the bounding box area). - Provide surrounding context for the target object (context is crucial for recognition).
The two sets of queries interact through self-attention layers with shared parameters (differing only in query initialization), incurring virtually no extra overhead. Finally, only Mask Queries are retained as input to the LLM.
Ablation of Context Regions: Expanding the bounding box by 0.5× yields the best performance (+2.6% compared to global context); neither overly tight nor overly loose regions are optimal.
Mode Query: Dual-Mode of Vocabulary-Specific vs. Vocabulary-Agnostic¶
To balance open-ended recognition capabilities with alignment to dataset-specific vocabularies, OSM introduces a Mode Query mechanism:
- Vocabulary-Specific Query: Each training dataset is associated with its own learnable query. Activating the corresponding dataset query during training helps the model "memorize" the label space of that dataset.
- Vocabulary-Agnostic Query: A shared general query across all datasets, which can be activated on any dataset during training to preserve open-ended recognition capabilities.
During training, half of the samples in each batch activate the vocab-specific query, while the other half activate the vocab-agnostic query. Testing allows for flexible selection: - For aligning with a specific vocabulary \(\rightarrow\) Activate the vocab-specific query. - For open-ended predictions \(\rightarrow\) Activate the vocab-agnostic query.
Ablation experiments prove that: without the Mode Query, the model generalizes better but shows decreased alignment to specific datasets; adding the Mode Query improves both aspects.
Loss & Training¶
Datasets: Jointly trained on six public segmentation datasets: - COCO panoptic segmentation, ADE20K panoptic segmentation, Cityscapes panoptic segmentation - LVIS instance segmentation, ADE-847 semantic segmentation, PC-459 semantic segmentation
Training Scheme: - Employs an instruction tuning strategy. - In each iteration, an image and a ground-truth (GT) mask are randomly selected, a prompt is chosen among 19 instruction templates, and the true category name is inserted. - Standard next-token prediction loss is used. - Initialized from InstructBLIP pre-trained weights (EVA-ViT-g/224 + Vicuna-7B). - 32 mask queries + 32 context queries + 1 mode query. - AdamW optimizer, learning rate of \(4\times10^{-5}\), weight decay of \(0.05\), with cosine decay. - Dataset-specific batch sizes are applied (COCO: 32, LVIS: 64, ADE20K: 16, etc.). - A total of 6M masks are processed.
Inference: Uses the default template "What is in the segmentation mask?" and adopts greedy decoding.
Key Experimental Results¶
Main Results: GT Mask Classification¶
OSM evaluates mask classification accuracy (Acc) and non-in-vocabulary prediction ratio (NIV) across six datasets:
| Setting | COCO Acc | LVIS Acc | ADE20K Acc | Cityscapes Acc | A847 Acc | PC459 Acc | Average Acc |
|---|---|---|---|---|---|---|---|
| Single-dataset Training | 85.5 | 68.3 | 82.3 | 79.4 | 76.9 | 80.9 | - |
| Learnable Embed | - | - | - | - | - | - | 78.9 |
| Text Embed | - | - | - | - | - | - | < 78.7 |
| OSM vocab-specific | - | - | - | - | - | - | 78.7 |
| OSM vocab-agnostic | - | - | - | - | - | - | Slightly lower |
| OSM† vocab-specific | - | - | - | - | - | - | Significant improvement |
Key Findings: The performance of the generative model (OSM) on discriminative tasks is on par with discriminative models (Learnable Embed) (78.7% vs 78.9%), with an extremely low NIV, suggesting that the generative model can be effectively constrained within the training vocabulary.
Main Results: Combining with Off-the-shelf Segmenters¶
Using kMaX-DeepLab as the mask proposal model, OSM is compared with other multi-dataset generic segmentation methods:
| Method | backbone | COCO PQ | ADE20K PQ | ADE20K mIoU | Cityscapes PQ | Cityscapes mIoU |
|---|---|---|---|---|---|---|
| Mask2Former (Expert Model) | ResNet50 | 51.9 | 39.7 | 46.1 | 62.1 | 77.5 |
| LMSeg | ResNet50 | 38.6 | 35.4 | 45.2 | 54.8 | 80.9 |
| DaTaSeg | ResNet50 | 49.0 | 29.8 | 48.1 | - | - |
| DaTaSeg | ViTDet-L | 53.5 | 33.4 | 54.0 | - | - |
| OSM | ResNet50 | 53.3 | 43.8 | 50.0 | 59.5 | 77.0 |
| OSM | ConvNeXt-L | 56.1 | 49.7 | 55.2 | 64.7 | 80.2 |
- OSM (ResNet50) outperforms LMSeg by +14.7 PQ (COCO), +8.4 PQ (ADE20K), and +4.7 PQ (Cityscapes).
- OSM outperforms DaTaSeg by +4.3 PQ (COCO, ResNet50) and +2.6 PQ (COCO, Large).
- OSM achieves comparable performance with the expert model Mask2Former.
Ablation Study¶
Input Resolution Ablation:
| Resolution | Average Acc |
|---|---|
| 224×224 | Baseline |
| 448×448 | +12.8% |
| 896×896 | Continued improvement |
| 1120×1120 | Optimal |
| Higher | Performance drop |
Sliding Window vs. Global: Sliding window performs +7.2% better than directly processing high-resolution inputs globally.
Context Region Expansion:
| Context Range | Change relative to Global |
|---|---|
| Global | Baseline |
| 0.0× (tight bbox) | +0.8% |
| 0.5× (expanded bbox) | +2.6% |
Open-Vocabulary Evaluation (trained only on COCO+LVIS, zero-shot evaluation on ADE20K):
| Method | PQ | AP | mIoU |
|---|---|---|---|
| MaskCLIP | 15.1 | 6.0 | 23.7 |
| FC-CLIP | 17.8 | 11.1 | 20.8 |
| ODISE | 19.5 | 10.8 | 23.8 |
| OSM | 21.4 | 12.4 | 26.9 |
Key Findings¶
- Generative \(\approx\) Discriminative: OSM performs on par with learnable embedding classifiers in discriminative tasks, challenging the bias that 'generative models are unsuitable for discriminative tasks.'
- Accuracy-Generalization Trade-off: As the number of training masks increases (1M \(\rightarrow\) 9M), Acc improves while NIV decreases (weakening generalization capability), representing an inherent trade-off.
- Emergent Part Recognition: Although never trained on part segmentation data, OSM exhibits the emergent ability to predict parts like "tail" or "ear" (leveraging the world knowledge of the LLM).
- Comparison with GPT-4V: OSM is more accurate in mask classification than GPT-4V but tends to make more conservative predictions (e.g., predicting "person" instead of "man in armor").
Highlights & Insights¶
- Paradigm Shift: Transitioning from discriminative (selecting from a vocabulary) to generative (directly generating category names) completely eliminates the dependency on predefined vocabularies, marking a significant evolution in visual recognition paradigms.
- No Manual Intervention for Multi-Dataset Training: The Mode Query mechanism elegantly resolves cross-dataset label conflicts, eliminating the need for manual label consolidation.
- High-Resolution Feature Extraction via Sliding Window: This seemingly simple strategy yields a massive 12.8% boost, proving that high-resolution adaptation of frozen ViTs does not require complex designs.
- Mask-Context Dual-Query Design in MaskQ-Former: It balances precise target region features with contextual information, supported by an efficient and elegant parameter-sharing scheme.
- NIV Analysis: Visualizing NIV cases reveals that OSM's predictions (e.g., "monitor") are often more accurate than the ground-truth annotations (e.g., labeled as "tv" in COCO), exposing the inherent bias in fixed-vocabulary annotations.
Limitations & Future Work¶
- Accuracy-Generalization Trade-off: Longer training makes the model more prone to conservative predictions (lower NIV). Maintaining open-ended generation capacity while ensuring accuracy remains an open issue.
- Computational Overhead: The frozen LLM (Vicuna-7B) requires autoregressive generation for category names during inference, which is significantly slower than simple embedding matching.
- Prediction Conservatism: Compared to GPT-4V, OSM tends to generate more generic category names ("person" vs. "man in armor"), which could potentially be mitigated by using stronger base LLMs (e.g., Llama-2) or scaling up the training data.
- Unprocessed Image-Level Data: Image-level datasets like ImageNet are not utilized due to single-label constraints being unsuited for multi-object scenarios. Exploring image-level weakly supervised integration is a direction for future work.
- Dependency on External Segmenters: Mask quality directly impacts recognition performance; joint end-to-end training of segmentation and recognition could be superior.
Related Work & Insights¶
- Vocabulary-Free Classification: Parallel works eliminate predefined requirements by retrieving or parsing generated vocabularies, whereas OSM addresses this more thoroughly via a generative approach.
- InstructBLIP / LLaVA: Representative modular multimodal LLMs; OSM builds upon them by incorporating mask-aware capabilities.
- DaTaSeg / LMSeg: Discriminative multi-dataset segmentation methods that require independent classifiers or manual label consolidation, which OSM naturally avoids.
- Insights: The generative recognition paradigm can be extended to other vision tasks like detection and tracking, and the Mode Query mechanism can be applied to other multi-task or multi-dataset scenarios.
Rating¶
| Dimension | Score (1-10) | Description |
|---|---|---|
| Novelty | 9 | Open-ended recognition paradigm + MaskQ-Former + Mode Query, paradigm shift is highly significant. |
| Technical Depth | 8 | Exquisitely designed modules; sliding window, dual-query, and dual-mode queries are tightly integrated. |
| Experimental Thoroughness | 9 | 6 datasets, GT masks + off-the-shelf masks, multi-dimensional ablations, comparison with GPT-4V, and open-vocabulary evaluation. |
| Writing Quality | 8 | Clear formulation of motivations, with intuitive comparison diagrams of discriminative vs. generative paradigms. |
| Value | 8 | Eliminates vocabulary dependency with zero manual intervention for multi-dataset training, offering great practical deployment value. |
| Overall Score | 8.4 | Outstanding paradigm innovation with rigorous and comprehensive experiments. Reliable quality from ByteDance. |