ObjEmbed: Towards Universal Multimodal Object Embeddings¶
Conference: ICML 2026
arXiv: 2605.29118
Code: To be confirmed
Area: Multimodal / Vision-Language / Object Representation Learning
Keywords: Universal Object Embeddings, Multimodal Learning, Object Retrieval, Cross-task Representation
TL;DR¶
ObjEmbed trains a universal object embedding model—aligning multimodal object representations by combining tasks such as detection, segmentation, retrieval, captioning, and classification. A single embedding outperforms or matches task-specific SOTA on 11 tasks, including OVD / OVS / Text2Image-Object / Open-Caption-Eval.
Background & Motivation¶
Background: Multimodal understanding of visual objects is a core task in computer vision, but existing methods are largely task-specific—CLIP aligns image-text with weak object-level granularity, OWL-ViT focuses on strong object detection but lacks generative capabilities, and SAM offers strong segmentation with weak semantics.
Limitations of Prior Work: (1) High deployment costs due to task-specific models; (2) Fragmentation of representations between tasks leads to failure in cross-task transfer; (3) Lack of a unified benchmark for evaluating object-level representations; (4) Data scarcity—high-quality object-level data for single tasks is difficult to scale.
Key Challenge: Real-world applications require a single embedding to support multiple tasks like detection, segmentation, retrieval, and captioning, but current methods are either task-specific or lack sufficient granularity.
Goal: Construct a universal object embedding model where a single representation supports high performance across multiple tasks.
Key Insight: It is observed that objects serve as the "common carrier" for multimodal tasks—localization for detection/segmentation, matching for retrieval, and semantics for labeling/captioning. Learning object-level universal embeddings can support all these tasks simultaneously.
Core Idea: Learn universal object embeddings through multi-task joint training + object-level alignment; train a single backbone using large-scale heterogeneous data (COCO/LVIS/RefCOCO/CC3M) + task-specific heads.
Method¶
Overall Architecture¶
(1) Dual-stream encoding: Image encoder (ViT-L) + Text encoder (BERT-Large); (2) Object detection head: Outputs object boxes based on DETR; (3) Object embedding head: Outputs \(\mathbf{e}_{\text{obj}} \in \mathbb{R}^{512}\) for each object; (4) Multi-task loss: Joint optimization of detection, segmentation, retrieval, captioning, and classification; (5) Object-level alignment: Aligns image object embeddings with text object embeddings via contrastive learning.
Key Designs¶
-
Object-level Alignment + Multimodal Contrastive Learning:
- Function: Aligns image object embeddings with corresponding text embeddings.
- Mechanism: Each image produces \(N\) object embeddings \(\{\mathbf{e}_{\text{obj}}^i\}\) via the detection head. Corresponding text descriptions \(\mathbf{t}^i\) for each object are obtained from datasets like RefCOCO. The contrastive loss is defined as \(\mathcal{L}_{\text{align}} = -\log \frac{\exp(\mathbf{e}_{\text{obj}}^i \cdot \mathbf{e}_{\text{text}}^i / \tau)}{\sum_j \exp(\mathbf{e}_{\text{obj}}^i \cdot \mathbf{e}_{\text{text}}^j / \tau)}\), utilizing in-batch and cross-image negative samples.
- Design Motivation: Image-level alignment (e.g., CLIP) is too coarse; object-level alignment allows embeddings to capture fine-grained semantics. Contrastive learning provides large-scale unlabeled training signals.
-
Multi-task Joint Training + Task-specific Heads:
- Function: Enables the backbone to learn generalized object representations through multi-task training.
- Mechanism: Combined training includes detection loss \(\mathcal{L}_{\text{det}}\) (DETR set matching), segmentation loss \(\mathcal{L}_{\text{seg}}\) (mask prediction), retrieval loss \(\mathcal{L}_{\text{ret}}\) (contrastive alignment), captioning loss \(\mathcal{L}_{\text{cap}}\) (autoregressive generation), and classification loss \(\mathcal{L}_{\text{cls}}\) (cross-entropy). The total loss \(\mathcal{L} = \sum \lambda_i \mathcal{L}_i\) is balanced using GradNorm for adaptive gradient weighting.
- Design Motivation: Single-task training results in specialized representations; joint multi-task training forces the backbone to learn universal features; GradNorm balances gradient scales across different tasks.
-
Large-scale Heterogeneous Data Training + Data Mixing Strategy:
- Function: Trains the universal model using large-scale heterogeneous data.
- Mechanism: Mixed training data includes COCO (detection/segmentation), LVIS (long-tail detection), RefCOCO (referring expression), CC3M (image-text alignment), and ImageNet (classification). Batches are sampled according to task proportions, and Online Hard Example Mining (OHEM) is used to improve performance on tail categories.
- Design Motivation: Single datasets have task biases (e.g., COCO only has 80 classes); heterogeneous data provides diverse object types and contexts.
Key Experimental Results¶
Main Results: Cross-task Performance (vs Task-specific SOTA)¶
| Task | Dataset | Prev. SOTA | Ours | Gain |
|---|---|---|---|---|
| Open-voc Detection | LVIS | OWL-ViT (33.7 AP) | 35.2 AP | +1.5 |
| Open-voc Segmentation | LVIS | OpenSeeD (26.5 mIoU) | 27.8 mIoU | +1.3 |
| Object Retrieval | COCO-Search | Detic (52.3 R@5) | 58.7 R@5 | +6.4 |
| Text-to-Image Object | Open-T2I | T2I-Object (47.8 mAP) | 49.5 mAP | +1.7 |
| Object Captioning | RefCOCOg | OFA-Cap (118.4 CIDEr) | 122.3 CIDEr | +3.9 |
| Object Classification | LVIS-V | Eva-CLIP (74.2 Acc) | 76.5 Acc | +2.3 |
Multi-task Joint Effects¶
| Training Strategy | LVIS AP | RefCOCO R@1 | Cap CIDEr | Task Average |
|---|---|---|---|---|
| Detection Only | 32.8 | 51.2 | 95.3 | 59.8 |
| Retrieval Only | 28.5 | 78.4 | 92.7 | 66.5 |
| Detection + Retrieval | 33.5 | 76.8 | 108.4 | 72.9 |
| Full Multi-task | 35.2 | 82.5 | 122.3 | 80.0 |
Data Scaling¶
| Training Data Scale | LVIS AP | RefCOCO R@1 | OOD Cap CIDEr |
|---|---|---|---|
| 100K images | 28.7 | 72.3 | 92.1 |
| 500K images | 32.5 | 78.6 | 108.7 |
| 1M images | 34.1 | 81.3 | 117.4 |
| 2M images | 35.2 | 82.5 | 122.3 |
Object Embedding Quality Evaluation¶
| Evaluation Metric | CLIP Object Embed | OWL-ViT | Ours |
|---|---|---|---|
| Object t-SNE Purity | 0.67 | 0.74 | 0.88 |
| Cross-dataset Transfer Acc | 58.3 | 64.7 | 78.9 |
| Zero-shot Object Cls | 71.5 | 73.2 | 79.4 |
Key Findings¶
- Multi-task joint training significantly outperforms single-task training: The joint approach yields an average improvement of 20 points over single-task baselines.
- Data scale is not yet saturated: Performance continues to improve steadily at the 2M image scale.
- Object embedding quality is substantially enhanced: t-SNE purity reaches 0.88 compared to 0.67 for CLIP.
Highlights & Insights¶
- Success of unified object representation learning: Overcomes the limitations of task specialization, proving that universal object embeddings can support multiple tasks simultaneously.
- Synergy in multi-task joint training: Multi-task constraints force the model to learn more universal and robust features.
- Object-level alignment + Heterogeneous data: The combination of fine-grained alignment and data diversity is the key driver of success.
Limitations & Future Work¶
- Model Scale: The deployment cost of a ViT-L model with multiple heads remains high.
- Task Coverage: Currently covers 5 tasks; future work should extend to 3D objects, video objects, and compositional understanding.
- Long-tail Objects: A performance gap still exists for extremely rare categories.
- Proposed Improvements: Distilling the model into smaller versions for deployment; extending to 3D/video domains; incorporating active learning to tackle the long-tail problem.
Related Work & Insights¶
- vs CLIP: Shifts from image-level alignment to object-level alignment in ObjEmbed.
- vs OWL-ViT: Evolves from single-task open-vocabulary detection to multi-task joint optimization.
- vs SAM: Compares strong segmentation (SAM) with weak semantics against ObjEmbed's strong segmentation and strong semantics.
- vs Florence / Florence-2: Moves from a universal backbone with task specialization to a single embedding for all tasks.
- Insight: Object-level representation is the "atomic unit" of multimodal vision, and unified embeddings represent the future direction of the field.
Rating¶
- Novelty: ⭐⭐⭐⭐ Universal object embedding concepts have been explored, but this work contributes a systematic training framework and multi-task optimization.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 11 tasks, multiple baselines, scaling analysis, and embedding quality metrics.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, systematic methodology, and strong conclusions.
- Value: ⭐⭐⭐⭐⭐ Unified object embeddings reduce deployment costs and have a significant impact on open-vocabulary visual understanding.