CTRL-O: Language-Controllable Object-Centric Visual Representation Learning¶

Conference: CVPR 2025
arXiv: 2503.21747
Code: https://ctrl-o-paper.github.io
Area: Image Generation
Keywords: Object-Centric Representation, Slot Attention, Language Control, Contrastive Loss, Object Discovery

TL;DR¶

CTRL-O introduces language controllability into object-centric representation learning. By using language embedding to initialize slot queries, conditioning the decoder on language, and employing a control contrastive loss, it achieves language-object binding without mask supervision. It achieves an FG-ARI of 47.5 on COCO (+7.0 over Dinosaur), while supporting zero-shot referring expression segmentation, instance-level image generation, and VQA.

Background & Motivation¶

Background: Object-centric representation learning (e.g., Slot Attention, Dinosaur) decomposes scenes into independent object representations (slots), but slots are uncontrollable—it is impossible to specify which slot corresponds to which object.
Limitations of Prior Work: (1) Slot assignment is arbitrary, making it impossible for users to specify objects of interest using language; (2) object discovery accuracy is limited in complex real-world scenes; (3) learned representations are difficult to apply directly to downstream tasks.
Key Challenge: Object-centric representation requires "discovering" objects (unsupervised), whereas "controlling" object binding requires semantic understanding—how to introduce language control without requiring mask annotations?
Goal: Control the binding of slots to objects using language descriptions without mask supervision.
Key Insight: Initialize slot queries with pretrained LLM embeddings, so slots naturally tend to bind to objects of corresponding semantics.
Core Idea: Query initialization (LLM embedding + positional information) + decoder language conditioning + control contrastive loss.

Method¶

Overall Architecture¶

Input image \(\rightarrow\) extract features with frozen DINOv2 \(\rightarrow\) encode language descriptions via LLaMA-3-8B (LLM2Vec) + centroid coordinates \(\rightarrow\) initialize slot queries \(\rightarrow\) iterative assignment via Slot Attention \(\rightarrow\) reconstruct with decoder conditioned on slot + control query \(\rightarrow\) control contrastive loss to constrain slot-language alignment.

Key Designs¶

Language Query Initialization
- Function: Encourages slots to bind to language-described objects from the start.
- Mechanism: Concatenates LLaMA-3-8B (LLM2Vec) language embeddings with centroid coordinates to serve as initial slot queries. Dynamic class-to-prompt mapping: K-means clusters \(C\) classes into \(M\) prompts (updated per epoch).
- Design Motivation: The object to which randomly initialized slots bind is uncontrollable; language initialization provides semantic "anchors".
Control Contrastive Loss
- Function: Enforces alignment between slot representations and their corresponding language embeddings.
- Mechanism: \(\mathcal{L}_{CC}^l = -\sum_i \log\frac{\exp(z_i^{emb} \cdot l_i / \tau)}{\sum_t \exp(z_i^{emb} \cdot l_t / \tau)}\), where \(z_i = \sum_k a_{ik} h_k\) denotes the DINO features aggregated by slot attention weights, and \(\tau=0.1\).
- Design Motivation: Initialization alone cannot guarantee that slots maintain correct binding after attention iterations.
Decoder Language Conditioning
- Function: Injects language information during the reconstruction phase to enhance object-language association.
- Mechanism: Concatenates slots with control queries and feeds them into an MLP decoder.
- Design Motivation: Conditioning the decoder on language helps it learn more semantically meaningful reconstructions.

Loss & Training¶

Reconstruction loss + control contrastive loss. Frozen DINOv2 ViT backbone. Trained on COCO+VG for 300K steps with a batch size of 128.

Key Experimental Results¶

Main Results¶

Method	FG-ARI↑	mBO↑	Binding Hits↑
Dinosaur	40.5	27.7	-
CTRL-O	47.5	27.2	61.3%

Task	CTRL-O	Best Baseline
RefCOCO mIoU (Zero-shot)	28.2	Shatter&Gather 21.8
Image Generation FID (COCO)	25.20	Stable LSD 26.20
VQAv2 Accuracy	60.25%	CLIP 58.64%

Ablation Study¶

Configuration	Binding Hits	Note
w/o Language Initialization	~40%	Loss of semantic anchors
w/o Contrastive Loss	~48%	Alignment not persistent
w/ GT masks (Upper Bound)	71.2%	Supervised ceiling
Full CTRL-O	61.3%	Unsupervised, close to upper bound

Key Findings¶

The +7.0 FG-ARI gain primarily stems from language guidance enabling slots to segment boundaries more accurately.
The zero-shot referring expression segmentation score of 28.2 mIoU outperforms non-language methods by over 30%.
Even with GT mask supervision, Binding Hits reach only 71.2%, indicating that object binding itself is a challenging problem.

Highlights & Insights¶

Controllability is the key missing piece in object-centric representations: CTRL-O fills this gap.
61.3% Binding Hits without mask supervision: Standard language-object alignment can be achieved without segmentation annotations.
Unified framework supporting multiple tasks: Object discovery, referring segmentation, image generation, and VQA.

Limitations & Future Work¶

Multiple instances of the same category require additional spatial disambiguation (centroid coordinates).
The mBO of the MLP decoder (27.2) is slightly lower than that of Dinosaur (27.7).
The VQA accuracy of 60.25% is still far below large language model-based solutions (>80%).
Diffusion generation exhibits failure cases of object distortion or repetition.

vs Dinosaur: Lacks language control, achieving 40.5 FG-ARI. CTRL-O improves this to 47.5 via language anchors.
vs CLIP: Performs global image-level contrastive learning without object-level decomposition. CTRL-O performs language alignment at the object level.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing language control to object-centric learning is a natural yet significant extension.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-angle validation spanning object discovery, segmentation, generation, and VQA.
Writing Quality: ⭐⭐⭐⭐ Clear and well-written.
Value: ⭐⭐⭐⭐ Controllable object representations that unify multiple tasks hold long-term research value.