GLASS: Guided Latent Slot Diffusion for Object-Centric Learning¶

Conference: CVPR 2025
arXiv: 2407.17929
Code: https://visinf.github.io/glass/
Area: Diffusion Models / Multimodal VLMs
Keywords: Object-Centric Learning, Slot Attention, Diffusion Models, Semantic Guidance, Instance Segmentation

TL;DR¶

This paper proposes GLASS, an object-centric learning method based on Slot Attention. By learning within the image space generated by a diffusion model, GLASS collaboratively addresses over-segmentation and under-segmentation issues using a semantic guidance module (generating pseudo-semantic masks from the cross-attention of the diffusion model) and an instance guidance module (reconstructing encoder features via an MLP). It significantly outperforms prior methods on object discovery, conditional generation, and compositional generation in real-world scenarios.

Background & Motivation¶

Background: Object-Centric Learning (OCL) aims to decompose an image into a set of latent object representations (slots), where each slot competitively binds to an object. Slot Attention is the most popular OCL method but has long been limited to synthetic datasets (e.g., CLEVR) and performs poorly on real-world images. Recent methods (e.g., DINOSAUR, SPOT, StableLSD) use pre-trained DINOv2 encoders and diffusion model decoders to handle real-world images, but still face significant object segmentation quality issues.

Limitations of Prior Work: Slot representations in real-world scenes suffer from severe part-whole hierarchical ambiguity, where over-segmentation (splitting one object into multiple slots) and under-segmentation (merging multiple objects into a single slot) coexist. DINOSAUR reconstructs DINOv2 features using an MLP as a training signal; though the features themselves are instance-aware, the lack of semantic constraints leads to imprecise boundaries. StableLSD decodes with a diffusion model, but the slot quality remains limited. No existing method simultaneously supports object discovery, conditional generation, and compositional generation.

Key Challenge: Semantic-level guidance signals can resolve over-segmentation (ensuring different parts of the same object are covered by the same slot) but can cause slots to drift toward semantic categories rather than individual instances (resulting in under-segmentation). Conversely, instance-level feature reconstruction signals facilitate instance differentiation but lack semantic boundary constraints.

Goal: To simultaneously resolve both over-segmentation and under-segmentation within the Slot Attention framework, forcing slots to honor semantic boundaries while distinguishing between different instances, and to support object discovery, conditional image generation, and compositional scene generation all at once.

Key Insight: Training on images generated by a diffusion model (rather than real images) makes it possible to exploit the cross-attention layers of the diffusion model to extract pseudo-semantic masks as guidance signals. Combining semantic guidance (BCE loss aligning slot masks with pseudo-semantic masks) and instance guidance (MSE loss reconstructing DINOv2 features) allows them to complement each other and resolve the part-whole ambiguity.

Core Idea: Training Slot Attention on generated images, extracting semantic guidance from the diffusion model's own cross-attention, and providing instance guidance via DINOv2 feature reconstruction collaboratively solves over-segmentation and under-segmentation in real scenes.

Method¶

Overall Architecture¶

Input real image → BLIP-2 generates text description → Extract noun category labels to construct prompts → Generate image \(\mathbf{I}_{gen}\) from random noise using pre-trained Stable Diffusion → Extract semantic mask \(\mathbf{M}_{gen}\) from the diffusion model → Feed \(\mathbf{I}_{gen}\) into a DINOv2 encoder → Slot Attention generates slots → Diffusion decoder reconstructs image + MLP decoder reconstructs features → Align slot masks with semantic masks using Hungarian matching → Joint training with three loss terms. During inference, Slot Attention is run directly on the encoded real images.

Key Designs¶

Semantic Guidance Module:
- Function: Leverages the cross-attention layers of the diffusion model to generate pseudo-semantic masks, guiding the slot masks to align with object boundaries and resolving over-segmentation.
- Mechanism: A single-token prompt is created individually for each class label \(c_k \in \mathcal{C}\). The diffusion model's cross-attention maps across multiple timesteps and resolutions are extracted and averaged to obtain rough attention maps. These are refined using self-attention maps (exponentiated self-attention × cross-attention), followed by argmax and range thresholding to produce the final semantic mask \(\mathbf{M}_{gen}\). Finally, Hungarian matching pairs the slot masks with semantic mask components, and a BCE loss is applied to the matched slots.
- Design Motivation: Since diffusion models naturally learn to align text tokens with image regions during training, their cross-attention maps inherently provide coarse semantic segmentation information. Applying this signal to guide the slots ensures that a single semantic object is not split into multiple slots.
Instance Guidance Module:
- Function: Maintains the instance-awareness of slots by reconstructing DINOv2 encoder features, resolving under-segmentation (preventing slots from drifting into broad semantic classes due to semantic guidance).
- Mechanism: Each slot reconstructs its corresponding region's DINOv2 feature \(\mathbf{F}_{inp}\) via a lightweight MLP decoder, and is trained using the MSE loss \(\mathcal{L}_{instance} = \text{MSE}(\mathbf{F}_{inp}, \mathbf{F}_{recon})\). Because DINOv2 patch features already exhibit instance-aware properties (different patches of the same object have similar features), reconstructing these features naturally drives the slots to bind at the instance level.
- Design Motivation: Relying solely on semantic guidance shifts the slots toward broad semantic categories (e.g., merging two separate cats into a single "cat" slot). Incorporating feature reconstruction forces the slots to distinguish between different instances to accurately reconstruct features in each region.
Learning in the Generated Image Space:
- Function: Enables slots to be trained on generated images whose distribution matches real data, achieving zero-shot generalization on real-world images.
- Mechanism: BLIP-2 is used to generate text descriptions from real images, and SD generates the corresponding images. All training is conducted on these generated images. During inference, Slot Attention is run directly on real images (without requiring the SD generation step).
- Design Motivation: Training on generated images offers two key advantages: (1) it allows leveraging the cross-attention of SD to obtain semantic masks (which are unavailable for real images), and (2) the distribution of SD-generated images is close to that of real data, allowing the model to generalize zero-shot to real images.

Loss & Training¶

The total loss is \(\mathcal{L} = \mathcal{L}_{MSE}(\mathbf{I}_{gen}, \mathbf{I}_{recon}) + \lambda_s \mathcal{L}_{BCE}(\mathbf{P}(\mathbf{M}_{gen}, \mathbf{A})) + \lambda_i \mathcal{L}_{MSE}(\mathbf{F}_{inp}, \mathbf{F}_{recon})\), with \(\lambda_s = 0.7\) and \(\lambda_i = 0.9\). The training consists of two phases: in Phase 1, only Slot Attention and the MLP decoder are trained (learning instance-level slot representations); in Phase 2, the diffusion decoder is trained jointly (fine-tuning Slot Attention with a low learning rate) to align the diffusion decoder with the slot embeddings.

Key Experimental Results¶

Main Results (Instance-level Object Discovery)¶

Method	VOC mIoU_i↑	VOC mBO_i↑	VOC mBO_c↑	COCO mIoU_i↑	COCO mBO_i↑
DINOSAUR-Trans.	42.0	43.2	47.8	31.6	33.3
SPOT	48.8	48.3	55.6	34.0	35.0
StableLSD	31.5	32.1	35.4	24.7	25.9
GLASS	58.1 (+9.3)	58.9 (+8.5)	62.2 (+6.6)	38.9 (+4.9)	40.6 (+5.6)

Compared to the previous SOTA (SPOT), mIoU_i improved by 9.3% on VOC and 4.9% on COCO.

Ablation Study¶

Configuration	VOC mIoU_i↑	COCO mIoU_i↑
Only Instance Guidance	~42	~31
Only Semantic Guidance	~50	~35
Semantic + Instance (GLASS)	58.1	38.9

Key Findings¶

Both Semantic and Instance Guidance are Indispensable: Using semantic guidance alone reduces over-segmentation but leads to under-segmentation (merging objects of the same category), while using instance guidance alone distinguishes instances but yields imprecise boundaries. Combining both improves over using either individually by approximately 8-16% on VOC.
GLASS even outperforms language-conditioned semantic segmentation methods (e.g., DiffuMask, SegCLIP) on semantic-level object discovery.
For the first time, compositional generation with slot attention is achieved in complex real-world scenes—allowing the swapping of slots from different images to compose new scenes.
GLASS maintains its advantage in zero-shot cross-dataset evaluations (Object365, CLEVRTex).
Compared to StableLSD variants that use additional weak supervision (such as bounding box localization or known object counts), GLASS is still superior without requiring this information.

Highlights & Insights¶

Training on Generated Images is the Core Innovation: This design kills two birds with one stone: it obtains the semantic masks (cross-attention) of the diffusion model, and leverages the fact that the generative distribution is close to the real distribution to achieve zero-shot transfer. This paradigm of "training on synthetic data, deploying on real data" can be transferred to other vision tasks.
The Complementary Design of Semantic + Instance Guidance is Highly Elegant: Over-segmentation and under-segmentation form the core dilemma of OCL. This paper elegantly addresses both using guidance signals in two orthogonal directions. The BCE semantic loss ensures that slots cover whole objects, while the MSE feature reconstruction ensures that slots differentiate between different instances.
Slot Attention Achieves Compositional Generation in Real Scenes for the First Time: Slots can be extracted from different images and combined to generate new scenes, which holds significant value for visual reasoning and scene editing.

Limitations & Future Work¶

It relies on BLIP-2 to generate descriptions and CLIP embeddings; inaccurate descriptions can degrade the quality of the semantic masks.
The quality of the pseudo-semantic masks is limited by the precision of the diffusion model's cross-attention, which may not be fine-grained enough for small objects and partially occluded scenes.
Training requires running the entire SD generation pipeline to obtain the pseudo-masks, leading to significant computational overhead.
Although compositional generation is achieved in real-world scenes for the first time, there is still room for improvement regarding fidelity and consistency.
Fine-tuning the diffusion decoder in Phase 2 requires careful hyperparameter tuning; an excessively high learning rate can destroy the slot representations learned in Phase 1.

vs DINOSAUR: DINOSAUR only uses an MLP to reconstruct DINOv2 features, lacking semantic guidance and leading to imprecise boundaries. GLASS adds the semantic guidance module, significantly improving the quality of object discovery.
vs SPOT: SPOT uses a Transformer encoder-decoder but lacks semantic guidance. GLASS outperforms it by 9.3% in VOC mIoU_i and does not confuse different instances of the same category.
vs StableLSD: StableLSD also employs a diffusion decoder but lacks guidance modules. GLASS outperforms it by a large margin. Even when StableLSD is equipped with additional weak supervision (bboxes), GLASS remains superior.
vs Semantic Segmentation Methods (e.g., DiffuMask): These methods perform only semantic segmentation and cannot distinguish instances or generate images. GLASS outperforms these specialized methods even on semantic discovery.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The design of training in the generative space plus dual semantic/instance guidance is highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across multiple tasks (discovery, generation, composition, attribute prediction), multiple datasets, and zero-shot cross-dataset transfer.
Writing Quality: ⭐⭐⭐⭐ Clearly structured with well-explained motivations.
Value: ⭐⭐⭐⭐⭐ Strongly advances the field of object-centric learning, achieving compositional generation in real-world scenes for the first time.