From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning¶

Conference: ICLR 2026 arXiv: 2602.03390 Code: https://github.com/hynnsk/SRL Area: Video Understanding / Self-Supervised Learning / Object Discovery Keywords: Object-centric learning, slot attention, contrastive learning, encoder-decoder alignment, unsupervised segmentation

TL;DR¶

This paper identifies a vicious cycle between the encoder (producing sharp but noisy attention maps) and the decoder (producing spatially consistent but blurry reconstruction masks) in slot-based object-centric learning. It proposes a synergistic contrastive learning objective paired with a slot regularization warm-up strategy to convert this vicious cycle into a virtuous one, achieving substantial improvements in object discovery performance on MOVi and YouTube-VIS.

Background & Motivation¶

Background: Object-Centric Learning (OCL) aims to unsupervisedly decompose videos into per-object representations (slots). Mainstream methods adopt a slot attention + reconstruction paradigm, and recent works leverage DINOv2 features to improve object segmentation quality.

Limitations of Prior Work: A vicious cycle exists between the encoder and decoder — (a) the encoder (DINOv2) produces attention maps that are sharp but contaminated by high-frequency noise, posing an ill-posed reconstruction task for the decoder and yielding blurry reconstruction masks; (b) the MSE reconstruction loss acts as a low-pass filter, feeding back gradients to the encoder that lack high-frequency information and thus fail to suppress noise.

Key Challenge: The encoder's noise problem and the decoder's blurriness problem mutually reinforce each other — neither party can learn from the other without being misled by the other's deficiencies.

Goal: To break the vicious cycle between the encoder and decoder so that each component reinforces rather than degrades the other.

Key Insight: Exploit the complementary strengths of the encoder and decoder — the encoder's attention maps have sharp boundaries despite being noisy, while the decoder's reconstruction masks are spatially consistent despite being blurry — and design a cross-directional contrastive learning objective that allows each to compensate for the other's weaknesses.

Core Idea: Use the decoder's spatially consistent masks to "denoise" the encoder's attention, while using the encoder's sharp attention to "deblur" the decoder's reconstruction, forming a virtuous cycle.

Method¶

Overall Architecture¶

Three components are added on top of the standard slot attention + reconstruction baseline (SlotContrast): (1) Slot Regularization Warm-up (Stage 1), (2) Stable Transition (Stage 2), and (3) Bidirectional Contrastive Learning (Stage 3). The three-stage training schedule ensures that slots first specialize, then stabilize, and finally are refined.

Key Designs¶

Slot Regularization Warm-up (Stage 1, first 10% of training):
- Function: Prevent multiple slots from collapsing onto the same object.
- Mechanism: Iteratively identify the slot pair \((i, j)\) with the highest cosine similarity, select the less-specialized slot via KL divergence to the uniform distribution, and apply a regularization penalty to its attention. This is repeated \(M = \lfloor S/2 \rfloor\) times, where \(S\) is the number of slots.
- Design Motivation: Slot collapse is a common failure mode in OCL. If two slots represent the same object, subsequent contrastive learning cannot produce meaningful positive/negative sample partitions.
Deblurring Contrastive Learning (Deblurring CL, activated in Stage 3):
- Function: Use the encoder's sharp attention maps as pseudo-labels to guide the decoder toward producing sharper reconstruction masks.
- Mechanism: A three-level hierarchical contrastive objective is constructed — (a) positives: encoder–decoder patch pairs with themselves, (b) semi-positives: the set of patches assigned to the same slot by the encoder's attention, (c) negatives: all remaining patches. A ranked contrastive loss encourages decoder features to cluster within the same slot and separate across slots.
- Design Motivation: Standard MSE loss optimizes reconstruction fidelity but cannot improve spatial resolution; the hierarchical contrastive loss directly optimizes the discriminability of the masks.
Denoising Contrastive Learning (Denoising CL, activated in Stage 3):
- Function: Use the decoder's spatially consistent masks as pseudo-labels to guide the encoder toward learning smoother, noise-free features.
- Mechanism: Positive samples are drawn from the Top-K nearest neighbors in the DINOv2 feature space; semi-positive samples are drawn from patches assigned to the same slot by the decoder mask. The structure mirrors that of Deblurring CL but operates in the opposite direction.
- Design Motivation: Although the decoder masks are blurry, they are spatially consistent (noisy patches are not erroneously labeled as foreground). This spatial consistency is leveraged to guide the encoder toward a smoother feature distribution.

Loss & Training¶

Three-stage schedule: Stage 1 (0–10%) applies slot regularization; Stage 2 (10–20%) trains with the baseline loss only for stabilization; Stage 3 (20–100%) activates bidirectional contrastive learning. All contrastive loss weights are set to 0.1.

Key Experimental Results¶

Main Results¶

Method	MOVi-C FG-ARI	MOVi-C mBO	MOVi-E FG-ARI	YTVIS FG-ARI	YTVIS mBO
STEVE	36.1	26.5	50.6	15.0	19.1
VideoSAUR	64.8	38.9	73.9	28.9	26.3
SlotContrast	70.4	31.7	80.9	36.2	32.9
SRL (Ours)	74.3	34.5	81.9	42.9	35.6

Ablation Study¶

Deblurring CL	Denoising CL	Slot Reg.	FG-ARI	mBO
-	-	-	70.8	31.4
Y	-	-	70.0	33.2
-	Y	-	72.2	31.2
-	-	Y	70.7	35.1
Y	Y	Y	74.2	33.2

Key Findings¶

Compared to the SlotContrast baseline, FG-ARI on YouTube-VIS improves by 18.5% (36.2 → 42.9), indicating that the gains are not limited to synthetic data.
The three components are complementary — applying Deblurring CL alone actually decreases FG-ARI (70.0 vs. 70.8), and it must be combined with Denoising CL or slot regularization.
Slot regularization contributes most to mBO (31.4 → 35.1), suggesting that slot collapse is the primary cause of low mBO.
On the cross-dataset transfer benchmark DAVIS 2017, the Jaccard metric improves by 11.7 points.

Highlights & Insights¶

Diagnosis of the Vicious Cycle: The paper clearly identifies the causal feedback loop between encoder noise and decoder blurriness. This approach of rigorously analyzing the root cause of a problem is methodologically instructive.
Symmetric Design: Deblurring and denoising are symmetric — each uses the other's pseudo-labels, forming a mutual teaching scheme. This "you help me denoise, I help you deblur" design paradigm is particularly elegant.
Three-Stage Training Schedule: Stabilizing slots before introducing contrastive learning prevents incorrect gradients from arising due to slot collapse in early training.
Transfer to Static Images: Performance improvements on COCO demonstrate that the method generalizes beyond video.

Limitations & Future Work¶

The addition of two hierarchical contrastive losses may introduce non-trivial computational overhead, though the paper does not report training time comparisons.
The three-stage schedule introduces additional hyperparameters (stage boundary ratios, regularization strength); sensitivity analyses reveal some dependence on the training split ratios.
In scenes with a very large number of objects (e.g., MOVi-E with up to 23 objects), the benefit of slot regularization is limited.
The quality of pseudo-labels directly affects contrastive learning efficacy; pseudo-labels generated in early training may be insufficiently accurate.

vs. SlotContrast: The direct baseline of this work; SRL extends it with bidirectional contrastive learning and slot regularization.
vs. VideoSAUR: A pioneering work on using DINO features for slot learning; SRL further addresses the noise problem inherent in DINO features.
vs. STEVE: A slot-based method operating in pixel space; SRL operates in feature space and substantially outperforms it.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The diagnosis of the vicious cycle and the symmetric mutual-teaching design are highly insightful.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on MOVi-C/E, YTVIS, DAVIS, and COCO with comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐⭐ The vicious/virtuous cycle narrative is clear and persuasive.
Value: ⭐⭐⭐⭐ Provides an effective methodological framework for unsupervised object discovery.