OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics¶

Conference: ICCV 2025 arXiv: 2404.18423 Code: None Area: Video Generation Keywords: Object-centric learning, video prediction, kinematic modeling, Slot Attention, autoregressive Transformer

TL;DR¶

This paper proposes OCK (Object-Centric Kinematics), which augments object-centric video prediction by introducing explicit kinematic attributes (position, velocity, acceleration) as complements to slot representations. Two Transformer variants — Joint-OCK and Cross-OCK — are designed to fuse appearance and motion information, achieving significant improvements in dynamic video prediction quality across complex synthetic and real-world scenarios.

Background & Motivation¶

Human perception decomposes complex multi-object scenes into time-invariant appearance (size, shape, color) and time-varying motion (position, velocity, acceleration). Existing object-centric Transformer-based video prediction methods (e.g., SlotFormer, OCVP) primarily rely on appearance representations extracted by Slot Attention, and suffer from the following limitations:

Lack of explicit motion dynamics: Motion changes are learned only implicitly, making it difficult to accurately model dynamic interactions such as collisions and acceleration/deceleration.

Poor performance in complex scenes: Prediction quality degrades or even diverges in scenes with diverse object appearances, motion patterns, and backgrounds (e.g., MOVi-C/D/E).

Poor long-term generalization: The absence of explicit kinematic priors leads to rapid error accumulation.

Method¶

Overall Architecture¶

OCK consists of three main modules: 1. Slot Encoder: A pretrained SAVi model that decomposes video frames into object slots $\mathcal{S}_t \in \mathbb{R}^{N \times D_{\text{slot}}}$. 2. Kinematics Encoder: Extracts object kinematics $\mathbf{K}_t \in \mathbb{R}^{N \times D_{\text{kin}}}$ from video frames. 3. Autoregressive OCK Transformer: Fuses slot and kinematic information to predict slots at the next timestep.

Key Designs¶

Object Kinematics: A CNN extracts low-level image features to localize the 2D centroid coordinates of each object, constructing a three-level kinematic state: $$\mathbf{K}_t = \begin{bmatrix} \mathbf{x}_t^{\text{pos}} \\ \mathbf{x}_t^{\text{vel}} \\ \mathbf{x}_t^{\text{acc}} \end{bmatrix} = \begin{bmatrix} \phi(\mathbf{o}_t) \\ \lambda(\mathbf{x}_t^{\text{pos}} - \mathbf{x}_{t-1}^{\text{pos}}) \\ \mathbf{x}_t^{\text{vel}} - \mathbf{x}_{t-1}^{\text{vel}} \end{bmatrix}$$ where $\lambda$ is a learnable scaling parameter. Kinematics are modeled in 2D image space (avoiding the computational overhead of 3D depth estimation) and require no task-specific loss functions.
Two Kinematic Integration Strategies:
- Analytical: Predicts the next-frame position from current kinematics as $\mathbf{x}_{t+1}^{\text{pos}'} = \mathbf{x}_t^{\text{pos}} + \mathbf{x}_t^{\text{vel}} \times \delta$, then feeds both current and predicted kinematics into the Transformer.
- Empirical: Uses only current-frame kinematics, allowing the Transformer to implicitly learn motion patterns.
Two OCK Transformer Architectures:
- Joint-OCK: Concatenates slots and kinematics and feeds them jointly into a standard Transformer encoder for self-attention.
- Cross-OCK: Employs a cross-attention mechanism where slots serve as queries and kinematics as keys/values, with a temperature parameter $\tau$ for attention calibration: $\text{Cross-OCK}(\mathbf{v}, \mathbf{k}, \mathbf{q}; \tau) = \mathbf{v} \cdot \text{softmax}(\frac{\mathbf{k}^\top \mathbf{q}}{\tau})$

Loss & Training¶

Training proceeds in two stages: SAVi is first trained to decompose video frames into slots, followed by training of the OCK Transformer.

The total loss is $\mathcal{L} = \mathcal{L}_{\text{object}} + \alpha \mathcal{L}_{\text{image}}$: - Object reconstruction loss: L2 distance between predicted slots and ground-truth slots. - Image reconstruction loss: L2 distance between frames decoded from predicted slots via the frozen SAVi decoder and ground-truth frames.

The model is trained with 6 input frames to predict 8 future frames, using temporal positional encodings that preserve permutation equivariance across objects.

Key Experimental Results¶

Main Results (Tables)¶

Video prediction quality on six synthetic datasets (from simple to complex):

Model	OBJ3D PSNR↑	MOVi-A PSNR↑	MOVi-C PSNR↑	MOVi-D PSNR↑	MOVi-E PSNR↑
SlotFormer	33.08	25.18	19.48	20.68	21.27
OCVP-Seq	33.10	26.24	17.95	Diverge	Diverge
Joint-OCK	35.13	27.26	21.04	22.09	22.39
Cross-OCK	34.10	27.58	21.04	22.34	22.34

Real-world Waymo Open Dataset:

Model	PSNR↑	SSIM↑	LPIPS↓
SlotFormer	19.13	0.330	0.714
OCVP-Seq	18.98	0.329	0.718
Joint-OCK	25.02	0.798	0.251
Cross-OCK	25.98	0.728	0.220

Ablation Study (Tables)¶

Ablation of Transformer components (MOVi-A):

Setting	PSNR↑	SSIM↑	LPIPS↓
Cross-OCK(A) default	27.58	0.812	0.123
Input frames = 4	27.01	0.801	0.125
Input frames = 8	27.12	0.806	0.125
Transformer layers = 6	26.92	0.796	0.130
Transformer layers = 8	26.50	0.784	0.133
Standard positional encoding	23.60	0.591	0.205
Teacher Forcing	23.58	0.589	0.207

Key Findings¶

Kinematics are critical for complex scenes: OCVP completely diverges on MOVi-D/E, whereas OCK maintains stable predictions.
On the Waymo real-world dataset, OCK improves PSNR by ~6.9 dB and reduces LPIPS by ~0.49 compared to SlotFormer.
Temporal positional encodings (preserving permutation equivariance) are crucial; replacing them with standard positional encodings reduces PSNR by 4 dB.
Teacher Forcing is harmful: Training the model to handle its own imperfect predictions is more beneficial for long-term generalization.
The analytical strategy marginally outperforms the empirical strategy, as explicitly predicting the next-frame kinematic state provides more accurate guidance.
Six input frames suffice to capture object dynamics; increasing to 8 frames yields a slight performance drop.

Highlights & Insights¶

Physics-inspired design: Incorporating classical kinematics (position–velocity–acceleration) into object-centric learning is an intuitive and effective innovation.
Elegant Cross-OCK design: Using slots as queries and kinematics as keys/values achieves a favorable balance between computational efficiency and performance.
Long-term generalization: Trained on only 6 frames, the model generalizes to 18-frame prediction with slow error accumulation.
Strong performance is demonstrated on real-world autonomous driving data (Waymo).

Limitations & Future Work¶

Kinematics are modeled only in 2D image space, offering limited handling of 3D occlusions and depth variations.
The method depends on the quality of SAVi-pretrained slot encoders; slot decomposition may be imperfect in complex scenes.
Rotational and scale kinematics are not considered, limiting the modeling of complex motion types.
Handling of object appearance and disappearance is not discussed.
Incorporating object interaction graphs (GNNs) to explicitly model collision events is a promising future direction.

SlotFormer is the most direct baseline; OCK extends it by incorporating a kinematic dimension.
OCVP explores the separation of temporal and relational attention but diverges in complex scenes.
G-SWM models interactions via graph neural networks but underperforms Transformer-based methods.
The proposed approach has potential implications for physics simulator learning and robotic visual understanding.

Rating¶

Novelty: ⭐⭐⭐⭐ Introducing classical kinematics into object-centric learning is a natural yet effective contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 7 datasets including real-world scenarios, with comprehensive ablation analysis.
Writing Quality: ⭐⭐⭐⭐ Well-structured; the systematic comparison between two strategies (analytical/empirical) and two architectures (Joint/Cross) is clearly presented.
Value: ⭐⭐⭐⭐ Addresses a key bottleneck in object-centric video prediction under complex scenes.