OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics¶
Conference: ICCV 2025 arXiv: 2404.18423 Code: None Area: Video Generation Keywords: Object-centric learning, video prediction, kinematic modeling, Slot Attention, autoregressive Transformer
TL;DR¶
This paper proposes OCK (Object-Centric Kinematics), which augments object-centric video prediction by introducing explicit kinematic attributes (position, velocity, acceleration) as complements to slot representations. Two Transformer variants — Joint-OCK and Cross-OCK — are designed to fuse appearance and motion information, achieving significant improvements in dynamic video prediction quality across complex synthetic and real-world scenarios.
Background & Motivation¶
Human perception decomposes complex multi-object scenes into time-invariant appearance (size, shape, color) and time-varying motion (position, velocity, acceleration). Existing object-centric Transformer-based video prediction methods (e.g., SlotFormer, OCVP) primarily rely on appearance representations extracted by Slot Attention, and suffer from the following limitations:
Lack of explicit motion dynamics: Motion changes are learned only implicitly, making it difficult to accurately model dynamic interactions such as collisions and acceleration/deceleration.
Poor performance in complex scenes: Prediction quality degrades or even diverges in scenes with diverse object appearances, motion patterns, and backgrounds (e.g., MOVi-C/D/E).
Poor long-term generalization: The absence of explicit kinematic priors leads to rapid error accumulation.
Method¶
Overall Architecture¶
OCK consists of three main modules: 1. Slot Encoder: A pretrained SAVi model that decomposes video frames into object slots \(\mathcal{S}_t \in \mathbb{R}^{N \times D_{\text{slot}}}\). 2. Kinematics Encoder: Extracts object kinematics \(\mathbf{K}_t \in \mathbb{R}^{N \times D_{\text{kin}}}\) from video frames. 3. Autoregressive OCK Transformer: Fuses slot and kinematic information to predict slots at the next timestep.
Key Designs¶
-
Object Kinematics: A CNN extracts low-level image features to localize the 2D centroid coordinates of each object, constructing a three-level kinematic state: $\(\mathbf{K}_t = \begin{bmatrix} \mathbf{x}_t^{\text{pos}} \\ \mathbf{x}_t^{\text{vel}} \\ \mathbf{x}_t^{\text{acc}} \end{bmatrix} = \begin{bmatrix} \phi(\mathbf{o}_t) \\ \lambda(\mathbf{x}_t^{\text{pos}} - \mathbf{x}_{t-1}^{\text{pos}}) \\ \mathbf{x}_t^{\text{vel}} - \mathbf{x}_{t-1}^{\text{vel}} \end{bmatrix}\)$ where \(\lambda\) is a learnable scaling parameter. Kinematics are modeled in 2D image space (avoiding the computational overhead of 3D depth estimation) and require no task-specific loss functions.
-
Two Kinematic Integration Strategies:
- Analytical: Predicts the next-frame position from current kinematics as \(\mathbf{x}_{t+1}^{\text{pos}'} = \mathbf{x}_t^{\text{pos}} + \mathbf{x}_t^{\text{vel}} \times \delta\), then feeds both current and predicted kinematics into the Transformer.
- Empirical: Uses only current-frame kinematics, allowing the Transformer to implicitly learn motion patterns.
-
Two OCK Transformer Architectures:
- Joint-OCK: Concatenates slots and kinematics and feeds them jointly into a standard Transformer encoder for self-attention.
- Cross-OCK: Employs a cross-attention mechanism where slots serve as queries and kinematics as keys/values, with a temperature parameter \(\tau\) for attention calibration: \(\text{Cross-OCK}(\mathbf{v}, \mathbf{k}, \mathbf{q}; \tau) = \mathbf{v} \cdot \text{softmax}(\frac{\mathbf{k}^\top \mathbf{q}}{\tau})\)
Loss & Training¶
Training proceeds in two stages: SAVi is first trained to decompose video frames into slots, followed by training of the OCK Transformer.
The total loss is \(\mathcal{L} = \mathcal{L}_{\text{object}} + \alpha \mathcal{L}_{\text{image}}\): - Object reconstruction loss: L2 distance between predicted slots and ground-truth slots. - Image reconstruction loss: L2 distance between frames decoded from predicted slots via the frozen SAVi decoder and ground-truth frames.
The model is trained with 6 input frames to predict 8 future frames, using temporal positional encodings that preserve permutation equivariance across objects.
Key Experimental Results¶
Main Results (Tables)¶
Video prediction quality on six synthetic datasets (from simple to complex):
| Model | OBJ3D PSNR↑ | MOVi-A PSNR↑ | MOVi-C PSNR↑ | MOVi-D PSNR↑ | MOVi-E PSNR↑ |
|---|---|---|---|---|---|
| SlotFormer | 33.08 | 25.18 | 19.48 | 20.68 | 21.27 |
| OCVP-Seq | 33.10 | 26.24 | 17.95 | Diverge | Diverge |
| Joint-OCK | 35.13 | 27.26 | 21.04 | 22.09 | 22.39 |
| Cross-OCK | 34.10 | 27.58 | 21.04 | 22.34 | 22.34 |
Real-world Waymo Open Dataset:
| Model | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| SlotFormer | 19.13 | 0.330 | 0.714 |
| OCVP-Seq | 18.98 | 0.329 | 0.718 |
| Joint-OCK | 25.02 | 0.798 | 0.251 |
| Cross-OCK | 25.98 | 0.728 | 0.220 |
Ablation Study (Tables)¶
Ablation of Transformer components (MOVi-A):
| Setting | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| Cross-OCK(A) default | 27.58 | 0.812 | 0.123 |
| Input frames = 4 | 27.01 | 0.801 | 0.125 |
| Input frames = 8 | 27.12 | 0.806 | 0.125 |
| Transformer layers = 6 | 26.92 | 0.796 | 0.130 |
| Transformer layers = 8 | 26.50 | 0.784 | 0.133 |
| Standard positional encoding | 23.60 | 0.591 | 0.205 |
| Teacher Forcing | 23.58 | 0.589 | 0.207 |
Key Findings¶
- Kinematics are critical for complex scenes: OCVP completely diverges on MOVi-D/E, whereas OCK maintains stable predictions.
- On the Waymo real-world dataset, OCK improves PSNR by ~6.9 dB and reduces LPIPS by ~0.49 compared to SlotFormer.
- Temporal positional encodings (preserving permutation equivariance) are crucial; replacing them with standard positional encodings reduces PSNR by 4 dB.
- Teacher Forcing is harmful: Training the model to handle its own imperfect predictions is more beneficial for long-term generalization.
- The analytical strategy marginally outperforms the empirical strategy, as explicitly predicting the next-frame kinematic state provides more accurate guidance.
- Six input frames suffice to capture object dynamics; increasing to 8 frames yields a slight performance drop.
Highlights & Insights¶
- Physics-inspired design: Incorporating classical kinematics (position–velocity–acceleration) into object-centric learning is an intuitive and effective innovation.
- Elegant Cross-OCK design: Using slots as queries and kinematics as keys/values achieves a favorable balance between computational efficiency and performance.
- Long-term generalization: Trained on only 6 frames, the model generalizes to 18-frame prediction with slow error accumulation.
- Strong performance is demonstrated on real-world autonomous driving data (Waymo).
Limitations & Future Work¶
- Kinematics are modeled only in 2D image space, offering limited handling of 3D occlusions and depth variations.
- The method depends on the quality of SAVi-pretrained slot encoders; slot decomposition may be imperfect in complex scenes.
- Rotational and scale kinematics are not considered, limiting the modeling of complex motion types.
- Handling of object appearance and disappearance is not discussed.
- Incorporating object interaction graphs (GNNs) to explicitly model collision events is a promising future direction.
Related Work & Insights¶
- SlotFormer is the most direct baseline; OCK extends it by incorporating a kinematic dimension.
- OCVP explores the separation of temporal and relational attention but diverges in complex scenes.
- G-SWM models interactions via graph neural networks but underperforms Transformer-based methods.
- The proposed approach has potential implications for physics simulator learning and robotic visual understanding.
Rating¶
- Novelty: ⭐⭐⭐⭐ Introducing classical kinematics into object-centric learning is a natural yet effective contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on 7 datasets including real-world scenarios, with comprehensive ablation analysis.
- Writing Quality: ⭐⭐⭐⭐ Well-structured; the systematic comparison between two strategies (analytical/empirical) and two architectures (Joint/Cross) is clearly presented.
- Value: ⭐⭐⭐⭐ Addresses a key bottleneck in object-centric video prediction under complex scenes.