LPWM: Latent Particle World Models for Object-Centric Stochastic Dynamics¶

Conference: ICLR 2026 arXiv: 2603.04553 Code: Project Page Area: World Models / Object-Centric Representation / Video Prediction Keywords: Object-centric, Latent particles, Self-supervised, World models, Stochastic dynamics, Latent actions

TL;DR¶

LPWM is the first self-supervised object-centric world model that scales to real-world multi-object datasets. Its core innovation is learning independent per-particle latent action distributions (\(z_c^m\)) for each particle, encoding all frames in parallel via a causal spatiotemporal Transformer, supporting diverse conditioning signals (actions, language, image goals, multi-view), achieving state-of-the-art video prediction, and demonstrating imitation learning capability (89% success rate on OGBench task3).

Background & Motivation¶

Background: Object-centric world models decompose scenes into independent object representations (slots/patches/particles), making them naturally suited for modeling multi-object interactions. The DLP (Deep Latent Particles) framework represents objects via keypoints and extended attributes (position, scale, depth, transparency, visual features).

Limitations of Prior Work: - Slot-based methods (SlotFormer, etc.): suffer from inconsistent decomposition, blurry predictions, and convergence difficulties, and require two-stage training. - Patch-based methods (G-SWM, etc.): rely on cross-frame post-hoc matching and fail to scale to complex data. - DDLP (the current best particle-based method): depends on explicit particle tracking and sequential encoding → non-parallelizable and unable to model stochasticity. - All object-centric methods are limited to simple simulated environments and cannot handle real-world multi-object videos.

Key Challenge: Object-centric representations offer natural advantages (interpretability, compositional generalization, sparse interaction modeling), but the key bottleneck for scaling to real-world complex scenes is handling the independent stochastic motion of multiple objects. Global latent actions cannot capture independent behaviors such as "object A moves left while object B remains stationary."

Core Idea: Learn independent latent action distributions \(z_c^m\) for each latent particle — inferred from frame pairs via inverse dynamics during training, and sampled from a learned latent policy at inference time, conditioned into a causal spatiotemporal Transformer via AdaLN.

Method¶

Overall Architecture¶

Video frame sequence → Parallel particle encoder (extracting \(M\) foreground particles \(z_{fg}^m = [z_p, z_s, z_d, z_t, z_f]\) + background particle per frame) → Context Module (learning per-particle latent actions \(z_c^m\)) → Causal spatiotemporal Transformer for dynamics prediction → Particle decoder for next-frame reconstruction.

Key Designs¶

Latent Particle Representation (Tracking-Free)
- Function: Each frame is encoded into \(M\) foreground particles, each with an attribute vector \(z_{fg}^m = [z_p, z_s, z_d, z_t, z_f]\) (2D position, scale, depth, transparency, visual features).
- Key difference from DDLP: All \(M\) encoded particles retain their identity (based on patch origin), rather than tracking the trajectories of a small subset of particles. This enables parallel encoding across frames (no sequential dependency).
- Positioned as a middle ground between patch-based and fully object-centric: particles can move within a bounded region around their origin but do not roam freely.
- Design Motivation: Explicit tracking is the scalability bottleneck of DDLP — tracking failures cause compounding errors.
Per-Particle Latent Actions (Context Module — Core Innovation)
- Function: Learns independent latent action distributions \(q(z_c^m | o_t, o_{t+1})\) for each particle \(m\).
- During training (inverse dynamics): Given two consecutive frames, infers each particle's latent action (analogous to an inverse model).
- During inference (latent policy): \(\pi(z_c^m | o_{\leq t})\) — predicts the latent action distribution for each particle based solely on historical frames; stochastic prediction is achieved by sampling.
- Training objective: KL divergence regularization \(D_{KL}(q(z_c^m | o_t, o_{t+1}) \| \pi(z_c^m | o_{\leq t}))\).
- vs. global latent actions (Genie, CADDY): Ablation experiments confirm that per-particle actions are critical — global actions cannot capture independent motion patterns across multiple objects.
- Design Motivation: In multi-object scenes, object motions are independent (a ball moves left while a block stays still), requiring per-object stochasticity modeling.
Causal Spatiotemporal Transformer Dynamics
- Function: Predicts particle attribute changes for the next frame.
- Mechanism: Causal attention (attending only to historical frames) + spatial attention (inter-particle interactions within the same frame) + AdaLN conditioning (integrating latent actions \(z_c^m\) into Transformer layers via Adaptive Layer Normalization).
- Design Motivation: AdaLN more effectively incorporates conditioning signals than additive positional embeddings (validated by ablation).
Multimodal Conditioning
- Action conditioning: External action signals are directly fused into the Transformer.
- Language conditioning: Text encodings serve as additional conditioning.
- Image goal conditioning: A goal frame encoding guides generation.
- Multi-view: Multi-view particles can be jointly modeled for dynamics.
- Design Motivation: A unified conditioning interface allows the same model to be applied to a variety of downstream tasks.

Loss & Training¶

End-to-end self-supervised training on raw video (no object labels or segmentation annotations required).
Reconstruction loss + KL regularization (latent action distribution).
Training resolution: 128×128; \(M\) particles adjusted per dataset.

Key Experimental Results¶

Video Prediction (Main Results)¶

Dataset	Condition Type	DVAE LPIPS↓	LPWM LPIPS↓	DVAE FVD↓	LPWM FVD↓
Sketchy-U	Latent action	0.113	0.070	140.06	85.45
BAIR-U	Latent action	0.063	0.062	164.41	163.91
Bridge-L	Language	—	—	146.85	47.78
Mario-U	Latent action	—	Best	—	Best

LPWM surpasses all baselines on LPIPS and FVD across all stochastic dynamics datasets.

Imitation Learning¶

Environment / Task	GCIVL	HIQL	LPWM
PandaPush 1 Cube	74±4	—	100±0
PandaPush 3 Cubes	62.1±4.4	—	89.4±2.5
OGBench task1	84±4	80±6	100±0
OGBench task3	16±8	61±11	89±9

LPWM significantly outperforms baselines across PandaPush and OGBench tasks. On OGBench task3 (involving 4 atomic behaviors), LPWM achieves 89% success vs. 16% for EC Diffuser.

Ablation Study¶

Configuration	Effect	Notes
Global vs. per-particle latent actions	Per-particle significantly superior	Core innovation validated
Latent action dimensionality	Best near effective particle dimension (\(6+d_{obj}\))	Model is robust to dimensionality
AdaLN vs. additive positional embedding	AdaLN superior	Conditioning mechanism matters

Key Findings¶

Per-particle latent actions are the decisive performance factor — global latent actions prevent independent motion modeling in multi-object scenes.
LPWM's compact model matches large-scale video generation models on BAIR-64 FVD (89.4), demonstrating the efficiency advantage of object-centric inductive biases.
Success on real-world datasets (Sketchy, BAIR, Bridge) challenges the prevailing assumption that object-centric methods are restricted to simulation.
Imagined trajectories closely match actual execution (Figure 4), confirming that world model accuracy directly translates to decision-making capability.

Highlights & Insights¶

Scalability breakthrough for object-centric representations: Object-centric methods have long been considered viable only in simple simulated settings. LPWM demonstrates that the right design choices — per-particle latent actions, removing explicit tracking, and parallel encoding — enable scaling to real-world data, representing a conceptual advance.
Computational realization of the "what-where" visual pathway: The particle representation naturally corresponds to the object decomposition in the visual system; latent actions correspond to motor prediction; the overall architecture aligns with neuroscientific findings on human visuospatial world models.
Unified multi-condition interface: A single model supports unconditional, action, language, image goal, and multi-view conditioning, enabling natural transfer from video pretraining to robotic control.

Limitations & Future Work¶

Assumes limited camera motion and repetitive scenes (e.g., robot tabletop manipulation); not suitable for arbitrary video (e.g., street scenes, films).
Long-horizon tasks involving more than 4 atomic behaviors (OGBench tasks 4/5) remain unsolved across all methods — long-term planning is still a challenge.
The particle count \(M\) is fixed and cannot dynamically adapt to scenes of varying complexity.
Implicit tracking (particles moving near their origin) may fail in large-displacement scenarios.
Integration with explicit reward models and reinforcement learning has not been explored.

vs. SlotFormer (slot-based): Two-stage training, inconsistent decomposition, and blurry predictions; LPWM uses end-to-end training and per-particle latent actions for greater flexibility.
vs. Genie/CADDY (global latent actions): Global actions cannot capture independent multi-object motion; PlaySlot adds slot-level latent actions but is constrained by the limitations of slot representations.
vs. DDLP (closest particle-based predecessor): DDLP requires explicit tracking and sequential encoding; LPWM removes tracking, enables parallel encoding, and introduces stochasticity.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Per-particle latent actions represent a natural yet non-obvious key innovation; scaling object-centric world models to real-world data is a significant breakthrough.
Experimental Thoroughness: ⭐⭐⭐⭐ 6+ datasets (synthetic + real) covering video prediction, imitation learning, and ablation studies.
Writing Quality: ⭐⭐⭐⭐ Motivation and design logic are clearly presented, with thorough comparison to prior work.
Value: ⭐⭐⭐⭐⭐ A pioneering contribution to object-centric world models, with code, data, and models fully open-sourced.