What If: Understanding Motion Through Sparse Interactions¶
Conference: ICCV 2025 arXiv: 2510.12777 Code: compvis.github.io/flow-poke-transformer Area: Image Segmentation Keywords: motion understanding, optical flow distribution prediction, sparse interaction, moving part segmentation, Transformer
TL;DR¶
This paper proposes the Flow Poke Transformer (FPT), which directly predicts multimodal probability distributions over object motion in a scene (rather than a single deterministic outcome), conditioned on sparse "poke" interactions, enabling interpretable motion understanding and moving part segmentation.
Background & Motivation¶
The central challenge in understanding scene dynamics is that real-world motion is inherently uncertain and multimodal. Human visual intelligence does not predict a single deterministic future; rather, it infers the multiple ways in which objects might move.
Limitations of existing motion prediction methods:
Dense video prediction (e.g., diffusion-based video generation) must commit to one trajectory, ignoring the multimodality of motion — even if the generated frames are photorealistic, this does not imply an understanding of the underlying physical process.
Deterministic optical flow prediction (e.g., RAFT) estimates motion between two given frames, requiring future frames as input and thus incapable of forecasting future motion.
Existing conditional motion generation (e.g., DragAPart, PuppetMaster) synthesizes results directly in RGB space, making the underlying motion representation inaccessible and providing no uncertainty estimates.
Core motivation: to design a model that directly outputs probability distributions over motion rather than merely sampling from them. This enables: - Direct quantification of uncertainty - Identification of multimodal motion - Exploration of physical interactions in a scene via sparse pokes
Method¶
Overall Architecture¶
Given an image \(\mathcal{I}\), the model learns the conditional distribution \(p(\mathbf{f}(\mathbf{q})|\mathcal{P}, \mathcal{I})\): the distribution of motion \(\mathbf{f}(\mathbf{q}) \in \mathbb{R}^2\) at query point \(\mathbf{q}\), conditioned on a set of pokes \(\mathcal{P} = \{(\mathbf{p}_i, \mathbf{f}(\mathbf{p}_i))\}_{i=1}^{N_p}\) and the image. The architecture consists of an image encoder (ViT-Base initialized with DINOv2-R) and the Flow Poke Transformer (ViT-Base), totaling 220M parameters.
Key Designs¶
-
Sparse Kinematic Modeling: Each poke \((\mathbf{p}_i, \mathbf{f}(\mathbf{p}_i))\) and query point \(\mathbf{q}_j\) are represented as independent tokens. Poke motion is encoded via Fourier embeddings, and positions are represented using RoPE relative positional encodings to support arbitrary-precision, off-grid locations. Query tokens attend only to themselves and the pokes (not to other queries), enabling efficient parallel prediction over multiple queries.
-
Gaussian Mixture Model (GMM) Output Distribution: The projection head at the Transformer output directly predicts an \(N\)-component GMM: $\(p_\theta = \sum_{n=1}^N \pi^{(n)} \cdot \mathcal{N}(\boldsymbol{\mu}^{(n)}, \boldsymbol{\Sigma}^{(n)})\)$ A key improvement over GIVT is the use of full covariance matrices \(\boldsymbol{\Sigma}^{(n)} \in \mathbb{R}^{2 \times 2}\) (guaranteed positive definite by predicting the lower-triangular Cholesky factor \(\mathbf{L}^{(n)}\)) rather than diagonal covariances, substantially increasing modeling expressiveness.
-
Query-Causal Attention: During training, a causal attention mask is applied over pokes, and each query attends only to its corresponding poke subset. This reduces computational complexity from \(\mathcal{O}(N_p^2 \cdot N_q^2)\) to \(\mathcal{O}(N_p^2 + N_p \cdot N_q)\), enabling efficient training.
-
Camera Motion Adaptation: Adaptive normalization layers (AdaIN) are used to condition the model on whether the camera is static, preventing camera motion from dominating the learned motion distribution.
-
Moving Part Segmentation: The KL divergence is used to measure the influence of a poke on the motion distribution at a query point: $\(D_{KL}(p_\theta(\mathbf{f}(\mathbf{q})|(\mathbf{p}, \mathbf{f}(\mathbf{p})), \mathcal{I}) \parallel p_\theta(\mathbf{f}(\mathbf{q})|\mathcal{I}))\)$ A KL divergence of zero indicates motion independence, while a nonzero value indicates influence from the poke — directly quantifying part-level motion correlation.
Loss & Training¶
The model is trained by minimizing the negative log-likelihood of ground-truth optical flow: $\(\mathcal{L} = -\log p_\theta(\mathbf{f}(\mathbf{q})|\mathcal{P}, \mathcal{I}) = -\log\left(\sum_{n=1}^N \pi^{(n)} \mathcal{N}(\mathbf{f}(\mathbf{q})|\boldsymbol{\mu}^{(n)}_\theta, \boldsymbol{\Sigma}^{(n)}_\theta)\right)\)$
Training details: - Dataset: WebVid 3.8M subset (general-purpose pretraining); an alternative variant uses 5M open-domain videos - Ground-truth optical flow: dense tracking on a \(48^2\) grid using CoTracker3 / TAPNext - Optimizer: AdamW, lr=5e-5, batch size 32→128, 800k steps - Per image: 0–128 pokes and 15 random query points sampled - Training time: 7 days on 2×H200 (or 24h on 8×H200 with an optimized configuration)
Key Experimental Results¶
Main Results: Talking Face Motion Generation (TalkingHead-1KH)¶
| Method | Training Data | 1 Poke EPE↓ | 10 Pokes EPE↓ | 100 Pokes EPE↓ |
|---|---|---|---|---|
| InstantDrag | Face-specific | 9.24 | 8.39 | 7.29 |
| Motion-I2V | General (Zero-Shot) | 29.08 | 20.90 | n/a |
| FPT (Ours) | General (Zero-Shot) | 7.64 | 4.20 | 2.51 |
The zero-shot general-purpose model surpasses the face-specific model InstantDrag, with performance gains increasing markedly as the number of pokes grows.
Articulated Object Motion Estimation (Drag-A-Move)¶
| Method | Training Set | EPE↓ | PCK↑ | Seg. mIoU↑ |
|---|---|---|---|---|
| DragAPart | DAM (specialized) | 9.69 | 0.514 | 0.273 |
| PuppetMaster | DAM+OAHQ (specialized) | 9.62 | 0.472 | 0.112 |
| FPT (zero-shot) | General | 12.74 | 0.191 | 0.287 |
| FPT (fine-tuned) | General→DAM | 3.57 | 0.834 | 0.572 |
After fine-tuning, EPE is reduced by 63%, and moving part segmentation mIoU reaches 0.572 (vs. 0.273 for DragAPart).
Ablation Study¶
| Predictive Uncertainty Calibration | Description |
|---|---|
| Pearson \(\rho\) = 0.66 (sampled) | Predicted uncertainty strongly correlates with actual error |
| Pearson \(\rho\) = 0.64 (mean) | Mean prediction is equally accurate |
| Pearson \(\rho\) = 0.62 (highest-confidence mode) | Higher confidence correlates with higher accuracy |
| Multimodal Analysis | Description |
|---|---|
| High mode diversity | Mode variation covers a large portion of the poke magnitude |
| Mode closest to GT has above-average confidence | Confidence predictions are semantically meaningful |
| More pokes → more unimodal | Distribution naturally converges as conditioning information increases |
Key Findings¶
- The uncertainty predicted by FPT correlates strongly with actual error (Pearson \(\rho > 0.6\)), validating the reliability of probabilistic distribution prediction
- General-purpose pretraining generalizes effectively: zero-shot performance on talking-face motion surpasses specialized models
- Moving part segmentation requires no dedicated training and emerges naturally from distribution comparison
- Single-H200 inference latency is <25ms with throughput >160k predictions/s, suitable for real-time applications
Highlights & Insights¶
- Direct access to the probability distribution: Unlike diffusion or GAN models that only support sampling, FPT's GMM output allows direct reading of probability densities, computation of KL divergences, and mode identification
- Sparse representation → efficiency: Sparse token modeling avoids the computational overhead of dense prediction while preserving rich motion semantics
- Moving part segmentation emerges naturally from motion understanding: No additional annotations or modules are required; part-level correlation is quantified solely via KL divergence — an elegant conceptual contribution
- The query-causal attention design substantially reduces training cost and is transferable to other sparse conditional prediction tasks
Limitations & Future Work¶
- Generalization to cartoon or animated imagery is limited (training data consists primarily of real-world video)
- The model occasionally erroneously couples object shadows with object motion
- The current formulation models only 2D motion distributions; extending to 3D motion is a natural future direction
- Autoregressive dense sampling can generate globally consistent motion but is comparatively slow
Related Work & Insights¶
- The GMM output head design builds upon GIVT but extends it to full covariance matrices, enhancing modeling capacity
- The poke concept originates from iPoke (Blattmann et al. 2021) but is elevated from RGB synthesis to probabilistic distribution modeling
- RoPE positional encodings enable the model to support arbitrary-precision off-grid positions, contributing to strong generalization
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Direct prediction of motion probability distributions + emergent moving part segmentation
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-domain evaluation + uncertainty analysis + segmentation
- Value: ⭐⭐⭐⭐ — Real-time inference + general-purpose pretraining + multiple downstream tasks
- Overall: ⭐⭐⭐⭐⭐