TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation¶
Conference: CVPR 2026 arXiv: 2602.19053 Code: github.com/KTH-RPL/OpenSceneFlow Area: Self-Supervised Learning / Autonomous Driving Keywords: Scene Flow, Self-Supervised, Multi-frame Supervision, Temporal Aggregation, Feed-forward Network, Point Cloud
TL;DR¶
This paper proposes TeFlow — the first method to introduce multi-frame supervision into self-supervised feed-forward scene flow estimation. By constructing a motion candidate pool via temporal aggregation and aggregating temporally consistent supervision signals through consensus voting, TeFlow achieves a Three-way EPE of 3.57 cm on Argoverse 2 (on par with the optimization-based method Floxels) while maintaining real-time inference (8 s vs. 24 min), representing a 22.3% improvement over SeFlow++.
Background & Motivation¶
Background: Scene flow estimation aims to predict the 3D motion of each point in LiDAR point clouds. Existing self-supervised methods fall into two categories: (1) Optimization-based methods (NSFP, EulerFlow) — which optimize scene-specific models using multi-frame long-horizon constraints, achieving high accuracy but with prohibitive latency (hours to days); (2) Feed-forward methods (SeFlow, ZeroFlow) — which achieve efficient single-pass inference, but whose training objectives rely solely on two-frame point correspondences, making them susceptible to unstable supervision signals caused by occlusions, noise, and sparse observations.
Key Challenge: Multi-frame supervision has the potential to provide more stable training signals, but naively extending two-frame objectives to multiple frames is ineffective — inter-frame point correspondences vary dramatically, producing inconsistent signals. As shown in Figure 1b of the paper, two-frame supervision signal directions fluctuate violently over time; even when the true motion is smooth, two-frame estimates oscillate sharply due to occlusion and noise.
Limitations of Prior Work: ZeroFlow generates pseudo-labels via knowledge distillation from a slow "teacher," requiring 7.2 GPU-months of computation. SeFlow improves the two-frame loss function but remains bounded by the two-frame signal ceiling. Multi-frame architectures (Flow4D, DeltaFlow) are effective under supervised learning, but are still constrained by two-frame objectives in the self-supervised setting.
Key Insight: Rather than designing a better two-frame loss, TeFlow mines temporally consistent motion cues across multiple frames — constructing a motion candidate pool followed by consensus voting to produce stable multi-frame supervision signals, enabling feed-forward models to fully leverage the temporal modeling capacity of multi-frame architectures under self-supervision for the first time.
Method¶
Overall Architecture¶
TeFlow builds upon the DeltaFlow multi-frame backbone network. Given 5 LiDAR frames as input (ego-motion aligned), it predicts a residual flow \(\mathcal{F}_{res}\). During training, point clouds are first partitioned into static and dynamic regions (provided by DUFOMap), and dynamic points are clustered into clusters \(\mathcal{C}_j\) (pre-computed by HDBSCAN). For each dynamic cluster, TeFlow generates reliable supervision targets \(\bar{\mathbf{f}}_{\mathcal{C}_j}\) via temporal aggregation, and training is performed using a combination of a static loss and a geometric consistency loss.
Key Designs¶
-
Motion Candidate Generation
- Function: Generate diverse motion hypotheses for each dynamic cluster, balancing stability with geometry-driven evidence from data.
- Mechanism: The candidate pool consists of two sources — (a) Internal candidates \(\hat{\mathbf{f}}_{\mathcal{C}_j}\): the cluster-averaged flow from the current network prediction, serving as a stable anchor; (b) External candidates \(\mathbf{f}^{t'}_{\mathcal{C}_j,k}\): for each temporal frame \(t'\), the top-K nearest neighbor correspondences with the largest displacements are retrieved, normalized by the temporal interval: \(\mathbf{f}^{t'}_{\mathcal{C}_j,k} = \frac{\mathcal{NN}(\mathbf{p}_k, \mathcal{P}_{t',d}) - \mathbf{p}_k}{t' - t}\)
- Design Motivation: Internal candidates prevent training from drifting; external candidates mine true motion cues from multi-frame geometry. Selecting top-K largest displacements filters noisy points. Temporal normalization ensures comparability across candidates from different time intervals.
-
Consensus Voting
- Function: Extract the most reliable motion estimate from the candidate pool, filtering inconsistent noisy candidates.
- Mechanism: A consensus matrix \(\mathbf{M}_{ab} = \mathbf{1}[\cos(\mathbf{f}_a, \mathbf{f}_b) > \tau_{cos}]\) is constructed to measure directional consistency, combined with reliability weights \(w_i = \gamma^{m_i}(1 + \|\mathbf{f}_i\|_2^2)\) (temporal decay plus displacement magnitude weighting). The voting score \(\mathbf{S} = \mathbf{M}\mathbf{w}\) selects the highest-scoring candidate as the consensus winner \(a^\dagger\). The final supervision target is the weighted average of all candidates consistent with the winner's direction: \(\bar{\mathbf{f}}_{\mathcal{C}_j} = \frac{\sum_b \mathbf{M}_{a^\dagger b} w_b \mathbf{f}_b}{\sum_b \mathbf{M}_{a^\dagger b} w_b}\)
- Design Motivation: A single candidate is unreliable; voting aggregation exploits majority consistency. Temporal decay \(\gamma=0.9\) prioritizes nearby frames; large-displacement candidates are more informative. The resulting consensus signal is substantially more stable than two-frame signals (cf. Figure 1b).
-
Dynamic Cluster Loss
- Function: Supervise dynamic objects of varying sizes equitably, preventing large objects from dominating training due to their higher point counts.
- Mechanism: A combination of point-level loss (averaged L2 over all points) and cluster-level loss (intra-cluster average followed by cross-cluster average): \(\mathcal{L}_{dcls} = \frac{1}{|\mathcal{P}_\mathcal{C}|}\sum_j\sum_{\mathbf{p}_i \in \mathcal{C}_j}\|\hat{\mathbf{f}}_i - \bar{\mathbf{f}}_{\mathcal{C}_j}\|^2_2 + \frac{1}{N_c}\sum_j(\frac{1}{|\mathcal{C}_j|}\sum_{\mathbf{p}_i \in \mathcal{C}_j}\|\hat{\mathbf{f}}_i - \bar{\mathbf{f}}_{\mathcal{C}_j}\|^2_2)\)
- Design Motivation: Point-level loss alone causes small objects such as pedestrians to be overwhelmed (ablation shows a 53% increase on the PED category, Table 5); cluster-level loss alone is insufficient for fine-grained alignment of large objects (OTHER category error increases by 82%). The combination achieves optimal performance.
Loss & Training¶
Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{dcls} + \mathcal{L}_{static} + \mathcal{L}_{geom}\)
- \(\mathcal{L}_{static}\): Drives residual flow of static points toward zero.
- \(\mathcal{L}_{geom}\): Multi-frame Chamfer distance ensuring warped point clouds align geometrically with neighboring frames.
- Training configuration: Adam optimizer, lr=0.002, batch size=20, 10×RTX 3080, 15 epochs, approximately 15–20 hours.
Key Experimental Results¶
Main Results: Argoverse 2 Test Set Leaderboard¶
| Method | Type | #Frames | Runtime/seq | Three-way EPE↓ | Dynamic Norm↓ | PED↓ |
|---|---|---|---|---|---|---|
| NSFP | Optimization | 2 | 60 min | 6.06 | 0.422 | 0.722 |
| EulerFlow | Optimization | all | 1440 min | 4.23 | 0.130 | 0.195 |
| Floxels | Optimization | 13 | 24 min | 3.57 | 0.154 | 0.195 |
| ZeroFlow | Feed-forward | 3 | 5.4 s | 4.94 | 0.439 | 0.808 |
| SeFlow | Feed-forward | 2 | 7.2 s | 4.86 | 0.309 | 0.464 |
| SeFlow++ | Feed-forward | 3 | 10 s | 4.40 | 0.264 | 0.367 |
| TeFlow | Feed-forward | 5 | 8 s | 3.57 | 0.205 | 0.253 |
- TeFlow EPE of 3.57 cm matches Floxels (an optimization-based method) while being 150× faster (8 s vs. 24 min).
- Dynamic metric improves 22.3% over SeFlow++; pedestrian category error decreases by 31%.
Ablation Study: Number of Frames and Loss Terms¶
| Loss Combination | Dynamic Norm Mean↓ | CAR↓ | PED↓ | Three-way EPE↓ |
|---|---|---|---|---|
| \(\mathcal{L}_{geom}\) only | 0.386 | 0.317 | 0.297 | 8.85 |
| \(\mathcal{L}_{geom} + \mathcal{L}_{static}\) | 0.458 | 0.321 | 0.481 | 6.37 |
| \(\mathcal{L}_{dcls}\) only | 0.303 | 0.254 | 0.285 | 8.53 |
| \(\mathcal{L}_{static} + \mathcal{L}_{dcls}\) | 0.313 | 0.233 | 0.296 | 4.84 |
| All three terms | 0.265 | 0.198 | 0.295 | 4.43 |
| #Frames | Dynamic Norm Mean↓ | Three-way EPE Mean↓ |
|---|---|---|
| 2 (SeFlow) | 0.408 | 6.35 |
| 2 (TeFlow) | 0.353 | 5.98 |
| 4 | 0.283 | 4.57 |
| 5 | 0.265 | 4.43 |
| 6 | 0.269 | 4.55 |
| 8 | 0.300 | 5.40 |
Key Findings¶
- Even with 2 frames, TeFlow outperforms SeFlow by 13.5%, attributable to the candidate pool and cluster-level loss.
- 5 frames is the optimal window; performance slightly degrades at 6 frames and noticeably at 8 frames, suggesting that distant frames introduce noise.
- The dynamic cluster loss alone performs strongly on dynamic objects (Mean 0.303) but yields high static EPE of 8.53, necessitating the static loss.
- Candidate pool ablation: internal only (0.455) < external only (0.321) < combined (0.265), confirming the complementarity of internal anchoring and external geometric evidence.
- State-of-the-art results are also achieved on nuScenes: Dynamic Norm 0.395 vs. SeFlow++ 0.509, with a 33.8% reduction in pedestrian error.
Highlights & Insights¶
- Precise core insight: The key difficulty in self-supervised feed-forward scene flow is not architecture but supervision signal quality — mining temporal consistency is the right approach.
- Elegant candidate pool and voting mechanism: Unreliable multi-source motion estimates are aggregated into reliable signals without additional networks or complex optimization.
- Simple yet effective cluster-level loss: A single additional loss term resolves the large-small object imbalance, yielding substantial gains especially for small objects such as pedestrians.
- Pareto-optimal efficiency–accuracy trade-off: TeFlow achieves the highest accuracy among real-time methods and the fastest inference among high-accuracy methods, successfully bridging the gap between the two paradigms.
Limitations & Future Work¶
- Relies on external modules for static/dynamic segmentation (DUFOMap) and dynamic clustering (HDBSCAN); segmentation errors may propagate through the pipeline.
- Performance degrades beyond 5 frames, suggesting the consensus mechanism does not yet exploit distant frames with sufficient granularity.
- Candidate normalization assumes linear motion, limiting candidate quality for curvilinear motion (e.g., turning vehicles).
- Application of the temporal aggregation strategy at inference time (currently used only during training) remains unexplored.
Related Work & Insights¶
- vs. EulerFlow: EulerFlow optimizes a continuous ODE and achieves high accuracy (EPE 4.23) but requires 1440 minutes per sequence. TeFlow achieves 3.57 cm in only 8 seconds, making it the only viable high-accuracy solution for real-world deployment.
- vs. SeFlow/SeFlow++: These methods improve the two-frame loss but remain bounded by the two-frame signal ceiling. TeFlow improves signal quality at the source.
- vs. ZeroFlow: Knowledge distillation requires 7.2 GPU-months to generate pseudo-labels. TeFlow is fully end-to-end self-supervised.
- Insights: The multi-frame signal mining paradigm via consensus voting is transferable to other self-supervised visual tasks (optical flow, depth estimation). The cluster-level loss concept is broadly applicable to any 3D perception task involving object scale imbalance.
Rating¶
⭐⭐⭐⭐⭐ (5/5)
Overall assessment: TeFlow precisely identifies the core bottleneck of self-supervised feed-forward scene flow (unstable multi-frame supervision signals) and proposes a concise and elegant temporal aggregation solution. Experiments are comprehensive (Argoverse 2 + nuScenes + Waymo), ablations are thorough (number of frames / loss combinations / candidate pool / loss formulations), and both quantitative and qualitative results are convincing. This work is the first to achieve optimization-level accuracy in self-supervised feed-forward scene flow while maintaining real-time efficiency, establishing a new Pareto frontier.