FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance¶

Conference: NeurIPS 2025 arXiv: 2502.14520 Authors: Meng Wang, Fan Wu, Ruihui Li, Yunchuan Qin, Zhuo Tang, Kenli Li (Hunan University) Code: https://github.com/willemeng/FlowScene Area: Autonomous Driving / 3D Semantic Scene Completion Keywords: 3D Semantic Scene Completion, Optical Flow Guidance, Temporal Modeling, Occlusion Awareness, Voxel Refinement

TL;DR¶

This paper proposes FlowScene, which leverages optical flow to guide temporal feature aggregation and employs occlusion masks for voxel refinement. Using only 2 historical frames as input, FlowScene achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks (mIoU 17.70 / 20.81).

Background & Motivation¶

3D Semantic Scene Completion (SSC) is a core perception task in autonomous driving, aiming to jointly infer 3D geometric structure and semantic labels from sparse observations. Existing methods exhibit two categories of limitations:

Single-frame methods (MonoScene, CGFormer, etc.): Rely solely on the current frame's limited observations to recover 3D geometry and semantics, leading to insufficient information.

Existing temporal methods (VoxFormer-T, HTCL, etc.): Naively stack historical frame features or align features via estimated camera poses, ignoring scene motion context and failing to achieve temporal consistency.

Core Problem: How to accurately identify correspondences between historical and current frames to effectively guide temporal SSC modeling?

Optical flow naturally encodes pixel-level inter-frame motion correspondences, capturing motion, viewpoint variation, occlusion, and deformation. This paper proposes to leverage optical flow to guide temporal modeling, injecting motion-aware and occlusion information into the SSC pipeline.

Method¶

Overall Architecture¶

FlowScene comprises four core modules:

Image Encoder: RepViT + FPN to extract current frame features \(F_t\) and historical frame features \(F_{temp}\)
Optical Flow Estimation (OFE): Pre-trained GMFlow generates bidirectional optical flow; forward-backward consistency check yields occlusion mask \(M\)
Flow-Guided Temporal Aggregation (FGTA): Aligns and aggregates temporal features in 2D feature space using optical flow
Occlusion-Guided Voxel Refinement (OGVR): Adaptively fuses voxel features in 3D voxel space using the occlusion mask

Optical Flow Estimation and Occlusion Detection¶

Flow-Guided Warping: GMFlow estimates the flow field \(Flow^{t \to t-i}\) from the current frame to historical frames; bilinear interpolation warps historical frame features to the current frame coordinate system:

\[F_{warp}^{t-i \to t} = \mathcal{W}arp(F_{t-i}, Flow^{t \to t-i})\]

Occlusion Detection: Classical forward-backward consistency check is employed. For each pixel, the forward flow maps it to the historical frame, and the backward flow maps it back. If the round-trip residual exceeds threshold \(\tau\), the pixel is marked as occluded:

\[M(x) = \begin{cases} 1 & \text{if } \|\Delta(x)\| > \tau \quad \text{(occluded)} \\ 0 & \text{otherwise} \end{cases}\]

This detection approach requires zero parameters and zero training overhead, yet effectively identifies occluded regions.

Flow-Guided Temporal Aggregation (FGTA)¶

The FGTA module performs temporal feature alignment and aggregation in 2D image feature space, consisting of two sub-modules:

(1) Temporal Aggregation: For warped historical frame features, spatial position weights are computed via cosine similarity:

\[w_{t-i \to t}(P) = \text{similarity}(F_{warp}^{t-i \to t}(P), F_t(P))\]

Weighted aggregation yields the aggregated feature \(F_{agg} = \sum_{i=0}^{t} w_{t-i \to t} \cdot F_{warp}^{t-i \to t}\). Intuitively, historical features more similar to the current frame after warping receive higher weights, achieving motion-aware feature fusion.

(2) Occlusion Cross-Attention: A Neighborhood Cross-Attention (NCA) mechanism selectively supplements current frame features with information from non-occluded regions of historical frames:

\[F_t = \mathcal{NCA}(F_t, (1 - M) \cdot F_{warp})\]

Here the current frame features serve as queries, while warped features from non-occluded historical regions serve as keys/values. This prevents unreliable occluded region information from being injected into the current frame, while exploiting texture and contextual information that is visible in historical frames but occluded in the current frame.

After FGTA completes temporal fusion in 2D space, explicit geometric modeling in 3D space is still lacking. The OGVR module elevates occlusion awareness to the 3D voxel space.

First, via LSS (Lift-Splat-Shoot) view transformation, \(F_t\), \(F_{agg}\), and \(M\) are projected into 3D voxel space to obtain \(V_t\), \(V_{agg}\), and \(V_{mask}\), respectively.

The two voxel features are then adaptively fused using the occlusion mask:

\[V_{fine} = \frac{(1 - V_{mask}) \cdot V_{agg} + V_t}{(1 - V_{mask}) + 1}\]

Design rationale: - Non-occluded regions: Prioritize aggregated features \(V_{agg}\) (fusing multi-frame information) - Occluded regions: Use current frame features \(V_t\) (historical frames are unreliable in these areas) - Normalization: Ensures smooth feature transitions at occlusion boundaries, avoiding abrupt changes

The OGVR module introduces zero additional parameter overhead, achieving significant gains solely through occlusion-mask-based weighting.

Loss & Training¶

The overall loss function comprises four terms:

\[\mathcal{L} = \lambda_{sem}\mathcal{L}_{scal}^{sem} + \lambda_{geo}\mathcal{L}_{scal}^{geo} + \lambda_{ce}\mathcal{L}_{ce} + \lambda_d\mathcal{L}_d\]

\(\mathcal{L}_{scal}\): Scene-class affinity loss from MonoScene (semantic + geometric)
\(\mathcal{L}_{ce}\): Class-frequency-weighted cross-entropy loss
\(\mathcal{L}_d\): Binary cross-entropy supervision on depth distribution (using LiDAR projection)

Key Experimental Results¶

SemanticKITTI Test Set¶

Method	Input	IoU(%)	mIoU(%)
MonoScene	S	34.16	11.08
CGFormer	S	44.41	16.63
VoxFormer-T	T(5 frames)	43.21	13.41
HTCL	T(3 frames)	44.23	17.09
FlowScene	T(2 frames)	45.20	17.70

Using only 2 historical frames, FlowScene surpasses HTCL (3 frames) by +0.61% mIoU and +0.97% IoU, and outperforms the best single-frame method CGFormer by +1.07% mIoU.

SSCBench-KITTI-360 Test Set¶

Method	IoU(%)	mIoU(%)
CGFormer	48.07	20.05
Symphonies	44.12	18.58
FlowScene	46.95	20.81

Advantages are particularly pronounced on dynamic objects (car, truck, person, etc.), with dynamic object mIoU reaching 14.87%, far exceeding CGFormer's 11.37%.

Performance at Different Distance Ranges (SemanticKITTI Validation Set)¶

Method	12.8m	25.6m	51.2m
VLScene	26.51	24.37	17.83
FlowScene	27.63	24.65	18.13

FlowScene consistently outperforms existing methods across all distance ranges.

Efficiency Comparison¶

Method	mIoU(%)	Inference Time (s)	Parameters (M)
HTCL	17.13	0.297	181.4
BRGScene	15.43	0.285	161.4
FlowScene	18.13	0.301	52.4

FlowScene has only 52.4M parameters, far fewer than HTCL (181.4M), with comparable inference speed.

Ablation Study¶

Configuration	IoU(%)	mIoU(%)
Baseline (naive stacking)	43.98	15.89
+ Flow Warping	44.13	16.21
+ Occlusion Detection	44.38	16.43
+ Temporal Aggregation	44.63	17.23
+ Occlusion Cross-Attention	44.42	17.08
Full (+ OGVR)	45.01	18.13

Each module contributes clearly, with a total gain of +2.24% mIoU and +1.03% IoU.

Temporal Frame Count Ablation: 2 frames is the optimal trade-off. As the frame count increases, optical flow prediction quality degrades (due to large inter-frame intervals), and performance actually decreases with 4–5 frames.

Optical Flow Network Ablation: GMFlow > FlowFormer > RAFT > PWC-Net, with GMFlow achieving the best performance at the smallest parameter count (4.7M).

Backbone Ablation: RepViT-M2.3 > EfficientNetB7 > ResNet50, offering the best performance with the fewest parameters.

Highlights & Insights¶

Novelty¶

Optical flow-guided temporal SSC: The first work to systematically introduce optical flow into SSC, moving beyond naive frame stacking or pose-based alignment.
Occlusion-aware dual-space modeling: Occlusion information is exploited in both 2D (FGTA) and 3D (OGVR) spaces, forming a complete occlusion-handling pipeline.
Efficient design: Achieves SOTA with only 2 historical frames and 52.4M parameters; OGVR introduces zero additional parameters.

Limitations & Future Work¶

Relies on a pre-trained optical flow model (GMFlow); estimation errors propagate downstream.
Evaluation is limited to KITTI-series datasets; experiments on multi-camera settings such as nuScenes are absent.
The number of usable historical frames is constrained by optical flow quality — flow estimates for temporally distant frames are unreliable.

Personal Reflections¶

The intuition behind introducing optical flow into SSC is clear — optical flow encodes precise pixel-level motion correspondences, preserving temporal consistency better than naive stacking or pose estimation.
The use of occlusion masks is elegantly designed: in 2D space, they filter out unreliable features; in 3D space, they enable adaptive fusion — the two modules are complementary.
The OGVR weighting formula (Eq. 7) is simple yet effective, with the core strategy being a divide-and-conquer approach: use historical information in non-occluded regions, and current-frame information in occluded regions.
A noteworthy trend: compared to methods requiring 5 input frames, FlowScene achieves superior performance with only 2 frames, suggesting that "quality > quantity" — high-quality temporal alignment matters more than stacking more frames.
Future directions include combining optical flow guidance with BEV perception (e.g., BEVFormer) or occupancy networks (e.g., Occ3D).