GDFusion: Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction¶
Conference: CVPR 2025
arXiv: 2504.12959
Code: https://cdb342.github.io/GDFusion
Area: Autonomous Driving / Occupancy Prediction
Keywords: 3D Semantic Occupancy Prediction, Temporal Fusion, Gradient Descent RNN, Motion Compensation, Scene Adaptation
TL;DR¶
GDFusion is proposed, which reinterprets RNN as gradient descent on the feature space to uniformly fuse four types of heterogeneous temporal information (voxel-level, scene-level, motion, geometry) in VisionOcc, achieving a 1.4%-4.8% mIoU improvement on Occ3D while reducing GPU memory by 27%-72%.
Background & Motivation¶
Background: Temporal information is increasingly important in vision-based 3D语义占用预测 (VisionOcc), but existing methods only focus on voxel-level feature fusion.
Limitations of Prior Work: Three temporal cues are neglected: scene-level consistency priors (unchanged weather/lighting in the short term), historical motion information to correct ego-motion alignment errors of the current frame, and historical geometric information to complement depth estimation of the current frame.
Key Challenge: The four types of temporal information have completely different representations (3D feature maps, network parameters, 3D flow fields, and probabilistic point clouds), making them difficult to fuse in a unified manner.
Core Idea: Reinterpreting the vanilla RNN update \(h^t = Ah^{t-1} + Bx^t\) as a gradient descent step that minimizes \(||Ah^{t-1} - Bx^t||^2\), thereby designing specific loss functions to fuse heterogeneous representations in a unified framework.
Method¶
Key Designs¶
-
Scene-level Temporal Fusion: Encodes scene information into trainable network parameters \(\mathbf{S}^t\) (including scale/shift of LayerNorm and linear layers), and updates the parameters frame-by-frame during inference via self-supervised reconstruction loss to adapt to the current scene.
-
Motion Temporal Fusion: Learns displacement offsets \(\mathbf{M}^t\) to compensate for dynamic object motion and ego-motion estimation errors, where historical motion gradients correct the current frame predictions.
-
Geometric Temporal Fusion: Fuses historical depth probability distributions (geometric priors from 2D-to-3D lifting) with the current frame to enhance depth estimation quality.
Loss & Training¶
Each temporal fusion process is unified into a gradient descent formulation: computing the discrepancy loss between the current frame representation and the historical state, and then adding the gradient as a temporal residual to the current representation. The entire process is highly differentiable and only maintains a historical state equivalent to the size of a single frame.
Key Experimental Results¶
Main Results¶
| Baseline | Original mIoU | +GDFusion mIoU | Memory Savings |
|---|---|---|---|
| FB-Occ | 39.2 | 40.6 (+1.4) | -27% |
| COTR | 42.4 | 44.8 (+2.4) | -72% |
| SurroundOcc | 20.6 | 34.6 (+14.0) | - |
Key Findings¶
- The four temporal cues make complementary contributions.
- The gradient descent perspective enables the fusion of heterogeneous representations.
- Memory efficiency is significantly superior to methods like SOLOFusion.
- Achieves a massive 14.0% mIoU improvement on SurroundOcc (from 20.6% to 34.6%), demonstrating the significant headroom temporal fusion provides for weak baselines.
- Also achieves a 6.3% mIoU improvement on OpenOccupancy with almost negligible inference overhead.
Highlights & Insights¶
- The reinterpretation of RNN as gradient descent is highly elegant.
- Plug-and-play, applicable to various VisionOcc baselines.
- The "test-time adaptation" concept of scene-level fusion is highly novel - maintaining only a single-frame-sized historical state, resulting in a memory efficiency far superior to SOLOFusion, which requires storing multi-frame features.
- By designing specific loss functions to quantify the discrepancies between different temporal representations and the current frame, it achieves an elegant and unified fusion.
Limitations & Future Work¶
- The self-supervised task design for scene-level fusion is relatively simple.
- There is no explicit supervision for motion information, relying on indirect learning.
- The representational capacity of scene-adaptive parameters (LayerNorm scale/shift + linear layers) is limited, which may not capture complex scene changes.
- Geometric temporal fusion depends on the quality of historical depth estimation; errors may accumulate when depth estimates are incorrect across consecutive multiple frames.
- Currently only validated on the nuScenes dataset; its performance on larger and more diverse datasets (such as Waymo) remains to be confirmed.
- The theoretical elegance of the gradient descent perspective might be affected by the choice of learning rates in practice.
- In scenarios with dramatic weather changes (e.g., heavy rain, dense fog), the adaptation speed of scene-level fusion may not be sufficiently fast.
- The relationship with world-model-based methods (e.g., OccWorld) is worth further exploration.
Rating¶
- Novelty: 9/10 — Theoretical contribution of unifying RNN via gradient descent
- Technical Depth: 9/10 — Theoretical derivation + four types of fusion
- Experimental Thoroughness: 8/10 — Multiple baselines on three benchmarks
- Writing Quality: 8/10