GDFusion: Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction¶

Conference: CVPR 2025
arXiv: 2504.12959
Code: https://cdb342.github.io/GDFusion
Area: Autonomous Driving / Occupancy Prediction
Keywords: 3D Semantic Occupancy Prediction, Temporal Fusion, Gradient Descent RNN, Motion Compensation, Scene Adaptation

TL;DR¶

GDFusion is proposed, which reinterprets RNN as gradient descent on the feature space to uniformly fuse four types of heterogeneous temporal information (voxel-level, scene-level, motion, geometry) in VisionOcc, achieving a 1.4%-4.8% mIoU improvement on Occ3D while reducing GPU memory by 27%-72%.

Background & Motivation¶

Background: Temporal information is increasingly important in vision-based 3D语义占用预测 (VisionOcc), but existing methods only focus on voxel-level feature fusion.

Limitations of Prior Work: Three temporal cues are neglected: scene-level consistency priors (unchanged weather/lighting in the short term), historical motion information to correct ego-motion alignment errors of the current frame, and historical geometric information to complement depth estimation of the current frame.

Key Challenge: The four types of temporal information have completely different representations (3D feature maps, network parameters, 3D flow fields, and probabilistic point clouds), making them difficult to fuse in a unified manner.

Core Idea: Reinterpreting the vanilla RNN update \(h^t = Ah^{t-1} + Bx^t\) as a gradient descent step that minimizes \(||Ah^{t-1} - Bx^t||^2\), thereby designing specific loss functions to fuse heterogeneous representations in a unified framework.

Method¶

Key Designs¶

Scene-level Temporal Fusion: Encodes scene information into trainable network parameters \(\mathbf{S}^t\) (including scale/shift of LayerNorm and linear layers), and updates the parameters frame-by-frame during inference via self-supervised reconstruction loss to adapt to the current scene.
Motion Temporal Fusion: Learns displacement offsets \(\mathbf{M}^t\) to compensate for dynamic object motion and ego-motion estimation errors, where historical motion gradients correct the current frame predictions.
Geometric Temporal Fusion: Fuses historical depth probability distributions (geometric priors from 2D-to-3D lifting) with the current frame to enhance depth estimation quality.

Loss & Training¶

Each temporal fusion process is unified into a gradient descent formulation: computing the discrepancy loss between the current frame representation and the historical state, and then adding the gradient as a temporal residual to the current representation. The entire process is highly differentiable and only maintains a historical state equivalent to the size of a single frame.

Key Experimental Results¶

Main Results¶

Baseline	Original mIoU	+GDFusion mIoU	Memory Savings
FB-Occ	39.2	40.6 (+1.4)	-27%
COTR	42.4	44.8 (+2.4)	-72%
SurroundOcc	20.6	34.6 (+14.0)	-

Key Findings¶

The four temporal cues make complementary contributions.
The gradient descent perspective enables the fusion of heterogeneous representations.
Memory efficiency is significantly superior to methods like SOLOFusion.
Achieves a massive 14.0% mIoU improvement on SurroundOcc (from 20.6% to 34.6%), demonstrating the significant headroom temporal fusion provides for weak baselines.
Also achieves a 6.3% mIoU improvement on OpenOccupancy with almost negligible inference overhead.

Highlights & Insights¶

The reinterpretation of RNN as gradient descent is highly elegant.
Plug-and-play, applicable to various VisionOcc baselines.
The "test-time adaptation" concept of scene-level fusion is highly novel - maintaining only a single-frame-sized historical state, resulting in a memory efficiency far superior to SOLOFusion, which requires storing multi-frame features.
By designing specific loss functions to quantify the discrepancies between different temporal representations and the current frame, it achieves an elegant and unified fusion.

Limitations & Future Work¶

The self-supervised task design for scene-level fusion is relatively simple.
There is no explicit supervision for motion information, relying on indirect learning.
The representational capacity of scene-adaptive parameters (LayerNorm scale/shift + linear layers) is limited, which may not capture complex scene changes.
Geometric temporal fusion depends on the quality of historical depth estimation; errors may accumulate when depth estimates are incorrect across consecutive multiple frames.
Currently only validated on the nuScenes dataset; its performance on larger and more diverse datasets (such as Waymo) remains to be confirmed.
The theoretical elegance of the gradient descent perspective might be affected by the choice of learning rates in practice.
In scenarios with dramatic weather changes (e.g., heavy rain, dense fog), the adaptation speed of scene-level fusion may not be sufficiently fast.
The relationship with world-model-based methods (e.g., OccWorld) is worth further exploration.

Rating¶

Novelty: 9/10 — Theoretical contribution of unifying RNN via gradient descent
Technical Depth: 9/10 — Theoretical derivation + four types of fusion
Experimental Thoroughness: 8/10 — Multiple baselines on three benchmarks
Writing Quality: 8/10