Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising¶

Conference: ECCV 2024
Code: https://github.com/gtdong-ustc/multi-frame-tof-denoising
Area: Image Restoration
Keywords: Time-of-Flight Depth Denoising, Multi-frame Fusion, Dual-Correlation, Multi-Path Interference, Confidence Guidance

TL;DR¶

Proposal of the first learning-based multi-frame ToF depth denoising framework, which effectively utilizes the correlation between multi-frame ToF data to guide noise removal via a Dual-Correlation Estimation Module (exploiting intra-frame and inter-frame correlation) and a Confidence-guided Residual Regression Module, significantly outperforming existing single-frame methods in high-noise regions.

Background & Motivation¶

Background: Time-of-Flight (ToF) depth cameras acquire scene depth information by measuring the round-trip time of light signals, widely used in 3D perception, augmented reality, and autonomous driving. ToF depth maps suffer from two major types of noise: Multi-Path Interference (MPI) and shot noise. MPI occurs when light signals undergo multiple reflections before reaching the sensor, causing measured depth values to deviate from the ground truth; shot noise originates from the quantum statistical nature of photon detection. In recent years, deep learning-based ToF denoising methods have achieved impressive results.

Limitations of Prior Work: Existing ToF denoising methods almost exclusively process single-frame data, completely ignoring correlation information across multiple frames. However, in practical applications, ToF cameras continuously capture data at 30fps or higher—there are rich correlations between neighboring frames: (1) Scene geometry remains static or changes slowly over short periods, allowing depth information across different frames to validate and complement each other; (2) ToF noise (especially shot noise) exhibits independent random distribution across different frames, whereby multi-frame averaging itself can significantly reduce noise; (3) The intensity of MPI noise may vary across frames due to minor viewpoint changes, providing clues for identifying MPI regions.

Key Challenge: Multi-frame data contains rich complementary information for denoising, but existing methods only use single frames, wasting inter-frame correlation—a crucial source of information. Meanwhile, while simple multi-frame averaging reduces random noise, it is ineffective against systematic errors like MPI, requiring more intelligent multi-frame fusion strategies.

Goal: (1) To design the first learning-based multi-frame ToF denoising framework. (2) To effectively exploit both intra-frame and inter-frame correlations to guide denoising. (3) To focus on improving denoising performance in highly noisy regions (such as corners and edges severe with MPI).

Key Insight: The authors propose the concept of "dual-correlation", decomposing useful information in multi-frame ToF denoising into intra-correlation (spatial location and geometric structure correlation) and inter-correlation (noise distribution variations at the same location across different frames), and design specialized modules to extract and utilize these two correlations, respectively.

Core Idea: Utilizing intra-frame spatial correlation to initialize depth residuals and inter-frame noise distribution correlation to locate highly noisy regions, enabling targeted denoising via confidence guidance.

Method¶

Overall Architecture¶

The framework processes multiple consecutive frames of raw ToF data (including measurements at multiple phases/frequencies) as input and outputs a denoised depth map. The core pipeline consists of three stages: (1) Feature extraction—extracting initial features from each frame independently; (2) Dual-correlation estimation—estimating both intra-frame and inter-frame correlations simultaneously through the Dual-Correlation Estimation Module, where the former assists in initializing depth residuals (i.e., the difference between ground truth and measured depth) and the latter helps localize highly noisy regions; (3) Confidence-guided residual regression—generating a confidence map based on inter-frame correlation through the Confidence-guided Residual Regression Module, which guides the residual regression to prioritize highly noisy regions, ultimately yielding the denoised depth map.

Key Designs¶

Intra-Correlation Estimation:
- Function: Establish correlations between spatial locations and geometric structures in the scene to assist in depth residual initialization.
- Mechanism: Within a single frame of ToF data, structural relationships exist among depth values at different spatial locations (e.g., points on the same plane should share consistent depth gradients, and object boundaries correspond to depth discontinuities). The intra-correlation module explicitly builds a spatial correlation matrix using a convolutional attention mechanism, where each element represents the geometric correlation strength between two locations. This correlation matrix is then used to weight and aggregate features from different locations, generating a global context-aware feature representation. This context-enhanced feature is subsequently utilized to initialize depth residual estimations. Compared to directly predicting residuals from individual pixels, utilizing spatial correlation yields more accurate initial estimates, especially in regions with severe local depth deviations caused by MPI, where information from correct surrounding regions can propagate through the correlation to correct errors.
- Design Motivation: MPI noise is spatially localized (typically concentrated in corners and concave surfaces), while geometric structure information can help identify which regions contain reliable depth values. By explicitly modeling spatial correlation, the network can leverage information from reliable regions to correct unreliable ones.
Inter-Correlation Estimation:
- Function: Distinguish variations in ToF noise distribution across different frames to localize highly noisy regions.
- Mechanism: ToF noise behavior varies across different frames: shot noise is random and independent across frames, while MPI noise, though more systematic, varies in intensity across frames due to minor sensor movements or scene variations. The inter-correlation module aligns and compares the current frame with reference frames (adjacent or multiple frames) to calculate depth variations at the same spatial location across frames. Locations with high variation correspond to strong noise (as true depth should stay constant or change minimally within short periods), while locations with small variation indicate weak noise. This inter-frame variation analysis produces a "noise intensity map" indicating the noise level of each location, serving as the basis for subsequent confidence guidance.
- Design Motivation: Single-frame methods cannot distinguish between "measurements shifted by MPI" and "correct measurements," as they can look identical in a single frame. However, across multiple frames, MPI-induced shifts are usually more stable (systematic errors), whereas shot noise shows random frame-to-frame variations. Inter-frame correlation exploits this difference to localize different types of noise regions.
Confidence-guided Residual Regression Module:
- Function: Guide residual regression to prioritize highly noisy regions based on noise intensity distribution.
- Mechanism: Based on the noise intensity map generated by inter-frame correlation estimation, this module predicts a confidence map, where each pixel value indicates "how reliable the initial residual estimation is at that location." Low-confidence regions (i.e., highly noisy areas) require more residual correction, while high-confidence regions can retain their initial estimates. Specifically, the confidence map acts as attention weights for the residual regression network. By multiplying the residual regression outputs pixel-wise with the confidence map, the network's regression capacity is concentrated on regions requiring the most correction. This "soft focus" mechanism avoids wasting computational resources on already accurate areas while ensuring highly noisy regions receive sufficient correction.
- Design Motivation: Uniformly performing residual regression across the entire depth map is inefficient, as noise is weak in most areas (where initial estimates are already sufficient) and only a few highly noisy regions require major corrections. The confidence-guided mechanism achieves "on-demand denoising," improving efficiency and effectiveness.

Loss & Training¶

The loss function mainly consists of two parts: (1) Depth reconstruction loss—the \(L_1/L_2\) distance between the denoised depth map and the ground truth depth map; (2) Confidence supervision loss—optional weak supervision on the confidence map to align it with the actual noise distribution. The training data is obtained from both synthetic multi-frame ToF data (which allows precise control over MPI and shot noise parameters) and real-world ToF datasets. Multi-frame input typically consists of 3 to 5 consecutive frames.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (Multi-frame)	Prev. SOTA (Single-frame)	Gain
Synthetic ToF Dataset	MAE (mm)	Best	Single-frame methods	Significantly reduced
Real ToF Dataset	MAE (mm)	Best	Single-frame methods	Especially evident in high-noise regions
Severe MPI Scenes	MAE (mm)	Significantly outperforms single-frame	Single-frame methods	Multi-frame info contributes most to MPI denoising

Ablation Study¶

Configuration	Key Metric	Description
Intra-correlation only	High MAE	Lacks inter-frame information, unable to locate noise regions
Inter-correlation only	High MAE	Lacks spatial context, inaccurate residual initialization
Dual-correlation (Full)	Lowest MAE	The two correlations are complementary
Without confidence guidance	High MAE	Inefficient uniform regression, insufficient correction in high-noise regions
With confidence guidance	Lowest MAE	Effectively focuses on high-noise regions
Different frame counts (1/3/5)	MAE decreases progressively	More frames provide more complementary information

Key Findings¶

The multi-frame approach outperforms single-frame methods in all evaluation settings, validating the value of utilizing inter-frame information.
In severe MPI regions (corners, concave surfaces), the advantage of the multi-frame approach is most prominent—which is precisely the most challenging scenario for single-frame methods.
The contributions of intra-frame and inter-frame correlations are complementary; omitting either results in a significant performance drop.
The confidence guidance mechanism successfully focuses regression capacity on highly noisy regions.
As the number of input frames increases, performance continuously improves, but with diminishing marginal returns (3 frames already capture most of the gain).

Highlights & Insights¶

The formulation of "dual-correlation" offers a clear perspective: intra-frame maps spatial structure, and inter-frame maps noise localization, each serving its distinct purpose.
Confidence-guided residual regression yields an elegant "on-demand denoising" design, avoiding computational waste in noise-free areas.
It presents the first learning-based multi-frame ToF denoising framework, paving the way for a new line of research in this direction.
The methodological contributions (dual-correlation + confidence guidance) are generic and can potentially be extended to other multi-frame denoising tasks.

Limitations & Future Work¶

Multi-frame inputs require inter-frame alignment, where alignment errors in dynamic scenes may introduce additional noise.
Computational overhead scales linearly with the number of frames, demanding attention on real-time execution—especially on embedded ToF systems.
Only short-term temporal windows (3-5 frames) are considered for inter-frame correlation; exploring longer-term temporal information is a valuable direction.
The handling of moving objects remains unclear, as inter-frame alignment assumes largely static scenes.
Domain adaptation from synthetic training data to real-world data requires more thorough validation.
In-depth comparisons against traditional multi-frame averaging methods (e.g., temporal filters) could be more detailed.

Single-frame ToF denoising: DeepToF, FLAT, and others utilize CNNs to learn MPI patterns.
Multi-frame depth fusion: Existing multi-frame fusion methods in RGB-D and LiDAR domains; this paper introduces this concept to ToF.
Applications of confidence estimation in stereo matching: Confidence maps have been successfully utilized to schedule residual regressions.
Video denoising methods (such as VRT, RVRT) utilizing alignment and fusion techniques.
Insight: The concept of dual-correlation can be generalized to multi-frame denoising of other sensor modalities (e.g., multi-frame RAW image denoising, multi-frame LiDAR point cloud denoising).

Rating¶

Novelty: ⭐⭐⭐⭐ First multi-frame ToF denoising framework, proposing a novel perspective on dual-correlation decomposition.
Experimental Thoroughness: ⭐⭐⭐ Validated on synthetic and real-world data with ablations, though coverage of real-world scenarios could be expanded.
Writing Quality: ⭐⭐⭐⭐ Clear problem analysis and well-structured methodology description.
Value: ⭐⭐⭐⭐ Establishes a new multi-frame research direction for ToF depth denoising with highly generalizable methodology.