Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: Open-sourced as promised (paper, models, and project page); repository URL pending
Area: 3D Vision
Keywords: Depth Completion, dToF Sensors, Metric Depth, Zero-shot Generalization, Cross-modal Fusion
TL;DR¶
Addressing the "extremely sparse + low resolution + noisy" depth maps from direct Time-of-Flight (dToF) sensors, this paper proposes a depth-guided dual-branch ViT encoder with masked joint attention. This allows sparse depth to unidirectionally guide RGB features without being contaminated by RGB cues, coupled with a lightweight DPT decoder to directly output dense metric depth. Trained entirely on a simulation pipeline covering flash/rotating dToF synthetic data, the model achieves zero-shot generalization across 6 datasets and 3 real dToF devices, matching or exceeding Prev. SOTA while being 20× faster and using 10× less VRAM.
Background & Motivation¶
Background: Dense metric depth is essential for VR/XR, robotics, and 3D perception. Monocular depth foundation models (e.g., Depth Anything, MoGe) exhibit strong in-the-wild generalization but use scale-invariant loss, resulting in scale ambiguity and a failure to recover reliable absolute metric depth in complex real scenes.
Limitations of Prior Work: A common remedy is integrating sparse dToF measurements (LiDAR, low-resolution ToF) to anchor the true scale. However, existing depth completion methods have two major flaws: (1) Sensor-specific coupling—designs tailored for specific sampling patterns (fixed-line LiDAR, fixed-res ToF) break when devices change or when depth becomes extremely sparse/noisy. (2) Heavy computation—recent high-accuracy methods utilizing diffusion (Marigold-DC), iterative optimization (OMNI-DC), or multi-stage refinement (PriorDA) involve massive computational overhead, making them impractical for mobile or real-time scenarios.
Key Challenge: Existing methods treat "sparse depth as an auxiliary signal" and rely primarily on pre-trained monocular encoders, failing to model the complementary and mutually constraining structures between RGB appearance and geometric measurements. Consequently, it is difficult to balance generalization (cross-device/sparsity) and efficiency (avoiding heavy refinement).
Goal: Develop a single model that covers various sparse dToF devices and extreme sparsity/noise scenarios while remaining lightweight. This requires: (1) Ensuring sparse depth truly "guides" image features without being overwhelmed; (2) Producing high-quality dense metric depth without diffusion, iteration, or multi-stage refinement; (3) Overcoming the scarcity of paired training data.
Key Insight + Core Idea: The authors observe that simple concatenation or standard cross-attention allows unreliable RGB cues to contaminate accurate sparse depth. The Core Idea is to use a unidirectional masked joint attention that only allows "depth → image" guidance while blocking "image → depth" feedback. Combined with a simulation pipeline generating millions of synthetic scenes covering various dToF modalities, a zero-shot robust lightweight model is trained.
Method¶
Overall Architecture¶
Input consists of a high-resolution RGB image \(I \in \mathbb{R}^{H\times W\times 3}\) and a sparse depth map \(Z \in \mathbb{R}^{H\times W}\). The goal is to output dense metric depth \(\tilde{D} \in \mathbb{R}^{H\times W}\) and a validity mask \(\tilde{M}\) (marking unreliable areas like sky/reflections). The pipeline follows four steps: Heterogeneous dToF depth is first preprocessed into a unified representation with log-normalization; a dual-branch ViT encoder separately encodes RGB and sparse depth, using masked joint attention for controlled fusion; a lightweight DPT decoder predicts normalized depth and masks; finally, inverse normalization using recorded scale parameters restores the metric depth. The entire system is trained on data from a dToF simulation pipeline.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Input: RGB Image + Sparse dToF Depth"] --> B["Unified Depth Preprocessing<br/>resize+flood-fill+log norm→3 ch"]
B --> C["Masked Joint Attention Dual-branch Encoder<br/>Depth→Image Unidirectional Guidance"]
C --> D["Lightweight DPT Decoder & Inverse Norm<br/>Depth Head + Mask Head"]
D --> E["Output: Dense Metric Depth + Validity Mask"]
F["dToF Sensor Simulation Pipeline<br/>flash / sub-VGA / rotating + noise"] -.Training Data.-> C
Key Designs¶
1. Unified Depth Representation Preprocessing: Aligning Diverse dToF Depth for DINOv2
Since dToF resolutions and sampling patterns (point grids, low-res grids, line scans) vary wildly, direct feeding prevents generalization. Ours upsamples the sensor depth and its mask to RGB resolution and uses nearest neighbor interpolation + flood-fill to create a continuous depth field (preserving local geometry). Crucially, the validity mask remains sparse, allowing the network to distinguish "real measurements" from "interpolated pixels."
To reuse DINOv2 pre-trained ViT weights for the depth branch, log-normalization aligns the depth distribution with RGB inputs:
where \(Z_{\min}, Z_{\max}\) are the min/max values of original valid depths. Log-normalization compresses large depth ranges and stabilizes scale variations. The normalized depth is replicated across the first two channels, with the third channel containing the sparse validity mask, forming a "3-channel depth input" scaled to \([-1,1]\). Parameters \(\alpha, \beta\) are stored for inverse normalization.
2. Masked Joint Attention Dual-branch Encoder: Unidirectional Guidance
This is the core innovation. The encoder uses parallel ViT branches (RGB and normalized sparse depth). Instead of standard cross-attention, it uses joint attention with a directional mask. Image and depth tokens (query/key/value) are concatenated: \(Q=[Q_I; Q_Z]\), \(K=[K_I; K_Z]\), \(V=[V_I; V_Z]\). A mask \(G\) is applied during attention:
The mask \(G\) enforces asymmetric information flow: "depth → image" attention is allowed (accurate depth guides image features), but "image → depth" attention is blocked (preventing unreliable RGB cues from contaminating geometric depth). This yields depth-aware image embeddings while preserving depth feature purity. Furthermore, since the structure matches standard ViT self-attention, both branches are initialized with DINOv2 weights, injecting large-scale visual/geometric priors.
3. Lightweight DPT Decoder and Inverse Normalization: Direct Metric Depth Extrapolation
The decoder uses the DPT architecture with two lightweight branches: one for the validity mask and one for the dense normalized depth \(\hat{D}\). Metric depth is restored via:
Since \(\hat{D}\) is not constrained to \([0,1]\), the decoder can extrapolate depth beyond the sensor's range \([Z_{\min}, Z_{\max}]\) (e.g., distant regions). The lack of diffusion or iterative refinement makes this 1-2 orders of magnitude lighter than OMNI-DC/Marigold-DC.
4. dToF Sensor Simulation Pipeline: Scaling Synthetic Data for Zero-shot Robustness
Given the scarcity of paired "High-res GT + dToF" data, Ours simulates dToF depth from large-scale synthetic RGB-D datasets. It covers two categories: (a) flash dToF—extremely sparse point clouds (64~10K points) and sub-VGA flash (low-res dense maps with edge degradation modeled via Perlin noise). (b) rotating dToF—simulating line scanning patterns (e.g., Velodyne VLP-16/32). Noise augmentation (Gaussian, spatial jitter, 0.2-1.0% outliers, and inpainting with irregular masks) teaches the model to handle real-world artifacts.
Loss & Training¶
The total loss is \(\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_g + \mathcal{L}_l + \mathcal{L}_m\).
- Depth-weighted L1: \(\mathcal{L}_1 = \sum_{i\in\mathcal{M}} \frac{1}{d_i}|\tilde{d}_i - d_i|_1\). Inverse depth weighting prevents overfitting to large distant values.
- Global/Local Scale-invariant Loss: \(\mathcal{L}_g\) and \(\mathcal{L}_l\) use ROE solvers to maintain global scale consistency and local geometric structure.
- Mask Loss: \(\mathcal{L}_m\) uses binary cross-entropy to train the validity head.
Training details: ViT-Small (DINOv2) encoder; features from layers 6 and 12 fed to the DPT decoder; support for dynamic token counts; AdamW; trained for 100k iterations on 8x A100.
Key Experimental Results¶
Main Results¶
Zero-shot generalization on real sensors (KITTI-DC, ZJUL5) and simulated sparse points (DDAD, DIODE, ETH3D, iBims-1). Average Rel and \(\delta < 1.025\) are reported.
| Dataset / Metric | OMNI-DC | PromptDA(ViT-S) | PriorDA(ViT-B) | Marigold-DC | Ours(ViT-S) |
|---|---|---|---|---|---|
| KITTI-DC Rel↓ | 1.48 | 2.90 | 2.76 | 5.50 | 2.00 |
| ZJUL5 Rel↓ | 12.92 | 13.25 | 11.13 | 12.77 | 11.13 |
| ETH3D Rel↓ | 1.05 | 2.67 | 1.04 | 1.80 | 0.84 |
| Avg. Rel↓ | 3.77 | 5.09 | 3.99 | 6.11 | 3.46 |
| Avg. δ↑ | 79.0 | 66.3 | 74.5 | 59.1 | 77.9 |
| Inference Time↓ | 638 ms | 30 ms | 197 ms | 6470 ms | 34 ms |
| VRAM↓ | 3.49 GB | 0.40 GB | 1.62 GB | 5.71 GB | 0.44 GB |
Ours achieves the lowest average Rel (3.46). It is ~20× faster and uses ~10× less VRAM than OMNI-DC while maintaining better or comparable accuracy.
Ablation Study¶
Architecture components (Avg. Rel / δ):
| Configuration | Rel↓ | δ↑ | Description |
|---|---|---|---|
| Depth prompting | 4.07 | 73.6 | Depth injected only at decoder neck (like PromptDA) |
| Joint attention | 3.87 | 75.6 | Joint attention without directional mask |
| Masked joint attention (Ours, ViT-S) | 3.46 | 77.9 | Full model |
| Ours, ViT-B | 3.28 | 79.0 | Larger backbone |
Key Findings¶
- Directional masking is vital: Encoder-level fusion (3.87) outperforms decoder-level prompting (4.07); adding the directional mask (3.46) provides a significant boost, verifying the importance of blocking image-to-depth influence.
- Scalability: Performance improves monotonically from ViT-S to ViT-L.
- Superior at extreme sparsity: Ours maintains the highest accuracy at 100 points, demonstrating robustness to input density changes.
Highlights & Insights¶
- Asymmetric Information Flow: The \(2\times2\) mask \(G\) encodes the intuition "let the reliable guide the noisy, but prevent the noisy from polluting the reliable." This plug-and-play design is highly transferable.
- Expressive Encoder, Lightweight Decoder: Contrary to the "diffusion/iteration" trend, Ours proves that superior encoder-level fusion allows for lightweight decoding, providing a 20× speed advantage for real-world deployment.
- Structural Isomorphism: Designing the fusion layer to be isomorphic to ViT self-attention allows the depth branch to inherit large-scale vision priors from DINOv2 for free.
- Simulation for Zero-shot: Successfully transferring from pure synthetic training to 3 types of real devices highlights the scalability of physics-aware sensor simulation.
Limitations & Future Work¶
- The paper lacks absolute error metrics like RMSE/MAE, making it hard to judge absolute geometric precision in millimeters.
- The unidirectional mask assumes depth is always more reliable than RGB; this may fail in scenarios with dToF multipath interference or specular reflections.
- Generalization to unmodeled sensor mechanisms (e.g., specific solid-state LiDAR patterns) is unknown.
- Reliability of extrapolation depends on the \(Z_{\min}/Z_{\max}\) estimation, which can be sensitive to outliers.
Related Work & Insights¶
- vs. PromptDA: Ours fuses in the encoder via masked joint attention rather than the decoder neck, yielding better accuracy (3.46 vs 4.07) and cross-device generalization.
- vs. Marigold-DC: Diffusion methods generalize well but are extremely slow (~6.4 seconds). Ours is ~200× faster and more robust to noise.
- vs. OMNI-DC: Iterative optimization is expensive (638 ms). Ours achieves better results in a single forward pass.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Simple but effective masked joint attention)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Zero-shot across 6 datasets and 3 devices)
- Writing Quality: ⭐⭐⭐⭐ (Method is clear, though absolute metrics are missing)
- Value: ⭐⭐⭐⭐⭐ (High industrial potential for VR/XR and mobile robotics)