Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting¶
Conference: CVPR 2026 arXiv: 2512.17908 Authors: Ananta R. Bhattarai, Helge Rhodin (Bielefeld University) Code: GitHub Area: Self-Supervised Keywords: Monocular Depth Estimation, Test-Time Optimization, Score Distillation Sampling, Re-lighting, Depth Anything
TL;DR¶
This paper proposes Re-Depth Anything, which refines depth predictions from Depth Anything V2/3 at inference time through self-supervised optimization: the predicted depth map is augmented via re-lighting, and a 2D diffusion model's SDS loss is used to guide the optimization without any labeled data.
Background & Motivation¶
Foundation models such as Depth Anything V2 (DA-V2) deliver strong performance, yet they still exhibit errors on in-the-wild images with large distributional shifts from training data—e.g., lighting bias causes loss of fine structures, and flat regions suffer from spurious noise. Existing test-time adaptation methods either require multi-frame temporal information or rely on specific external priors (3D meshes or sparse point clouds). Meanwhile, large-scale 2D diffusion models encode rich physical-world priors that have not been fully exploited for test-time depth refinement.
Core Problem¶
How can 2D diffusion model priors be leveraged to perform label-free test-time refinement of depth maps from a single image?
Method¶
Mechanism: Re-lighting vs. Photometric Reconstruction¶
The key innovation is that the method deliberately avoids reconstructing the true lighting and material properties of the input image—an extremely ill-posed inverse rendering problem—and instead augments the input image. Specifically, the depth map is randomly re-lit and composited onto the original image, and a diffusion model is used to assess whether the augmented image "looks plausible."
4.1 Depth-Illumination Rendering¶
Surface normals \(\mathbf{N}\) are computed from the disparity map \(\hat{D}_{\text{disp}}\), and then re-lighting is applied using the Blinn-Phong model:
- \(\mathbf{l}\): randomly sampled light direction
- \(\beta_1, \beta_2\): diffuse/specular reflection intensities
- \(\tau(\cdot) = (\cdot)^{1/\gamma}\): tone mapping (\(\gamma=2.2\))
- The input image \(\mathbf{I}\) serves as an approximation of the diffuse albedo
Handling relative depth: DA-V2 outputs normalized disparity; surface normals are invariant to global scale, so only the offset parameter \(b = ms\) needs to be optimized (in practice, fixing \(b=0.1\) suffices).
4.2 Augmentation Objective¶
At each step, \(\mathbf{l}\), \(\beta_1\), \(\beta_2\), and \(\alpha\) are randomly sampled, and the SDS loss measures the plausibility of the augmented image:
- SDS loss: a diffusion model (Stable Diffusion v1.5) evaluates the naturalness of the re-lit image
- Smoothness regularization: suppresses depth noise
- Text condition \(c\): automatically generated from the input image via BLIP-2
4.3 Optimization Strategy¶
Directly optimizing the depth tensor or fine-tuning the full model both fail. The key design is to optimize only the intermediate embeddings \(\mathbf{W}\) and the DPT decoder weights \(\theta\), while freezing the ViT encoder:
- The frozen ViT encoder preserves its strong geometric priors
- Decoder weight optimization enables structural adjustments
- Embedding optimization enables per-image customization
Ensemble prediction: Due to the stochasticity of the SDS loss, \(N=10\) independent optimization runs are performed and their results are averaged; most of the gain is already achieved with 3 runs.
Implementation Details¶
- 1000 AdamW iterations; embedding learning rate \(10^{-3}\), DPT weight learning rate \(2 \times 10^{-6}\)
- Scaled orthographic projection (default); regularization \(\lambda_1 = 1.0\)
- Approximately 80 seconds per single run (RTX 5000)
Key Experimental Results¶
| Dataset | Method | AbsRel ↓ | RMSE ↓ | SI log ↓ | SqRel ↓ |
|---|---|---|---|---|---|
| KITTI | DA-V2 | 0.305 | 7.01 | 33.6 | 2.49 |
| Ours + DA-V2 | 0.283 | 6.71 | 30.7 | 2.20 | |
| Relative Gain | 7.10% | 4.29% | 8.51% | 11.4% | |
| ETH3D | DA-V2 | 0.113 | 0.955 | 15.1 | 0.391 |
| Ours + DA-V2 | 0.104 | 0.875 | 14.1 | 0.347 | |
| Relative Gain | 8.30% | 8.39% | 6.22% | 11.1% |
| Dataset | Method | AbsRel ↓ | SqRel ↓ | Normal MSE ↓ |
|---|---|---|---|---|
| CO3D | DA3 | 0.00251 | 0.000317 | 0.000479 |
| Ours + DA3 | 0.00238 | 0.000294 | 0.000409 | |
| Relative Gain | 4.83% | 7.39% | 14.65% |
Consistent improvements are also obtained on DA3, with normal error reduction reaching 14.7%.
Highlights & Insights¶
- Re-lighting instead of photometric reconstruction: elegantly sidesteps the ill-posedness of inverse rendering, repurposing diffusion priors as an "augmentation plausibility check"
- No labels, multi-view data, or additional supervision required: purely single-image self-supervised
- Backbone-agnostic: effective across both DA-V2 and DA3
- Notable fine-detail enhancement: recovers spherical textures, balcony railings, and other fine structures; removes spurious noise on flat surfaces
- Theoretically elegant: transfers the SfS+SDS paradigm from DreamFusion's text-to-3D setting to depth refinement
Limitations & Future Work¶
- Occasional hallucinated edges (e.g., stickers on a truck misinterpreted as geometric features)
- Sky regions may have geometry incorrectly extended
- Dark regions (e.g., under tree shadows) may be over-smoothed
- The 10-run ensemble requires approximately 13 minutes, which is prohibitive for real-time applications
- Stable Diffusion v1.5 is used as the diffusion prior, which may no longer be the optimal choice
Related Work & Insights¶
- vs. DreamFusion/RealFusion: these methods perform full 3D reconstruction with photometric consistency; this work only applies re-lighting augmentation, avoiding the stringent requirements of photometric reconstruction
- vs. classical Shape-from-Shading (SfS): SfS assumes constant albedo or known illumination and fails in many practical cases; this work replaces hand-crafted assumptions with diffusion priors
- vs. Marigold: Marigold uses a diffusion model directly for depth estimation; this work uses a diffusion model for test-time optimization, making it complementary to feed-forward methods
-
vs. multi-frame TTA: the proposed single-image approach requires no temporal information, broadening its applicability
-
The re-lighting augmentation + SDS paradigm is generalizable to other geometry tasks such as normal estimation and surface reconstruction
- "Augment rather than reconstruct" represents an important strategic shift for handling ill-posed inverse problems
- The design of freezing the encoder while optimizing embeddings and the decoder is transferable to test-time adaptation of other foundation models
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Self-supervised depth refinement via re-lighting SDS is entirely novel
- Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets, two backbones (DA-V2 and DA3), and detailed ablation studies
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear motivation, elegant methodology, and thorough analysis
- Value: ⭐⭐⭐⭐ — Opens a new direction for test-time refinement of foundation models