Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration¶

Conference: ICCV 2025 arXiv: 2411.17240 Code: https://github.com/JunyuanDeng/DM-Calib Area: 3D Vision Keywords: Monocular camera calibration, diffusion model, depth estimation, 3D reconstruction, Camera Image

TL;DR¶

This paper proposes DM-Calib, which leverages Stable Diffusion priors for monocular camera intrinsic estimation. It introduces a Camera Image representation that losslessly encodes intrinsics as an image, and recovers focal length and principal point via RANSAC. DM-Calib significantly outperforms existing calibration methods on 5 zero-shot datasets and advances downstream tasks including metric depth estimation, pose estimation, and sparse-view reconstruction.

Background & Motivation¶

Camera calibration is a cornerstone of 3D vision. Traditional methods rely on multi-view images or checkerboard patterns, making them difficult to apply in sparse-view or monocular settings. Monocular calibration is inherently an ill-posed problem that requires additional constraints.

Limitations of Prior Work:

Geometry-based methods (leveraging vanishing points, Manhattan-world assumptions, face priors, etc.): rely on hand-crafted assumptions and generalize poorly to unconstrained real-world scenes.

Learning-based methods (e.g., WildCamera, UniDepth): constrained by the scale of public datasets and prone to overfitting to training distributions.

Existing diffusion-based methods (DiffCalib): use Incidence Maps to encode camera parameters, but the representation has a large domain gap from natural images, failing to fully exploit diffusion model priors; also requires joint training with depth maps.

Key Observation: Stable Diffusion, trained on large-scale image-text pairs, implicitly learns the correspondence between focal length and image content. Long-focal-length images exhibit background blur and shallow depth of field; wide-angle images exhibit strong perspective distortion. Text prompts describing different focal lengths can generate images with corresponding visual characteristics.

Core Idea: Design a novel image-based representation called Camera Image that losslessly encodes camera intrinsics as a 3-channel color image, preserving high-frequency details of the original image (by embedding the grayscale image) and aligning closely with the image domain of diffusion models. Intrinsic estimation is thus reformulated as a conditional image generation task.

Method¶

Overall Architecture¶

DM-Calib is built upon Stable Diffusion V2.1. (1) The RGB input image and corresponding Camera Image are encoded into the latent space via a frozen VAE. (2) The noised latent of the Camera Image is concatenated with the RGB latent as input to the UNet (doubling the input channels). (3) The UNet learns to predict noise (v-prediction). (4) At inference, the Camera Image is generated by denoising from random noise, and intrinsics are recovered via RANSAC.

Key Designs¶

Camera Image Representation: Intrinsics \(K\) are encoded as a 3-channel image \(\mathbf{c}_{(u,v)} = [\arctan(r_1/r_3), \arccos(r_2), \mathbf{g}_{(u,v)}]\), where \(\vec{r} = K^{-1}[u,v,1]^T\) is the normalized ray direction. The first two channels represent azimuth and elevation angles (pseudo-spherical coordinate representation), and the third channel is the grayscale value of the original image.
- Design Motivation: Incidence Maps are low-texture gradient images that differ significantly from natural images, causing large VAE encoding/decoding errors. Camera Image preserves high-frequency details by embedding the grayscale image, reducing VAE reconstruction error to nearly zero (FoV error < 0.1°), and aligning more closely with the natural image distribution seen during diffusion model training.
RANSAC-based Intrinsic Recovery: From the first two channels \([c_\theta, c_\varphi]\) of the Camera Image, linear relationships \(\tan(c_\theta) f_x + c_x = u\) and \(\frac{1}{\cos(c_\theta)\tan(c_\varphi)} f_y + c_y = v\) are derived. Each pair of pixels yields one set of intrinsics; RANSAC fits the optimal lines across all pixels, with slopes and intercepts corresponding to focal lengths and principal points respectively.
- Design Motivation: The Camera Image generated by the diffusion model inevitably contains noise; RANSAC provides a robust parameter recovery mechanism.
Single-step Deterministic Metric Depth Estimation: The stochastic multi-step denoising process is replaced with a deterministic single-step forward pass. The RGB latent and Camera Image latent (without noise addition) are fed directly into the UNet, which predicts the depth latent. Both the UNet and VAE decoder are fine-tuned to support arbitrary output value ranges.
- Design Motivation: The standard VAE decoder has a limited output range unsuitable for metric depth. Fine-tuning the decoder and adopting noise-free deterministic inference avoids the computational overhead and stochasticity of multi-step denoising.

Loss & Training¶

Intrinsic estimation training: \(\mathcal{L} = \mathbb{E}\|\hat{\epsilon}_\theta(z_t^c; z^x) - v_t\|_2^2\) (v-prediction objective), with multi-resolution noise.
Metric depth training: \(\mathcal{L}_{depth} = \mathbb{E}\|M \odot [\mathcal{D}(\mathcal{U}(z^x, \hat{z}^c)) - d]\|\) (L1 loss under sparse mask).
Optimizer: AdamW, learning rate \(3 \times 10^{-5}\), batch size 196, 8×A800 GPUs, 30K iterations.
Training data sourced from multiple public datasets (NuScenes, KITTI, CityScapes, NYUv2, etc.).

Key Experimental Results¶

Main Results¶

Monocular Calibration (zero-shot, average over 5 datasets)

Method	\(e_f\)↓ (focal length)	\(e_b\)↓ (principal point)
Perspective (geometry)	0.239	-
GeoCalib (geometry)	0.215	-
WildCamera (learning)	0.155	0.041
DiffCalib (diffusion)	0.122	0.030
DiffCalib-D (diffusion + depth joint)	0.095	0.041
DM-Calib (Ours)	0.078	0.017

Focal length error of 0.078 (36% reduction vs. DiffCalib 0.122); principal point error of 0.017 (43% reduction vs. DiffCalib 0.030).

Metric Depth Estimation (zero-shot)

Method	NYU-V2 \(\delta_1\)↑	NuScenes \(\delta_1\)↑	ETH3D \(\delta_1\)↑	IBims-1 \(\delta_1\)↑
Metric3D	92.6	72.3	45.6	79.7
UniDepth	97.2	83.3	22.9	79.4
DM-Calib	96.0	85.7	49.0	94.4

Achieves state of the art on NuScenes, ETH3D, and IBims-1, with superior detail preservation and foreground-background relationship understanding compared to UniDepth.

Ablation Study¶

Camera Representation	VAE Reconstruction FoV Error (°)
Azimuth + elevation only (3rd channel duplicated)	~2.5°
Azimuth + elevation + constant channel	~1.5°
Incidence Map (DiffCalib)	~1.0°
Camera Image (grayscale 3rd channel)	< 0.1°

Camera Image achieves less than 1/10 the VAE reconstruction error of Incidence Map, validating the importance of preserving high-frequency details for diffusion models.

Key Findings¶

Camera Image also performs well on the Scenes11 dataset (an extreme scenario with randomly moving objects), with \(e_f=0.061\), demonstrating robustness.
Estimated intrinsics directly improve sparse-view reconstruction quality: using DM-Calib intrinsics reduces DUST3R reconstruction distance error by approximately 20%.
For monocular metric measurement (estimating real object sizes), DM-Calib is far more robust to varying focal lengths than Metric3D.

Highlights & Insights¶

Elegant Camera Image design: embedding the grayscale image in the third channel both bridges the domain gap and provides a high-frequency anchor for the VAE — a simple yet highly effective solution.
One model, multiple benefits: calibration results directly transfer to metric depth, 3D measurement, pose estimation, sparse reconstruction, and other downstream tasks.
Cleverly exploits the diffusion model's implicit knowledge of focal length, transforming an ill-posed problem into a conditional generation task.
RANSAC-based intrinsic recovery is clean and robust, leveraging the property that every pixel in the Camera Image encodes information.

Limitations & Future Work¶

Performance degrades on ultra-wide-angle (small focal length) images due to the scarcity of small focal length images in training data.
Only the pinhole camera model is addressed; distortion parameter estimation is not considered.
Inference requires multi-step denoising (inherent cost of diffusion models), limiting real-time applicability.
The metric depth component underperforms UniDepth on DIODE indoor and VOID benchmarks, possibly due to training data distribution mismatch.
Only single-image calibration is supported; temporal consistency across video sequences has not been explored.

DiffCalib / WildCamera: DiffCalib uses Incidence Map + diffusion model; WildCamera directly regresses intrinsics. DM-Calib surpasses both with a superior representation.
Marigold / GeoWizard: Diffusion models applied to affine-invariant depth estimation; DM-Calib extends this to metric depth.
DUST3R: Sparse-view reconstruction method; providing intrinsic priors from DM-Calib significantly improves reconstruction quality.
Insight: The idea of encoding geometric parameters as diffusion-compatible image representations is generalizable to other geometric parameter estimation tasks (e.g., extrinsics, distortion coefficients, lighting parameters).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The Camera Image representation is highly innovative; reformulating calibration as image generation is a novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated across 5 zero-shot calibration benchmarks + 6 depth benchmarks + reconstruction/pose/measurement tasks — comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, thorough ablations, and intuitive visualizations.
Value: ⭐⭐⭐⭐⭐ A foundational tool; calibration results directly benefit multiple downstream 3D tasks.