Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration¶

Conference: ICCV 2025 arXiv: 2411.17240 Code: https://github.com/JunyuanDeng/DM-Calib Area: 3D Vision / Camera Calibration / Diffusion Models Keywords: Monocular camera calibration, Camera Image, diffusion model prior, metric depth estimation, sparse-view reconstruction

TL;DR¶

This paper proposes DM-Calib, a diffusion-based monocular camera intrinsic estimation method. It introduces a Camera Image representation that losslessly encodes intrinsics as a 3-channel image (azimuth + elevation + grayscale), fine-tunes Stable Diffusion to generate Camera Images, and extracts intrinsics via RANSAC. The method outperforms all baselines on 5 zero-shot datasets and extends camera calibration to metric depth estimation, pose estimation, and sparse-view 3D reconstruction.

Background & Motivation¶

State of the Field¶

Background: Monocular camera calibration is an ill-posed problem. Traditional methods rely on strong priors such as the Manhattan World assumption or calibration boards, resulting in poor generalization. Learning-based methods are constrained by training data scale. Diffusion models implicitly understand the relationship between focal length and image content (e.g., telephoto → shallow depth of field / compression effect; wide-angle → exaggerated perspective), and this prior knowledge can be leveraged for camera calibration.

Mechanism¶

Goal: How can the implicit imaging priors encoded in diffusion models be effectively extracted for high-accuracy monocular camera intrinsic estimation? The key challenge is that numerical camera parameters \((f_x, f_y, c_x, c_y)\) are not directly compatible with image-based diffusion models.

Method¶

Overall Architecture¶

Input RGB image → VAE encodes RGB latent → GT intrinsics encoded as Camera Image representation → noise added to Camera Image latent → conditional UNet denoises to predict Camera Image → VAE decodes → RANSAC extracts intrinsics \((f_x, f_y, c_x, c_y)\) from Camera Image.

Key Designs¶

Camera Image Representation: Encodes intrinsics as a 3-channel "image": Channel 1 = \(\arctan(r_1/r_3)\) (azimuth), Channel 2 = \(\arccos(r_2)\) (elevation), Channel 3 = grayscale image. Preserving high-frequency details from the input image reduces the domain gap with real images, making VAE encoding/decoding errors negligible. By contrast, incidence maps exhibit a large domain gap with real images.
RANSAC Intrinsic Extraction: Each pixel in the Camera Image encodes a ray direction. Any two pixels yield a linear equation \((\tan(\theta) \cdot f_x + c_x = u)\). RANSAC fits a line over all pixels — slope = focal length, intercept = principal point.
Extension to Metric Depth: The Camera Image is used as a conditional input to the UNet with single-step deterministic forward inference (no multi-step denoising), while the VAE decoder is jointly trained. This achieves the first diffusion-prior-based metric depth estimation.

Loss & Training¶

Intrinsic estimation: v-prediction loss + multi-resolution noise
Metric depth: \(\mathcal{L} = \|M \odot (D(U(z_x, z_c)) - d)\|\)
Intrinsics: AdamW, lr = 3e-5, 30K iterations, BS = 196, 8× A800
Depth: same optimizer, BS = 96, ~5 days training

Key Experimental Results¶

Zero-Shot Camera Calibration (focal length error \(e_f\) ↓)¶

Method	Waymo	RGBD	ScanNet	MVS	Scenes11	Average
WildCame	0.210	0.097	0.128	0.170	0.170	0.155
DiffCalib	0.188	0.092	0.089	0.135	0.108	0.122
DM-Calib	best	best	0.089	best	best	best

Sparse-View 3D Reconstruction (relative distance error)¶

	Scene1	Scene2	Scene3	Scene4
w/o intrinsics	1.67	0.87	1.03	1.43
w/ intrinsics (DM-Calib)	1.37	0.68	0.68	1.06

Reconstruction error reduced by ~20%.

Ablation Study¶

Camera Image \((\theta, \phi, g)\) > \((\theta, \phi, \theta)\): \(e_f\) drops from 24.36° to ~4°
Multi-resolution noise further reduces error
Metric depth: removing Camera Image conditioning drops \(\delta_1\) from 85.8 to 83.8
Single-step vs. multi-step inference: single-step performs better (multi-step training with sparse GT is difficult)
Fine-tuning the VAE decoder is critical for metric depth

Highlights & Insights¶

Camera Image Design: The key insight is placing a grayscale image in the third channel to reduce the domain gap, making VAE reconstruction error negligible — seemingly simple, but experimentally shown to make a dramatic difference.
RANSAC Intrinsic Extraction: Transforms the mapping from dense Camera Image to 4 scalars into a straightforward line-fitting problem, achieving both robustness and efficiency.
Diffusion Models Understand Focal Length: SD models genuinely encode the imaging characteristics of different focal lengths (as shown in Fig. 1 with telephoto vs. wide-angle generation), a finding that is valuable in its own right.
Intrinsics → Metric Depth: Given accurate intrinsics, affine-invariant depth can be upgraded to metric depth — a critical yet often overlooked step in the pipeline.

Limitations & Future Work¶

Performance degrades on ultra-wide-angle (small focal length) images due to underrepresentation in training data
Inference still requires multi-step diffusion sampling (can be accelerated with few-step methods)
Metric depth training still requires sparse ground truth from LiDAR/RGBD sensors
Non-pinhole camera models (e.g., radial distortion) are not addressed

vs. DiffCalib: Also uses diffusion models but generates incidence maps, which suffer from a large domain gap and require joint training with depth; DM-Calib's Camera Image is more compatible with diffusion models and can be trained independently.
vs. WildCame/GeoCalib: Non-diffusion methods relying on geometric features (e.g., vanishing points) with limited generalization.
vs. UniDepth: Jointly trains intrinsics and depth, but mutual interference degrades intrinsic accuracy.

The paradigm of applying diffusion model priors to 3D geometric tasks is worth studying.
The Camera Image design philosophy — encoding non-image signals into image format to leverage pretrained diffusion models — has broad applicability.
Intrinsic estimation is a critical component for any pipeline performing 3D reconstruction from in-the-wild images.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The Camera Image representation combined with diffusion priors for calibration is novel and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers calibration, depth, pose, reconstruction, and metric evaluation across 5 zero-shot datasets.
Writing Quality: ⭐⭐⭐⭐ Method description is clear; VAE reconstruction error analysis is convincing.
Value: ⭐⭐⭐⭐⭐ The paradigm of encoding non-visual signals as images to exploit diffusion priors is highly inspiring.