Vision-Language Embodiment for Monocular Depth Estimation¶
Conference: CVPR 2025
arXiv: 2503.16535
Code: None
Area: 3D Vision
Keywords: Monocular Depth Estimation, Camera Model Embodiment, Vision-Language Fusion, Scene Depth Prior, Variational Autoencoder
TL;DR¶
An embodied depth estimation framework is proposed, which embodies the physical characteristics of the camera model into a deep learning system to calculate Embodied Scene Depth as a geometric prior. Simultaneously, it leverages vision-language complementarity (depth text descriptions + textual VAE + conditional sampler) to fuse RGB image features and physical depth prior for monocular depth estimation.
Background & Motivation¶
- Monocular depth estimation is a core problem in computer vision, but the mapping from 3D to 2D is inherently ill-posed.
- Existing depth estimation models mainly rely on inter-image relationships for supervised training, ignoring the intrinsic information provided by the camera itself.
- Although geometric priors (normal constraints, planar constraints) can reduce uncertainty, their overall impact is limited.
- Existing CLIP-based depth estimation methods (such as DepthCLIP) use fixed discrete depth descriptions, lacking expressive capability.
- As two inherently ambiguous modalities, images and text have complementary advantages: images provide direct 3D observations, while text provides object scale priors.
- For flat areas such as roads, camera model parameters can directly calculate absolute depth with extremely high accuracy (>99% of pixels have an error of <10%).
- There is a lack of a depth estimation framework that uniformly integrates physical camera models, environmental semantics, and linguistic priors.
- Sparse LiDAR ground truth limits the density of available depth supervision signals.
Method¶
Overall Architecture¶
The system consists of three levels of embodiment: Camera Embodiment—utilizing camera intrinsic/extrinsic parameters and road segmentation to compute Embodied Scene Depth; Language Embodiment—using ExpansionNet-v2 to generate image descriptions, and generating depth text descriptions based on segmentation and Embodied Depth to combine into the input of a textual VAE; Vision-Language Fusion—an RGB encoder and a depth encoder fuse features through cross-attention, while a conditional sampler samples from the latent distribution of the textual VAE, and a depth decoder with shared weights outputs the final depth. An alternating optimization strategy is employed for training.
Key Designs¶
Design 1: Embodied Scene Depth - Function: Utilizes physical camera parameters to calculate dense scene depth priors in real time. - Mechanism: Under the coplanarity assumption, using the camera intrinsic and extrinsic parameter matrix \(A = [K|0][R,T;0,1]\), the depth \(z_c\) of each pixel is solved given the camera height \(h\). First, the road region is identified via semantic segmentation to obtain Embodied Road Depth (with 99%+ accuracy), which is then extended to the ground and vertical surfaces. Finally, Telea Inpainting is used to fill in blanks to obtain the complete Embodied Scene Depth. - Design Motivation: Roads satisfy planar conditions, allowing direct analytical calculation of absolute depth—accurate like LiDAR but denser. Step-by-step extension to the entire scene, although reducing accuracy, provides valuable geometric constraints.
Design 2: Depth-Guided Textual Variational Autoencoder - Function: Leverages linguistic descriptions to model the probability distribution of possible 3D scene layouts. - Mechanism: For each object \(O_i\), a depth text description \(T_i\) is generated (containing depth value \(d_i\) and ranking \(r_i\)). Together with the image caption, the combined text is passed through a CLIP text encoder and an MLP to estimate the mean \(\hat{\mu}\) and standard deviation \(\hat{\sigma}\) of the latent distribution. Employing the reparameterization trick \(\hat{z} = \hat{\mu} + \epsilon \cdot \hat{\sigma}\) yields samples, which are then generated into a depth map by the depth decoder. - Design Motivation: Object scale and spatial layout priors provided by text can constrain the solution space of depth estimation. The variational framework naturally models the diverse possibilities of scenes.
Design 3: Embodiment-Driven Conditional Sampler - Function: Samples depth corresponding to a specific image from the latent distribution of the textual VAE conditioned on the image. - Mechanism: Transformer blocks encode the fused features of RGB and Embodied Depth (fused via cross-attention \(F_f^d = \text{softmax}(\frac{Q_r K_d}{d_k})V_d\)) into \(h \times w\) local samples \(\tilde{\epsilon}\), which replace the standard Gaussian noise \(\epsilon\) to generate \(\tilde{z} = \hat{\mu} + \tilde{\epsilon} \cdot \hat{\sigma}\). The final depth is output through the shared-weight depth decoder. - Design Motivation: The textual language manifold can only describe the distribution of possible 3D layouts, and image information is required to pinpoint the latent vector that best matches the current scene.
Loss & Training¶
Alternating training: (1) Freeze the conditional sampler, train the textual VAE and the depth decoder using \(\mathcal{L}_{KL}(\mu, \sigma) + \mathcal{L}_{SiLog}\); (2) Freeze the textual VAE, train the conditional sampler and the depth decoder using the SiLog loss. The KL divergence regularizes the latent distribution toward a standard Gaussian, and the SiLog loss enhances scale invariance.
Key Experimental Results¶
Main Results: KITTI Depth Estimation¶
| Method | AbsRel↓ | SqRel↓ | RMSE↓ | \(\delta<1.25\)↑ |
|---|---|---|---|---|
| BTS | 0.061 | 0.261 | 2.834 | 0.954 |
| Adabins | 0.058 | 0.190 | 2.360 | 0.964 |
| iDisc | 0.053 | 0.175 | 2.216 | 0.971 |
| ECoDepth | 0.054 | 0.171 | 2.173 | 0.970 |
| Ours | 0.050 | 0.159 | 2.054 | 0.974 |
Ablation Study: Embodied Depth Accuracy (KITTI Dataset)¶
| Depth Type | ±5% Error Range | ±10% Error Range |
|---|---|---|
| Embodied Road Depth | 80.24% | 99.33% |
| Embodied Ground Depth | 60.30% | 74.89% |
| Embodied Scene Depth | 38.88% | 52.45% |
Key Findings¶
- Embodied Road Depth has extremely high accuracy (99.33% of pixels have an error < 10%) and can serve as a substitute for LiDAR in road areas.
- Different semantic segmentation models have minimal impact on Embodied Depth (the gap compared with GT segmentation is < 1%), showing strong robustness.
- The complete framework achieves an AbsRel of 0.050 on KITTI, surpassing SOTA methods such as ECoDepth and iDisc.
- Although the accuracy of Embodied Scene Depth decreases in non-planar regions, it provides valuable dense geometric priors.
- The alternating training strategy (textual VAE and conditional sampler) is crucial to the overall performance.
Highlights & Insights¶
- Direct Exploitation of Physical Camera Models: Instead of learning depth priors, they are analytically calculated, achieving near-sensor-level accuracy in planar regions.
- Three-Level Embodiment Integration: A unified framework of camera + environment + language, where each level complements the others.
- Innovative Depth Text Descriptions: Object depth values and rankings are transformed into natural language, leveraging the semantic capacity of the CLIP text encoder.
- Low Dependency on Segmentation Models: The accuracy differences of different segmentation models have minimal impact on the final depth estimation.
Limitations & Future Work¶
- Primarily validated in driving scenarios like KITTI and DDAD; applicability to scenes without clear ground planes (such as indoors) is limited.
- Embodied Scene Depth relies on the coplanarity assumption, causing accuracy to degrade on non-planar surfaces like ramps and steps.
- Textual descriptions depend on the quality of automatic caption generation tools.
- Requires known camera intrinsic and extrinsic parameters, which limits zero-shot generalizability.
- Future work can extend the method to more scene types and integrate stronger VLMs.
Related Work & Insights¶
- Standard depth estimation methods like DepthCLIP use fixed text templates, whereas this work dynamically generates text descriptions incorporating depth information.
- Shares a similar textual VAE structure with WordDepth but incorporates the fusion of Embodied Scene Depth.
- The approach of analytically computing depth from camera models can serve as an optional prior input for any depth estimation method.
Rating¶
⭐⭐⭐⭐ — The idea of embodying camera models carries high practical value in driving scenarios, and the design of the vision-language fusion framework is sound; however, its application scenarios are constrained by the assumption of planar environments.