CVPR 2026 3D Vision Monocular depth estimation relative-to-metric depth transfer language descriptions cross-modal attention contrastive learning pixel-level scaling

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast¶

Conference: CVPR 2026 arXiv: 2506.13387 Code: GitHub Institution: The Chinese University of Hong Kong Area: 3D Vision Keywords: Monocular depth estimation, relative-to-metric depth transfer, language descriptions, cross-modal attention, contrastive learning, pixel-level scaling

TL;DR¶

TR2M is a framework that leverages images and textual descriptions to predict pixel-wise scale/shift maps, converting generalizable but scale-ambiguous relative depth into metric depth. With only 19M trainable parameters and 102K training images, it achieves zero-shot cross-domain metric depth estimation.

Background & Motivation¶

Monocular depth estimation (MDE) follows two major paradigms:

Metric MDE (MMDE): Outputs real-world scale (in meters), but is typically domain-specific and generalizes poorly across domains; camera intrinsics or large-scale data are required to mitigate this, at high cost.
Relative MDE (MRDE): Trained with affine-invariant losses, offering strong cross-domain generalization, but lacking absolute scale, which limits downstream applications such as robotic navigation and 3D reconstruction.

Existing relative-to-metric conversion methods suffer from two core problems:

Single-factor scaling limitation: Prior methods (e.g., RSA) estimate only one global scale and shift factor, failing to correct locally erroneous regions in the relative depth map and potentially amplifying errors.
Semantic description ambiguity: Objects of the same category can have varying depths across different distributions and scales, yet their textual descriptions may be similar, leading to inaccurate scale estimation.

Core Problem¶

How can the scale ambiguity of relative depth be efficiently resolved to convert it into metric depth, while preserving the cross-domain generalization of MRDE? The key challenge lies in achieving pixel-level fine-grained scaling rather than global scaling, and enforcing scale consistency at the feature level.

Method¶

Overall Architecture¶

TR2M takes as input an RGB image \(I \in \mathbb{R}^{H \times W \times 3}\) and a textual description \(L\) (automatically generated by LLaVA), and predicts two pixel-wise maps: a scale map \(A \in \mathbb{R}^{H \times W}\) and a shift map \(B \in \mathbb{R}^{H \times W}\). These maps convert the relative depth \(D_r\) output by a frozen relative depth model (Depth Anything-Small) into metric depth:

\[\hat{D}_m = \frac{1}{A \odot D_r + B}\]

where \(\odot\) denotes element-wise multiplication. This pixel-wise transformation can correct locally erroneous regions in the relative depth map, unlike global single-factor scaling.

Image encoder: Frozen DINOv2 ViT-L, extracting image features \(F_I \in \mathbb{R}^{HW \times D}\)
Text encoder: Frozen CLIP ViT-L/14, extracting text features \(F_L \in \mathbb{R}^{1 \times D}\)

A cross-modal attention module fuses the two feature streams. Using image features as Query, attention is computed both over itself (self-attention) and over text features (cross-attention):

\[\text{Attn}_{cm}^{i}(Q_I, K_i, V_i) = \text{softmax}\left(\frac{Q_I K_i^T}{\sqrt{d}}\right) \cdot V_i, \quad i \in \{I, L\}\]

The final fused feature is: \(F_{out} = F_I + \text{Attn}_{cm}^I + \text{Attn}_{cm}^L\)

The key design insight is that image features maintain pixel-level spatial resolution as the Query, while text features serve as global scale priors injected into each pixel position via cross-attention.

Decoder¶

Two lightweight DPT-style decoder heads generate the scale map and shift map from the fused feature \(F_f\):

\(A = \text{ScaleHead}(F_f)\)
\(B = \text{ShiftHead}(F_f)\)

Pseudo Metric Depth and Threshold Filtering¶

Since ground-truth metric depth is typically sparse (some pixels lack annotations), pseudo metric depth is generated by aligning relative depth to the GT via least-squares regression:

\[(\tilde{\alpha}, \tilde{\beta}) = \arg\min_{\tilde{\alpha}, \tilde{\beta}} \sum_{i=1}^{HW} (\tilde{\alpha} D_r(i) + \tilde{\beta} - D_m^{gt}(i))^2\]

yielding \(D_m^{pseudo} = \tilde{\alpha} D_r + \tilde{\beta}\).

A key innovation is quality filtering: the threshold accuracy \(\delta_1\) (the proportion of pixels where \(\max(D_m^{gt}/D_m^{pseudo}, D_m^{pseudo}/D_m^{gt}) < 1.25\)) is used to assess pseudo depth reliability. Supervision is applied only when \(\delta_1 > \rho\):

\[\mathcal{L}_{tp\text{-}si} = \mathbf{1}(\delta_1 > \rho) \cdot \mathcal{L}_{si}(\hat{D}_m, D_m^{pseudo})\]

This provides reasonable supervision signals for pixels outside sparse GT regions, improving generalization.

Dual-Level Scale-Oriented Contrast¶

This is the most novel module in the paper, enforcing consistency between features and depth scale distributions at two granularities:

1) Image-Level Coarse Contrast

Within each training batch, image-level embeddings \(\tilde{F}_f\) are obtained via average pooling over image features, then sorted by the pseudo scale factor \(\tilde{\alpha}\). Images with similar scales form positive pairs; those with large scale differences form negative pairs:

Positive samples: \(|i - j| < t\) (scale distance below threshold \(t\))
Negative samples: \(|i - j| \geq t\)

An InfoNCE-style contrastive loss pulls scenes with similar scales closer in feature space.

2) Pixel-Level Fine Contrast

GT depth maps are used to construct pixel-level contrast. Depth values are quantized into \(|\mathcal{C}|\) discrete categories based on their distribution:

\[Y = \text{round}\left(\frac{d_{max} - D}{d_{max} - d_{min}} \times |\mathcal{C}|\right)\]

An EMA dual-branch structure (similar to MoCo) is adopted: Query and Key branches process different samples and generate pixel-level features. For a Query pixel \(i\), pixels in the Key feature map belonging to the same depth category serve as positive samples, while those in different categories serve as negative samples. The loss maximizes similarity to positive samples and minimizes similarity to negative samples.

The total contrastive loss is: \(\mathcal{L}_{soc} = \mathcal{L}_{coarse} + \mathcal{L}_{fine}\)

Loss & Training¶

\[\mathcal{L} = \lambda_1 \mathcal{L}_{si} + \lambda_2 \mathcal{L}_{tp\text{-}si} + \lambda_3 \mathcal{L}_{soc} + \lambda_4 \mathcal{L}_{es}\]

where \(\mathcal{L}_{es}\) is an edge-aware smoothness loss that regularizes the scale/shift maps. Hyperparameters are set as \(\lambda_1=1, \lambda_2=0.5, \lambda_3=0.1, \lambda_4=0.01\).

Implementation Details¶

Configuration	Setting
GPU	NVIDIA RTX 4090
Optimizer	AdamW
Batch size	8
Learning rate	\(1 \times 10^{-5}\), decayed by 0.9 per epoch
Training epochs	20
Learnable scale embeddings	256
Relative depth model	Depth Anything-Small (frozen)
Image encoder	DINOv2 ViT-L (frozen)
Text encoder	CLIP ViT-L/14 (frozen)
Trainable parameters	19M
Training data size	102K images
Training datasets	NYUv2 + KITTI + VOID + C3VD
Text generation	LLaVA v1.6 Vicuna & Mistral

Key Experimental Results¶

NYUv2 Indoor Benchmark (Table 1)¶

Method	Type	Trainable Params	\(\delta_1\uparrow\)	AbsRel\(\downarrow\)	RMSE\(\downarrow\)
DA V2	Direct metric	25M	0.969	0.073	0.261
UniDepth	Zero-shot metric	347M	0.981	0.072	0.229
Metric3Dv2	Zero-shot metric	1011M	0.980	0.067	0.260
RSA	Language + scale factor	4.7M	0.752	0.156	0.528
ScaleDepth	Language + domain adaptation	109M	0.913	0.099	0.329
WorDepth	Language + domain adaptation	137M	0.926	0.090	0.330
TR2M	Language + scale map	19M	0.954	0.082	0.293

Zero-Shot Cross-Domain Generalization (Table 3, 5 Unseen Datasets)¶

Method	Backbone	Training Images	SUN δ₁	iBims δ₁	HyperSim δ₁	DIODE δ₁	SimCol δ₁	Avg Rank
ZoeDepth	BeiT384-L	-	0.545	0.656	0.302	0.237	0.438	4.70
DA Single	ViT-L	-	0.660	0.714	0.361	0.288	0.553	2.80
UniDepth	ViT-L	3M	0.443	0.217	0.545	0.635	-	4.00
RSA	ViT-S	102K	0.527	0.450	0.230	0.244	0.162	5.80
TR2M	ViT-S	102K	0.591	0.736	0.361	0.274	0.445	2.40

TR2M achieves the best average rank (2.40) with a ViT-Small backbone and only 102K training images, substantially outperforming methods that use larger models and more data.

Ablation Study¶

Rescale Maps	\(\mathcal{L}_{tp\text{-}si}\)	\(\mathcal{L}_{soc}\)	NYUv2 AbsRel	iBims δ₁	DIODE δ₁
✕	✕	✕	0.118	0.566	0.225
✓	✕	✕	0.084	0.657	0.237
✓	✓	✕	0.085	0.708	0.243
✓	✓	✓	0.082	0.736	0.274

Single-factor → pixel-wise maps: AbsRel drops from 0.118 to 0.084, the largest single gain.
Adding pseudo depth supervision: significant improvement in zero-shot settings (iBims δ₁ +0.051).
Adding scale-oriented contrastive learning: further zero-shot gains (DIODE δ₁ +0.031).

Highlights & Insights¶

Pixel-level rescale maps replacing global single-factor scaling represent the core innovation, enabling correction of locally erroneous regions in relative depth — something prior methods (e.g., RSA) cannot achieve.
Exceptional parameter efficiency: only 19M trainable parameters (all encoders are frozen), 1–2 orders of magnitude smaller than UniDepth (347M) and Metric3Dv2 (1011M), yet achieving comparable performance.
Dual-level contrastive learning is elegantly designed: coarse-grained contrast ensures global scale consistency, fine-grained contrast ensures local depth distribution consistency, and the two are complementary.
The pseudo depth + threshold filtering strategy is a concise and effective solution to sparse GT supervision, improving zero-shot generalization.
Textual descriptions serve as "free" auxiliary priors, requiring no camera parameters or sensor information.

Limitations & Future Work¶

Dependence on LLaVA-generated descriptions: Text quality is bounded by VLM capability, and descriptions for extreme domains (e.g., medical endoscopy) may be insufficiently accurate.
Upper bound of the relative depth model: Framework performance is constrained by the output quality of the frozen Depth Anything-Small backbone.
Remaining gap in some zero-shot domains: AbsRel remains high on SUN RGB-D and DIODE Outdoor (0.451/0.673), indicating that large cross-domain scale discrepancies remain challenging.
Computational cost of contrastive learning: Pixel-level contrast requires an EMA dual-branch structure, increasing training time and memory consumption.
A stronger relative depth backbone (e.g., Depth Anything V2) could further raise the performance ceiling.

Dimension	RSA	ScaleDepth/WorDepth	UniDepth/Metric3Dv2	TR2M
Input	Text	Text + image	Image (+ camera params)	Text + image + relative depth
Scaling approach	Global dual-factor	Implicit decoding	End-to-end metric head	Pixel-wise maps
Trainable parameters	4.7M	109–137M	347–1011M	19M
Training data	102K	Single domain	3–16M	102K
Cross-domain ability	Poor	Domain-restricted	Moderate	Strong
Core limitation	Single factor too coarse	Limited by domain adaptation	Requires large data and models	Dependent on relative depth quality

The relative-to-metric depth paradigm parallels the "pretrain + lightweight adaptation" paradigm in NLP; the strategy of freezing large models and fine-tuning with few parameters is broadly transferable. The pixel-level rescale map concept can be extended to other scale-recovery tasks (e.g., monocular normal estimation, optical flow scale recovery). The dual-level contrastive learning design (global + local) can be adapted as feature regularization in other dense prediction tasks. Incorporating more powerful MLLMs (e.g., GPT-4V) into text-assisted depth estimation could further enhance scene understanding and scale reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ (novel combination of pixel-level rescale maps and dual-level contrastive learning)
Experimental Thoroughness: ⭐⭐⭐⭐ (4 training sets + 5 zero-shot test sets + comprehensive ablations)
Writing Quality: ⭐⭐⭐⭐ (clear structure, well-motivated)
Value: ⭐⭐⭐⭐ (efficient relative-to-metric depth conversion with strong practical applicability)