Skip to content

TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast

Conference: CVPR 2026 arXiv: 2506.13387 Code: GitHub Institution: The Chinese University of Hong Kong Area: 3D Vision Keywords: Monocular depth estimation, relative-to-metric depth transfer, language descriptions, cross-modal attention, contrastive learning, pixel-level scaling

TL;DR

TR2M is a framework that leverages images and textual descriptions to predict pixel-wise scale/shift maps, converting generalizable but scale-ambiguous relative depth into metric depth. With only 19M trainable parameters and 102K training images, it achieves zero-shot cross-domain metric depth estimation.

Background & Motivation

Monocular depth estimation (MDE) follows two major paradigms:

  • Metric MDE (MMDE): Outputs real-world scale (in meters), but is typically domain-specific and generalizes poorly across domains; camera intrinsics or large-scale data are required to mitigate this, at high cost.
  • Relative MDE (MRDE): Trained with affine-invariant losses, offering strong cross-domain generalization, but lacking absolute scale, which limits downstream applications such as robotic navigation and 3D reconstruction.

Existing relative-to-metric conversion methods suffer from two core problems:

  1. Single-factor scaling limitation: Prior methods (e.g., RSA) estimate only one global scale and shift factor, failing to correct locally erroneous regions in the relative depth map and potentially amplifying errors.
  2. Semantic description ambiguity: Objects of the same category can have varying depths across different distributions and scales, yet their textual descriptions may be similar, leading to inaccurate scale estimation.

Core Problem

How can the scale ambiguity of relative depth be efficiently resolved to convert it into metric depth, while preserving the cross-domain generalization of MRDE? The key challenge lies in achieving pixel-level fine-grained scaling rather than global scaling, and enforcing scale consistency at the feature level.

Method

Overall Architecture

TR2M takes as input an RGB image \(I \in \mathbb{R}^{H \times W \times 3}\) and a textual description \(L\) (automatically generated by LLaVA), and predicts two pixel-wise maps: a scale map \(A \in \mathbb{R}^{H \times W}\) and a shift map \(B \in \mathbb{R}^{H \times W}\). These maps convert the relative depth \(D_r\) output by a frozen relative depth model (Depth Anything-Small) into metric depth:

\[\hat{D}_m = \frac{1}{A \odot D_r + B}\]

where \(\odot\) denotes element-wise multiplication. This pixel-wise transformation can correct locally erroneous regions in the relative depth map, unlike global single-factor scaling.

Feature Extraction and Cross-Modal Attention

  • Image encoder: Frozen DINOv2 ViT-L, extracting image features \(F_I \in \mathbb{R}^{HW \times D}\)
  • Text encoder: Frozen CLIP ViT-L/14, extracting text features \(F_L \in \mathbb{R}^{1 \times D}\)

A cross-modal attention module fuses the two feature streams. Using image features as Query, attention is computed both over itself (self-attention) and over text features (cross-attention):

\[\text{Attn}_{cm}^{i}(Q_I, K_i, V_i) = \text{softmax}\left(\frac{Q_I K_i^T}{\sqrt{d}}\right) \cdot V_i, \quad i \in \{I, L\}\]

The final fused feature is: \(F_{out} = F_I + \text{Attn}_{cm}^I + \text{Attn}_{cm}^L\)

The key design insight is that image features maintain pixel-level spatial resolution as the Query, while text features serve as global scale priors injected into each pixel position via cross-attention.

Decoder

Two lightweight DPT-style decoder heads generate the scale map and shift map from the fused feature \(F_f\):

  • \(A = \text{ScaleHead}(F_f)\)
  • \(B = \text{ShiftHead}(F_f)\)

Pseudo Metric Depth and Threshold Filtering

Since ground-truth metric depth is typically sparse (some pixels lack annotations), pseudo metric depth is generated by aligning relative depth to the GT via least-squares regression:

\[(\tilde{\alpha}, \tilde{\beta}) = \arg\min_{\tilde{\alpha}, \tilde{\beta}} \sum_{i=1}^{HW} (\tilde{\alpha} D_r(i) + \tilde{\beta} - D_m^{gt}(i))^2\]

yielding \(D_m^{pseudo} = \tilde{\alpha} D_r + \tilde{\beta}\).

A key innovation is quality filtering: the threshold accuracy \(\delta_1\) (the proportion of pixels where \(\max(D_m^{gt}/D_m^{pseudo}, D_m^{pseudo}/D_m^{gt}) < 1.25\)) is used to assess pseudo depth reliability. Supervision is applied only when \(\delta_1 > \rho\):

\[\mathcal{L}_{tp\text{-}si} = \mathbf{1}(\delta_1 > \rho) \cdot \mathcal{L}_{si}(\hat{D}_m, D_m^{pseudo})\]

This provides reasonable supervision signals for pixels outside sparse GT regions, improving generalization.

Dual-Level Scale-Oriented Contrast

This is the most novel module in the paper, enforcing consistency between features and depth scale distributions at two granularities:

1) Image-Level Coarse Contrast

Within each training batch, image-level embeddings \(\tilde{F}_f\) are obtained via average pooling over image features, then sorted by the pseudo scale factor \(\tilde{\alpha}\). Images with similar scales form positive pairs; those with large scale differences form negative pairs:

  • Positive samples: \(|i - j| < t\) (scale distance below threshold \(t\))
  • Negative samples: \(|i - j| \geq t\)

An InfoNCE-style contrastive loss pulls scenes with similar scales closer in feature space.

2) Pixel-Level Fine Contrast

GT depth maps are used to construct pixel-level contrast. Depth values are quantized into \(|\mathcal{C}|\) discrete categories based on their distribution:

\[Y = \text{round}\left(\frac{d_{max} - D}{d_{max} - d_{min}} \times |\mathcal{C}|\right)\]

An EMA dual-branch structure (similar to MoCo) is adopted: Query and Key branches process different samples and generate pixel-level features. For a Query pixel \(i\), pixels in the Key feature map belonging to the same depth category serve as positive samples, while those in different categories serve as negative samples. The loss maximizes similarity to positive samples and minimizes similarity to negative samples.

The total contrastive loss is: \(\mathcal{L}_{soc} = \mathcal{L}_{coarse} + \mathcal{L}_{fine}\)

Loss & Training

\[\mathcal{L} = \lambda_1 \mathcal{L}_{si} + \lambda_2 \mathcal{L}_{tp\text{-}si} + \lambda_3 \mathcal{L}_{soc} + \lambda_4 \mathcal{L}_{es}\]

where \(\mathcal{L}_{es}\) is an edge-aware smoothness loss that regularizes the scale/shift maps. Hyperparameters are set as \(\lambda_1=1, \lambda_2=0.5, \lambda_3=0.1, \lambda_4=0.01\).

Implementation Details

Configuration Setting
GPU NVIDIA RTX 4090
Optimizer AdamW
Batch size 8
Learning rate \(1 \times 10^{-5}\), decayed by 0.9 per epoch
Training epochs 20
Learnable scale embeddings 256
Relative depth model Depth Anything-Small (frozen)
Image encoder DINOv2 ViT-L (frozen)
Text encoder CLIP ViT-L/14 (frozen)
Trainable parameters 19M
Training data size 102K images
Training datasets NYUv2 + KITTI + VOID + C3VD
Text generation LLaVA v1.6 Vicuna & Mistral

Key Experimental Results

NYUv2 Indoor Benchmark (Table 1)

Method Type Trainable Params \(\delta_1\uparrow\) AbsRel\(\downarrow\) RMSE\(\downarrow\)
DA V2 Direct metric 25M 0.969 0.073 0.261
UniDepth Zero-shot metric 347M 0.981 0.072 0.229
Metric3Dv2 Zero-shot metric 1011M 0.980 0.067 0.260
RSA Language + scale factor 4.7M 0.752 0.156 0.528
ScaleDepth Language + domain adaptation 109M 0.913 0.099 0.329
WorDepth Language + domain adaptation 137M 0.926 0.090 0.330
TR2M Language + scale map 19M 0.954 0.082 0.293

Zero-Shot Cross-Domain Generalization (Table 3, 5 Unseen Datasets)

Method Backbone Training Images SUN δ₁ iBims δ₁ HyperSim δ₁ DIODE δ₁ SimCol δ₁ Avg Rank
ZoeDepth BeiT384-L - 0.545 0.656 0.302 0.237 0.438 4.70
DA Single ViT-L - 0.660 0.714 0.361 0.288 0.553 2.80
UniDepth ViT-L 3M 0.443 0.217 0.545 0.635 - 4.00
RSA ViT-S 102K 0.527 0.450 0.230 0.244 0.162 5.80
TR2M ViT-S 102K 0.591 0.736 0.361 0.274 0.445 2.40

TR2M achieves the best average rank (2.40) with a ViT-Small backbone and only 102K training images, substantially outperforming methods that use larger models and more data.

Ablation Study

Rescale Maps \(\mathcal{L}_{tp\text{-}si}\) \(\mathcal{L}_{soc}\) NYUv2 AbsRel iBims δ₁ DIODE δ₁
0.118 0.566 0.225
0.084 0.657 0.237
0.085 0.708 0.243
0.082 0.736 0.274
  • Single-factor → pixel-wise maps: AbsRel drops from 0.118 to 0.084, the largest single gain.
  • Adding pseudo depth supervision: significant improvement in zero-shot settings (iBims δ₁ +0.051).
  • Adding scale-oriented contrastive learning: further zero-shot gains (DIODE δ₁ +0.031).

Highlights & Insights

  1. Pixel-level rescale maps replacing global single-factor scaling represent the core innovation, enabling correction of locally erroneous regions in relative depth — something prior methods (e.g., RSA) cannot achieve.
  2. Exceptional parameter efficiency: only 19M trainable parameters (all encoders are frozen), 1–2 orders of magnitude smaller than UniDepth (347M) and Metric3Dv2 (1011M), yet achieving comparable performance.
  3. Dual-level contrastive learning is elegantly designed: coarse-grained contrast ensures global scale consistency, fine-grained contrast ensures local depth distribution consistency, and the two are complementary.
  4. The pseudo depth + threshold filtering strategy is a concise and effective solution to sparse GT supervision, improving zero-shot generalization.
  5. Textual descriptions serve as "free" auxiliary priors, requiring no camera parameters or sensor information.

Limitations & Future Work

  1. Dependence on LLaVA-generated descriptions: Text quality is bounded by VLM capability, and descriptions for extreme domains (e.g., medical endoscopy) may be insufficiently accurate.
  2. Upper bound of the relative depth model: Framework performance is constrained by the output quality of the frozen Depth Anything-Small backbone.
  3. Remaining gap in some zero-shot domains: AbsRel remains high on SUN RGB-D and DIODE Outdoor (0.451/0.673), indicating that large cross-domain scale discrepancies remain challenging.
  4. Computational cost of contrastive learning: Pixel-level contrast requires an EMA dual-branch structure, increasing training time and memory consumption.
  5. A stronger relative depth backbone (e.g., Depth Anything V2) could further raise the performance ceiling.
Dimension RSA ScaleDepth/WorDepth UniDepth/Metric3Dv2 TR2M
Input Text Text + image Image (+ camera params) Text + image + relative depth
Scaling approach Global dual-factor Implicit decoding End-to-end metric head Pixel-wise maps
Trainable parameters 4.7M 109–137M 347–1011M 19M
Training data 102K Single domain 3–16M 102K
Cross-domain ability Poor Domain-restricted Moderate Strong
Core limitation Single factor too coarse Limited by domain adaptation Requires large data and models Dependent on relative depth quality

The relative-to-metric depth paradigm parallels the "pretrain + lightweight adaptation" paradigm in NLP; the strategy of freezing large models and fine-tuning with few parameters is broadly transferable. The pixel-level rescale map concept can be extended to other scale-recovery tasks (e.g., monocular normal estimation, optical flow scale recovery). The dual-level contrastive learning design (global + local) can be adapted as feature regularization in other dense prediction tasks. Incorporating more powerful MLLMs (e.g., GPT-4V) into text-assisted depth estimation could further enhance scene understanding and scale reasoning.

Rating

  • Novelty: ⭐⭐⭐⭐ (novel combination of pixel-level rescale maps and dual-level contrastive learning)
  • Experimental Thoroughness: ⭐⭐⭐⭐ (4 training sets + 5 zero-shot test sets + comprehensive ablations)
  • Writing Quality: ⭐⭐⭐⭐ (clear structure, well-motivated)
  • Value: ⭐⭐⭐⭐ (efficient relative-to-metric depth conversion with strong practical applicability)