TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast¶
Conference: CVPR 2026 arXiv: 2506.13387 Code: GitHub Institution: The Chinese University of Hong Kong Area: 3D Vision Keywords: Monocular depth estimation, relative-to-metric depth transfer, language descriptions, cross-modal attention, contrastive learning, pixel-level scaling
TL;DR¶
TR2M is a framework that leverages images and textual descriptions to predict pixel-wise scale/shift maps, converting generalizable but scale-ambiguous relative depth into metric depth. With only 19M trainable parameters and 102K training images, it achieves zero-shot cross-domain metric depth estimation.
Background & Motivation¶
Monocular depth estimation (MDE) follows two major paradigms:
- Metric MDE (MMDE): Outputs real-world scale (in meters), but is typically domain-specific and generalizes poorly across domains; camera intrinsics or large-scale data are required to mitigate this, at high cost.
- Relative MDE (MRDE): Trained with affine-invariant losses, offering strong cross-domain generalization, but lacking absolute scale, which limits downstream applications such as robotic navigation and 3D reconstruction.
Existing relative-to-metric conversion methods suffer from two core problems:
- Single-factor scaling limitation: Prior methods (e.g., RSA) estimate only one global scale and shift factor, failing to correct locally erroneous regions in the relative depth map and potentially amplifying errors.
- Semantic description ambiguity: Objects of the same category can have varying depths across different distributions and scales, yet their textual descriptions may be similar, leading to inaccurate scale estimation.
Core Problem¶
How can the scale ambiguity of relative depth be efficiently resolved to convert it into metric depth, while preserving the cross-domain generalization of MRDE? The key challenge lies in achieving pixel-level fine-grained scaling rather than global scaling, and enforcing scale consistency at the feature level.
Method¶
Overall Architecture¶
TR2M takes as input an RGB image \(I \in \mathbb{R}^{H \times W \times 3}\) and a textual description \(L\) (automatically generated by LLaVA), and predicts two pixel-wise maps: a scale map \(A \in \mathbb{R}^{H \times W}\) and a shift map \(B \in \mathbb{R}^{H \times W}\). These maps convert the relative depth \(D_r\) output by a frozen relative depth model (Depth Anything-Small) into metric depth:
where \(\odot\) denotes element-wise multiplication. This pixel-wise transformation can correct locally erroneous regions in the relative depth map, unlike global single-factor scaling.
Feature Extraction and Cross-Modal Attention¶
- Image encoder: Frozen DINOv2 ViT-L, extracting image features \(F_I \in \mathbb{R}^{HW \times D}\)
- Text encoder: Frozen CLIP ViT-L/14, extracting text features \(F_L \in \mathbb{R}^{1 \times D}\)
A cross-modal attention module fuses the two feature streams. Using image features as Query, attention is computed both over itself (self-attention) and over text features (cross-attention):
The final fused feature is: \(F_{out} = F_I + \text{Attn}_{cm}^I + \text{Attn}_{cm}^L\)
The key design insight is that image features maintain pixel-level spatial resolution as the Query, while text features serve as global scale priors injected into each pixel position via cross-attention.
Decoder¶
Two lightweight DPT-style decoder heads generate the scale map and shift map from the fused feature \(F_f\):
- \(A = \text{ScaleHead}(F_f)\)
- \(B = \text{ShiftHead}(F_f)\)
Pseudo Metric Depth and Threshold Filtering¶
Since ground-truth metric depth is typically sparse (some pixels lack annotations), pseudo metric depth is generated by aligning relative depth to the GT via least-squares regression:
yielding \(D_m^{pseudo} = \tilde{\alpha} D_r + \tilde{\beta}\).
A key innovation is quality filtering: the threshold accuracy \(\delta_1\) (the proportion of pixels where \(\max(D_m^{gt}/D_m^{pseudo}, D_m^{pseudo}/D_m^{gt}) < 1.25\)) is used to assess pseudo depth reliability. Supervision is applied only when \(\delta_1 > \rho\):
This provides reasonable supervision signals for pixels outside sparse GT regions, improving generalization.
Dual-Level Scale-Oriented Contrast¶
This is the most novel module in the paper, enforcing consistency between features and depth scale distributions at two granularities:
1) Image-Level Coarse Contrast
Within each training batch, image-level embeddings \(\tilde{F}_f\) are obtained via average pooling over image features, then sorted by the pseudo scale factor \(\tilde{\alpha}\). Images with similar scales form positive pairs; those with large scale differences form negative pairs:
- Positive samples: \(|i - j| < t\) (scale distance below threshold \(t\))
- Negative samples: \(|i - j| \geq t\)
An InfoNCE-style contrastive loss pulls scenes with similar scales closer in feature space.
2) Pixel-Level Fine Contrast
GT depth maps are used to construct pixel-level contrast. Depth values are quantized into \(|\mathcal{C}|\) discrete categories based on their distribution:
An EMA dual-branch structure (similar to MoCo) is adopted: Query and Key branches process different samples and generate pixel-level features. For a Query pixel \(i\), pixels in the Key feature map belonging to the same depth category serve as positive samples, while those in different categories serve as negative samples. The loss maximizes similarity to positive samples and minimizes similarity to negative samples.
The total contrastive loss is: \(\mathcal{L}_{soc} = \mathcal{L}_{coarse} + \mathcal{L}_{fine}\)
Loss & Training¶
where \(\mathcal{L}_{es}\) is an edge-aware smoothness loss that regularizes the scale/shift maps. Hyperparameters are set as \(\lambda_1=1, \lambda_2=0.5, \lambda_3=0.1, \lambda_4=0.01\).
Implementation Details¶
| Configuration | Setting |
|---|---|
| GPU | NVIDIA RTX 4090 |
| Optimizer | AdamW |
| Batch size | 8 |
| Learning rate | \(1 \times 10^{-5}\), decayed by 0.9 per epoch |
| Training epochs | 20 |
| Learnable scale embeddings | 256 |
| Relative depth model | Depth Anything-Small (frozen) |
| Image encoder | DINOv2 ViT-L (frozen) |
| Text encoder | CLIP ViT-L/14 (frozen) |
| Trainable parameters | 19M |
| Training data size | 102K images |
| Training datasets | NYUv2 + KITTI + VOID + C3VD |
| Text generation | LLaVA v1.6 Vicuna & Mistral |
Key Experimental Results¶
NYUv2 Indoor Benchmark (Table 1)¶
| Method | Type | Trainable Params | \(\delta_1\uparrow\) | AbsRel\(\downarrow\) | RMSE\(\downarrow\) |
|---|---|---|---|---|---|
| DA V2 | Direct metric | 25M | 0.969 | 0.073 | 0.261 |
| UniDepth | Zero-shot metric | 347M | 0.981 | 0.072 | 0.229 |
| Metric3Dv2 | Zero-shot metric | 1011M | 0.980 | 0.067 | 0.260 |
| RSA | Language + scale factor | 4.7M | 0.752 | 0.156 | 0.528 |
| ScaleDepth | Language + domain adaptation | 109M | 0.913 | 0.099 | 0.329 |
| WorDepth | Language + domain adaptation | 137M | 0.926 | 0.090 | 0.330 |
| TR2M | Language + scale map | 19M | 0.954 | 0.082 | 0.293 |
Zero-Shot Cross-Domain Generalization (Table 3, 5 Unseen Datasets)¶
| Method | Backbone | Training Images | SUN δ₁ | iBims δ₁ | HyperSim δ₁ | DIODE δ₁ | SimCol δ₁ | Avg Rank |
|---|---|---|---|---|---|---|---|---|
| ZoeDepth | BeiT384-L | - | 0.545 | 0.656 | 0.302 | 0.237 | 0.438 | 4.70 |
| DA Single | ViT-L | - | 0.660 | 0.714 | 0.361 | 0.288 | 0.553 | 2.80 |
| UniDepth | ViT-L | 3M | 0.443 | 0.217 | 0.545 | 0.635 | - | 4.00 |
| RSA | ViT-S | 102K | 0.527 | 0.450 | 0.230 | 0.244 | 0.162 | 5.80 |
| TR2M | ViT-S | 102K | 0.591 | 0.736 | 0.361 | 0.274 | 0.445 | 2.40 |
TR2M achieves the best average rank (2.40) with a ViT-Small backbone and only 102K training images, substantially outperforming methods that use larger models and more data.
Ablation Study¶
| Rescale Maps | \(\mathcal{L}_{tp\text{-}si}\) | \(\mathcal{L}_{soc}\) | NYUv2 AbsRel | iBims δ₁ | DIODE δ₁ |
|---|---|---|---|---|---|
| ✕ | ✕ | ✕ | 0.118 | 0.566 | 0.225 |
| ✓ | ✕ | ✕ | 0.084 | 0.657 | 0.237 |
| ✓ | ✓ | ✕ | 0.085 | 0.708 | 0.243 |
| ✓ | ✓ | ✓ | 0.082 | 0.736 | 0.274 |
- Single-factor → pixel-wise maps: AbsRel drops from 0.118 to 0.084, the largest single gain.
- Adding pseudo depth supervision: significant improvement in zero-shot settings (iBims δ₁ +0.051).
- Adding scale-oriented contrastive learning: further zero-shot gains (DIODE δ₁ +0.031).
Highlights & Insights¶
- Pixel-level rescale maps replacing global single-factor scaling represent the core innovation, enabling correction of locally erroneous regions in relative depth — something prior methods (e.g., RSA) cannot achieve.
- Exceptional parameter efficiency: only 19M trainable parameters (all encoders are frozen), 1–2 orders of magnitude smaller than UniDepth (347M) and Metric3Dv2 (1011M), yet achieving comparable performance.
- Dual-level contrastive learning is elegantly designed: coarse-grained contrast ensures global scale consistency, fine-grained contrast ensures local depth distribution consistency, and the two are complementary.
- The pseudo depth + threshold filtering strategy is a concise and effective solution to sparse GT supervision, improving zero-shot generalization.
- Textual descriptions serve as "free" auxiliary priors, requiring no camera parameters or sensor information.
Limitations & Future Work¶
- Dependence on LLaVA-generated descriptions: Text quality is bounded by VLM capability, and descriptions for extreme domains (e.g., medical endoscopy) may be insufficiently accurate.
- Upper bound of the relative depth model: Framework performance is constrained by the output quality of the frozen Depth Anything-Small backbone.
- Remaining gap in some zero-shot domains: AbsRel remains high on SUN RGB-D and DIODE Outdoor (0.451/0.673), indicating that large cross-domain scale discrepancies remain challenging.
- Computational cost of contrastive learning: Pixel-level contrast requires an EMA dual-branch structure, increasing training time and memory consumption.
- A stronger relative depth backbone (e.g., Depth Anything V2) could further raise the performance ceiling.
Related Work & Insights¶
| Dimension | RSA | ScaleDepth/WorDepth | UniDepth/Metric3Dv2 | TR2M |
|---|---|---|---|---|
| Input | Text | Text + image | Image (+ camera params) | Text + image + relative depth |
| Scaling approach | Global dual-factor | Implicit decoding | End-to-end metric head | Pixel-wise maps |
| Trainable parameters | 4.7M | 109–137M | 347–1011M | 19M |
| Training data | 102K | Single domain | 3–16M | 102K |
| Cross-domain ability | Poor | Domain-restricted | Moderate | Strong |
| Core limitation | Single factor too coarse | Limited by domain adaptation | Requires large data and models | Dependent on relative depth quality |
The relative-to-metric depth paradigm parallels the "pretrain + lightweight adaptation" paradigm in NLP; the strategy of freezing large models and fine-tuning with few parameters is broadly transferable. The pixel-level rescale map concept can be extended to other scale-recovery tasks (e.g., monocular normal estimation, optical flow scale recovery). The dual-level contrastive learning design (global + local) can be adapted as feature regularization in other dense prediction tasks. Incorporating more powerful MLLMs (e.g., GPT-4V) into text-assisted depth estimation could further enhance scene understanding and scale reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ (novel combination of pixel-level rescale maps and dual-level contrastive learning)
- Experimental Thoroughness: ⭐⭐⭐⭐ (4 training sets + 5 zero-shot test sets + comprehensive ablations)
- Writing Quality: ⭐⭐⭐⭐ (clear structure, well-motivated)
- Value: ⭐⭐⭐⭐ (efficient relative-to-metric depth conversion with strong practical applicability)