TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast¶
Conference: CVPR2026
arXiv: 2506.13387
Code: GitHub
Institution: The Chinese University of Hong Kong
Area: 3D Vision
Keywords: Monocular Depth Estimation, Relative to Metric Depth, Language Descriptions, Cross-modal Attention, Contrastive Learning, Pixel-level Scaling
TL;DR¶
The TR2M framework is proposed to predict pixel-level scale/shift maps using image and text descriptions, converting highly generalizable but scale-less relative depth into metric depth. Achieving cross-domain zero-shot metric depth estimation is realized with only 19M trainable parameters and 102K training images.
Background & Motivation¶
Monocular Depth Estimation (MDE) is divided into two primary paradigms:
- Metric Depth Estimation (MMDE): Outputs real-world scale (meters) but is typically restricted to specific domains with poor cross-domain generalization; mitigating this requires camera intrinsics or massive data at high costs.
- Relative Depth Estimation (MRDE): Trained via affine-invariant loss, showing strong cross-domain generalization, but the output lacks absolute scale, limiting downstream applications (robotic navigation, 3D reconstruction).
Existing relative-to-metric conversion methods face two core issues:
- Limitations of Single-factor Scaling: Prior methods (e.g., RSA) only estimate a global scale and shift factor, which cannot correct local errors in relative depth and may even amplify them.
- Semantic Description Ambiguity: Objects of the same category can result in different depths under various distributions and scales, yet text descriptions may remain similar, leading to inaccurate scale estimation.
Core Problem¶
How can scale uncertainty in relative depth be efficiently removed to convert it into metric depth while preserving the cross-domain generalization of MRDE? The key challenge lies in achieving pixel-level fine-grained scaling rather than global scaling and establishing scale consistency at the feature level.
Method¶
Overall Architecture¶
TR2M aims to resolve the dilemma where "relative depth generalizes well but lacks scale, while metric depth has scale but generalizes poorly." The objective is to retain the generalization of relative depth while appending real-world scale. Instead of directly regressing depth, it predicts a pair of pixel-level rescale maps. Given an RGB image \(I \in \mathbb{R}^{H \times W \times 3}\) and a text description \(L\) (automatically generated by LLaVA), the network outputs a scale map \(A \in \mathbb{R}^{H \times W}\) and a shift map \(B \in \mathbb{R}^{H \times W}\). These are used to transform the relative depth \(D_r\) provided by a frozen Depth Anything-Small into metric depth pixel-by-pixel:
where \(\odot\) denotes element-wise multiplication. In this pipeline, the relative depth model, image encoder, and text encoder are all frozen; only the intermediate fusion and decoding modules (totaling 19M parameters) are trained.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
I["RGB Image I"] --> ENC["DINOv2 ViT-L Image Encoder (Frozen)<br/>→ Image Feature F_I"]
L["Text Description L (LLaVA Generated)"] --> TENC["CLIP ViT-L/14 Text Encoder (Frozen)<br/>→ Text Feature F_L"]
I --> DA["Depth Anything-Small (Frozen)<br/>→ Relative Depth D_r"]
ENC --> FUSE["Cross-modal Attention<br/>Image Self-Attn + Image×Text Cross-Attn"]
TENC --> FUSE
FUSE --> SHEAD["Pixel-level scale/shift maps<br/>ScaleHead→A, ShiftHead→B"]
SHEAD --> COMBINE["Pixel-wise Rescaling<br/>D̂_m = 1 / (A⊙D_r + B)"]
DA --> COMBINE
COMBINE --> OUT["Metric Depth D̂_m"]
COMBINE -.Training Supervision.-> SUP1["Pseudo Metric Depth + Threshold Filtering<br/>Least Squares Alignment + δ₁ Gating"]
COMBINE -.Training Supervision.-> SUP2["Dual-Level Scale-Oriented Contrast<br/>Image-level + Pixel-level Scale Alignment"]
Key Designs¶
1. Pixel-level scale/shift maps: Replacing global factors with pixel-wise correction
Prior methods (e.g., RSA) estimate only a single global scale and shift. If a local region in the relative depth is incorrect, it cannot be fixed and is instead amplified by global scaling. TR2M directly predicts maps \(A\) and \(B\) at the same resolution as the image, providing a set of scaling parameters for every pixel. This allows local error regions to be corrected individually. This step is the primary source of performance gain: in ablation studies, switching from a single factor to mapping maps reduced NYUv2 AbsRel from 0.118 to 0.084.
2. Cross-modal Attention: Utilizing text as a global scale prior injected into every pixel
Image features \(F_I \in \mathbb{R}^{HW \times D}\) are extracted by a frozen DINOv2 ViT-L, and text features \(F_L \in \mathbb{R}^{1 \times D}\) by a frozen CLIP ViT-L/14. During fusion, image features serve as the Query for self-attention and cross-attention (with text):
The fused feature \(F_{out} = F_I + \text{Attn}_{cm}^I + \text{Attn}_{cm}^L\) is processed by two lightweight DPT-style decoding heads to derive \(A = \text{ScaleHead}(F_f)\) and \(B = \text{ShiftHead}(F_f)\). This setup maintains pixel-level spatial resolution through image-based Queries while injecting global scale information (e.g., "indoor tabletop" vs. "outdoor street view") into each pixel location.
3. Pseudo Metric Depth + Threshold Filtering: Supervising pixels beyond sparse Ground Truth
Metric depth GT is often sparse. TR2M first aligns relative depth to GT using least squares to generate pseudo metric depth:
yielding \(D_m^{pseudo} = \tilde{\alpha} D_r + \tilde{\beta}\). To account for potential unreliability, a quality gate using threshold accuracy \(\delta_1\) (the proportion of pixels where \(\max(D_m^{gt}/D_m^{pseudo}, D_m^{pseudo}/D_m^{gt}) < 1.25\)) is applied. Supervision is only included if \(\delta_1 > \rho\):
This provides credible supervision for regions outside sparse annotations, improving zero-shot iBims \(\delta_1\) by +0.051.
4. Dual-Level Scale-Oriented Contrast: Aligning scale distributions at image and pixel levels
Beyond regression supervision, the feature space is organized by "scale." At the coarse-grained level, image-level embeddings \(\tilde{F}_f\) are obtained via average pooling and sorted by the pseudo scale factor \(\tilde{\alpha}\). Pairs with similar scales (\(|i - j| < t\)) are treated as positive and those with large differences (\(|i - j| \geq t\)) as negative within an InfoNCE-style loss. At the fine-grained level, GT depth is discretized into \(|\mathcal{C}|\) classes:
Using an EMA dual-branch structure (similar to MoCo), Query pixels find positive samples in Key feature maps among pixels of the same depth category. The total loss \(\mathcal{L}_{soc} = \mathcal{L}_{coarse} + \mathcal{L}_{fine}\) ensures global scale consistency and local depth distribution consistency, further improving zero-shot performance (DIODE \(\delta_1\) +0.031).
Loss & Training¶
The total loss is a weighted sum of four components:
where \(\mathcal{L}_{es}\) is an edge-aware smoothness loss (constraining \(A\) and \(B\)), with weights \(\lambda_1=1, \lambda_2=0.5, \lambda_3=0.1, \lambda_4=0.01\). Key training configurations:
| Configuration | Setting |
|---|---|
| GPU | NVIDIA RTX 4090 |
| Optimizer | AdamW |
| Batch size | 8 |
| Learning rate | \(1 \times 10^{-5}\), decaying by 0.9 per epoch |
| Training epochs | 20 |
| Trainable Scale Embeddings | 256 |
| Relative Depth Model | Depth Anything-Small (Frozen) |
| Image Encoder | DINOv2 ViT-L (Frozen) |
| Text Encoder | CLIP ViT-L/14 (Frozen) |
| Trainable Parameters | 19M |
| Training Data Size | 102K images |
| Training Datasets | NYUv2 + KITTI + VOID + C3VD |
| Text Generation | LLaVA v1.6 Vicuna & Mistral |
Key Experimental Results¶
Main Results on NYUv2 Indoor Benchmark (Table 1)¶
| Method | Type | Trainable Params | \(\delta_1\uparrow\) | AbsRel\(\downarrow\) | RMSE\(\downarrow\) |
|---|---|---|---|---|---|
| DA V2 | Direct Metric | 25M | 0.969 | 0.073 | 0.261 |
| UniDepth | Zero-shot Metric | 347M | 0.981 | 0.072 | 0.229 |
| Metric3Dv2 | Zero-shot Metric | 1011M | 0.980 | 0.067 | 0.260 |
| RSA | Lang + Scale Factor | 4.7M | 0.752 | 0.156 | 0.528 |
| ScaleDepth | Lang + Domain Adapt | 109M | 0.913 | 0.099 | 0.329 |
| WorDepth | Lang + Domain Adapt | 137M | 0.926 | 0.090 | 0.330 |
| TR2M | Lang + Rescale Map | 19M | 0.954 | 0.082 | 0.293 |
Zero-shot Cross-domain Generalization (Table 3, 5 unseen datasets)¶
| Method | Backbone | Training Images | SUN δ₁ | iBims δ₁ | HyperSim δ₁ | DIODE δ₁ | SimCol δ₁ | Avg Rank |
|---|---|---|---|---|---|---|---|---|
| ZoeDepth | BeiT384-L | - | 0.545 | 0.656 | 0.302 | 0.237 | 0.438 | 4.70 |
| DA Single | ViT-L | - | 0.660 | 0.714 | 0.361 | 0.288 | 0.553 | 2.80 |
| UniDepth | ViT-L | 3M | 0.443 | 0.217 | 0.545 | 0.635 | - | 4.00 |
| RSA | ViT-S | 102K | 0.527 | 0.450 | 0.230 | 0.244 | 0.162 | 5.80 |
| TR2M | ViT-S | 102K | 0.591 | 0.736 | 0.361 | 0.274 | 0.445 | 2.40 |
TR2M achieves the best average rank (2.40) using a ViT-Small backbone and 102K training images, significantly outperforming methods with larger models and more data.
Ablation Study¶
| Rescale Maps | \(\mathcal{L}_{tp\text{-}si}\) | \(\mathcal{L}_{soc}\) | NYUv2 AbsRel | iBims δ₁ | DIODE δ₁ |
|---|---|---|---|---|---|
| ✕ | ✕ | ✕ | 0.118 | 0.566 | 0.225 |
| ✓ | ✕ | ✕ | 0.084 | 0.657 | 0.237 |
| ✓ | ✓ | ✕ | 0.085 | 0.708 | 0.243 |
| ✓ | ✓ | ✓ | 0.082 | 0.736 | 0.274 |
- From single factor \(\rightarrow\) pixel-level maps: AbsRel decreased from 0.118 to 0.084, the largest gain.
- Adding pseudo-depth supervision: Significant zero-shot improvement (iBims \(\delta_1\) +0.051).
- Adding scale contrastive learning: Further zero-shot gains (DIODE \(\delta_1\) +0.031).
Highlights & Insights¶
- Novelty: Replacing global scaling factors with pixel-level rescale maps is the core innovation, allowing for local error correction in relative depth which was impossible in prior work like RSA.
- Value: Extremely high parameter efficiency; with only 19M trainable parameters (frozen encoders), it is 1-2 orders of magnitude smaller than UniDepth (347M) or Metric3Dv2 (1011M) while achieving comparable performance.
- Mechanism: The dual-level contrastive learning design is sophisticated, where coarse-grained ensures global scale consistency and fine-grained ensures local depth distribution consistency.
- Experimental Thoroughness: The pseudo-depth with threshold filtering concisely addresses sparse GT issues, enhancing zero-shot generalization.
- Language descriptions serve as a "free" auxiliary prior, requiring no camera parameters or sensor data.
Limitations & Future Work¶
- Dependency on LLaVA: Text quality is limited by the VLM's capability; descriptions for extreme scenarios (e.g., medical endoscopy) may be inaccurate.
- Ceiling of Relative Depth: Framework performance is capped by the quality of the frozen relative depth model (Depth Anything-Small).
- Zero-shot Gap: AbsRel remains high in some domains (e.g., 0.451/0.673 on SUN RGB-D and DIODE Outdoor), indicating that massive cross-domain scale gaps are still challenging.
- Computational Overhead: Pixel-level contrastive learning requires an EMA dual-branch structure, increasing training time and memory usage.
- Future work could explore stronger relative depth backbones (e.g., Depth Anything V2) to further push the performance ceiling.
Related Work & Insights¶
| Dimension | RSA | ScaleDepth/WorDepth | UniDepth/Metric3Dv2 | TR2M |
|---|---|---|---|---|
| Input | Text | Text + Image | Image (+ Camera) | Text + Image + Rel Depth |
| Scaling Method | Global dual-factor | Implicit decoding | End-to-end metric head | Pixel-level maps |
| Trainable Params | 4.7M | 109-137M | 347-1011M | 19M |
| Training Data | 102K | Single domain | 3-16M | 102K |
| Cross-domain | Poor | Single domain | Good | Strong |
| Key Limitation | Global factor lacks detail | Limited by domain adaptation | Requires Big Data/Model | Rel Depth quality dependent |
Insights¶
- The relative-to-metric paradigm is similar to "pre-training + lightweight adaptation" in NLP; using frozen large models with parameter-efficient fine-tuning is a highly effective strategy.
- The pixel-level rescale map concept can be extended to other tasks requiring scale recovery (e.g., monocular normal estimation, optical flow scale recovery).
- Dual-level contrastive learning (global + local) can be transferred to other dense prediction tasks for feature regularization.
- Text-assisted depth estimation combined with stronger MLLMs (e.g., GPT-4V) may further enhance scene understanding and scale reasoning capabilities.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Combining pixel-level rescale maps with dual-level contrastive learning is novel)
- Experimental Thoroughness: ⭐⭐⭐⭐ (4 training sets + 5 zero-shot test sets + extensive ablation)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, well-articulated motivation)
- Value: ⭐⭐⭐⭐ (An efficient relative-to-metric depth conversion solution with high practicality)