High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior¶

Conference: ECCV 2024
arXiv: 2408.00361
Code: https://github.com/wencheng256/RPrDepth
Area: 3D Vision
Keywords: Self-Supervised Depth Estimation, Monocular Depth Estimation, Knowledge Distillation, Prior Fusion, Attention-guided Feature Selection

TL;DR¶

This paper proposes RPrDepth, which utilizes features and predictions of "rich-resource" models (such as multi-frame and high-resolution models) as priors during training. Through a prior depth fusion module and a rich-resource guided loss, the model achieves or even exceeds the depth estimation accuracy of multi-frame high-resolution models while performing inference using only a low-resolution single image.

Background & Motivation¶

Background: Self-supervised monocular depth estimation avoids dependency on LiDAR annotations using view-synthesis reconstruction loss. Currently, most top-performing methods rely on "rich-resource" inputs, such as multi-frame images, high-resolution images, and even future frames.

Limitations of Prior Work: Rich-resource inputs are often unavailable in practical applications. For example, multi-frame data cannot be obtained when a vehicle is stationary, and future frames are nonexistent in real-time systems. This severely limits the practicality of high-performance methods.

Key Challenge: A single low-resolution image lacks key information encoded in rich-resource inputs (such as inter-frame disparity). This fundamental gap in information volume imposes a performance ceiling.

Goal: Achieve the depth estimation accuracy of rich-resource models during the inference phase using only a single low-resolution image.

Key Insight: Although rich-resource data is unavailable during inference, it is available during training. By utilizing the features and predictions of rich-resource models as offline prior information, the information gap of low-resource inputs can be bridged through feature retrieval and fusion.

Core Idea: Construct an offline rich-resource reference feature database. During inference, retrieve similar prior features for each pixel, and reconstruct depth accuracy comparable to rich-resource models after fusion.

Method¶

Overall Architecture¶

The training phase contains two branches: - Upper branch (Rich-resource branch): Uses a pretrained ManyDepth-HR (multi-frame + high-resolution) with fixed parameters that are not updated. It provides reference features \(f_r\) and reference depth \(D_r\). - Lower branch (Single-image branch): The target model is based on DIFFNet, receiving a single low-resolution image and improving performance through prior fusion.

During inference, only the single-image branch and the compressed prior data (compressed from 2.6 million pixels to 25k pixels) are retained.

Key Designs¶

Prior Depth Fusion Module: Contains two fusion schemes:

Pixel-wise Fusion: Retrieves the most similar pixels from the reference features via an affinity matrix. Specifically, target features \(F_s\) and reference features \(F_r\) are aligned in dimension, and the affinity is calculated as:

$\mathcal{A} = \text{Softmax}(F_s \otimes F_r)$

The affinity matrix is then used to construct pixel-wise reference features and reference depths:

$F_c = \mathcal{A} \times f_r, \quad D_c = \mathcal{A} \times D_r$

Here, \(F_c\) provides spatial geometric priors, and \(D_c\) provides direct depth priors.

Depth-hint Fusion: Uses a Transformer multi-head attention mechanism with target features \(F_s\) as Query, and reference features \(f_r\) as Key and Value:

$F_d = \text{MHA}(Q, K, V)$

This global attention fusion captures macro-level prior information, complementing the pixel-wise fusion.

The results of both fusion methods are concatenated with the single-image features and compressed to the original channel count via convolution, yielding the feature \(F_o\) containing rich information.

Rich-resource Guided Loss: Composed of two parts:

View-reconstruction Loss: Uses the rich-resource input (high-resolution multi-frame) as the reconstruction target, providing a more precise supervision signal than the low-resolution source image:

$\mathcal{L}_{\text{vp}} = l_{vp}(\text{Resize}(D_o), I_r)$

Consistency Loss: Uses the depth prediction of the rich-resource model as pseudo-labels. Since self-supervised models predict relative disparity rather than absolute depth, scale differences exist across models. Therefore, the gradient difference is minimized instead of the direct value difference:

$\mathcal{L}_c = \|\tilde{G}_{x,y}(D_o) - \tilde{G}_{x,y}(D_p)\|_1$

where \(\tilde{G}_{x,y}(\cdot)\) represents the normalized sum of gradients in the x and y directions. This forces the model to generate depth discontinuities at edges consistent with the rich-resource model.

Auxiliary Loss: Guides the learning of the affinity matrix: \(\mathcal{L}_{\text{aux}} = \|D_p - \text{Resize}(D_c)\|_1\)

Total loss: \(\mathcal{L} = \alpha \mathcal{L}_{\text{vp}} + \beta \mathcal{L}_c + \mathcal{L}_{\text{aux}}\)

Attention Guided Feature Selection: During inference, retrieving reference features from the complete reference set poses a large computational overhead. The solution is:
- Calculate the average attention weights of all reference pixels on the validation set: \(W_{\text{avg}} = \frac{1}{N}\sum_{i=1}^{N}(\mathcal{A}_i + \mathcal{A}_{\text{MHA},i})\)
- Select the subset of pixels with the highest weights as the compressed prior data.
- Compress from 2.6 million pixels to 25k pixels (1%), replacing the prior set, and fine-tune for several epochs.
- Surprising key finding: The performance improves rather than drops after compression, as it filters out interference from irrelevant pixels.

Loss & Training¶

Uses DIFFNet (based on HR-Net) as the single-image baseline model.
Uses ManyDepth-HR as the rich-resource guidance model (multi-frame + high-resolution + future frames).
Reference dataset: Randomly selects 2,000 triplets from the training set, extracting features offline (1% pixel sampling × 2,000 images).
End-to-end training while fixing the parameters of the rich-resource branch.
Replaces the reference set and fine-tunes after completing feature selection.

Key Experimental Results¶

Main Results (KITTI Eigen Split, 640×192 Resolution, Mono Training)¶

Method	Input Frames	Abs Rel↓	Sq Rel↓	RMSE↓	\(\delta<1.25\)↑
Monodepth2	1	0.115	0.903	4.863	0.877
DIFFNet (Baseline)	1	0.102	0.764	4.483	0.896
ManyDepth (Multi-frame)	2	0.098	0.770	4.459	0.900
RPrDepth (Single frame)	1	0.097	0.658	4.279	0.900

High-resolution (1024×320) results:

Method	Input Frames	Abs Rel↓	RMSE↓	\(\delta<1.25\)↑
ManyDepth-HR (Guidance Model)	2	0.093	4.245	0.909
RPrDepth (Single frame)	1	0.091	4.098	0.910

Single-frame RPrDepth outperforms its own multi-frame high-resolution teacher model!

Ablation Study¶

Component	Abs Rel↓	RMSE↓	\(\delta<1.25\)↑
Baseline (DIFFNet)	0.102	4.483	0.896
+ PDF (Prior Fusion)	0.098	4.284	0.898
+ AGFS (Feature Selection)	0.098	4.240	0.898
+ RGL (Guided Loss)	0.100	4.321	0.897
+ Full (All)	0.097	4.279	0.900

Key Findings¶

The prior depth fusion module is the primary source of performance improvement (Abs Rel: 0.102 → 0.098).
Compressing the reference set by 99% via feature selection actually continues to reduce RMSE (4.284 → 4.240), as streamlining the features eliminates noise interference.
RPrDepth exhibits the best generalization performance, achieving the top result in the Make3D cross-domain test (Abs Rel 0.288 vs BRNet 0.302).
The gradient consistency loss effectively resolves the scale inconsistency problem between different models.

Highlights & Insights¶

Training-Inference Decoupling: Build priors using rich-resource data during training, while requiring only a single low-resolution image during inference, offering high practicality.
Student Outperforms Teacher: The single-frame model surpasses the multi-frame high-resolution teacher model through prior fusion (multi-frame methods are often detrimental in moving object regions in autonomous driving, which can be corrected by a single frame + priors).
Philosophy of Feature Selection: Less is more—selecting the top 1% most representative reference pixels is more effective than retaining all data.
Gradient-level rather than Value-level Consistency: Cleverly bypasses the scale ambiguity problem in self-supervised depth estimation.

Limitations & Future Work¶

The reference feature set requires offline preprocessing and storage, which still introduces extra deployment overhead despite only being 25k pixels.
Prior quality is constrained by the capability of the rich-resource model; if the teacher model fails in certain scenarios, the prior information will also be unreliable.
Currently, only ManyDepth has been verified as the rich-resource model. Stronger teachers, such as the Transformer-based DPT, could be explored.
The diversity of the reference dataset must cover the target scene distribution; cross-domain deployment requires updating the reference set.
The approach can be extended to knowledge transfer for other dense prediction tasks (e.g., optical flow, semantic segmentation).

ManyDepth: A representative method for multi-frame self-supervised depth estimation and the main teacher model for RPrDepth, providing cost volume features.
DIFFNet: An efficient single-frame baseline based on HR-Net, serving as the student backbone of RPrDepth.
Monodepth2: Established foundational designs for self-supervised monocular depth estimation, such as the min-reprojection loss.
Knowledge Distillation Perspective: Can be viewed as a more granular teacher-student learning scheme—not only distilling predictions but also distilling spatial priors at the feature level.

Rating¶

Novelty: ⭐⭐⭐⭐ — The problem formulation of "rich resources during training, low resources during inference" has actual practical value, and the prior retrieval + fusion scheme is more flexible than direct distillation.
Experimental Thoroughness: ⭐⭐⭐⭐ — Conducted experiments on three datasets: KITTI, Make3D, and Cityscapes, covering multiple training modes with complete ablation studies.
Writing Quality: ⭐⭐⭐⭐ — The motivation is clearly articulated, and the pipeline figures and tables are well-designed.
Value: ⭐⭐⭐⭐⭐ — Directly reduces the sensor requirements for depth estimation in autonomous driving, with no additional overhead during inference.