Consistency-aware Self-Training for Iterative-based Stereo Matching¶
Conference: CVPR 2025
arXiv: 2503.23747
Code: None
Area: 3D Vision / Stereo Matching
Keywords: Stereo Matching, Self-Training, Pseudo-label Filtering, Consistency-aware, Iterative Optimization
TL;DR¶
This paper proposes the first consistency-aware self-training framework (CST-Stereo) for iterative-based stereo matching. It evaluates pseudo-label reliability through multi-resolution prediction consistency filtering and iterative prediction consistency filtering, and combines them with a soft-weighted loss to leverage unlabeled real-world data effectively, thereby improving model performance and generalization.
Background & Motivation¶
Background¶
Background: Iterative methods (e.g., RAFT-Stereo, IGEV-Stereo) have become the mainstream in stereo matching, but they rely heavily on annotated data.
Limitations of Prior Work¶
Limitations of Prior Work: Obtaining high-quality stereo annotations is extremely expensive. Existing annotated datasets are mostly synthetic, leading to poor generalization in real-world scenes.
Key Challenge¶
Key Challenge: Existing self-training methods are only suitable for cost-volume-based methods and cannot be directly applied to iterative methods that lack a complete cost volume.
Proposed Solution¶
Proposed Solution: Existing pseudo-label filtering strategies adopt hard-threshold binary selection, which both discards valuable hard samples and fails to distinguish the reliability differences among pseudo-labels within the threshold.
Additional Notes¶
Additional Notes: Core observation: regions with larger errors exhibit more pronounced oscillation characteristics in model predictions.
Method¶
Overall Architecture¶
CST-Stereo adopts a teacher-student self-training framework: the teacher model generates pseudo-labels on unlabeled data; the Consistency-aware Soft Filtering (CSF) module evaluates the reliability of the pseudo-labels; the student model learns with a soft-weighted loss under strongly augmented inputs; and the teacher is updated via EMA.
Key Designs¶
-
Multi-Resolution Prediction Consistency Filter (MRPCF):
- Function: Evaluates pseudo-label reliability from the spatial dimension.
- Mechanism: Feeds the input images into the teacher model after upsampling, maintaining the original scale, and downsampling, respectively. It then computes the pixel-level variance \(\sigma_{i,j}\) of the predictions across the three resolutions, which is converted to a reliability weight \(w_{rc} = 1/(1 + e^{-\varepsilon_1(\sigma - \tau_1)})\) via sigmoid mapping.
- Design Motivation: Pixels with inconsistent predictions across different resolutions tend to be unreliable (e.g., edges, occluded regions), whereas consistent pixels are more reliable.
-
Iterative Prediction Consistency Filter (IPCF):
- Function: Evaluates pseudo-label reliability from the temporal dimension.
- Mechanism: Calculates the average prediction difference between adjacent iterations in the second half of the iterative process, i.e., \(\Delta_{i,j} = \frac{1}{\lceil n/2 \rceil} \sum_{k} |P^{k+1}_{i,j} - P^k_{i,j}|\), which is similarly mapped to a weight \(w_{ic}\) via sigmoid mapping.
- Design Motivation: Pixels that still oscillate in the late stages of iteration indicate inconsistent multi-source information where the model fails to converge to a deterministic value, making the predictions unreliable.
-
Consistency-aware Soft-Weighted Loss:
- Function: Guides student model training by combining the weights of both filters.
- Mechanism: \(w_{soft} = w_{rc} \odot w_{ic}\), and the loss is \(L_{st} = w_{soft} \odot |\hat{P} - P^O|\). The multiplication ensures that only pixels deemed reliable by both filters receive high weights.
- Design Motivation: Avoids discarding valuable hard samples due to hard thresholding, while mitigating the impact of noisy pseudo-labels.
Loss & Training¶
- Pre-train on annotated datasets to obtain initial weights.
- Self-training phase: The student receives strongly augmented images while the teacher receives non-augmented ones.
- Update the teacher via EMA: \(\theta_T \leftarrow \lambda \theta_T + (1-\lambda) \theta_S\)
- Applicable to various iterative baselines: RAFT-Stereo, CREStereo, IGEV-Stereo, Selective-Stereo, etc.
Key Experimental Results¶
Main Results¶
| Method | KITTI2015 D1-all ↓ | Middlebury bad1.0 ↓ | ETH3D bad1.0 ↓ |
|---|---|---|---|
| RAFT-Stereo | 1.96 | 9.37 | 2.44 |
| IGEV-Stereo | 1.59 | 9.41 | 1.12 |
| Selective-IGEV | 1.55 | 6.53 | 1.23 |
| CST-Stereo | 1.50 | 6.23 | 1.02 |
Ablation Study¶
| Configuration | Key Effect | Description |
|---|---|---|
| Baseline without filtering | Performance drop | Direct usage of pseudo-labels leads to noise accumulation |
| MRPCF only | Gain | Captures error regions sensitive to spatial resolution |
| IPCF only | Gain | Captures error regions related to iterative oscillation |
| MRPCF + IPCF (Hard Threshold) | Suboptimal | Discards hard samples |
| MRPCF + IPCF (Soft Weights) | Optimal | Blends reliability evaluation and hard sample utilization |
Key Findings¶
- The method is highly versatile and can be applied in a plug-and-play manner to boost multiple iterative baselines (e.g., reducing error by 35% for RAFT-Stereo and 21% for CREStereo).
- It is effective in all three scenarios: in-domain, domain adaptation, and domain generalization.
- It achieves SOTA performance among published methods on multiple public leaderboards, including KITTI2015, Middlebury, and ETH3D.
- Multi-resolution consistency and iterative consistency are complementary: MRPCF is effective in edge regions, whereas IPCF excels in blurry areas.
Highlights & Insights¶
- The observation of "oscillation characteristics in error regions" is highly intuitive and thoroughly verified through visualization.
- The soft-threshold design is significantly more elegant than hard binary filtering, preserving the training value of hard samples.
- Direct of cost volumes is avoided, making this method the first to successfully extend self-training to the entire family of iterative stereo matching.
- The two consistency filters evaluate reliability from spatial and temporal dimensions, which are orthogonal to each other.
Limitations & Future Work¶
- Multi-resolution prediction requires three forward passes, increasing the training computational overhead.
- The soft threshold parameters \(\tau_1, \tau_2\) and scaling factors \(\varepsilon_1, \varepsilon_2\) in the filters require manual tuning.
- More complex weight fusion strategies (such as learned fusion weights) have not yet been explored.
- Generalizing the observation to other dense prediction tasks, such as optical flow estimation, could be explored.
Related Work & Insights¶
- Continuative of self-training stereo matching works such as StereoBase and PCT-Stereo, but targeted specifically at iterative-based methods for the first time.
- The concept of soft filtering can be borrowed for other pseudo-label learning scenarios, such as semi-supervised semantic segmentation.
- The consistency observation may hold deeper significance for understanding the uncertainty of iterative models.
Rating¶
- Novelty: ⭐⭐⭐⭐ The observation of oscillation characteristics is novel, and the design of the two complementary filters is well-founded.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation across multiple baselines, scenarios, and datasets, with detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with a complete logical flow from observation to design.
- Value: ⭐⭐⭐⭐ Provides plug-and-play performance gains for iterative-based stereo matching, offering high practicality.