TC-Stereo: Temporally Consistent Stereo Matching¶
Conference: ECCV 2024
arXiv: 2407.11950
Code: GitHub
Area: 3D Vision
Keywords: Stereo Matching, Temporal Consistency, Disparity Completion, Dual-Space Optimization, Video Stereo
TL;DR¶
Proposed TC-Stereo, which achieves temporally consistent stereo matching through temporal disparity completion for good initialization, temporal state fusion to maintain hidden state coherence, and dual-space (disparity + disparity gradient) iterative refinement to improve ill-posed regions.
Background & Motivation¶
Background¶
Background: Existing stereo matching methods perform independent inference frame-by-frame, leading to temporal inconsistency. The inconsistency stems from two main sources: (1) each frame searches for disparity globally from scratch, resulting in large update steps and high variability; (2) inherent ambiguities in ill-posed regions (e.g., occlusions, reflections) lead to unstable outputs under frame-to-frame appearance changes. This work addresses these issues by localizing the search range and improving ill-posed regions.
Proposed Solution¶
Goal: ### Overall Architecture
TC-Stereo processes stereo video sequences online: (1) Temporal Disparity Completion (TDC)—generates initial dense disparity from semi-dense disparity projected from preceding frames; (2) Temporal State Fusion—fuses the current completed state with the historical refined state; (3) Dual-Space Iterative Refinement—alternately optimizes in the disparity space and disparity gradient space.
Method¶
Overall Architecture¶
TC-Stereo processes stereo video sequences online: (1) Temporal Disparity Completion (TDC)—generates initial dense disparity from semi-dense disparity projected from preceding frames; (2) Temporal State Fusion—fuses the current completed state with the historical refined state; (3) Dual-Space Iterative Refinement—alternately optimizes in the disparity space and disparity gradient space.
Key Designs¶
Source of Semi-Dense Disparity: The first frame is obtained from the cost volume via winner-take-all filtering with a confidence threshold; subsequent frames project the preceding disparity map to the current frame view using camera poses, which naturally produces holes in non-overlapping regions.
Temporal Disparity Completion: A lightweight encoder-decoder network that takes the encoder's context features, a semi-dense disparity, and a sparsity mask as inputs, and outputs a dense disparity and state features.
Temporal State Fusion: Uses a GRU-like module to fuse the current state \(c^t\) and the historical hidden state \(h_{N-1}^{t-1}\), addressing two challenges: (1) the historical state only encodes the information of the preceding viewpoint; (2) there are no historical states in non-overlapping regions.
Dual-Space Refinement: In the disparity space, a multi-level GRU is used to update the disparity (similar to RAFT-Stereo); in the gradient space, the disparity is converted to a gradient field and refined—leveraging the prior that real-world depths are typically locally smooth. Through gradient-guided disparity propagation, neighboring disparities are propagated to the current pixel based on the local plane assumption: \(\hat{d}_n = d + (u_n-u)\frac{\partial d}{\partial u} + (v_n-v)\frac{\partial d}{\partial v}\), followed by a softmax weighted sum to obtain the final disparity.
Loss & Training¶
- \(\mathcal{L}_{cv}\): Cost volume contrastive loss, maximizing similarity at the ground truth disparity and enforcing a margin
- \(\mathcal{L}_{disp}\): L1 disparity loss over three stages (completion, refinement, and propagation), using exponentially decaying weights
- \(\mathcal{L}_{grad}\): L1 loss of the disparity gradient after gradient-space refinement and propagation
Key Experimental Results¶
TartanAir Ablation Study¶
Main Results¶
| Setting | Iters | Temporal | State Fusion | TDC | Dual Space | ALL >1px↓ | OCC >1px↓ | ALL |Δd|↓ | OCC |Δd|↓ |
|---|---|---|---|---|---|---|---|---|---|
| RAFT-Stereo (5iter) | 5 | ✗ | ✗ | ✗ | ✗ | 8.04% | 34.91% | 0.35 | 1.15% |
| RAFT-Stereo (32iter) | 32 | ✗ | ✗ | ✗ | ✗ | 6.02% | 25.89% | 0.28 | 0.93 |
| +Temporal+Fusion | 5 | ✓ | ✓ | ✗ | ✗ | 6.98% | 29.89% | 0.25 | 0.86 |
| +Temporal+Fusion+TDC | 5 | ✓ | ✓ | ✓ | ✗ | 6.10% | 25.97% | 0.23 | 0.78 |
| +Temporal+Fusion+Dual Space | 5 | ✓ | ✓ | ✗ | ✓ | 6.28% | 25.89% | 0.21 | 0.74 |
| TC-Stereo (Full) | 5 | ✓ | ✓ | ✓ | ✓ | 5.98% | 24.67% | 0.21 | 0.71 |
KITTI 2015 Leaderboard¶
TC-Stereo ranks second while achieving better efficiency compared to other SOTA methods.
Key Findings¶
- Achieves better accuracy and consistency with only 5 iterations than RAFT-Stereo with 32 iterations
- Temporal jitter |Δd| in occluded regions is reduced from 1.15 to 0.71, representing a 38% improvement
- TDC and dual-space refinement are complementary in improving occluded regions: TDC provides a good initialization, while dual-space refinement enhances details in ill-posed regions
- Directly using the state of the preceding frame for initialization (similar to XR-Stereo) degrades performance, demonstrating the necessity of state fusion
Highlights & Insights¶
- Dual-space refinement serves as the core innovation: It constrains smoothness in the gradient space and iteratively propagates it globally, which is particularly effective for reflection and occlusion areas
- Temporal disparity completion restricts the search range from global to local, leading to smaller and more stable update steps
- Two hierarchical temporal consistency evaluation metrics are designed (absolute difference |Δd| and error divergence Relu(Δe)), providing a more comprehensive evaluation
Limitations & Future Work¶
- Requires camera poses as input
- Robustness to extremely fast motions or large occlusion changes needs further improvement
- Dual-space refinement introduces a certain amount of computational overhead
Related Work & Insights¶
Incorporating video information into stereo matching has become increasingly popular (e.g., Dynamic-Stereo, TemporalStereo). The proposed disparity gradient space refinement strategy could be extended to other tasks such as monocular depth estimation.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Practicality: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐