TC-Stereo: Temporally Consistent Stereo Matching¶

Conference: ECCV 2024
arXiv: 2407.11950
Code: GitHub
Area: 3D Vision
Keywords: Stereo Matching, Temporal Consistency, Disparity Completion, Dual-Space Optimization, Video Stereo

TL;DR¶

Proposed TC-Stereo, which achieves temporally consistent stereo matching through temporal disparity completion for good initialization, temporal state fusion to maintain hidden state coherence, and dual-space (disparity + disparity gradient) iterative refinement to improve ill-posed regions.

Background & Motivation¶

Background¶

Background: Existing stereo matching methods perform independent inference frame-by-frame, leading to temporal inconsistency. The inconsistency stems from two main sources: (1) each frame searches for disparity globally from scratch, resulting in large update steps and high variability; (2) inherent ambiguities in ill-posed regions (e.g., occlusions, reflections) lead to unstable outputs under frame-to-frame appearance changes. This work addresses these issues by localizing the search range and improving ill-posed regions.

Proposed Solution¶

Goal: ### Overall Architecture

TC-Stereo processes stereo video sequences online: (1) Temporal Disparity Completion (TDC)—generates initial dense disparity from semi-dense disparity projected from preceding frames; (2) Temporal State Fusion—fuses the current completed state with the historical refined state; (3) Dual-Space Iterative Refinement—alternately optimizes in the disparity space and disparity gradient space.

Method¶

Overall Architecture¶

TC-Stereo processes stereo video sequences online: (1) Temporal Disparity Completion (TDC)—generates initial dense disparity from semi-dense disparity projected from preceding frames; (2) Temporal State Fusion—fuses the current completed state with the historical refined state; (3) Dual-Space Iterative Refinement—alternately optimizes in the disparity space and disparity gradient space.

Key Designs¶

Source of Semi-Dense Disparity: The first frame is obtained from the cost volume via winner-take-all filtering with a confidence threshold; subsequent frames project the preceding disparity map to the current frame view using camera poses, which naturally produces holes in non-overlapping regions.

Temporal Disparity Completion: A lightweight encoder-decoder network that takes the encoder's context features, a semi-dense disparity, and a sparsity mask as inputs, and outputs a dense disparity and state features.

Temporal State Fusion: Uses a GRU-like module to fuse the current state \(c^t\) and the historical hidden state \(h_{N-1}^{t-1}\), addressing two challenges: (1) the historical state only encodes the information of the preceding viewpoint; (2) there are no historical states in non-overlapping regions.

Dual-Space Refinement: In the disparity space, a multi-level GRU is used to update the disparity (similar to RAFT-Stereo); in the gradient space, the disparity is converted to a gradient field and refined—leveraging the prior that real-world depths are typically locally smooth. Through gradient-guided disparity propagation, neighboring disparities are propagated to the current pixel based on the local plane assumption: \(\hat{d}_n = d + (u_n-u)\frac{\partial d}{\partial u} + (v_n-v)\frac{\partial d}{\partial v}\), followed by a softmax weighted sum to obtain the final disparity.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{cv} + \mathcal{L}_{disp} + \mathcal{L}_{grad}\]

\(\mathcal{L}_{cv}\): Cost volume contrastive loss, maximizing similarity at the ground truth disparity and enforcing a margin
\(\mathcal{L}_{disp}\): L1 disparity loss over three stages (completion, refinement, and propagation), using exponentially decaying weights
\(\mathcal{L}_{grad}\): L1 loss of the disparity gradient after gradient-space refinement and propagation

Key Experimental Results¶

TartanAir Ablation Study¶

Main Results¶

Setting	Iters	Temporal	State Fusion	TDC	Dual Space	ALL >1px↓	OCC >1px↓	ALL \|Δd\|↓	OCC \|Δd\|↓
RAFT-Stereo (5iter)	5	✗	✗	✗	✗	8.04%	34.91%	0.35	1.15%
RAFT-Stereo (32iter)	32	✗	✗	✗	✗	6.02%	25.89%	0.28	0.93
+Temporal+Fusion	5	✓	✓	✗	✗	6.98%	29.89%	0.25	0.86
+Temporal+Fusion+TDC	5	✓	✓	✓	✗	6.10%	25.97%	0.23	0.78
+Temporal+Fusion+Dual Space	5	✓	✓	✗	✓	6.28%	25.89%	0.21	0.74
TC-Stereo (Full)	5	✓	✓	✓	✓	5.98%	24.67%	0.21	0.71

KITTI 2015 Leaderboard¶

TC-Stereo ranks second while achieving better efficiency compared to other SOTA methods.

Key Findings¶

Achieves better accuracy and consistency with only 5 iterations than RAFT-Stereo with 32 iterations
Temporal jitter |Δd| in occluded regions is reduced from 1.15 to 0.71, representing a 38% improvement
TDC and dual-space refinement are complementary in improving occluded regions: TDC provides a good initialization, while dual-space refinement enhances details in ill-posed regions
Directly using the state of the preceding frame for initialization (similar to XR-Stereo) degrades performance, demonstrating the necessity of state fusion

Highlights & Insights¶

Dual-space refinement serves as the core innovation: It constrains smoothness in the gradient space and iteratively propagates it globally, which is particularly effective for reflection and occlusion areas
Temporal disparity completion restricts the search range from global to local, leading to smaller and more stable update steps
Two hierarchical temporal consistency evaluation metrics are designed (absolute difference |Δd| and error divergence Relu(Δe)), providing a more comprehensive evaluation

Limitations & Future Work¶

Requires camera poses as input
Robustness to extremely fast motions or large occlusion changes needs further improvement
Dual-space refinement introduces a certain amount of computational overhead

Incorporating video information into stereo matching has become increasingly popular (e.g., Dynamic-Stereo, TemporalStereo). The proposed disparity gradient space refinement strategy could be extended to other tasks such as monocular depth estimation.

Rating¶

Novelty: ⭐⭐⭐⭐
Practicality: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐