ZeroStereo: Zero-shot Stereo Matching from Single Images¶
Conference: ICCV2025
arXiv: 2501.08654
Code: GitHub
Area: 3D Vision
Keywords: stereo matching, zero-shot generalization, diffusion inpainting, monocular depth estimation, view synthesis
TL;DR¶
This paper proposes ZeroStereo, a pipeline that starts from an arbitrary single image, uses monocular depth estimation to generate pseudo disparity, and synthesizes high-quality right-view images via a fine-tuned diffusion inpainting model. The approach achieves state-of-the-art zero-shot stereo matching generalization using only 35K synthetic training samples.
Background & Motivation¶
- Core Problem: Supervised stereo matching models perform well on standard benchmarks, but generalization to real-world scenes degrades significantly due to the extreme scarcity of annotated real-world stereo data.
- Limitations of Prior Work:
- Domain-invariant feature learning (DSMNet, GraftNet, ITSA, etc.): A non-trivial domain gap still exists between synthetic and real data.
- Self-supervised learning (photometric loss): Severely affected by occlusions, ghosting artifacts, and ill-posed regions; large-scale collection of high-quality stereo image pairs is itself non-trivial.
- View synthesis:
- Early methods (Luo et al., MfS-Stereo) generate right-view images via monocular depth and forward warping, but occluded regions are filled with neighboring pixels or random backgrounds, causing structural inconsistency.
- NeRF-Stereo requires multi-view inputs for scene reconstruction, offering poor flexibility; NeRF also performs poorly for distant scene reconstruction, making it unsuitable for large-scale outdoor settings.
- Motivation: Can a single image combined with a pre-trained diffusion model be used to complete occluded regions with high quality, while assessing pseudo-label reliability without any additional training?
Method¶
Overall Pipeline¶
Given a single left image \(\mathbf{I}_l\):
- Monocular Depth Estimation: Depth Anything V2 (DAv2) produces a normalized inverse depth map \(\mathbf{D}\).
- Adaptive Disparity Selection (ADS): \(\mathbf{D}\) is multiplied by a scaling factor \(s \cdot w\) to obtain the pseudo disparity map \(\mathbf{d}\), where \(s\) is sampled from a piecewise uniform distribution.
- Forward Warping: Following MfS-Stereo, a warped right image \(\tilde{\mathbf{I}}_r\), a non-occlusion mask \(\mathbf{M}_{noc}\), and an inpainting mask \(\mathbf{M}_{inp}\) are generated.
- Diffusion Inpainting: A fine-tuned SDv2I model performs semantically coherent completion of occluded regions, yielding the final right image \(\mathbf{I}_r\).
- Training-Free Confidence Generation (TCG): The confidence map \(\mathbf{C}\) of the depth estimation is computed via horizontal flip symmetry.
- Stereo Training: RAFT-Stereo / IGEV-Stereo is trained using the ZeroStereo Loss.
Diffusion Inpainting¶
- Base Model: Stable Diffusion V2 Inpainting (SDv2I).
- Motivation for Fine-tuning:
- Standard text-guided inpainting lacks text prompts suited to stereo occlusion regions.
- Stereo occlusion masks are diverse and irregular; directly applying the pre-trained model yields suboptimal results.
- Fine-tuning Protocol:
- Trained on synthetic datasets with accurate dense disparity ground truth: Scene Flow, Tartan Air, CREStereo Dataset, and VKITTI 2.
- VAE is frozen; only the U-Net is fine-tuned; text conditioning is disabled; DDPM noise scheduler with 1000 steps.
- 50K steps, batch size 32 (gradient accumulation over 4 steps), AdamW with One-cycle LR of 2e-5, cropped to 512×512.
- Inference Protocol: DDIM scheduler with 50 sampling steps. Final output is blended via the mask: \(\mathbf{I}_r = \mathbf{M}_{inp} \odot \mathbf{I}_d + (1-\mathbf{M}_{inp}) \odot \tilde{\mathbf{I}}_r\).
- Advantages: Compared to StereoDiffusion (which causes structural distortion via latent-space warping), ZeroStereo requires only 5.8G memory vs. 14.6G, and runs in 1.9s vs. 31.2s.
Training-Free Confidence Generation (TCG)¶
- Core Idea: Modern monocular depth models predict relative depth (inverse depth); the relative depth between pixels should remain consistent after horizontal flipping.
- Procedure:
- The left image is horizontally flipped and fed into DAv2 alongside the original to obtain \(\mathbf{D}\) and \(\mathbf{D}'\).
- The flipped depth is un-flipped for alignment, and the per-pixel discrepancy is computed: \(\mathbf{u} = 1 - |\mathbf{D} - \mathbf{H}^{-1}(\mathbf{D}')|\).
- Normalization yields the confidence map \(\mathbf{C}\).
- Effect: Low-confidence regions concentrate on edges, textureless areas, and fine structures — precisely the most ambiguous locations in stereo matching.
Adaptive Disparity Selection (ADS)¶
- Problem: MfS-Stereo uniformly samples the maximum disparity over a fixed range, leading to foreground distortion and excessive occlusion when the disparity-to-width ratio is too large at low resolutions, and insufficient left-right differences when the ratio is too small at high resolutions.
- Solution: Disparity is computed as \(\mathbf{d} = \mathbf{D} \cdot s \cdot w\), where \(s\) is sampled from a three-segment distribution:
- Central segment \((c-r, c+r)\), probability \(p_c = 0.8\) (primary operating range).
- Small-disparity segment \((c-2r, c-r)\), probability \(p_s = 0.1\).
- Large-disparity segment \((c+r, c+2r)\), probability \(p_l = 0.1\).
- Default: \(c=0.1, r=0.05\).
- Effect: Adaptively adjusts the disparity range according to image width, ensuring diversity while avoiding degenerate cases.
ZeroStereo Loss¶
Three loss terms are combined:
- Disparity Loss: \(\mathcal{L}_d = \|\tilde{\mathbf{d}} - \mathbf{d}\|_1\).
- Non-occluded Photometric Loss \(\mathcal{L}_{np}\): The estimated disparity is used to backward-warp the right image and compared against the left image using an SSIM + L1 combination (\(\beta=0.85\)); unreliable pixels are excluded using the non-occlusion mask and the backward-warped inpainting mask.
- ZeroStereo Loss: \(\mathcal{L}_{Zero} = \mathbf{C} \odot \mathcal{L}_d + \mu \cdot (1-\mathbf{C}) \odot \mathcal{L}_{np}\), with \(\mu=0.1\). High-confidence regions are supervised primarily by pseudo ground truth; low-confidence regions rely on photometric consistency self-supervision.
Key Experimental Results¶
Ablation Study (IGEV-Stereo, Tab. 1)¶
| Module Combination | KITTI-15 EPE | KITTI-15 >3px | Midd-T EPE | ETH3D >1px |
|---|---|---|---|---|
| Baseline | 1.52 | 4.89 | 2.71 | 2.38 |
| +ADS | 1.24 | 4.84 | 2.28 | 2.27 |
| +Inpainting | 1.44 | 4.85 | 2.34 | 1.92 |
| +ADS+Inpainting | 1.06 | 4.74 | 2.26 | 2.05 |
| +ADS+Inp+TCG | 1.05 | 4.71 | 2.18 | 2.01 |
| All + ZeroStereo Loss | 1.04 | 4.73 | 2.09 | 1.90 |
Zero-Shot Generalization Benchmark (Tab. 8, Zero-RAFT-Stereo)¶
| Dataset | KITTI-15 >3px (All) | Midd-T H >2px (Noc) | ETH3D >1px (Noc) |
|---|---|---|---|
| Zero-RAFT-Stereo | 4.53 | 4.45 | 2.13 |
| NS-RAFT-Stereo (NeRF-Stereo official) | 5.41 | 6.45 | 2.55 |
| RAFT-Stereo (SceneFlow GT) | 5.47 | 8.66 | 2.29 |
Dataset Scale Comparison (Tab. 5, Zero-RAFT-Stereo)¶
Using only 35K synthetic samples (MfS35K), the method outperforms FoundationStereo trained on 1106K samples, demonstrating that data diversity matters more than absolute scale.
Synthesis Efficiency (Tab. 4)¶
| Method | Resolution | VRAM | Time per Image |
|---|---|---|---|
| RePaint | 256×256 | 2.7G | 156.5s |
| StereoDiffusion | 512×512 | 14.6G | 31.2s |
| ZeroStereo (Ours) | 512×512 | 5.8G | 1.9s |
Highlights & Insights¶
- Elegant paradigm: The task of stereo data generation is decomposed into three modular stages — monocular depth estimation, forward warping, and diffusion inpainting — each leveraging the strongest pre-trained models in its respective domain, avoiding end-to-end training from scratch.
- TCG requires no additional training: The flip symmetry of monocular depth models serves as a cost-free signal for confidence estimation, making the approach both concise and transferable to other pseudo-label settings.
- ADS adapts to resolution: This directly addresses the practical engineering issue of fixed disparity ranges being mismatched to varying image resolutions.
- 35K samples outperform millions: This demonstrates that training data synthesized from real single images possesses inherently superior scene diversity compared to larger purely synthetic datasets.
- Boundary error analysis (Tab. 7): Models trained on MfS35K exhibit the lowest errors in boundary regions, confirming that diffusion inpainting yields genuinely higher-quality completion at occlusion boundaries than competing approaches.
Limitations & Future Work¶
- The fine-tuned SDv2I still fails on complex scenes: Ill-posed regions such as transparent objects and mesh-like structures cannot be handled correctly even at the forward warping stage.
- Occasional color inconsistency: Because fine-tuning uses synthetic datasets, a color distribution gap exists between synthetic and real images.
- Dependence on monocular depth model quality: The performance ceiling of the entire pipeline is determined by DAv2's accuracy; failures in certain scenes propagate to all downstream components.
- Potential improvements:
- Replacing SDv2I with more advanced diffusion models (e.g., SDXL Inpainting or Flux) to enhance inpainting quality.
- Introducing multi-scale disparity generation strategies.
- Incorporating video diffusion models for temporally consistent stereo video data generation.
- The flip-invariance assumption in TCG may not hold for scenes with text or strong directional content; additional geometric transformations such as rotation could be considered.
Related Work & Insights¶
- MfS-Stereo [Watson et al.]: The direct predecessor of this work, generating stereo pairs via monocular depth and forward warping but filling occluded regions with random backgrounds.
- NeRF-Stereo [Tosi et al.]: Reconstructs 3D scenes via NeRF to render stereo pairs and introduces Ambient Occlusion for confidence estimation, but requires multi-view inputs and performs poorly on distant scenes.
- Marigold [Ke et al.]: Demonstrates that diffusion models can be fine-tuned from synthetic data for monocular depth estimation, inspiring the analogous approach adopted here for stereo image synthesis.
- Depth Anything V2: The monocular depth model used in this work to provide the basis for pseudo disparity.
- Stable Diffusion V2 Inpainting: The base inpainting model fine-tuned in this work.
- Insight: The paradigm of using strong pre-trained models from one domain to automatically generate training data for another domain holds promise for extension to optical flow estimation, scene flow estimation, and beyond.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Individual modules are not entirely novel, but the overall pipeline design and the flip-symmetry idea in TCG are notably elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive ablation study, multi-dataset and multi-model evaluation, fair comparison with NeRF-Stereo, and additional boundary error and synthesis efficiency analyses.
- Writing Quality: ⭐⭐⭐⭐ — The paper is clearly structured with information-dense figures and tables and consistent notation.
- Value: ⭐⭐⭐⭐ — Highly practical; achieving state-of-the-art performance with only 35K samples is particularly significant for resource-constrained settings.