CVPR 2026 3D Vision stereo matching synthetic data procedural generation zero-shot generalization dataset design Infinigen

What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?¶

Conference: CVPR 2026 arXiv: 2504.16930 Code: Not available (the paper mentions that the procedural generation code will be open-sourced, but no specific link is included in the cache) Area: 3D Vision Keywords: stereo matching, synthetic data, procedural generation, zero-shot generalization, dataset design, Infinigen

TL;DR¶

This paper systematically ablates the design space of synthetic stereo matching training data—covering floating objects, backgrounds, materials, baselines, and more—and finds that "realistic indoor scenes + dense floating objects + wide baseline" is the optimal combination. The resulting WMGStereo-150k dataset, trained on a single source, outperforms the mixture of four classical datasets.

Background & Motivation¶

Problem Definition: Stereo matching estimates per-pixel disparity from binocular RGB images. Synthetic data, which provides precise depth annotations, is central to training. However, the critical question of what constitutes effective synthetic data design has never been systematically studied.

Limitations of Prior Work:

Entangled design variables: Existing synthetic datasets vary enormously—from the random flying objects of FlyingThings3D to the photorealistic simulators of TartanAir—yet each new dataset simultaneously changes multiple factors (object types, materials, scene layout, camera parameters, etc.), making it impossible to attribute the contribution of any single design choice. For instance, FoundationStereo introduces both a new architecture and new data, and the relative importance of individual data factors (floating objects, random lighting, physical simulation, etc.) cannot be disentangled.

Non-reproducible generation pipelines: Classical datasets such as TartanAir and IRS do not release generation code or assets, creating a hard barrier to ablations such as "what if only the materials are changed?"

Limitations of prior analysis: The seminal analysis of Mayer et al. concluded that "realism is overrated," but their experiments were based solely on 2D-warp FlyingChairs-style datasets without covering modern 3D-rendered datasets, limiting the generalizability of that conclusion.

Core Motivation: Leveraging the controllability of the open-source procedural generation platform Infinigen, this work isolates and ablates each design dimension of synthetic stereo data one at a time, identifies the factors that truly govern zero-shot generalization performance, and constructs a superior dataset accordingly.

Method¶

Overall Architecture¶

The authors build a configurable procedural stereo data generation system on top of Infinigen and the Blender Python API. The core contribution is not a new stereo matching network, but rather a parameterizable data production pipeline coupled with systematic ablation experiments.

The system supports three scene types:

Indoor Floating Objects: Objects are randomly placed inside rooms generated by Infinigen Indoors via ray casting. This combines scene realism (furniture, walls, floors, etc.) with geometric diversity (additional suspended objects).
Dense Floating Objects: A large number of objects (~200) are densely placed within the camera frustum against a blank sky background, maximizing geometric diversity in the spirit of the classical FlyingThings3D design.
Nature: Outdoor natural scenes are generated directly using Infinigen Nature.

Key engineering designs: - Floating object placement interface: Supports ray-casting-based placement (within the camera frustum) or bounding-box constraints, with controllable intersection with existing scene geometry. - Material management tools: Automatically detects and removes glass materials from object sub-parts (to avoid ill-posed problems caused by fully transparent surfaces); exterior windows are handled specially—rather than replacing the glass material (which would corrupt scene lighting), the window geometry is deleted entirely. - Automatic removal of high-error objects/materials: Per-object and per-pixel error statistics are used to identify and discard problematic objects such as cacti, sea urchins (extremely fine needle-like structures), and shelves (tiny holes), as well as extreme materials that are fully transparent or fully reflective.

Key Designs — Parametric Ablation Study¶

Experimental setup: For each parameter variant, 5,000 stereo pairs are generated using the indoor floating objects scene type. RAFT-Stereo is trained from random initialization for 75k steps and evaluated zero-shot on six benchmarks: Middlebury 2014/2021, ETH3D, KITTI-12/15, and Booster.

① Floating object density — one of the most critical design choices: - No floating objects → Middlebury 2014(H) 2px error: 12.52 - 0–10 floating objects → 7.78 (↓38%) - 10–30 floating objects → 6.60 (↓47%) - Finding: Floating objects, despite reducing scene realism, greatly increase geometric diversity and are crucial for zero-shot generalization. The final dataset places ~200 objects in dense scenes.

② Background objects — realism does help: - Removing background objects such as furniture degrades performance across all benchmarks (e.g., Middlebury(H) rises from 6.60 to 8.35). - Finding: This refutes the classical claim that "realism is overrated." A certain degree of scene realism provides significant benefit for zero-shot generalization.

③ Object types — diversity over specialization: - Individual object categories perform best on specific benchmarks (chairs help Middlebury; shrubs help ETH3D/KITTI) but yield the worst cross-benchmark robustness. - Using all object generators produces the most balanced performance across all benchmarks.

④ Object materials — a hard bottleneck for existing networks: - Metal + glass only → best on KITTI-15/Booster but catastrophic on ETH3D (4.95 vs. 2.77) - Diffuse only → best on ETH3D but severe degradation on Booster (12.73 vs. 9.80) - Core finding: Existing stereo matching networks cannot learn non-Lambertian materials without degrading performance on diffuse regions, motivating co-design of architecture and data.

⑤ Camera baseline randomization — an underappreciated critical factor: - Narrow baseline only [0.04, 0.1 m] → Middlebury(H) degrades from 6.60 to 9.60; Booster from 10.60 to 17.03. - Wide range [0.04, 0.4 m] achieves the best results across all benchmarks.

⑥ Lighting augmentation: Minimal impact, but retained to cover diverse in-the-wild conditions.

Loss & Training¶

Training protocol: The original training procedures and hyperparameters of RAFT-Stereo, DLNR, and Selective-IGEV are followed respectively. All models are trained from random initialization for 200k steps. No new loss functions or training tricks are introduced.
Balanced scene type sampling: The three scene types are sampled with equal weight (33%–33%–33%) during training; ablations confirm this ratio is optimal.
Masking strategy: Sky regions and untextured exterior room areas are masked out.
Cost optimization (6× speedup):
Solver steps reduced from 550 to 60 (greedy mode, only adding/not removing objects); indoor scene generation time reduced from 51 to 13 minutes.
Render samples reduced from 8,192 to 1,024, with Blender OptiX denoising; rendering time reduced to 27 seconds per frame.
Scene reuse: 20 independent camera placements per indoor scene; 200 randomizations (object poses, lighting, baseline) per dense scene.
Under a fixed compute budget, the low-cost configuration (30k samples) outperforms the high-cost configuration (5k samples).

Key Experimental Results¶

Main Results: Zero-Shot Stereo Matching (Table 2, 200k training steps)¶

Model	Midd 2014(H)	Midd 2021	ETH3D	KITTI-12	KITTI-15	Booster(Q)
RAFT-Mixed (SF+CRE+TA+IRS, 600k)	5.50	8.97	2.58	3.64	4.95	11.46
RAFT-WMGStereo-150k	4.48	8.17	2.93	3.25	4.25	9.17
DLNR-Mixed	5.21	9.30	2.50	3.68	4.95	12.17
DLNR-WMGStereo-150k	3.76	6.72	2.50	3.30	4.54	9.09
Sel-IGEV-Mixed	5.24	8.24	2.37	3.97	5.31	11.00
Sel-IGEV-WMGStereo-150k	3.61	7.62	2.47	3.26	4.55	8.84
FoundationStereo	1.10	4.17	0.50	2.30	2.80	4.16

DLNR-WMGStereo-150k vs. DLNR-Mixed: Middlebury reduced by 28%, Booster reduced by 25%.
RAFT trained solely on WMGStereo-150k surpasses StereoAnywhere—which leverages large-scale monocular priors—on Middlebury 2014.

Ablation Study: Design Dimension Comparison (Table 1, 5k pairs + RAFT-Stereo 75k steps)¶

Design Choice	Midd 2014(H)	ETH3D	KITTI-15	Booster(Q)
No floating objects	12.52	4.47	6.19	16.40
10–30 floating objects	6.60	3.92	5.11	10.60
No background objects	8.35	4.39	6.28	12.72
With background objects	6.60	3.92	5.11	10.60
Diffuse materials only	7.21	2.77	5.41	12.73
Metal + glass only	8.37	4.95	4.97	9.80
All materials	6.60	3.92	5.11	10.60
Narrow baseline [0.04, 0.1]	9.60	2.89	6.64	17.03
Wide baseline [0.04, 0.4]	6.60	3.92	5.11	10.60

Dataset Comparison (Table 5, DLNR 200k steps)¶

Training Data	Midd 2014(H)	Midd 2021	ETH3D	KITTI-12	KITTI-15	Booster(Q)
SceneFlow	6.20	8.44	23.12	9.45	15.74	18.17
CREStereo	11.53	10.60	5.18	4.95	5.90	14.61
TartanAir	7.27	14.47	4.35	3.98	5.33	18.14
IRS	6.13	8.49	3.91	4.56	5.60	10.32
FSD	3.27	6.93	2.13	3.56	4.18	7.51
WMGStereo-150k	3.76	6.72	2.50	3.30	4.54	9.09
FSD + WMGStereo	3.24	6.88	2.08	3.59	4.26	7.42

Key Findings¶

Exceptional sample efficiency: Only 500 WMGStereo-150k samples on Middlebury outperform 100,000 CREStereo samples, demonstrating that the data "recipe" matters more than quantity.
Cross-architecture generalization: Consistent improvements are observed across RAFT-Stereo, DLNR, and Selective-IGEV, indicating that the gains are not overfitted to any specific network.
Generalization to held-out benchmarks: On DrivingStereo (not used for parameter selection), the method reduces the 3px error by 27% relative to FSD.
Data complementarity: Mixing FSD + WMGStereo-150k outperforms either dataset alone on Middlebury(H), ETH3D, and Booster.
Mixed scene types are optimal: Equal mixture of the three scene types (33%–33%–33%) substantially outperforms any single scene type.

Highlights & Insights¶

First systematic study of the synthetic stereo data design space: The contributions of six dimensions—floating object density, background objects, object types, materials, baseline, and lighting—are isolated and quantified individually, providing actionable data engineering guidelines.
Realism and diversity are both indispensable: The optimal configuration—realistic indoor scenes with floating objects—simultaneously refutes "realism is overrated" (removing background objects degrades performance) and confirms the diversity value of random floating objects.
Non-Lambertian materials remain an open problem: Existing networks cannot simultaneously handle reflective/transparent materials and diffuse regions. This cannot be resolved through data alone and requires co-design of data and architecture, representing a valuable direction for future work.
More lower-quality data outperforms fewer higher-quality samples: Reducing rendering quality and solver precision while achieving a 6× speedup yields better performance under a fixed compute budget.
Methodological value of procedural generation: The parametric ablation methodology is transferable to dataset design for other vision tasks such as optical flow, depth completion, and semantic segmentation.

Limitations & Future Work¶

Significant gap remains relative to FoundationStereo: 3.76 vs. 1.10 on Middlebury 2014(H); FSD employs larger-scale data and physical simulation.
Absence of driving scenes: No CARLA/VirtualKITTI-style road scenes are included; the advantage on KITTI stems primarily from material and baseline settings rather than in-domain scene distribution.
Non-Lambertian materials are circumvented rather than solved: Removing extreme materials is a workaround; the ideal solution is to design new architectures capable of handling such materials.
Only existing architectures are validated: The interaction between data design and modern Transformer-based stereo architectures remains unexplored.
Nature scenes are underrepresented (~21k/163k, approximately 13%), which may limit generalization to outdoor environments.

SceneFlow / FlyingThings3D: The seminal approach using random flying objects. This paper demonstrates that this design alone is suboptimal on most benchmarks, yet its core idea of geometric diversity remains valid.
FoundationStereo: The current state of the art. The parametric analysis in this paper helps explain FoundationStereo's success as the result of multiple synergistic factors (floating objects + realistic backgrounds + wide baseline + diverse materials) rather than any single innovation.
Mayer et al. (IJCV 2018): The classical "realism is overrated" conclusion applies only to 2D-warp datasets; this paper provides a more nuanced finding for 3D-rendered datasets—realism and diversity must be balanced.
Infinigen: An open-source procedural 3D generation platform; this paper demonstrates its effective extension and application to stereo matching data.
Broader implication: The methodology of procedural data generation combined with parametric ablation is generalizable to any vision task requiring synthetic training data.

Rating¶

Novelty: ⭐⭐⭐⭐ — No new architecture, but the perspective of systematically ablating the data design space is distinctive and practically valuable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Ablations span 6 design dimensions, 6+ benchmarks, and 3 architectures, with cost analysis and sample efficiency curves.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, rich tables, actionable conclusions, and straightforward reproducibility.
Value: ⭐⭐⭐⭐ — Directly guides data engineering practices in the stereo matching community; open-sourced code further amplifies impact.