What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?¶
Conference: CVPR2026
arXiv: 2504.16930
Code: None (The paper mentions that procedural generation code will be open-sourced, but no specific link is provided in the cache)
Area: 3D Vision
Keywords: Stereo matching, synthetic data, procedural generation, zero-shot generalization, dataset design, Infinigen
TL;DR¶
This paper systematically ablates the design space of synthetic training data for stereo matching (including floating objects, backgrounds, materials, and baselines). It finds that the combination of "realistic indoor scenes + dense floating objects + wide baselines" is optimal. Based on these findings, WMGStereo-150k is constructed, which outperforms hybrid training using four classic datasets while using only a single dataset.
Background & Motivation¶
Problem Definition: Stereo matching estimates pixel-wise disparity from binocular RGB images. Synthetic data is a core resource for training due to its precise depth annotations. However, the critical question of what constitutes an effective synthetic data design has lacked systematic research.
Limitations of Prior Work:
Entangled Design Variables: Existing synthetic datasets range significantly from random flying objects (FlyingThings3D) to realistic scene simulators (TartanAir). Each new dataset changes multiple factors simultaneously (object types, materials, scene layouts, camera parameters), making it impossible to attribute gains to a single design choice. For instance, FoundationStereo introduced a new architecture alongside new data, leaving the relative importance of individual data factors (floating objects, random lighting, physics simulation) inseparable.
Irreproducible Generation Pipelines: Classic datasets like TartanAir and IRS do not open-source their generation code or assets, creating a barrier for ablation studies such as "what if only the materials were changed."
Limitations of Existing Analysis: A classic study by Mayer et al. concluded that "realism is overrated." However, their experiments were based on 2D-warp FlyingChairs-style datasets and did not involve modern 3D rendering datasets, making the applicability of the conclusion questionable.
Goal: To leverage the controllability of the open-source procedural generation platform Infinigen to isolate and ablate each design dimension of synthetic stereo data, identifying key factors for zero-shot generalization and constructing a superior dataset based on these insights.
Method¶
Overall Architecture¶
The authors constructed a configurable procedural stereo data generation system based on Infinigen and the Blender Python API. The core contribution is not a new stereo matching network, but rather a parameter-controlled data production pipeline and a systematic ablation study.
The system supports three scene types:
- Indoor Floating Objects: Random objects are placed within realistic indoor scenes generated by Infinigen Indoors using raycasting. This balances scene realism (furniture, walls, floors) with geometric diversity (extra suspended objects).
- Dense Floating Objects: Similar to the classic FlyingThings3D design, a large number of objects (approximately 200) are placed densely within the camera's field of view against a blank sky background to maximize geometric diversity.
- Nature: Outdoor natural scenes are generated directly using Infinigen Nature.
Key Engineering Designs: - Floating Object Placement Interface: Supports both raycasting (placement within the camera view) and bounding box constraints, allowing control over intersections with existing scene geometry. - Material Management: Automatically detects and removes glass materials from object sub-parts (to avoid ill-posed problems from fully transparent surfaces). External windows are treated specially - geometry is deleted rather than replacing glass to avoid breaking scene lighting. - Automatic Removal of High-Error Objects/Materials: Difficult objects such as cacti, sea urchins (extremely fine needle structures), and shelves (tiny holes) are identified and excluded via per-object and per-pixel error statistics, along with extreme transparent or reflective materials.
Key Designs¶
Experimental Setup: For each parameter variant, 5000 stereo image pairs are generated using the Indoor Floating Objects scene type. RAFT-Stereo is trained from scratch for 75k steps and evaluated on 6 benchmarks (Middlebury 2014/2021, ETH3D, KITTI-12/15, Booster) for zero-shot performance. The following six dimensions are isolated.
1. Floating Object Density: Geometric Diversity is the Primary Switch
The zero-shot capability of stereo networks relies heavily on the geometric diversity encountered during training. Real-world scenes offer limited geometric variation. Increasing floating objects from zero to dense reduces the 2px error on Middlebury 2014(H) from 12.52 to 7.78 (0-10 objects, ↓38%) and finally to 6.60 (10-30 objects, ↓47%). While floating objects reduce realism, the resulting geometric diversity is overwhelmingly important for zero-shot generalization. Thus, 200 objects are placed in dense scenes.
2. Background Objects: Realism is Not "Overrated"
The classic argument that "realism is overrated" stems from 2D-warp datasets like FlyingChairs and may not apply to modern 3D rendering. Removing background objects like furniture led to performance drops across all benchmarks (Middlebury(H) error rose from 6.60 to 8.35). This indicates that a certain level of scene realism significantly aids zero-shot generalization, refuting the old conclusion.
3. Object Types: Diversity Over Specialization
Single object types might perform better on specific benchmarks (e.g., chairs for Middlebury, bushes for ETH3D/KITTI), but they exhibit the poorest cross-benchmark robustness. Using the full set of object generators produced the most balanced results across all benchmarks. The conclusion is to use a wide variety of objects rather than tuning for a specific benchmark.
4. Object Materials: A Hard Bottleneck for Current Networks
Materials expose a problem that data alone cannot bypass. Using only metal and glass yielded optimal results on KITTI-15/Booster but caused ETH3D to fail (4.95 vs 2.77). Conversely, using only diffuse materials was optimal for ETH3D but severely degraded Booster performance (12.73 vs 9.80). Current stereo matching networks cannot learn non-Lambertian materials without harming performance in diffuse regions. The authors suggest this requires co-design of architecture and data.
5. Camera Baseline Randomization: An Underestimated Factor
The baseline range directly determines the disparity distribution. Using only a narrow baseline [0.04, 0.1m] degraded performance on Middlebury(H) from 6.60 to 9.60 and Booster from 10.60 to 17.03. Expanding to [0.04, 0.4m] was globally optimal. Wide baselines expose the network to a larger range of disparities, leading to more stable generalization.
6. Lighting Augmentation: Minimal Impact but Retained
Lighting randomization had limited impact on benchmarks but was retained to cover diverse outdoor conditions as a low-cost robustness measure.
Loss & Training¶
- Training Strategy: Follows the original training pipelines and hyperparameters of RAFT-Stereo / DLNR / Selective-IGEV, training for 200k steps from scratch without new loss functions.
- Scene Sampling: Equal weighting (33%-33%-33%) across the three scene types was found to be optimal.
- Masking Strategy: Masks out the sky and untextured regions outside the rooms.
- Cost Optimization (6x Acceleration):
- Reduced solver steps from 550 to 60 (greedy mode), reducing indoor scene generation time from 51 to 13 minutes.
- Reduced rendering samples from 8192 to 1024 with Blender OptiX denoising, bringing rendering time down to 27 seconds per frame.
- Scene reuse: 20 independent camera positions per indoor scene and 200 randomizations (pose, light, baseline) per dense scene.
- Under a fixed compute budget, the low-cost setting (30k samples) outperformed the high-cost setting (5k samples).
Key Experimental Results¶
Main Results: Zero-Shot Stereo Matching (Table 2, 200k steps)¶
| Model | Midd 2014(H) | Midd 2021 | ETH3D | KITTI-12 | KITTI-15 | Booster(Q) |
|---|---|---|---|---|---|---|
| RAFT-Mixed (SF+CRE+TA+IRS, 600k) | 5.50 | 8.97 | 2.58 | 3.64 | 4.95 | 11.46 |
| RAFT-WMGStereo-150k | 4.48 | 8.17 | 2.93 | 3.25 | 4.25 | 9.17 |
| DLNR-Mixed | 5.21 | 9.30 | 2.50 | 3.68 | 4.95 | 12.17 |
| DLNR-WMGStereo-150k | 3.76 | 6.72 | 2.50 | 3.30 | 4.54 | 9.09 |
| Sel-IGEV-Mixed | 5.24 | 8.24 | 2.37 | 3.97 | 5.31 | 11.00 |
| Sel-IGEV-WMGStereo-150k | 3.61 | 7.62 | 2.47 | 3.26 | 4.55 | 8.84 |
| FoundationStereo | 1.10 | 4.17 | 0.50 | 2.30 | 2.80 | 4.16 |
- DLNR-WMGStereo-150k vs DLNR-Mixed: Middlebury reduced by 28%, Booster reduced by 25%
- RAFT trained only on WMGStereo-150k outperforms StereoAnywhere (which uses large-scale monocular priors) on Middlebury 2014.
Ablation Study: Design Dimension Comparisons (Table 1, 5k pairs + RAFT-Stereo 75k steps)¶
| Design Choice | Midd 2014(H) | ETH3D | KITTI-15 | Booster(Q) |
|---|---|---|---|---|
| No Floating Objects | 12.52 | 4.47 | 6.19 | 16.40 |
| 10-30 Floating Objects | 6.60 | 3.92 | 5.11 | 10.60 |
| No Background Objects | 8.35 | 4.39 | 6.28 | 12.72 |
| With Background Objects | 6.60 | 3.92 | 5.11 | 10.60 |
| Diffuse Only | 7.21 | 2.77 | 5.41 | 12.73 |
| Metal + Glass Only | 8.37 | 4.95 | 4.97 | 9.80 |
| All Materials | 6.60 | 3.92 | 5.11 | 10.60 |
| Narrow Baseline [0.04, 0.1] | 9.60 | 2.89 | 6.64 | 17.03 |
| Wide Baseline [0.04, 0.4] | 6.60 | 3.92 | 5.11 | 10.60 |
Comparison of Datasets (Table 5, DLNR 200k steps)¶
| Training Data | Midd 2014(H) | Midd 2021 | ETH3D | KITTI-12 | KITTI-15 | Booster(Q) |
|---|---|---|---|---|---|---|
| SceneFlow | 6.20 | 8.44 | 23.12 | 9.45 | 15.74 | 18.17 |
| CREStereo | 11.53 | 10.60 | 5.18 | 4.95 | 5.90 | 14.61 |
| TartanAir | 7.27 | 14.47 | 4.35 | 3.98 | 5.33 | 18.14 |
| IRS | 6.13 | 8.49 | 3.91 | 4.56 | 5.60 | 10.32 |
| FSD | 3.27 | 6.93 | 2.13 | 3.56 | 4.18 | 7.51 |
| WMGStereo-150k | 3.76 | 6.72 | 2.50 | 3.30 | 4.54 | 9.09 |
| FSD + WMGStereo | 3.24 | 6.88 | 2.08 | 3.59 | 4.26 | 7.42 |
Key Findings¶
- Extreme Sample Efficiency: Only 500 WMGStereo-150k samples outperform 100,000 CREStereo samples on Middlebury, proving the data "recipe" is more critical than quantity.
- Cross-Architecture Generalization: Consistent improvements were observed across RAFT-Stereo, DLNR, and Selective-IGEV, showing gains are not network-specific.
- Generalization to Unseen Benchmarks: On DrivingStereo (not used for parameter selection), 3px error was reduced by 27% compared to FSD.
- Dataset Complementarity: Hybrid training with FSD + WMGStereo-150k outperformed either dataset alone on Middlebury(H), ETH3D, and Booster.
- Scene Type Mixing: An equal mix of the three scene types (33/33/33) significantly outperformed any single scene type.
Highlights & Insights¶
- First Systematic Study of Synthetic Stereo Data Design Space: By isolating the effects of 6 dimensions (density, background, object types, materials, baseline, lighting), the paper provides an actionable data engineering guide.
- "Realism + Diversity" are Indispensable: The optimal solution is "realistic indoor scenes + floating objects," refuting "realism is overrated" (removing background drops performance) while affirming the value of diversity from random objects.
- Non-Lambertian Materials are an Open Problem: Current networks struggle to handle both reflective/transparent and diffuse areas simultaneously. This is not purely a data issue—it requires co-design of data and architecture.
- "More Low-Quality Data > Less High-Quality Data": Lowering rendering quality and solver precision (allowing 6x speedup) resulted in better performance under a fixed compute budget.
- Methodological Value of Procedural Generation: The parameter ablation methodology can be generalized to data design for other vision tasks like optical flow, depth completion, and semantic segmentation.
Limitations & Future Work¶
- Gap vs. FoundationStereo: Still trails FoundationStereo (3.76 vs 1.10 on Middlebury 14(H)) which uses larger scale data and physics simulations.
- Lack of Driving Scenes: Does not include CARLA/VirtualKITTI style road scenes. Gains on KITTI originate mostly from material/baseline settings rather than in-domain scene distribution.
- Avoiding vs. Solving Non-Lambertian Materials: Removing extreme materials is a temporary workaround. A better solution would involve designing architectures capable of handling these materials.
- Limited to Existing Architectures: The interaction between data design and newer architectures like Transformer-based stereo remains unexplored.
- Low Ratio of Nature Scenes: Natural scenes account for only ~13% (21k/163k), which may limit generalization in diverse outdoor scenarios.
Related Work & Insights¶
- SceneFlow / FlyingThings3D: The pioneering approach using random flying objects. This paper proves such designs are sub-optimal on most benchmarks alone, but their core idea (geometric diversity) remains valid.
- FoundationStereo: Current SOTA. This parameter study helps explain FSD's success as a result of multi-factor synergy (floating objects + realistic background + wide baseline + diverse materials) rather than a single innovation.
- Mayer et al. (IJCV 2018): The conclusion "realism is overrated" applies to 2D-warp datasets. This paper provides a more nuanced conclusion for 3D rendered data—realism and diversity must be balanced.
- Infinigen: An open-source procedural 3D generation platform. This paper demonstrates its effective extension and application for stereo matching data.
- Insights: The methodology of procedural data generation combined with parameter ablation can be extended to any vision task requiring synthetic data.
Rating¶
- Novelty: ⭐⭐⭐⭐ — No new architecture, but the "systematic ablation of data design space" perspective is unique and practical.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Ablation covers 6 dimensions, 6+ benchmarks, 3 architectures, including cost analysis and sample efficiency curves.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clearly structured, rich tables, and actionable conclusions easy to reproduce.
- Value: ⭐⭐⭐⭐ — Direct guidance for data engineering in the stereo matching community, amplified by open-source code.