Skip to content

Domain Generalized Stereo Matching with Uncertainty-guided Data Augmentation

Conference: AAAI2026 arXiv: 2508.01303 Code: To be confirmed Area: 3D Vision Keywords: Stereo Matching, Domain Generalization, Data Augmentation, Uncertainty Modeling, Feature Consistency

TL;DR

This paper proposes UgDA-Stereo, a plug-and-play training-time module that simulates diverse unseen domain styles by applying Gaussian uncertainty perturbations—derived from batch statistics—to the per-channel mean and standard deviation of RGB images. Combined with a feature consistency constraint, the method substantially improves the cross-domain generalization of stereo matching models.

Background & Motivation

Stereo matching is a fundamental task in computer vision, aiming to estimate dense disparity maps from rectified image pairs for 3D reconstruction. State-of-the-art deep stereo matching networks are typically trained on synthetic data (e.g., SceneFlow), but suffer severe performance degradation in real-world scenarios due to domain gaps in color, illumination, contrast, and texture.

Two main approaches exist for addressing this issue: domain adaptation (requiring target-domain data) and domain generalization (requiring no target-domain data). This paper focuses on the more challenging single-domain generalization setting—training exclusively on synthetic data and evaluating on entirely unseen real-world domains.

Prior domain generalization methods either require architectural modifications (e.g., DSMNet's domain normalization layers), rely on complex learning objectives (e.g., ITSA's information-theoretic strategy), or demand additional modalities (e.g., depth prior distillation). While effective, these approaches introduce considerable complexity. This paper starts from a simple yet crucial observation: the statistics (mean and standard deviation) of RGB channels encode domain-specific information, and perturbing these statistics in a principled manner can generate samples from new domains.

Core Problem

  1. Stereo matching networks trained on synthetic data tend to learn domain-correlated shortcut features, failing to extract meaningful semantic and structural features on unseen real domains.
  2. How can data augmentation systematically broaden the sample distribution while ensuring diversity and effectiveness?
  3. Target domains may introduce domain shifts of uncertain direction and magnitude; how should such uncertainty be modeled?

Method

Overall Architecture

UgDA-Stereo is a plug-and-play module used only during training and can be integrated into any stereo matching network. The overall pipeline is:

  1. Apply uncertainty-guided data augmentation to the original left and right images to generate stylized images.
  2. Feed the augmented image pairs into the stereo matching network to obtain disparity estimates.
  3. Extract features from both original and augmented images and impose a feature consistency constraint.

Uncertainty-guided Data Augmentation (UgDA)

Core Idea: RGB channel statistics (per-channel mean \(\mu_c\) and standard deviation \(\sigma_c\)) reflect domain characteristics. Perturbing these statistics simulates domain shifts.

Procedure:

  1. Compute per-image channel statistics: For each image, compute the mean \(\mu_c(x)\) and standard deviation \(\sigma_c(x)\) of each RGB channel over the spatial dimensions.

  2. Compute within-batch variation of statistics: Measure the variance \(\sigma^2_{\mu_c}\) and \(\sigma^2_{\sigma_c}\) of the per-image statistics within a mini-batch. These quantities capture the appearance diversity among images in the batch and provide a reasonable scale for perturbation.

  3. Gaussian sampling for perturbation: The perturbation to the mean is assumed to follow \(\mathcal{N}(0, \sigma^2_{\mu_c})\), and that to the standard deviation follows \(\mathcal{N}(0, \sigma^2_{\sigma_c})\). New statistics are sampled via the reparameterization trick:

  4. \(\mu'_c(x) = \mu_c(x) + \epsilon_{\mu_c} \cdot \sigma_{\mu_c}(x)\), where \(\epsilon_{\mu_c} \sim \mathcal{N}(0,1)\)
  5. \(\sigma'_c(x) = \sigma_c(x) + \epsilon_{\sigma_c} \cdot \sigma_{\sigma_c}(x)\), where \(\epsilon_{\sigma_c} \sim \mathcal{N}(0,1)\)

  6. Generate augmented images: Normalize the original pixels and then denormalize using the new statistics:

  7. \(x^*_{c,h,w} = \frac{x_{c,h,w} - \mu_c(x)}{\sigma_c(x)} \cdot \sigma'_c(x) + \mu'_c(x)\)

Key Properties: - Only global per-channel statistics are modified; local texture, edges, and geometric content remain intact. - Uncertainty in perturbation direction and magnitude is modeled via batch statistics, yielding diverse augmentations. - The same augmentation is applied to both left and right images, preserving stereo consistency.

Feature Consistency Constraint

Since augmentation only alters style attributes (brightness, hue, contrast) while leaving scene structure unchanged, a feature consistency loss is introduced to encourage the network to learn domain-invariant representations that do not rely on shortcut features:

\[\mathcal{L}_{cons} = \|f_{feat}(x_L) - f_{feat}(x^*_L)\|_2 + \|f_{feat}(x_R) - f_{feat}(x^*_R)\|_2\]

Loss & Training

\[\mathcal{L} = \mathcal{L}_{smooth_{L_1}}(\hat{d}, d_{gt}) + \lambda \mathcal{L}_{cons}\]

where \(\lambda = 0.17\) is a hyperparameter.

Key Experimental Results

Experimental Setup

  • Training set: SceneFlow (35,454 pairs)
  • Test sets: KITTI 2012, KITTI 2015, Middlebury (half resolution), ETH3D
  • Baseline networks: PSMNet, GwcNet, CFNet
  • Training configuration: Single RTX 4090, Adam optimizer, 20 epochs, batch size = 4

Main Results (D1 Error Rate)

Method KITTI2015 KITTI2012 Middlebury ETH3D
PSMNet 16.3% 15.1% 34.2% 23.8%
UgDA-PSMNet 5.3% 4.8% 8.5% 10.2%
GwcNet 22.7% 20.2% 37.9% 54.2%
UgDA-GwcNet 4.9% 4.2% 8.3% 5.7%
CFNet 6.0% 5.2% 15.4% 5.7%
UgDA-CFNet 5.2% 4.7% 8.2% 4.9%

UgDA-PSMNet reduces the D1 error rate of the original PSMNet by 10.3%–25.7% across four datasets; UgDA-GwcNet achieves an even larger reduction of 16.0%–48.5%.

Ablation Study (D1 Error Rate)

Augmentation \(\mathcal{L}_{cons}\) KITTI2012 (PSMNet) KITTI2015 (PSMNet)
15.1% 16.3%
5.8% 6.1%
4.8% 5.3%

Data augmentation accounts for the dominant performance gain, while the feature consistency loss contributes an additional improvement of approximately 1%.

DrivingStereo Weather Robustness

Method Sunny Cloudy Rainy Foggy Average
PSMNet 62.5% 60.1% 60.5% 68.6% 63.9%
FT-PSMNet 4.0% 2.9% 11.5% 6.5% 6.3%
UgDA-PSMNet 4.2% 3.3% 6.5% 5.7% 4.9%

Without accessing any target-domain data, UgDA-PSMNet even surpasses the fine-tuned baseline in rainy and foggy conditions.

Highlights & Insights

  1. Minimal yet effective: The entire method operates at the input level, requires no backbone modification, no additional modalities, and incurs negligible computational overhead.
  2. Clear theoretical intuition: The approach is grounded in the observation that RGB statistics encode domain characteristics; modeling perturbation uncertainty with a Gaussian distribution follows naturally.
  3. Universal plug-and-play: Consistent and significant improvements are achieved across three distinct architectures (PSMNet, GwcNet, CFNet).
  4. Adverse weather robustness: The method performs stably under diverse weather conditions in DrivingStereo and outperforms fine-tuned baselines without accessing target-domain data.

Limitations & Future Work

  1. Global statistics only: The method cannot simulate local domain variations (e.g., local shadows or stylistic differences in occluded regions), which may limit performance in fine-grained regions.
  2. Gaussian assumption: Real-world domain shifts may not follow a Gaussian distribution; more expressive distribution modeling (e.g., Gaussian mixtures or flow-based models) could yield further improvements.
  3. Occlusion and non-Lambertian surfaces: The authors acknowledge that handling occluded regions and non-Lambertian surfaces (e.g., transparent or specular objects) remains an open challenge.
  4. Training on SceneFlow only: The impact of other synthetic datasets (e.g., TartanAir) or combinations thereof has not been explored.
  5. L2 feature consistency: Stronger alternatives such as contrastive learning or mutual information objectives could be considered.
Method Strategy Requires Arch. Modification Requires Target-Domain Data
DSMNet Domain normalization layers Yes No
GraftNet Large-scale pretrained features Yes No
ITSA Information-theoretic perturbation + gradient adversarial No No
HVT Multi-level data augmentation No No
Masked-Stereo Pseudo multi-task (matching + reconstruction) Yes No
UgDA-Stereo Statistics perturbation + feature consistency No No

The primary advantage of UgDA-Stereo lies in its simplicity—it requires no architectural modification, no additional pretrained models, and no target-domain data, achieving state-of-the-art cross-domain generalization through input-level statistical perturbation and a single auxiliary loss.

The core idea of simulating domain shifts by perturbing channel statistics is not limited to stereo matching; it transfers directly to other dense prediction tasks such as monocular depth estimation, optical flow estimation, and semantic segmentation. The augmentation formula is fundamentally an inverse normalization with random reparameterization, which bears a deep connection to AdaIN-based style transfer, suggesting the potential benefit of incorporating richer style libraries. Furthermore, using within-batch statistical variation to drive augmentation is an elegant and generalizable strategy applicable to other scenarios requiring domain diversity.

Rating

  • Novelty: 3/5 — The idea of perturbing channel statistics has precedents, but the combination of Gaussian uncertainty modeling and feature consistency is novel.
  • Experimental Thoroughness: 4/5 — Covers four standard benchmarks, weather robustness, ablation studies, and visualization comprehensively.
  • Writing Quality: 3/5 — Generally clear, with minor issues in formula typesetting and phrasing.
  • Value: 4/5 — Plug-and-play, simple, effective, and well-suited for practical deployment.