Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts¶
Conference: ICCV 2025 arXiv: 2507.04631 Code: GitHub Area: 3D Vision Keywords: Stereo Matching, Visual Foundation Model, Mixture-of-Experts, LoRA, Cross-Domain Generalization
TL;DR¶
This paper proposes SMoEStereo, which integrates variable-rank MoE-LoRA and variable-kernel MoE-Adapter modules into a frozen Visual Foundation Model (VFM), combined with a lightweight decision network for selective activation of MoE modules, achieving scene-adaptive robust stereo matching with state-of-the-art performance on cross-domain and joint generalization benchmarks.
Background & Motivation¶
Stereo matching is a core task in computer vision with broad applications in autonomous driving, robotic navigation, and augmented reality. Although recent learning-based methods perform well on standard benchmarks, their cross-domain generalization remains limited:
Core Challenges:
Domain Shift: Significant scene discrepancies and imbalanced disparity distributions exist across datasets.
Noisy Feature Maps: Domain shift can lead to noisy and distorted feature maps, degrading model robustness.
Why VFMs? Visual Foundation Models (e.g., DINOv2, SAM, DepthAnything), pretrained on large-scale diverse data, are capable of extracting robust and generalizable features. However, direct application raises two critical issues:
Limited Zero-Shot Performance: VFMs excel at semantic information extraction (segmentation/classification) but lack the discriminative features required for precise similarity measurement.
Inflexible Fixed Fine-Tuning: Fixed-rank LoRA or fixed-kernel CNN decoders apply uniform processing to all inputs, failing to adapt to the heterogeneity of in-the-wild scenes.
Proposed Solution: Both the rank of LoRA and the kernel size of CNN Adapters are made dynamically selectable as MoE experts, enabling scene-specific feature adaptation.
Method¶
Overall Architecture¶
Using RAFT-Stereo as the backbone, the feature extractor is replaced with a VFM, and two types of MoE modules along with a decision network are embedded into the ViT blocks:
VFM Feature Extraction → MoE-LoRA Layer (attention branch) + MoE-Adapter Layer (MLP branch) → Shallow CNN compression → Correlation volume → GRU iterative disparity update
MoE-LoRA Layer (Variable-Rank Experts)¶
Each MoE-LoRA layer contains \(M\) LoRA experts, each corresponding to a different matrix rank \(r_i\):
where each expert \(E_L^i(x) = W_i^{\text{up}} W_i^{\text{down}} x\), and the router selects optimal experts via Top-k selection:
Design Motivation: Different scenes require different low-rank subspaces — low-rank suffices for simple scenes, while high-rank is needed for complex textured regions.
MoE-Adapter Layer (Variable-Kernel Experts)¶
Multiple CNN Adapter experts with different convolutional kernel sizes \(k\) are embedded to capture local geometric structures at varying receptive fields:
Complementary Design: The CNN branch emphasizes fine-grained local geometric details, while the LoRA path models long-range interactions, reducing D1 error by up to 30%.
Decision Network (Selective Activation)¶
A lightweight MLP predicts binary activation decisions for each MoE layer, with Gumbel Softmax enabling end-to-end training:
The hyperparameter \(\gamma \in (0,1]\) controls the computational budget.
Loss & Training¶
- \(\mathcal{L}_{\text{disp}}\): L1 disparity loss with exponentially weighted iterative predictions
- \(\mathcal{L}_{\text{blc}}\): MoE expert load balancing loss to prevent uneven expert utilization
Key Experimental Results¶
Cross-Domain Generalization (Main Results)¶
| Method | KIT12 Bad3.0 | KIT15 Bad3.0 | Middle Bad2.0 | ETH3D Bad1.0 |
|---|---|---|---|---|
| RAFT-Stereo | 5.1 | 5.7 | 12.6 | 3.3 |
| LoS | 4.4 | 5.5 | 19.6 | 3.1 |
| Former-RAFT‡ (DAM) | 3.9 | 5.1 | 8.1 | 3.3 |
| SMoEStereo (DAMV2) | 4.22 | 4.86 | 7.05 | 2.10 |
Ablation Study: Component Effectiveness¶
| Ablation Setting | Key Findings |
|---|---|
| Fixed LoRA vs. MoE-LoRA | MoE-LoRA improves performance through scene-adaptive rank selection |
| w/o Adapter vs. MoE-Adapter | Injecting inductive bias significantly improves geometric feature extraction |
| Full MoE vs. Selective MoE | Decision network reduces redundant computation while maintaining accuracy |
| VFM capacity comparison | ViT-Base achieves satisfactory results with only 1/3 the parameters of ViT-Large |
Efficiency Comparison¶
| Method | Capacity | Memory (GB) | Time (s) | Extra Params (M) |
|---|---|---|---|---|
| Former-RAFT (DAM)† | ViT-Large | 4.1 | 0.47 | 6.9 |
| SMoEStereo (DAM) | ViT-Base | 1.9 | 0.18 | 2.86 |
Key Findings¶
- SMoEStereo achieves state-of-the-art cross-domain generalization across all four benchmarks without dataset-specific adaptation.
- Compared to the VFM-LoRA baseline, D1 error is reduced by up to 30%.
- The decision network flexibly controls the computational budget (\(\gamma\) from 0.3 to 1.0), accommodating different resource constraints.
- On DrivingStereo adverse weather evaluation, the average D1 is reduced from 5.0 to 4.3.
Highlights & Insights¶
- Heterogeneous MoE Design: Unlike conventional homogeneous MoE, LoRA experts employ variable ranks while Adapter experts use variable kernel sizes, leveraging the complementarity of both rank and receptive field dimensions.
- Selective Activation: The binary decision mechanism of the decision network achieves Pareto-optimal accuracy-efficiency trade-offs.
- Plug-and-Play: SMoE can be integrated as a plugin module into most stereo matching networks.
Limitations & Future Work¶
- Router training requires sufficiently diverse training data to learn meaningful expert assignment strategies.
- More extreme domain shifts (e.g., underwater or medical stereo imaging) remain unvalidated.
- The additional parameters introduced by MoE, though modest, increase overall model complexity.
Related Work & Insights¶
- Robust Stereo Matching: CFNet, CREStereo++, LoS, Selective-IGEV
- VFM Fine-Tuning: LoRA, AdaptFormer, VPT
- MoE: Sparse MoE, LoRA-MoE fusion
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to apply heterogeneous MoE with selective activation to VFM-based stereo matching
- Technical Depth: ⭐⭐⭐⭐ — Complete design comprising variable-rank LoRA, variable-kernel Adapter, and decision network
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive multi-benchmark, multi-VFM, multi-ablation, and efficiency evaluations
- Writing Quality: ⭐⭐⭐⭐ — Open-source code, clear efficiency advantages, suitable for practical deployment