Lite Any Stereo: Efficient Zero-Shot Stereo Matching¶

Conference: CVPR 2026 arXiv: 2511.16555 Code: tomtomtommi/LiteAnyStereo Area: 3D Vision Keywords: Stereo matching, zero-shot generalization, efficient inference, hybrid cost aggregation, knowledge distillation

TL;DR¶

This paper proposes Lite Any Stereo, which achieves first-place rankings on four real-world benchmarks using less than 1% of the computation (33G MACs) of state-of-the-art accurate methods. This is accomplished via a hybrid 2D-3D cost aggregation module and a three-stage million-scale training strategy (supervised → self-distillation → real-data knowledge distillation), demonstrating for the first time that ultra-lightweight models can exhibit strong zero-shot generalization.

Background & Motivation¶

The stereo matching field suffers from a severe accuracy–efficiency dichotomy:

Background: Current stereo matching methods fall into two camps — accurate methods (FoundationStereo, Selective-IGEV, etc.) leverage foundation model depth priors or large-scale computation to achieve high accuracy but incur MACs in the thousands of GigaFLOPs; efficient methods (LightStereo, BANet, etc.) target real-time inference but at the cost of accuracy.

Limitations of Prior Work: Most efficient methods are fine-tuned only for specific domains (e.g., KITTI) and lack zero-shot generalization; the community broadly assumes that lightweight models are inherently incapable of zero-shot generalization due to limited capacity.

Key Challenge: efficiency vs. zero-shot generalization — the community implicitly treats these as mutually exclusive.

Key Gap: Although StereoAnything attempts to train efficient models using 30M pseudo-disparity maps generated by monocular depth models, monocular depth quality is limited, and efficient models still lag far behind accurate methods.

Key Insight: The authors argue that the problem lies not in model capacity per se, but in (a) architectures that fail to exploit the complementary information of 2D and 3D representations, and (b) training strategies that do not leverage the abundance of unlabeled real-world data.

Core Idea: By combining hybrid cost aggregation with a three-stage training strategy, even ultra-lightweight models can bridge the sim-to-real gap and achieve strong zero-shot generalization.

Method¶

Overall Architecture¶

A feed-forward network with a four-stage pipeline: shared-weight feature extraction (MobileNetV2 backbone, multi-scale features unified to 1/4 resolution) → correlation computation (constructing cost volume \(\mathbf{C}(d,h,w) = \frac{1}{N_c}\langle \mathbf{F}_L^{1/4}(h,w), \mathbf{F}_R^{1/4}(h,w-d) \rangle\)) → hybrid 3D-2D cost aggregation → disparity estimation (soft-argmax + convex upsampling to full resolution). The entire model requires only 33G MACs.

Key Designs¶

Compact Backbone:
- Function: Efficient multi-scale matching feature extraction.
- Mechanism: An ImageNet-pretrained MobileNetV2 is used as a shared-weight feature extractor, producing \(\{1/4, 1/8, 1/16, 1/32\}\) multi-scale features unified to 1/4 resolution via residual upsampling.
- Design Motivation: Experimental comparisons show that MobileNetV2's channel configuration is better suited for stereo matching than the more recent ConvNeXt v2 (Tab. 2c: MobileNetV2 achieves 5.39 vs. ConvNeXt v2's 5.03 on ETH3D, but ConvNeXt v2 achieves 10.52 vs. 10.89 on Middlebury). External priors such as DepthAnything are deliberately excluded to maintain minimal computation.
Hybrid Cost Aggregation:
- Function: Jointly exploits 2D and 3D representations to capture complementary spatial and disparity cues.
- Core Problem: Pure 2D aggregation collapses the disparity dimension into channels and cannot model structural continuity along the disparity direction; pure 3D aggregation is computationally expensive and many disparity-dimension layers contribute minimally.
- Mechanism: A sequential 3D→2D structure is adopted: \(\mathbf{C}_{agg} = \mathbf{G}_{2D}(\mathbf{G}_{3D}(\mathbf{C}))\). The 3D block uses multi-scale 3D convolutions (kernel \((3,3,3)\)) to perceive cross-disparity structure at approximately 4.8% of total computation; the 2D block uses ConvNeXt layers for efficient spatial refinement.
- Design Comparison: Four integration schemes are explored — (a) parallel bilateral, (b) 2D→3D, (c) 3D→2D, and (d) interleaved. Ablations confirm that 3D→2D is optimal (Tab. 2a), as establishing disparity-structural awareness with a small number of 3D convolutions before efficient 2D spatial refinement constitutes a more principled information flow.
- 3D Ratio Ablation: Just 4.8% 3D computation suffices; increasing to 9.5% or 15.6% degrades performance (Middlebury rises from 9.50 to 10.06 and 10.34), as excessive 3D computation crowds out spatial refinement capacity within a fixed MACs budget.
Three-Stage Million-Scale Training Strategy:
- Stage ① Supervised Training on Synthetic Data: End-to-end training for 150K steps on 1.8M annotated synthetic samples (SceneFlow 35K + FallingThings 30K + FSD 1.1M + CREStereo 0.2M + VKITTI2 21K + TartanAir 0.31M + Dynamic Replica 0.14M) using smooth L1 loss to establish fundamental matching capability.
- Stage ② Self-Distillation on Synthetic Data: Teacher and student share the same architecture and are initialized from Stage ①. The teacher receives clean inputs; the student receives strongly augmented inputs. A feature alignment loss is used to learn domain-invariant representations: \(\mathcal{L}_{feat} = 1 - \frac{1}{HW}\sum_{i=1}^{HW} \cos(F_i, F_i')\). Ablations compare three distillation schemes: (a) fixed teacher weights, (b) EMA-updated teacher, and (c) hard-copy student→teacher. Fixed teacher (a) performs best (K.12: 3.64 vs. 3.97 vs. 4.22), likely because a stable anchor is more conducive for the lightweight student to learn domain-invariant features.
- Stage ③ Knowledge Distillation on Real Data: 0.5M unlabeled real stereo pairs are collected (Flickr1024, InStereo2K, Holopix50K, DrivingStereo, SouthKenSV, UASOL). A frozen FoundationStereo model serves as the teacher to generate pseudo-labels for 100K steps of fine-tuning. Key finding: data quality matters far more than quantity — low-quality data (e.g., Stereo4D at only 512×512 resolution, HRWSI with poor rectification) actively degrades generalization performance.

Loss & Training¶

Stage ①: \(\mathcal{L}_{disp} = \text{smooth}_{L_1}(\mathbf{D} - \mathbf{D}_{gt})\)
Stage ②: \(\mathcal{L}_{disp} + \mathcal{L}_{feat}\) (cosine feature alignment)
Stage ③: Smooth L1 loss on pseudo-labels; self-distillation is not applied (no additional benefit on pseudo-labels)
Training configuration: 150K + 50K + 100K steps, batch size 176, A100 GPUs, AdamW + one-cycle LR (peak 2e-4), crops 256×512 → fine-tuning 320×736, \(D_{max}=192\)

Key Experimental Results¶

Main Results (Million-Scale Zero-Shot Generalization)¶

Dataset	Metric	Lite Any Stereo	Prev. Best Efficient Method	Accurate Method Ref.	MACs
KITTI 2012	D1	3.04	4.00 (StereoAnything-L)	2.51 (FoundationStereo)	33G vs. 84G vs. 12824G
KITTI 2015	D1	3.87	4.81 (StereoAnything-L)	2.83 (FoundationStereo)	same
ETH3D	Bad 1.0	3.53	3.81 (StereoAnything-L)	0.49 (FoundationStereo)	same
Middlebury	Bad 2.0	7.51	9.82 (StereoAnything-L)	1.12 (FoundationStereo)	same
DrivingStereo Weather	D1	8.74	—	10.71 (FoundationStereo)	33G vs. 12824G

Note: On the DrivingStereo weather subset, Lite Any Stereo (33G MACs) even surpasses FoundationStereo, its teacher model with 389× more computation.

Ablation Study¶

Training Stage Ablation	K.12	K.15	ETH3D	Middlebury	Note
Stage ① only	4.05	4.55	4.43	8.49	Synthetic supervised baseline
+ Stage ②	3.66	4.53	4.69	7.03	Self-distillation: K.12/Mid improve significantly
+ Stage ③	3.04	3.87	3.53	7.51	Real-data distillation: K.12/ETH3D improve further

Aggregation Scheme (Tab. 2a)	K.12	K.15	ETH3D	Middlebury
Pure 2D	5.02	5.01	6.48	11.29
3D→2D (default)	4.78	4.64	5.39	10.89
Bilateral parallel	5.10	5.10	8.55	12.00
Interleaved	4.61	4.73	6.20	11.34

Key Findings¶

Only 4.8% 3D computation suffices for significant disparity-structural awareness — increasing the ratio degrades performance by crowding out the MACs budget.
The 3D→2D sequential design outperforms all other hybrid schemes, confirming that modeling disparity structure before spatial refinement is the correct information flow direction.
Fixed-teacher self-distillation outperforms EMA and hard-copy variants — a stable anchor is more effective for lightweight students learning domain-invariant features.
The training strategy exhibits architectural generality: applying the same strategy to LightStereo-M and BANet-2D yields consistent improvements.
Inference speed is fastest across all methods: 21 ms on GTX 1080 Ti, 19 ms on RTX 2080 Ti, 23 ms on RTX 3090, 17 ms on RTX 4090, with only 2.5 GB VRAM for 2K inputs.
The student surpasses the teacher on the DrivingStereo weather subset (8.74 vs. 10.71), demonstrating that lightweight models with distillation can learn more robust representations than large models on specific distributions.

Highlights & Insights¶

Breaking a Cognitive Barrier: For the first time, it is demonstrated that an ultra-lightweight model (<1% MACs) can match or even surpass accurate methods on zero-shot benchmarks, challenging the community consensus that "lightweight = weak generalization."
Minimalism in Hybrid Aggregation: Just 4.8% 3D computation captures the critical disparity-structural information, suggesting that 3D convolutions play a precise and surgical rather than brute-force role along the disparity dimension.
Universal Value of the Training Strategy: The three-stage strategy is effective across different architectures (LightStereo, BANet) and can serve directly as a standard training paradigm for efficient stereo matching.
Student Surpassing Teacher: The lightweight student outperforming FoundationStereo on DrivingStereo demonstrates that distillation combined with domain specialization can enable small models to outperform large ones on specific distributions — a phenomenon that warrants further investigation.

Limitations & Future Work¶

Gap from Prior-Based Methods: Without depth priors such as DepthAnything, performance headroom is limited (ETH3D: 3.53 vs. FoundationStereo's 0.49).
Pseudo-Label Bottleneck: Stage ③ quality depends entirely on FoundationStereo — where the teacher fails, the student cannot learn effectively.
Middlebury Indoor Scenes: Real indoor data is insufficient; Middlebury deteriorates from 7.03 to 7.51 after Stage ③, indicating a lack of real indoor training samples.
Fixed Maximum Disparity: \(D_{max}=192\) may limit applicability in scenes with very large disparities.
Transparent and Reflective Surfaces: The authors acknowledge that these challenging scenarios still require improvement.
Temporal consistency is unexplored; multi-frame stereo video may offer further gains.

vs. LightStereo: Both are lightweight 2D aggregation methods (33G MACs), but LightStereo has weak zero-shot generalization. Lite Any Stereo reduces K.12 from 4.10 to 3.04 by incorporating a minimal amount of 3D aggregation and the three-stage training strategy.
vs. StereoAnything: Also targets efficient model generalization but relies on monocular depth pseudo-labels (30M samples), whose quality is inferior to the stereo pseudo-labels generated by FoundationStereo in this work. Furthermore, StereoAnything-L requires 84G MACs vs. 33G here.
vs. FoundationStereo: Serves as the Stage ③ teacher. Notably, the student outperforms the teacher on DrivingStereo weather, illustrating the unique advantage of distillation combined with a lightweight architecture on specific distributions.
vs. BANet: A concurrent high-efficiency method (36G MACs) with very poor zero-shot generalization under SceneFlow-only training (ETH3D: 44.89); applying the proposed three-stage strategy substantially improves it to 4.05.

Rating¶

Novelty: ⭐⭐⭐⭐ First to achieve SOTA zero-shot stereo matching at extremely low computation, overturning the stereotype of "lightweight = weak generalization."
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five real-world benchmarks, multi-GPU inference time comparisons, detailed ablations, and validation of the training strategy's architectural generality.
Writing Quality: ⭐⭐⭐⭐ Clear structure, rich figures and tables, and systematic and persuasive ablations.
Value: ⭐⭐⭐⭐⭐ Directly advances practical deployment of stereo matching; the three-stage training strategy has broad reference value.