3D Visual Illusion Depth Estimation¶
Conference: NeurIPS 2025 arXiv: 2505.13061 Code: GitHub Area: 3D Vision / Depth Estimation Keywords: 3D visual illusion, depth estimation, monocular-stereo fusion, vision-language model, Flow Matching
TL;DR¶
This paper reveals that 3D visual illusions (e.g., wall paintings, screen replays, mirror reflections) severely mislead existing state-of-the-art monocular and stereo depth estimation methods. The authors construct a large-scale dataset comprising approximately 3k scenes and 200k images, and propose a VLM-driven monocular-stereo adaptive fusion framework that achieves state-of-the-art performance across diverse illusion scenarios.
Background & Motivation¶
Depth estimation is critical for downstream applications such as AR/VR and robotics. Current monocular methods (e.g., DepthAnything V2, Marigold) and stereo methods (e.g., RAFT-Stereo, IGEV) have approached human-level performance on ordinary scenes, yet they fail severely in 3D visual illusion scenarios—paintings on flat surfaces, printed images, screen content, holographic projections, and mirror/transparent objects all induce erroneous depth predictions. Such illusions are pervasive in the real world and pose threats to safety-critical applications such as autonomous driving and robot navigation, yet systematic research and evaluation benchmarks have been lacking.
Core Problem¶
- Quantitatively revealing the impact of 3D visual illusions on depth estimation—how different illusion types separately mislead monocular and stereo methods;
- Constructing a large-scale benchmark for systematic evaluation;
- Designing a fusion framework that exploits the complementarity of monocular and stereo methods to resist illusions.
Key insight: monocular methods rely on texture cues (shape, perspective, shading) and are easily deceived by 3D textures simulated on flat surfaces, yet can handle mirrors through learned priors; stereo methods rely on pixel matching and are immune to texture illusions, but fail on mirror/transparent surfaces due to overlapping reflections. The two approaches are thus strongly complementary.
Method¶
Overall Architecture¶
The proposed VLM-Driven Monocular-Stereo Fusion Model consists of two core components: 1. Dual-branch prediction network: simultaneously outputs a stereo disparity map and a monocular depth map. 2. VLM fusion network: leverages the commonsense reasoning capability of a vision-language model to assess the reliability of each depth cue across different regions and generates a confidence map to guide fusion.
Key Designs¶
- Dual-branch prediction network:
- Stereo branch: based on a GRU iterative refinement framework; extracts features from rectified image pairs, constructs a cost volume, and iteratively refines disparity from a zero initialization.
- Monocular branch: a frozen DepthAnything V2 predicts affine-invariant inverse depth; monocular features are simultaneously extracted as left-view context to assist stereo disparity refinement.
-
Elegant design: monocular features not only produce an independent depth prediction but also feed back into the stereo branch.
-
VLM prediction stage:
- A pretrained QwenVL2-7B is employed; the visual prompt comprises the left image, the stereo disparity map, and the monocular disparity map.
- The language prompt is designed from the perspective of "which materials interfere with stereo matching" (e.g., transparent/reflective objects) rather than directly describing complex illusion textures.
-
The last layer of the VLM is fine-tuned with LoRA.
-
Confidence map generation (Flow Matching):
- Inspired by FLUX, flow matching is used to learn a guided path flow from Gaussian noise to the confidence distribution.
- Image-text embeddings from the VLM serve as conditioning information, injected via Transformer with cross-attention.
-
A VAE decoder maps the final state back to image space; the result is concatenated with the cost volume and processed by a convolutional layer to predict the confidence map.
-
Global fusion stage:
- An affine transformation aligns monocular disparity to absolute/metric space: \(\tilde{D}_m = s_m \cdot D_m + t_m\).
- Affine parameters are learned via convolution over the concatenation of monocular and stereo disparities.
- Parameters in low-confidence regions are refined through pooling from high-confidence neighboring regions.
- The aligned monocular disparity, stereo disparity, and confidence map are concatenated and processed through convolution and upsampling to yield the final high-resolution disparity map.
Loss & Training¶
- Disparity loss \(\mathcal{L}_d\): L1 loss supervising the GRU disparity at each iteration, the aligned monocular disparity, and the final predicted disparity.
- Confidence map loss \(\mathcal{L}_c\): Focal Loss, with ground truth derived from the discrepancy between the final stereo prediction and the ground truth.
- Training strategy: pre-trained on SceneFlow, then fine-tuned on the synthetic 3D-Visual-Illusion data.
- Trained on 4×H100 GPUs for approximately 20 days with a batch size of 6 per GPU.
Dataset Construction (3D-Visual-Illusion Dataset)¶
Five illusion categories are covered: - Inpainting illusion (paintings on walls/floors) - Picture illusion (images printed or drawn on paper) - Replay illusion (content replayed on screens) - Holography illusion (holographic projections) - Mirror illusion (mirror/transparent surfaces)
Synthetic data: 5,226 videos (52M frames) crawled from the web, automatically filtered by Qwen2-VL-72B and manually curated to 1,384 videos/236k frames; mirror-type data generated by Sora/Kling/HunyuanVideo yielding 234 videos/2,382 frames. Depth ground truth is produced via DepthAnything V2 + SAM2 segmentation + RANSAC plane fitting correction.
Real data: ZED Mini stereo camera + RealSense L515 LiDAR, 72 scenes/617 frames, with GT depth obtained through calibration, Z-buffering, and back-projection validation.
Key Experimental Results¶
Illusion Region Evaluation on Real Data¶
| Method | Type | EPE↓ | bad2↓ | AbsRel↓ | δ1↑ |
|---|---|---|---|---|---|
| DA V2 | Mono | 5.81 | 61.45 | 0.14 | 92.86 |
| DepthPro+align | Mono | 4.36 | 44.98 | 0.09 | 93.83 |
| RAFT-Stereo | Stereo | 1.62 | 24.32 | 0.04 | 99.18 |
| Ours | Fusion | 1.77 | 26.72 | 0.03 | 99.60 |
Zero-Shot Generalization on Booster Dataset (Mirror/Transparent Regions)¶
| Method | All EPE↓ | All bad2↓ | Trans EPE↓ | Trans bad2↓ |
|---|---|---|---|---|
| RAFT-Stereo | 4.08 | 17.61 | 9.55 | 67.84 |
| MochaStereo | 3.79 | 16.77 | 9.18 | 66.64 |
| Ours | 2.43 | 13.84 | 7.32 | 56.77 |
Zero-Shot Generalization on Middlebury¶
| Metric | RAFT-Stereo | Selective-IGEV | MochaStereo | Ours |
|---|---|---|---|---|
| EPE | 2.34 | 2.66 | 2.89 | 1.50 |
| Bad-2 | 12.04 | 10.18 | 11.93 | 11.79 |
Ablation Study¶
- The pure stereo baseline achieves bad2=80.38% on Booster; incorporating VLM fusion reduces this to 56.77% (↓24 pp).
- Introducing monocular features (MF) improves the bad metric but slightly degrades EPE—better overall geometry at the cost of some extreme offsets.
- Adaptive post-fusion (APF, dual GRU) introduces noise due to inconsistent updates between the two branches.
- The VLM confidence estimator achieves an error rate of approximately 20% in zero-shot scenarios, demonstrating strong generalization.
Highlights & Insights¶
- Novel problem formulation: the first systematic study of 3D visual illusions in depth estimation, defining five illusion categories.
- Insightful complementarity analysis: clearly articulates the failure modes of monocular (generative/texture-to-geometry mapping) versus stereo (discriminative/pixel matching) methods under different illusions and their complementary relationship.
- VLM commonsense-driven fusion: leverages LLM commonsense knowledge (e.g., "mirrors/glass cause matching failures") to guide confidence estimation, achieving better generalization than purely data-driven approaches.
- Comprehensive dataset construction: combines synthetic and real data using web-crawled videos, generative models, and LiDAR ground truth in a systematic pipeline.
- No degradation on ordinary scenes: EPE improves on Middlebury, demonstrating the generality of the fusion framework.
Limitations & Future Work¶
- High computational cost: the VLM component requires approximately 54 GB VRAM and 4.77 s/sample (H100), far exceeding pure stereo methods (5.6 GB, 0.87 s).
- Manual annotation dependency: synthetic data generation relies on SAM2 manual annotation of illusion regions and supporting regions.
- Limited real data coverage: only three illusion categories (inpainting, picture, replay) are covered in real data; holography and mirror types are absent.
- Single-illusion assumption: complex scenarios involving multiple superimposed illusion types are not studied.
- Planarity assumption: disparity correction assumes that illusion regions are coplanar with supporting regions, which may fail for non-planar cases.
Related Work & Insights¶
- Compared with monocular methods (DepthAnything V2, Marigold, etc.): this paper reveals that, as "generative models" mapping texture cues to geometry, these methods are inherently susceptible to simulated textures; fine-tuning does not fundamentally resolve the issue—performance improves on illusion regions but degrades on ordinary regions.
- Compared with stereo methods (RAFT-Stereo, IGEV, etc.): reflections on mirror/transparent surfaces cause matching failures; fine-tuning is similarly limited because the learning signals from different illusion types conflict with each other.
- Compared with multi-view methods (DUSt3R, VGGT): these exhibit a strong monocular bias in illusion scenarios.
- Compared with the Booster dataset: Booster focuses on transparent/reflective surfaces, whereas this work covers a broader scope (five illusion categories).
The core insight—complementary monocular-stereo fusion driven by VLM commonsense reasoning—is generalizable to other 3D tasks that require handling "counter-intuitive" scenes.
Rating¶
- Novelty: ⭐⭐⭐⭐ The problem formulation is highly novel (the first systematic study of 3D visual illusions), though the fusion approach itself (dual-branch + confidence-based fusion) follows a relatively standard technical paradigm.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ The dataset construction is comprehensive, the baseline comparisons are extensive, the ablation study is thorough, and fine-tuning analysis as well as downstream task (3D detection) visualization are included.
- Writing Quality: ⭐⭐⭐⭐ The writing is clear, with in-depth analysis of problem motivation and complementarity; however, certain dataset construction details (e.g., mathematical derivation of plane fitting) occupy disproportionate space.
- Value: ⭐⭐⭐⭐ The paper surfaces an overlooked yet important problem, and the dataset and benchmark offer long-term value to the community; however, the high computational cost of the VLM fusion component limits practical applicability.