SimRecon: SimReady Compositional Scene Reconstruction from Real Videos¶

Conference: CVPR 2026 arXiv: 2603.02133 Code: https://xiac20.github.io/SimRecon/ Area: Other Keywords: compositional scene reconstruction, simulation-ready, scene graph, active viewpoint optimization, physical assembly

TL;DR¶

This paper proposes SimRecon, a framework that automatically constructs simulation-ready compositional 3D scenes from real videos via a three-stage "perception → generation → simulation" pipeline. The core innovations are Active Viewpoint Optimization (AVO), which identifies the optimal projection viewpoint for single-object generation, and the Scene Graph Synthesizer (SGS), which guides physically plausible hierarchical assembly.

Background & Motivation¶

Background: 3D scene reconstruction follows three main paradigms — holistic neural reconstruction (3DGS/NeRF, non-interactive), manually or procedurally constructed simulators (AI2-THOR, ProcTHOR), and emerging compositional reconstruction (decomposing individual objects from multi-view input).

Limitations of Prior Work: - Holistic reconstruction lacks object boundaries and complete geometry, making it unsuitable for simulation and interaction. - Manual/procedural simulators are costly to build and yield unrealistic layouts. - Existing compositional reconstruction methods (DPRecon, InstaScene) rely on heuristic viewpoint selection, causing distortion in generated objects, and their outputs remain visual representations rather than simulation-ready scenes.

Key Challenge: Two critical gaps exist between real video and simulation-ready scenes — a visual fidelity gap in the "perception → generation" stage and a physical plausibility gap in the "generation → simulation" stage.

Key Insight: Rather than redesigning the entire pipeline, the paper designs two bridging modules to address the core challenges at each stage transition.

Core Idea: Active Viewpoint Optimization acquires the projection with maximum information gain as the generation condition, while the Scene Graph Synthesizer guides hierarchical physical assembly.

Method¶

Overall Architecture¶

SimRecon adopts a three-stage pipeline with an object-centric spatial representation as the unified interface:

Perception Stage: Semantic reconstruction from video input (2DGS + semantic segmentation) to obtain pose, scale, and semantic label for each object.
Generation Stage: AVO identifies the optimal viewpoint projection, which serves as the conditioning input for a single-object generative model (Rodin) to produce complete geometry and texture.
Simulation Stage: SGS constructs a scene graph to guide hierarchical assembly within a physics simulator (Blender/Isaac Sim).

Key Designs¶

Object-Centric Scene Representation:
- Function: Represents the scene as \(\mathcal{S}_\text{comp} = \{o_1, o_2, ..., o_L\}\), a set of discrete object primitives.
- Each object contains intrinsic attributes (spatial \(T_i \in SE(3)\), appearance \(\mathcal{M}_i, \mathcal{T}_i\), physical \(l_i, \text{mat}_i, m_i\)) and relational attributes (support/attachment relations in the scene graph).
- Design Motivation: Provides a unified interface across all three stages, with attributes populated progressively.
Active Viewpoint Optimization (AVO):
- Function: Searches 3D space for the projection viewpoint with maximum information gain for each object.
- Mechanism: Formulates viewpoint selection as an information-theoretic problem \(IG(v) = H(X|v_0) - H(X|v)\), using cumulative opacity rendered by 3DGS as a differentiable proxy for information gain: \(\max_v IG(v) = \max_v A(v) = \max_v \sum_{p \in \mathcal{P}_\text{obj}(v)} \alpha(p,v)\)
- Depth regularization prevents viewpoint collapse onto the object surface: \(L_\text{depth}(v) = \frac{\lambda_\text{depth}}{|\mathcal{P}_\text{obj}(v)|}\sum_p (D(p,v) - d_\text{target}(s_i))^2\)
- Iterative Viewpoint Expansion: After each viewpoint is selected, the effective opacity of already-covered Gaussians is multiplicatively decayed as \(\alpha_i^{(k)} = \alpha_i^{(k-1)} \cdot (1 - \text{clip}(\alpha_i'(v_k^*), 0, 1))\), directing subsequent iterations toward unobserved regions.
- Distinction from heuristic methods: Instead of relying on manually defined viewpoint sampling strategies, AVO performs gradient-based optimization directly in 3D space.
Scene Graph Synthesizer (SGS):
- Function: Infers support/attachment physical relations between objects and constructs a globally consistent scene graph.
- Regionalized Inference: DBSCAN clusters objects into spatial regions; AVO obtains the optimal observation viewpoint for each region, and a VLM (Qwen2.5-VL) infers local subgraphs \(\mathcal{G}_k\).
- Online Graph Merging: BFS traversal incrementally merges subgraphs, detects conflicting edges (missing paths or hierarchical contradictions), and resolves conflicts by acquiring arbitration viewpoints and re-querying the VLM.
- Hierarchical Physical Assembly: BFS proceeds from base nodes (floor/walls); support relations are resolved via gravity settlement simulation, while attachment relations are anchored with fixed constraints.
- Design Motivation: Naive object placement leads to floating or interpenetrating objects, necessitating an understanding of physical dependencies.

Loss & Training¶

The framework does not involve end-to-end training; each stage employs independently pretrained models (2DGS, SceneSplat, Rodin). AVO optimization takes approximately 30 seconds per object.

Key Experimental Results¶

Main Results — ScanNet Compositional 3D Reconstruction¶

Method	CD↓	F-Score↑	NC↑	PSNR↑	SSIM↑	LPIPS↓	MUSIQ↑	Time
Gen3DSR	11.69	30.19	70.50	19.26	0.886	0.425	60.94	17min
DPRecon	9.26	46.12	78.28	21.97	0.913	0.257	71.49	10h42m
InstaScene	6.90	49.69	82.55	22.35	0.907	0.302	71.57	29min
SimRecon	4.34	62.65	87.37	24.43	0.924	0.153	73.56	21min

Ablation Study¶

Configuration	Description
Max. 2D Visibility	Maximizes 2D pixel coverage only; viewpoints are insufficiently informative.
w/o \(L_\text{depth}\)	Viewpoints collapse onto the object surface, rendering projections invalid.
Full AVO	Information gain maximization + depth regularization → optimal viewpoints.
Global Infer. (SGS)	Single global inference misses objects and relations.
Naive Merging (SGS)	Simple merging without conflict resolution produces contradictory relations.
Full SGS	Regionalized inference + online conflict resolution → consistent scene graph.

Key Findings¶

AVO reduces CD (Chamfer Distance) by 37% compared to InstaScene (4.34 vs. 6.90), demonstrating that viewpoint quality is critical for generation outcomes.
Maximizing 2D visibility does not equate to maximizing 3D information — small objects may have high 2D coverage yet suffer from severe occlusion.
SGS hierarchical assembly yields significantly better physical plausibility than the MCMC post-processing used in MetaScenes.
The modular pipeline design allows individual components at each stage (reconstruction/generation/simulation) to be replaced independently.

Highlights & Insights¶

Information-Theoretic Viewpoint Optimization: Formulates viewpoint selection as information gain maximization and employs 3DGS opacity as a differentiable proxy.
Bridging Module Design Paradigm: Rather than redesigning the full pipeline, the approach identifies transition bottlenecks and introduces targeted bridging solutions.
Scene Graph as Physical Scaffold: Physical relation reasoning is decoupled from reconstruction; the scene graph both guides assembly and remains interpretable.
Decay Mechanism for Iterative Viewpoint Expansion: Attenuating the contribution of already-covered regions after each viewpoint selection naturally directs attention toward unobserved areas.

Limitations & Future Work¶

Relies on a VLM (Qwen2.5-VL) for physical relation inference, which may produce erroneous relations.
Validation is limited to 20 ScanNet scenes; larger-scale and outdoor settings remain untested.
The generation stage depends on Rodin, which may underperform on complex, transparent, or specular objects.
Conflict resolution in SGS requires additional VLM queries, potentially reducing efficiency in complex scenes.
The quality of inferred physical attributes (mass, material) depends on the VLM and lacks quantitative validation.

Rating¶

Novelty: ⭐⭐⭐⭐ The information-theoretic viewpoint formulation in AVO and the online graph merging strategy in SGS are genuinely innovative.
Experimental Thoroughness: ⭐⭐⭐ Limited to 20 ScanNet scenes; scale is relatively small.
Writing Quality: ⭐⭐⭐⭐ Clear structure with intuitive illustrations of the three-stage pipeline.
Value: ⭐⭐⭐⭐ Provides a practical end-to-end solution for the "video-to-simulation" problem.