CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives¶
Conference: ICLR 2026 arXiv: 2512.14696 Code: Available (project page) Area: 3D Vision / Real2Sim Keywords: Real2Sim, monocular video, planar scene primitives, human-scene interaction, RL humanoid control
TL;DR¶
This paper proposes CRISP, a method for recovering simulatable human motion and scene geometry from monocular video. By fitting planar primitives to obtain clean, simulation-ready geometry and leveraging human-scene contact modeling to reconstruct occluded regions, CRISP reduces the motion tracking failure rate of a humanoid controller from 55.2% to 6.9%.
Background & Motivation¶
Real2Sim (converting real environments into simulation-ready representations) is a central problem in robotics and AR/VR. Recovering physically simulatable human motion and scene geometry from monocular video has significant value for robot policy training, motion retargeting, and virtual reality content creation.
Limitations of Prior Work:
Joint optimization methods with data-driven priors: These approaches rely on learned priors for joint human-scene reconstruction but operate with no physics in the loop, resulting in physically implausible outputs such as body-object interpenetration.
Direct geometric reconstruction methods: Although capable of recovering scene geometry, the results typically contain noise and artifacts. Such unclean geometry causes failures in scene interaction when fed into motion tracking policies — for example, an uneven chair surface can trigger abnormal physical collisions when a humanoid controller attempts to sit.
Key Challenge: Existing methods either lack physical plausibility or produce geometry that is insufficiently clean to be directly used for interactive physical simulation.
Core Idea: Fit planar primitives to the scene point cloud to obtain convex, clean, simulation-ready geometry, and use human-scene contact modeling to recover geometrically occluded regions during interaction.
Method¶
Overall Architecture¶
Input: Monocular RGB video. Output: Simulatable human motion sequences and clean scene geometry. The pipeline consists of three main stages: (1) scene geometry reconstruction via planar primitive fitting, (2) occluded region recovery via contact guidance, and (3) physical validation via a humanoid controller trained with RL.
Key Designs¶
-
Planar Primitive Fitting:
- Dense point cloud reconstruction is first obtained from the video using existing methods.
- A clustering pipeline segments the point cloud based on three features: depth, surface normals, and optical flow.
- A planar primitive is fit to each cluster.
- The result is a compact scene representation composed of convex planar surfaces.
- Design Motivation: Planar primitives are inherently convex and noise-free, making them well-suited for physics simulation engines. Compared to meshes or implicit representations, they introduce no noisy artifacts and support efficient, stable collision detection.
- Additional Benefit: Simulation throughput improves by 43%, as convex geometry enables much faster collision detection than complex meshes.
-
Contact-Guided Occlusion Recovery:
- During human-scene interaction, portions of the scene geometry are occluded by the human body (e.g., a chair seat is hidden when a person sits).
- Human-scene contact modeling is used to infer the occluded geometry.
- Core Idea: Human pose implicitly encodes scene geometry — for instance, a sitting pose can be used to infer the position and shape of a chair seat.
- By estimating contact points between body joints and the scene, the method back-projects the planar positions of occluded scene regions.
- This approach requires no prior CAD models or scene category templates.
-
Physical Validation via Humanoid Controller + RL:
- The recovered human motion and scene geometry are used to drive a humanoid controller.
- An RL policy is trained to track the original video motion within the reconstructed scene.
- This step serves both as a validation mechanism (poor reconstruction quality leads to RL failure) and as a final output (producing simulatable human motion).
- Physical simulation enforces physical plausibility in the final results: no penetration, maintained balance, and reasonable contact.
Loss & Training¶
- Clustering stage: Unsupervised clustering using distance metrics over depth, surface normal, and optical flow features.
- Plane fitting: Least-squares fitting of plane parameters for each cluster.
- RL controller training: Standard PPO or similar policy gradient method; reward function includes motion tracking error and physical plausibility penalties (e.g., penetration, loss of balance).
Key Experimental Results¶
Main Results¶
Evaluated on the human-centric video benchmarks EMDB and PROX:
| Method | Motion Tracking Failure Rate ↓ | RL Simulation Throughput | Notes |
|---|---|---|---|
| Prior methods (noisy geometry) | 55.2% | Baseline | Geometric artifacts cause frequent failures |
| CRISP (Ours) | 6.9% | +43% faster | Clean geometry drastically reduces failures |
In-the-Wild Video Validation¶
| Video Type | Result | Notes |
|---|---|---|
| Casually captured daily videos | Success | Generalizes to uncontrolled environments |
| Internet videos | Success | Generalizes to diverse scenes |
| Sora-generated videos | Success | Applicable even to AI-generated content |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| Without planar primitives (raw mesh) | Failure rate increases substantially | Validates the critical role of planar primitives |
| Without contact-guided occlusion recovery | Poor performance on interaction scenes | Occlusion recovery is necessary for interactions |
| Without RL validation (direct output) | Physically implausible penetration | RL ensures physical realism |
| Different clustering feature combinations | Depth + normals + flow is optimal | Three features are complementary |
Key Findings¶
- Planar primitives are an ideal scene representation for Real2Sim: clean, convex, and computationally efficient.
- Human pose is a powerful signal for inferring occluded scene geometry.
- The motion tracking failure rate drops from 55.2% to 6.9%, representing an approximately 88% relative improvement.
- The 43% throughput gain stems from more efficient collision detection with convex geometry.
- The method generalizes well to in-the-wild videos, including AI-generated content from Sora.
- The entire pipeline requires no CAD model libraries or scene category priors.
Highlights & Insights¶
- The insight of replacing complex meshes with planar primitives is simple yet powerful — at a modest cost in geometric detail, simulation compatibility is substantially improved.
- Contact-guided occlusion recovery is the key innovation: human pose serves as a "mold" for inferring hidden geometry.
- Using the RL humanoid controller as a physical plausibility validator forms a meaningful closed loop.
- Successful validation on Sora-generated videos demonstrates strong generalization potential and forward-looking applicability.
- The depth + normals + optical flow feature combination for clustering is concise yet effective.
- The method can generate physically valid human motion and interaction environments at scale, with direct applications in robotics and AR/VR.
Limitations & Future Work¶
- The planar primitive assumption limits representational capacity for curved objects (e.g., spheres, cylinders).
- The pipeline depends on the quality of upstream point cloud reconstruction; inaccurate depth estimation propagates errors throughout.
- Contact-based inference relies on human pose, and cannot handle occlusions caused by distant, non-contact objects.
- Clustering hyperparameters (e.g., number of clusters, distance thresholds) may require scene-specific tuning.
- Dynamic scenes involving moving objects are not currently handled.
- Training the RL controller itself requires substantial computational resources.
- Future work may extend to multi-person interaction scenarios and more complex object manipulation tasks.
Related Work & Insights¶
- Related to human-scene interaction reconstruction methods such as PROX and LEMO, but introduces physical simulation as a validation step.
- Complementary to physics-constrained motion generation approaches such as PhysDiff.
- The planar primitive fitting idea connects to classical computational geometry work on plane detection, but its application to Real2Sim is novel.
- Insights: (1) Simple geometric representations often outperform noisy but detailed ones in simulation contexts; (2) human pose as an implicit encoder of scene geometry is a direction worth deeper exploration; (3) the use of RL as a validation tool is transferable to other reconstruction tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐