Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass¶
Conference: CVPR 2026 arXiv: 2603.12789 Code: Project Page Area: 3D Vision / Joint Human-Scene Reconstruction Keywords: Multi-view human reconstruction, multi-person scene, SMPL-X, 3D foundation model, scale alignment
TL;DR¶
CHROMM is a unified framework that integrates the geometric prior of Pi3X and the human prior of Multi-HMR into a single feed-forward network, enabling joint reconstruction of cameras, scene point clouds, and SMPL-X human meshes from multi-person multi-view video in a single pass—without external modules, preprocessing, or iterative optimization. It achieves a multi-view WA-MPJPE of 53.1 mm on RICH and runs more than 8× faster than HAMSt3R.
Background & Motivation¶
Background: Joint 3D human-scene reconstruction is a core problem in computer vision, with applications in robotics, autonomous driving, and AR/VR. Recent 3D foundation models (DUSt3R, VGGT, Pi3X) have advanced scene reconstruction, while Multi-HMR has enabled multi-person human mesh recovery.
Limitations of Prior Work:
- Monocular methods such as UniSH and Human3R cannot exploit multi-view information, limiting their accuracy.
- Multi-view methods such as HSfM and HAMSt3R rely on additional modules (2D keypoint detectors, cross-view ReID modules) or require iterative optimization, resulting in high system complexity.
- Appearance-based ReID methods fail in visually similar scenarios (e.g., uniformed subjects).
- The near-metric scale output by Pi3X is misaligned with the true metric scale of SMPL—causing human meshes to penetrate the ground or float above it.
Key Challenge: Simultaneous reconstruction of the scene and multiple humans is hindered by scale inconsistency between the two, difficulty in cross-view person association, and the desire to avoid reliance on external preprocessing.
Goal: To build a unified feed-forward framework that requires no external modules or preprocessed data and performs joint multi-person multi-view human-scene reconstruction in a single pass.
Key Insight: Fuse the scene prior of Pi3X with the human prior of Multi-HMR, design a scale adjustment module to bridge the two, and replace appearance-based matching with geometric cues for cross-view association.
Core Idea: Dual-encoder late fusion + head-pelvis ratio scale adjustment + view-invariant/view-dependent decomposition fusion + geometry-driven multi-person association.
Method¶
Overall Architecture¶
Input multi-view multi-person video \(\{I_t^v\}\) → Dual encoders (Pi3X for scene features; Multi-HMR for human features) → Pi3X decoder reconstructs point maps and cameras → Head detection extracts human tokens, which are fused with scene tokens → SMPL decoder regresses pose/shape/translation → At test time: per-view tracking → geometry-driven cross-view multi-person association → view-invariant/view-dependent decomposition fusion → scale adjustment module aligns scene and human.
Key Designs¶
-
Dual-Encoder Late Fusion Architecture
- The Pi3X encoder captures global 3D geometry; the Multi-HMR encoder is optimized for human body representation.
- Key design decision: early fusion is avoided—feeding human tokens into the Pi3X decoder disrupts the input distribution and degrades scene reconstruction.
- Human tokens are fused with scene tokens only after decoding, via an MLP: \(H_n = \text{MLP}_{\text{fuse}}([Z_n^{\text{scene}} | Z_n^{\text{human}}])\)
-
Depth-Residual Translation Estimation
- Rather than directly regressing 3D head translation, the method exploits the depth prior provided by Pi3X point maps.
- A residual relative to the scene depth map is predicted: \(d_n^m = d_{n,m}^{\text{coarse}} + \Delta d_n^m\), which is back-projected to a 3D position using the 2D head keypoint and camera intrinsics.
- Ablation: depth residual (107.5 mm) vs. direct depth (133.8 mm) vs. direct translation regression (196.4 mm)—the differences are substantial.
-
Head-Pelvis Ratio Scale Adjustment
- Problem: The near-metric scale \(s\) output by Pi3X may be underestimated (human penetrates the ground) or overestimated (human floats above it).
- Solution: Compute the ratio between the 2D head-pelvis distance in the image \(\ell^{\text{img}}\) and the projected SMPL head-pelvis distance \(\ell^{\text{smpl}}\), then average over all frames and persons to obtain a global adjustment factor \(r = \frac{1}{|\mathcal{S}|}\sum \frac{\ell^{\text{smpl}}}{\ell^{\text{img}}}\), yielding the corrected scale \(s^* = r \cdot s\).
- Coarse-to-fine pelvis localization: the head token estimates a coarse position → the corresponding patch regresses an offset → the coarse position is used as fallback if the pelvis is out of bounds.
- Ablation: scale adjustment reduces WA-MPJPE from 169.7 to 102.6 mm (−39.5%).
-
Multi-View Fusion (Test Time, Optimization-Free)
- View-invariant quantities (shape \(\beta\), pose \(\theta\)): direct parameter averaging, which outperforms implicit token max-pooling.
- View-dependent quantities (rotation \(R\), translation \(\tau\)): transformed to world coordinates, then averaged via quaternion averaging and multi-view ray triangulation, respectively.
- Ablation: Avg+Tri (53.1) > MaxPool+Tri (63.2) > Only Avg (69.3).
-
Geometry-Based Multi-Person Association
- Per-view tracking: head token L2 distance for inter-frame matching; Sinkhorn optimal transport handles unmatched detections.
- Cross-view association cost: \(\mathcal{C}(a,b) = 0.8 \cdot \|3\text{D position difference}\| + 0.2 \cdot \|\text{canonical pose difference}\|\), solved with the Hungarian algorithm for one-to-one matching.
- Ablation: position alone 91.1% precision vs. pose alone 70.6%; combined 91.3%.
Loss & Training¶
- Two-stage training: Stage 1 freezes the Pi3X and Multi-HMR encoders and trains new modules including the SMPL decoder (20 epochs, BEDLAM, lr = 5e-5; scale adjustment disabled for the first 10 epochs).
- Stage 2 unfreezes only the pelvis detection MLP (10 epochs, mixed 3DPW + MPII + COCO + BEDLAM, lr = 1e-4).
- Stage 1 losses: 3D vertex/joint L1 (\(\lambda = 5.0\)) + 2D reprojection L1 + SMPL parameter L1 + detection BCE + pelvis BCE.
- Stage 2 adds: Chamfer distance (visible SMPL vertices vs. predicted depth map).
- Training hardware: 4 × A100, approximately 2 days.
Key Experimental Results¶
Main Results (Global Human Motion Estimation)¶
| Method | Multi-View | No External Modules | EMDB-2 WA-MPJPE↓ (mm) | RICH WA-MPJPE↓ (mm) | RICH W-MPJPE↓ (mm) |
|---|---|---|---|---|---|
| JOSH3R | ✗ | ✗ | 220.0 | — | — |
| UniSH | ✗ | ✗ | 118.5 | 118.1 | 183.2 |
| Human3R | ✗ | ✓ | 112.2 | 110.0 | 184.9 |
| CHROMM-mono | ✗ | ✓ | 102.6 | 87.5 | 138.3 |
| CHROMM-multi | ✓ | ✓ | — | 53.1 | 79.0 |
Multi-View Pose Estimation¶
| Method | No ReID | No Optimization | EgoHumans W-MPJPE↓ (m) | EgoHumans GA-MPJPE↓ (m) | EgoExo4D W-MPJPE↓ (m) |
|---|---|---|---|---|---|
| HSfM | ✗ | ✗ | 1.04 | 0.21 | 0.56 |
| HAMSt3R | ✓ | △ | 3.80 | 0.42 | 0.51 |
| CHROMM | ✓ | ✓ | 0.51 | 0.15 | 0.26 |
Runtime¶
| Method | Per-Frame Inference Time (3 persons, 4 views) |
|---|---|
| HSfM | ~118 s |
| HAMSt3R | ~32 s |
| CHROMM | ~4 s (8×+ speedup) |
Key Findings¶
- Multi-view fusion yields substantial gains: RICH WA-MPJPE drops from 87.5 mm (monocular) to 53.1 mm (multi-view), a 39.3% improvement.
- Scale adjustment is the most critical module: removing it raises WA-MPJPE from 102.6 to 169.7 mm (+65.5%).
- The depth-residual strategy outperforms direct translation regression by 89 mm (107.5 vs. 196.4 mm).
- Geometry-based association (91.3% accuracy) substantially outperforms pose-only matching (70.6%).
- CHROMM is 29× faster than HSfM and 8× faster than HAMSt3R, while requiring no ReID.
Highlights & Insights¶
- First end-to-end unified framework for multi-person multi-view human-scene reconstruction: requires no external modules, preprocessing, or optimization.
- Head-pelvis ratio scale adjustment: bridges the scale gap between scene and human using anatomical proportions—simple yet highly effective.
- View-invariant/view-dependent decomposition fusion: explicit parameter averaging combined with triangulation outperforms implicit token aggregation.
- Geometry-driven cross-view association: avoids the failure of appearance matching in uniformed scenarios; the combination of 3D position and canonical pose is an elegant design choice.
Limitations & Future Work¶
- Heavy reliance on head tokens for human detection—performance degrades when heads are occluded or invisible.
- The dual encoders are not unified into a single encoder—there remains room to improve scene-human interaction modeling.
- Extreme close-up shots (head filling the image) and close-range interpersonal interactions are typical failure cases.
- Scale adjustment depends on pelvis visibility—it degrades under full-body occlusion.
Related Work & Insights¶
- vs. Human3R: CHROMM extends to multi-view without external modules, achieving 9.6 mm improvement on EMDB-2 and 57 mm on RICH.
- vs. HSfM: CHROMM is 29× faster, with EgoHumans W-MPJPE of 0.51 m vs. 1.04 m (50% improvement).
- vs. HAMSt3R: CHROMM is 8× faster and supports multi-person association without external ReID.
- Insights: integrating 3D foundation models with human body priors is an emerging trend; scale alignment is a central engineering challenge; the view-invariant/view-dependent decomposition is generalizable to other multi-view estimation tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First unified multi-person multi-view framework free of external dependencies; scale adjustment and geometric association are novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, monocular and multi-view comparisons, comprehensive ablations, and runtime analysis.
- Writing Quality: ⭐⭐⭐⭐ — Contributions are clearly articulated; each design decision is validated experimentally.
- Value: ⭐⭐⭐⭐ — Fast inference and preprocessing-free operation are practically significant for real-world deployment.