Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass¶

Conference: CVPR 2026 arXiv: 2603.12789 Code: Project Page Area: 3D Vision / Joint Human-Scene Reconstruction Keywords: Multi-view human reconstruction, multi-person scene, SMPL-X, 3D foundation model, scale alignment

TL;DR¶

CHROMM is a unified framework that integrates the geometric prior of Pi3X and the human prior of Multi-HMR into a single feed-forward network, enabling joint reconstruction of cameras, scene point clouds, and SMPL-X human meshes from multi-person multi-view video in a single pass—without external modules, preprocessing, or iterative optimization. It achieves a multi-view WA-MPJPE of 53.1 mm on RICH and runs more than 8× faster than HAMSt3R.

Background & Motivation¶

Background: Joint 3D human-scene reconstruction is a core problem in computer vision, with applications in robotics, autonomous driving, and AR/VR. Recent 3D foundation models (DUSt3R, VGGT, Pi3X) have advanced scene reconstruction, while Multi-HMR has enabled multi-person human mesh recovery.

Limitations of Prior Work:

Monocular methods such as UniSH and Human3R cannot exploit multi-view information, limiting their accuracy.
Multi-view methods such as HSfM and HAMSt3R rely on additional modules (2D keypoint detectors, cross-view ReID modules) or require iterative optimization, resulting in high system complexity.
Appearance-based ReID methods fail in visually similar scenarios (e.g., uniformed subjects).
The near-metric scale output by Pi3X is misaligned with the true metric scale of SMPL—causing human meshes to penetrate the ground or float above it.

Key Challenge: Simultaneous reconstruction of the scene and multiple humans is hindered by scale inconsistency between the two, difficulty in cross-view person association, and the desire to avoid reliance on external preprocessing.

Goal: To build a unified feed-forward framework that requires no external modules or preprocessed data and performs joint multi-person multi-view human-scene reconstruction in a single pass.

Key Insight: Fuse the scene prior of Pi3X with the human prior of Multi-HMR, design a scale adjustment module to bridge the two, and replace appearance-based matching with geometric cues for cross-view association.

Core Idea: Dual-encoder late fusion + head-pelvis ratio scale adjustment + view-invariant/view-dependent decomposition fusion + geometry-driven multi-person association.

Method¶

Overall Architecture¶

Input multi-view multi-person video \(\{I_t^v\}\) → Dual encoders (Pi3X for scene features; Multi-HMR for human features) → Pi3X decoder reconstructs point maps and cameras → Head detection extracts human tokens, which are fused with scene tokens → SMPL decoder regresses pose/shape/translation → At test time: per-view tracking → geometry-driven cross-view multi-person association → view-invariant/view-dependent decomposition fusion → scale adjustment module aligns scene and human.

Key Designs¶

Dual-Encoder Late Fusion Architecture
- The Pi3X encoder captures global 3D geometry; the Multi-HMR encoder is optimized for human body representation.
- Key design decision: early fusion is avoided—feeding human tokens into the Pi3X decoder disrupts the input distribution and degrades scene reconstruction.
- Human tokens are fused with scene tokens only after decoding, via an MLP: \(H_n = \text{MLP}_{\text{fuse}}([Z_n^{\text{scene}} | Z_n^{\text{human}}])\)
Depth-Residual Translation Estimation
- Rather than directly regressing 3D head translation, the method exploits the depth prior provided by Pi3X point maps.
- A residual relative to the scene depth map is predicted: \(d_n^m = d_{n,m}^{\text{coarse}} + \Delta d_n^m\), which is back-projected to a 3D position using the 2D head keypoint and camera intrinsics.
- Ablation: depth residual (107.5 mm) vs. direct depth (133.8 mm) vs. direct translation regression (196.4 mm)—the differences are substantial.
Head-Pelvis Ratio Scale Adjustment
- Problem: The near-metric scale \(s\) output by Pi3X may be underestimated (human penetrates the ground) or overestimated (human floats above it).
- Solution: Compute the ratio between the 2D head-pelvis distance in the image \(\ell^{\text{img}}\) and the projected SMPL head-pelvis distance \(\ell^{\text{smpl}}\), then average over all frames and persons to obtain a global adjustment factor \(r = \frac{1}{|\mathcal{S}|}\sum \frac{\ell^{\text{smpl}}}{\ell^{\text{img}}}\), yielding the corrected scale \(s^* = r \cdot s\).
- Coarse-to-fine pelvis localization: the head token estimates a coarse position → the corresponding patch regresses an offset → the coarse position is used as fallback if the pelvis is out of bounds.
- Ablation: scale adjustment reduces WA-MPJPE from 169.7 to 102.6 mm (−39.5%).
Multi-View Fusion (Test Time, Optimization-Free)
- View-invariant quantities (shape \(\beta\), pose \(\theta\)): direct parameter averaging, which outperforms implicit token max-pooling.
- View-dependent quantities (rotation \(R\), translation \(\tau\)): transformed to world coordinates, then averaged via quaternion averaging and multi-view ray triangulation, respectively.
- Ablation: Avg+Tri (53.1) > MaxPool+Tri (63.2) > Only Avg (69.3).
Geometry-Based Multi-Person Association
- Per-view tracking: head token L2 distance for inter-frame matching; Sinkhorn optimal transport handles unmatched detections.
- Cross-view association cost: \(\mathcal{C}(a,b) = 0.8 \cdot \|3\text{D position difference}\| + 0.2 \cdot \|\text{canonical pose difference}\|\), solved with the Hungarian algorithm for one-to-one matching.
- Ablation: position alone 91.1% precision vs. pose alone 70.6%; combined 91.3%.

Loss & Training¶

Two-stage training: Stage 1 freezes the Pi3X and Multi-HMR encoders and trains new modules including the SMPL decoder (20 epochs, BEDLAM, lr = 5e-5; scale adjustment disabled for the first 10 epochs).
Stage 2 unfreezes only the pelvis detection MLP (10 epochs, mixed 3DPW + MPII + COCO + BEDLAM, lr = 1e-4).
Stage 1 losses: 3D vertex/joint L1 (\(\lambda = 5.0\)) + 2D reprojection L1 + SMPL parameter L1 + detection BCE + pelvis BCE.
Stage 2 adds: Chamfer distance (visible SMPL vertices vs. predicted depth map).
Training hardware: 4 × A100, approximately 2 days.

Key Experimental Results¶

Main Results (Global Human Motion Estimation)¶

Method	Multi-View	No External Modules	EMDB-2 WA-MPJPE↓ (mm)	RICH WA-MPJPE↓ (mm)	RICH W-MPJPE↓ (mm)
JOSH3R	✗	✗	220.0	—	—
UniSH	✗	✗	118.5	118.1	183.2
Human3R	✗	✓	112.2	110.0	184.9
CHROMM-mono	✗	✓	102.6	87.5	138.3
CHROMM-multi	✓	✓	—	53.1	79.0

Multi-View Pose Estimation¶

Method	No ReID	No Optimization	EgoHumans W-MPJPE↓ (m)	EgoHumans GA-MPJPE↓ (m)	EgoExo4D W-MPJPE↓ (m)
HSfM	✗	✗	1.04	0.21	0.56
HAMSt3R	✓	△	3.80	0.42	0.51
CHROMM	✓	✓	0.51	0.15	0.26

Runtime¶

Method	Per-Frame Inference Time (3 persons, 4 views)
HSfM	~118 s
HAMSt3R	~32 s
CHROMM	~4 s (8×+ speedup)

Key Findings¶

Multi-view fusion yields substantial gains: RICH WA-MPJPE drops from 87.5 mm (monocular) to 53.1 mm (multi-view), a 39.3% improvement.
Scale adjustment is the most critical module: removing it raises WA-MPJPE from 102.6 to 169.7 mm (+65.5%).
The depth-residual strategy outperforms direct translation regression by 89 mm (107.5 vs. 196.4 mm).
Geometry-based association (91.3% accuracy) substantially outperforms pose-only matching (70.6%).
CHROMM is 29× faster than HSfM and 8× faster than HAMSt3R, while requiring no ReID.

Highlights & Insights¶

First end-to-end unified framework for multi-person multi-view human-scene reconstruction: requires no external modules, preprocessing, or optimization.
Head-pelvis ratio scale adjustment: bridges the scale gap between scene and human using anatomical proportions—simple yet highly effective.
View-invariant/view-dependent decomposition fusion: explicit parameter averaging combined with triangulation outperforms implicit token aggregation.
Geometry-driven cross-view association: avoids the failure of appearance matching in uniformed scenarios; the combination of 3D position and canonical pose is an elegant design choice.

Limitations & Future Work¶

Heavy reliance on head tokens for human detection—performance degrades when heads are occluded or invisible.
The dual encoders are not unified into a single encoder—there remains room to improve scene-human interaction modeling.
Extreme close-up shots (head filling the image) and close-range interpersonal interactions are typical failure cases.
Scale adjustment depends on pelvis visibility—it degrades under full-body occlusion.

vs. Human3R: CHROMM extends to multi-view without external modules, achieving 9.6 mm improvement on EMDB-2 and 57 mm on RICH.
vs. HSfM: CHROMM is 29× faster, with EgoHumans W-MPJPE of 0.51 m vs. 1.04 m (50% improvement).
vs. HAMSt3R: CHROMM is 8× faster and supports multi-person association without external ReID.
Insights: integrating 3D foundation models with human body priors is an emerging trend; scale alignment is a central engineering challenge; the view-invariant/view-dependent decomposition is generalizable to other multi-view estimation tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ — First unified multi-person multi-view framework free of external dependencies; scale adjustment and geometric association are novel contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, monocular and multi-view comparisons, comprehensive ablations, and runtime analysis.
Writing Quality: ⭐⭐⭐⭐ — Contributions are clearly articulated; each design decision is validated experimentally.
Value: ⭐⭐⭐⭐ — Fast inference and preprocessing-free operation are practically significant for real-world deployment.