HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation¶

Conference: CVPR 2026 arXiv: 2602.24148 Code: Unavailable Area: 3D Vision Keywords: 3D human reconstruction, video diffusion model, multi-view generation, LoRA fine-tuning, orbit video

TL;DR¶

This paper reformulates single-image 3D human reconstruction as a 360° orbital video generation problem. A video diffusion model (Wan 2.1) is fine-tuned via LoRA using only 500 3D scans to generate 81-frame orbital videos, from which high-quality textured meshes are reconstructed via VGGT and Mesh Carving. The approach requires no pose annotations and surpasses existing methods in multi-view consistency and identity preservation.

Background & Motivation¶

Background: Reconstructing realistic 3D humans from a single image is a long-standing challenge with applications in communication, gaming, and AR/VR. Current approaches include large reconstruction models (e.g., InstantMesh), human-specific models relying on 3D human datasets, and multi-view diffusion-based methods.

Limitations of Prior Work: - Scarcity of 3D human data: High-quality multi-view/3D datasets require specialized capture studios (dense calibrated cameras, controlled environments), entailing prohibitive cost and limited diversity. - Inconsistency in image diffusion-based multi-view methods: Methods such as Zero-1-to-3 and SyncDreamer still exhibit noticeable artifacts in cross-view consistency, particularly in facial and hand details. - Dependence on external priors: Methods such as PSHuman require SMPL body shape and camera pose annotations, limiting applicability to non-full-body scenarios such as half-body portraits and headshots. - Large training data requirements: Human4DiT requires large-scale multi-dimensional human datasets.

Core Insight: 2D human image data vastly outnumbers 3D datasets. State-of-the-art DiT video diffusion models (e.g., Wan 2.1), trained on billions of real videos, have acquired strong temporal consistency and implicit 3D structural priors — generating an orbital video can thus be treated as multi-view synthesis.

Key Insight: Rather than adapting image diffusion models, this work fine-tunes a video diffusion model to generate orbital videos, leveraging the model's inherent temporal consistency to ensure multi-view geometric coherence while requiring only minimal 3D training data.

Method¶

Overall Architecture¶

A two-stage pipeline: 1. HumanOrbit model: Given a single input image, generates an 81-frame 360° orbital video. 2. 3D reconstruction pipeline: VGGT estimates camera parameters and point cloud → NormalCrafter estimates normal maps → Poisson surface reconstruction for initialization → differentiable rendering Mesh Carving for optimization.

Key Designs¶

LoRA Fine-Tuning of the Video Diffusion Model

Built upon the Wan 2.1 Image-to-Video 480p model (3D VAE + CLIP image encoder + umT5 text encoder + DiT blocks). The input image is zero-padded along the temporal dimension and encoded by the VAE into conditional latents, which are concatenated with noise and a binary mask before being denoised by the DiT blocks.

Training strategy: - LoRA (rank=32) is applied exclusively to the DiT blocks; all other parameters are frozen. - Training data: orbital videos rendered in Blender from only 500 PosedPro 3D scans, covering full-body and shoulder-up compositions, with slight rotation augmentation. - Final dataset: 3,000 videos, each 81 frames at 640×640 resolution. - Trained for 10 epochs on a single A100 GPU.

Design Motivation: The video diffusion model has learned priors over complex motion and camera trajectories from billions of real videos. LoRA fine-tuning only needs to teach the model a specific pattern — orbital motion — rather than learning 3D consistency from scratch. This enables high-quality multi-view generation with minimal data.

Pose-Free Reconstruction Pipeline
- Camera estimation: VGGT (a feedforward 3D scene attribute estimation network) directly predicts camera parameters \(\Pi = \{\pi_i\}_{i=1}^K\) and depth-projected point clouds from the generated multi-view images, without requiring predefined camera trajectories.
- Normal estimation: NormalCrafter is used to obtain temporally consistent normal maps.
- Mesh initialization: Poisson surface reconstruction is applied to the VGGT point cloud (rather than relying on SMPL), preserving generalizability to non-full-body scenarios.
- Mesh Carving optimization: Iterative optimization via differentiable rendering with the loss function:
\(\mathcal{L}_{recon} = \mathcal{L}_{mask} + \mathcal{L}_{normal} = \sum_i \|M_i - \hat{M}_i\|_2^2 + \sum_i M_i \odot \|N_i - \hat{N}_i\|_2^2\)

Following geometry optimization, per-vertex color is further optimized: \(\mathcal{L}_{color} = \sum_i M_i \odot \|I_i - \hat{I}_i\|_2\)

Design Motivation: Conventional methods require predefined camera poses or SMPL fitting, limiting their applicability. This work lets an SfM-based method estimate all parameters directly from the generated video, demonstrating that the 3D consistency of the generated video is sufficient to support reliable camera estimation.

Data-Efficient Design

The key lies in the pretraining priors of the video diffusion model: Wan 2.1, trained on billions of videos, already understands the motion pattern of "rotating around an object." LoRA requires only a small number of parameters (rank=32) to specialize this general capability into a precise 360° human orbital trajectory. Five hundred 3D scans yielding 3,000 training videos prove sufficient.

Loss & Training¶

Video generation: Standard diffusion training loss, LoRA rank=32, 10 epochs, single A100.
Mesh reconstruction: \(\mathcal{L}_{recon} = \mathcal{L}_{mask} + \mathcal{L}_{normal}\), followed by \(\mathcal{L}_{color}\) for texture optimization.
No body shape annotations, camera pose annotations, or face recognition modules are required.

Key Experimental Results¶

Main Results¶

Dataset	Metric	HumanOrbit	PSHuman	SV3D	MV-Adapter
CCP (full-body)	CLIP Score ↑	0.8317	0.8282	0.7888	0.7735
CCP (full-body)	MEt3R ↓	0.3175	0.3576	0.2966	0.3721
CCP (full-body)	MVReward ↑	0.8035	0.6814	0.2378	0.6795
CelebA (headshot)	CLIP Score ↑	0.7073	-	0.6582	0.6729
CelebA (headshot)	MVReward ↑	0.4947	-	0.4918	0.4727

Ablation Study¶

Configuration	Key Metric	Notes
VGGT vs. COLMAP	VGGT: dense point cloud + continuous trajectory; COLMAP: sparse point cloud + broken trajectory	COLMAP leads to missing left arm in reconstruction
Non-human objects (chair/dog)	Orbital video generated successfully	LoRA fine-tuning preserves pretrained generalization
Fixed-elevation orbit	Top of head/chin regions not visible	Richer camera trajectories should be explored

Key Findings¶

HumanOrbit substantially outperforms PSHuman on the MVReward metric (most aligned with human preference): 0.8035 vs. 0.6814, indicating notably superior generation quality and consistency.
SV3D tends to produce blurred contours and distorted faces; PSHuman lacks fine detail; MV-Adapter occasionally exhibits topological errors (spurious shoes).
VGGT reliably recovers a circular camera trajectory from the generated video, indirectly validating the 3D consistency of the generated frames.
The method generalizes to headshot scenarios (where PSHuman fails due to SMPL dependence), demonstrating broader applicability.
The model also works on non-human objects (chairs, dogs), indicating that a general orbital motion pattern has been learned.

Highlights & Insights¶

Elegant problem reformulation: Recasting multi-view generation from "image diffusion + 3D constraints" to "video diffusion + orbital motion" naturally yields temporal consistency.
Extreme data efficiency: Only 500 3D scans suffice to train a model that outperforms methods requiring far larger 3D datasets, owing to the strong priors of the pretrained video model.
Pose-free design: No external pose annotations (SMPL or predefined cameras) are needed; the model freely generates an orbital video, from which SfM recovers all parameters, avoiding generation–annotation misalignment.
Minimal architectural modification: The entire method adds only LoRA parameters with negligible architectural changes.

Limitations & Future Work¶

Fixed elevation: The orbital trajectory lies on a single horizontal plane, leaving the top of the head and chin regions unobserved. Multi-elevation or helical trajectories could be explored.
Slow inference: Generation of 81 orbital frames takes approximately 17 minutes due to the large video diffusion backbone. Preliminary attempts to reduce the frame count degrade quality; more efficient inference strategies remain to be investigated.
Dependence on VGGT robustness: If the generated video exhibits poor consistency, VGGT camera estimation will also fail.
Comparisons with recent methods such as MEAT and Pippo are absent, as their code is not publicly available.

PSHuman: Multi-view diffusion with cross-scale design and SMPL-initialized mesh carving; the most direct baseline.
SV3D: Stability AI's orbital video diffusion model (21 frames), but insufficient in consistency for human subjects.
VGGT: Feedforward 3D scene attribute estimation, serving as a replacement for traditional SfM on generated videos.
Wan 2.1: DiT video diffusion model; this work demonstrates that LoRA fine-tuning alone can specialize it as a multi-view generation tool.
Insight: Video diffusion models as carriers of implicit 3D priors may represent a new paradigm for single-image 3D reconstruction. The advantage of LoRA in preserving pretrained knowledge enables effective fine-tuning with small datasets.

Rating¶

Novelty: ⭐⭐⭐⭐ The paradigm shift from video diffusion to multi-view generation is novel; the pose-free design is elegant.
Experimental Thoroughness: ⭐⭐⭐ Multi-view generation evaluation is comprehensive, but 3D reconstruction assessment is limited to qualitative comparisons without quantitative metrics; comparisons with some recent methods are missing.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear, the method is concisely presented, and experimental results are intuitively illustrated.
Value: ⭐⭐⭐⭐ A highly data-efficient solution for single-image 3D human reconstruction with important implications for 3D data generation.
Value: TBD