Skip to content

HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation

Conference: CVPR 2026 arXiv: 2602.24148 Code: Unavailable Area: 3D Vision Keywords: 3D human reconstruction, video diffusion model, multi-view generation, LoRA fine-tuning, orbit video

TL;DR

This paper reformulates single-image 3D human reconstruction as a 360° orbital video generation problem. A video diffusion model (Wan 2.1) is fine-tuned via LoRA using only 500 3D scans to generate 81-frame orbital videos, from which high-quality textured meshes are reconstructed via VGGT and Mesh Carving. The approach requires no pose annotations and surpasses existing methods in multi-view consistency and identity preservation.

Background & Motivation

Background: Reconstructing realistic 3D humans from a single image is a long-standing challenge with applications in communication, gaming, and AR/VR. Current approaches include large reconstruction models (e.g., InstantMesh), human-specific models relying on 3D human datasets, and multi-view diffusion-based methods.

Limitations of Prior Work: - Scarcity of 3D human data: High-quality multi-view/3D datasets require specialized capture studios (dense calibrated cameras, controlled environments), entailing prohibitive cost and limited diversity. - Inconsistency in image diffusion-based multi-view methods: Methods such as Zero-1-to-3 and SyncDreamer still exhibit noticeable artifacts in cross-view consistency, particularly in facial and hand details. - Dependence on external priors: Methods such as PSHuman require SMPL body shape and camera pose annotations, limiting applicability to non-full-body scenarios such as half-body portraits and headshots. - Large training data requirements: Human4DiT requires large-scale multi-dimensional human datasets.

Core Insight: 2D human image data vastly outnumbers 3D datasets. State-of-the-art DiT video diffusion models (e.g., Wan 2.1), trained on billions of real videos, have acquired strong temporal consistency and implicit 3D structural priors — generating an orbital video can thus be treated as multi-view synthesis.

Key Insight: Rather than adapting image diffusion models, this work fine-tunes a video diffusion model to generate orbital videos, leveraging the model's inherent temporal consistency to ensure multi-view geometric coherence while requiring only minimal 3D training data.

Method

Overall Architecture

A two-stage pipeline: 1. HumanOrbit model: Given a single input image, generates an 81-frame 360° orbital video. 2. 3D reconstruction pipeline: VGGT estimates camera parameters and point cloud → NormalCrafter estimates normal maps → Poisson surface reconstruction for initialization → differentiable rendering Mesh Carving for optimization.

Key Designs

  1. LoRA Fine-Tuning of the Video Diffusion Model

Built upon the Wan 2.1 Image-to-Video 480p model (3D VAE + CLIP image encoder + umT5 text encoder + DiT blocks). The input image is zero-padded along the temporal dimension and encoded by the VAE into conditional latents, which are concatenated with noise and a binary mask before being denoised by the DiT blocks.

Training strategy: - LoRA (rank=32) is applied exclusively to the DiT blocks; all other parameters are frozen. - Training data: orbital videos rendered in Blender from only 500 PosedPro 3D scans, covering full-body and shoulder-up compositions, with slight rotation augmentation. - Final dataset: 3,000 videos, each 81 frames at 640×640 resolution. - Trained for 10 epochs on a single A100 GPU.

Design Motivation: The video diffusion model has learned priors over complex motion and camera trajectories from billions of real videos. LoRA fine-tuning only needs to teach the model a specific pattern — orbital motion — rather than learning 3D consistency from scratch. This enables high-quality multi-view generation with minimal data.

  1. Pose-Free Reconstruction Pipeline

    • Camera estimation: VGGT (a feedforward 3D scene attribute estimation network) directly predicts camera parameters \(\Pi = \{\pi_i\}_{i=1}^K\) and depth-projected point clouds from the generated multi-view images, without requiring predefined camera trajectories.
    • Normal estimation: NormalCrafter is used to obtain temporally consistent normal maps.
    • Mesh initialization: Poisson surface reconstruction is applied to the VGGT point cloud (rather than relying on SMPL), preserving generalizability to non-full-body scenarios.
    • Mesh Carving optimization: Iterative optimization via differentiable rendering with the loss function:

    \(\mathcal{L}_{recon} = \mathcal{L}_{mask} + \mathcal{L}_{normal} = \sum_i \|M_i - \hat{M}_i\|_2^2 + \sum_i M_i \odot \|N_i - \hat{N}_i\|_2^2\)

Following geometry optimization, per-vertex color is further optimized: \(\mathcal{L}_{color} = \sum_i M_i \odot \|I_i - \hat{I}_i\|_2\)

Design Motivation: Conventional methods require predefined camera poses or SMPL fitting, limiting their applicability. This work lets an SfM-based method estimate all parameters directly from the generated video, demonstrating that the 3D consistency of the generated video is sufficient to support reliable camera estimation.

  1. Data-Efficient Design

The key lies in the pretraining priors of the video diffusion model: Wan 2.1, trained on billions of videos, already understands the motion pattern of "rotating around an object." LoRA requires only a small number of parameters (rank=32) to specialize this general capability into a precise 360° human orbital trajectory. Five hundred 3D scans yielding 3,000 training videos prove sufficient.

Loss & Training

  • Video generation: Standard diffusion training loss, LoRA rank=32, 10 epochs, single A100.
  • Mesh reconstruction: \(\mathcal{L}_{recon} = \mathcal{L}_{mask} + \mathcal{L}_{normal}\), followed by \(\mathcal{L}_{color}\) for texture optimization.
  • No body shape annotations, camera pose annotations, or face recognition modules are required.

Key Experimental Results

Main Results

Dataset Metric HumanOrbit PSHuman SV3D MV-Adapter
CCP (full-body) CLIP Score ↑ 0.8317 0.8282 0.7888 0.7735
CCP (full-body) MEt3R ↓ 0.3175 0.3576 0.2966 0.3721
CCP (full-body) MVReward ↑ 0.8035 0.6814 0.2378 0.6795
CelebA (headshot) CLIP Score ↑ 0.7073 - 0.6582 0.6729
CelebA (headshot) MVReward ↑ 0.4947 - 0.4918 0.4727

Ablation Study

Configuration Key Metric Notes
VGGT vs. COLMAP VGGT: dense point cloud + continuous trajectory; COLMAP: sparse point cloud + broken trajectory COLMAP leads to missing left arm in reconstruction
Non-human objects (chair/dog) Orbital video generated successfully LoRA fine-tuning preserves pretrained generalization
Fixed-elevation orbit Top of head/chin regions not visible Richer camera trajectories should be explored

Key Findings

  • HumanOrbit substantially outperforms PSHuman on the MVReward metric (most aligned with human preference): 0.8035 vs. 0.6814, indicating notably superior generation quality and consistency.
  • SV3D tends to produce blurred contours and distorted faces; PSHuman lacks fine detail; MV-Adapter occasionally exhibits topological errors (spurious shoes).
  • VGGT reliably recovers a circular camera trajectory from the generated video, indirectly validating the 3D consistency of the generated frames.
  • The method generalizes to headshot scenarios (where PSHuman fails due to SMPL dependence), demonstrating broader applicability.
  • The model also works on non-human objects (chairs, dogs), indicating that a general orbital motion pattern has been learned.

Highlights & Insights

  • Elegant problem reformulation: Recasting multi-view generation from "image diffusion + 3D constraints" to "video diffusion + orbital motion" naturally yields temporal consistency.
  • Extreme data efficiency: Only 500 3D scans suffice to train a model that outperforms methods requiring far larger 3D datasets, owing to the strong priors of the pretrained video model.
  • Pose-free design: No external pose annotations (SMPL or predefined cameras) are needed; the model freely generates an orbital video, from which SfM recovers all parameters, avoiding generation–annotation misalignment.
  • Minimal architectural modification: The entire method adds only LoRA parameters with negligible architectural changes.

Limitations & Future Work

  • Fixed elevation: The orbital trajectory lies on a single horizontal plane, leaving the top of the head and chin regions unobserved. Multi-elevation or helical trajectories could be explored.
  • Slow inference: Generation of 81 orbital frames takes approximately 17 minutes due to the large video diffusion backbone. Preliminary attempts to reduce the frame count degrade quality; more efficient inference strategies remain to be investigated.
  • Dependence on VGGT robustness: If the generated video exhibits poor consistency, VGGT camera estimation will also fail.
  • Comparisons with recent methods such as MEAT and Pippo are absent, as their code is not publicly available.
  • PSHuman: Multi-view diffusion with cross-scale design and SMPL-initialized mesh carving; the most direct baseline.
  • SV3D: Stability AI's orbital video diffusion model (21 frames), but insufficient in consistency for human subjects.
  • VGGT: Feedforward 3D scene attribute estimation, serving as a replacement for traditional SfM on generated videos.
  • Wan 2.1: DiT video diffusion model; this work demonstrates that LoRA fine-tuning alone can specialize it as a multi-view generation tool.
  • Insight: Video diffusion models as carriers of implicit 3D priors may represent a new paradigm for single-image 3D reconstruction. The advantage of LoRA in preserving pretrained knowledge enables effective fine-tuning with small datasets.

Rating

  • Novelty: ⭐⭐⭐⭐ The paradigm shift from video diffusion to multi-view generation is novel; the pose-free design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐ Multi-view generation evaluation is comprehensive, but 3D reconstruction assessment is limited to qualitative comparisons without quantitative metrics; comparisons with some recent methods are missing.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clear, the method is concisely presented, and experimental results are intuitively illustrated.
  • Value: ⭐⭐⭐⭐ A highly data-efficient solution for single-image 3D human reconstruction with important implications for 3D data generation.
  • Value: TBD