Orientation Matters: Making 3D Generative Models Orientation-Aligned¶

Conference: NeurIPS 2025 arXiv: 2506.08640 Code: Project Page Area: 3D Vision Keywords: 3D generation, orientation alignment, dataset construction, Objaverse, pose estimation

TL;DR¶

This paper introduces the task of orientation-aligned 3D object generation, constructs the Objaverse-OA dataset comprising 14,832 orientation-aligned 3D models across 1,008 categories, fine-tunes two mainstream 3D generation frameworks (Trellis and Wonder3D) to achieve orientation-aligned object generation, and demonstrates two downstream applications: zero-shot orientation estimation and arrow-guided rotation manipulation.

Background & Motivation¶

Humans can intuitively perceive the shape and orientation of objects from a single image (object constancy), yet existing 3D generative models produce objects with inconsistent orientations—chairs may face arbitrary directions, cups may appear tilted, and vehicles may be misaligned. This stems from the inconsistency of 3D model orientations in training data such as Objaverse.

Consequences of orientation inconsistency: 1. Direct use in analysis-by-synthesis pose estimation is infeasible. 2. Placing objects in AR/VR requires tedious manual orientation adjustment. 3. Downstream applications (e.g., robotic manipulation, scene editing) require a consistent canonical coordinate system.

Existing alternatives and their limitations: - Post-processing alignment: generate first, then estimate orientation via PCA/VLM/Orient Anything—but PCA cannot disambiguate principal axis directions, VLMs struggle with objects lacking salient frontal features, and Orient Anything has limited accuracy. - Category-level pose estimation: requires large-scale manually annotated orientation-aligned datasets and is restricted to a limited number of categories (ImageNet3D covers only 200 categories).

The core idea of this paper: directly fine-tune 3D generative models to produce orientation-aligned objects—which requires a sufficiently diverse orientation-aligned dataset as a foundation. To this end, Objaverse-OA is constructed (14,832 models × 1,008 categories), far exceeding the scale of existing datasets.

Method¶

Overall Architecture¶

Objaverse-OA dataset construction (VLM preprocessing + human correction) → Fine-tuning of 3D generative models (Trellis-OA / Wonder3D-OA) → Downstream applications (zero-shot orientation estimation / arrow-guided rotation manipulation)

Key Designs¶

Objaverse-OA Dataset Construction

VLM Preprocessing: Starting from 46,219 models in Objaverse-LVIS, four orthogonal views (front/back/left/right) are rendered per model, and Gemini-2.0 is used to identify the frontal view for alignment. Among these, 20,664 are successfully identified; however, VLMs exhibit three typical failure modes: - Stick-like objects (e.g., forks, keys)—roll/pitch misalignment - Narrow/thin objects (e.g., fish, bicycles)—VLMs rely solely on frontal features without lateral reasoning - Ambiguously fronted objects (e.g., teapots, fire extinguishers)—inherent ambiguity in orientation definition

Human Correction: Objects in approximately 600 categories require manual correction using Blender. For ambiguous objects, orientation definitions from ImageNet3D are adopted as reference. Low-quality geometry and multi-object scenes are filtered out.

Trellis-OA (Fine-tuning a 3D VAE Generative Model)

Trellis consists of three modules: a sparse structure generator \(\mathcal{G}_S\), a structured latent code generator \(\mathcal{G}_L\), and a 3D decoder \(\mathcal{D}\).

Key finding: fine-tuning only the sparse structure generator \(\mathcal{G}_S\) is sufficient to achieve orientation alignment. This is because the poses generated by Trellis are randomly sampled from four orthogonal directions, and the aligned pose distribution falls within this range—thus \(\mathcal{G}_L\) and \(\mathcal{D}\) require no additional fine-tuning.

Training: batch size 64, 30,000 steps, approximately 10 hours on 8×A100 GPUs.

Wonder3D-OA (Fine-tuning a Multi-view Diffusion Model)

Core modifications: - Fixed camera configuration: renders 6 canonical views (front/front-left/front-right/left/right/back), replacing the original input-view-dependent setup. - LoRA fine-tuning: serves as a lightweight adapter to preserve the original 3D prior. - Pixel injector: injects the input image as a 7th view into 3D self-attention (inspired by ImageDream), resolving the failure of the original feature alignment under the fixed camera configuration. - LGM replacing NeuS: uses 4 views (front/left/right/back) to directly generate 3DGS, replacing the time-consuming optimization-based 3D lifting.

Zero-Shot Orientation Estimation

A generated orientation-aligned 3D model serves as a template → FoundationPose performs multi-view rendering and pose refinement → DINOv2 feature matching selects the best viewpoint. The key contribution is that no CAD model or depth map is required, with the generative model serving as a substitute.

Loss & Training¶

Trellis-OA: directly fine-tunes the sparse structure generator with end-to-end training.
Wonder3D-OA: LoRA fine-tuning with a pixel injector that modifies the 3D attention dimensions from \((b_z, 6, c, h, w)\) to \((b_z, 7, c, h, w)\).
The normal map generation branch is omitted to simplify the pipeline.

Key Experimental Results¶

Main Results: Orientation-Aligned Generation Quality (Wonder3D Backbone)¶

Method	GSO CD↓	GSO LPIPS↓	GSO CLIP↑	Toys4k CD↓	Toys4k CLIP↑
Wonder3D	0.0894	0.2799	76.37	0.0932	87.10
+ PCA	0.0788	0.2554	77.80	0.0858	87.58
+ VLM (Gemini)	0.0850	0.2752	76.30	0.0880	87.53
+ Orient Anything	0.1015	0.2600	77.50	0.1079	88.12
Wonder3D-OA	0.0564	0.2270	80.30	0.0548	92.09

Trellis Backbone¶

Method	GSO CD↓	GSO CLIP↑	Toys4k CD↓	Toys4k CLIP↑
Trellis + VLM	0.0421	89.97	0.0564	95.19
Trellis + OA (small)	0.0448	82.46	0.0465	93.74
Trellis-OA	0.0407	88.41	0.0393	95.71

Zero-Shot Orientation Estimation¶

Method	Toys4k Acc@30↑	Toys4k Abs↓	Stick-like Acc@30↑
FSDetView (Few-shot)	20.90	91.66	10.29
Orient Anything (ViT-L)	63.18	36.37	9.8
Ours (ViT-L)	52.87	46.76	62.25

Ablation Study¶

Training Data	Toys4k CD↓	Toys4k CLIP↑
100 categories + 5,720 objects (small)	0.0465	93.74
1,008 categories + 14,832 objects (full)	0.0393	95.71

Key Findings¶

Directly fine-tuning the generative model significantly outperforms post-processing alignment approaches (CD reduced by 30–40%).
Fine-tuning only the sparse structure generator in Trellis is sufficient—indicating that orientation information is primarily encoded in the structure.
For stick-like objects (forks, keys, etc.), Orient Anything nearly fails entirely (Acc@30 of only 9.8%), whereas the proposed method achieves 62.25%.
Category diversity is critical—expanding from 100 to 1,008 categories yields substantial performance gains.

Highlights & Insights¶

A new task is defined—orientation-aligned 3D generation, bridging the gap between 3D generation and real-world applications.
A pragmatic dataset construction strategy: coarse VLM filtering followed by human refinement, balancing efficiency and quality.
The finding that "only the sparse structure generator needs fine-tuning" reveals where orientation information is encoded within the 3D VAE.
The arrow manipulation application intuitively demonstrates the user experience improvements enabled by orientation alignment.

Limitations & Future Work¶

Zero-shot orientation estimation underperforms Orient Anything on common objects (approximately 10 points lower on Toys4k Acc@30); the method's strength lies in long-tail and stick-like objects.
Dataset construction still requires substantial human effort (approximately 600 categories need manual correction), leaving room for improved automation.
Ambiguity in orientation definition is an inherent challenge (e.g., what constitutes the "front" of a cup), and definitions may differ across datasets.
No strategy is discussed for handling symmetric objects.

The analysis of VLM failure modes in orientation recognition (stick-like objects, thin objects, ambiguous objects) offers practical guidance for data annotation.
The choice of fine-tuning strategy (full fine-tuning vs. LoRA) depends on how the model architecture encodes orientation information.

Rating¶

Novelty: ⭐⭐⭐⭐ — Clear definition of a new task and a valuable dataset, though the methodology is primarily based on fine-tuning.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two backbones, two held-out datasets, real-world scenarios, and downstream applications.
Writing Quality: ⭐⭐⭐⭐ — Task motivation is clear and visualizations are rich.
Value: ⭐⭐⭐⭐ — The Objaverse-OA dataset itself has lasting value, and downstream application scenarios are well-defined.