Orient Anything V2: Unifying Orientation and Rotation Understanding¶

Conference: NeurIPS 2025 arXiv: 2601.05573 Code: Available (https://orient-anythingv2.github.io/) Area: Computer Vision / 3D Understanding Keywords: orientation estimation, rotational symmetry, 6DoF pose estimation, synthetic data, foundation model

TL;DR¶

Orient Anything V2 unifies 3D object orientation and rotation understanding via a scalable synthetic data engine, a symmetry-aware periodic distribution objective, and a multi-frame architecture, achieving zero-shot state-of-the-art performance across three tasks: orientation estimation, 6DoF pose estimation, and symmetry recognition.

Background & Motivation¶

Estimating the 3D orientation of objects from images is a fundamental computer vision task with critical applications in robotic manipulation, autonomous driving, and AR/VR.

Limitations of Orient Anything V1:

Ignoring rotational symmetry: V1 defines orientation based on a unique canonical front face; for symmetric objects with multiple valid fronts (e.g., cups, chairs), it can only classify them as "front-free."

No support for relative rotation estimation: Inferring relative rotation from independently predicted absolute orientations leads to severe error accumulation.

Data quality issues: V1 relies on real 3D assets from Objaverse, which suffer from class imbalance, low texture quality, and fixed poses.

These limitations directly constrain the model's applicability to downstream tasks such as 6DoF pose estimation and robotic grasping.

Method¶

Overall Architecture¶

Orient Anything V2 upgrades both the data and model dimensions:

Data side: A scalable synthetic data engine is constructed to generate 600K high-quality 3D assets (12× larger than V1), paired with a robust orientation annotation system.

Model side: A symmetry-aware periodic distribution fitting objective is introduced, and the architecture is extended to multi-frame inputs to support relative rotation estimation. The model is initialized from VGGT (a 1.2B-parameter feed-forward Transformer).

Key Designs¶

1. Synthetic 3D Asset Generation¶

A structured pipeline is adopted: Class Tag → Caption → Image → 3D Mesh

ImageNet-21K class tags → Qwen-2.5 generates rich textual descriptions
FLUX.1-Dev generates images (with positional descriptions to enhance 3D structure)
Hunyuan-3D-2.0 generates high-quality 3D meshes
Final output: 600K assets, approximately 30 per category, with complete geometry and rich textures

2. Robust Annotation System¶

Multi-view pseudo-label aggregation: An improved V1 model is first trained as an annotator. It generates pseudo-labels from multiple rendered views of each 3D asset, which are then projected back into the 3D world coordinate frame. A periodic Gaussian distribution is fitted to the azimuth distribution on the horizontal plane:

\[(\bar{\varphi}, \bar{\alpha}, \bar{\sigma}) = \arg\min_{\varphi, \alpha, \sigma} \sum_{i=0}^{359} \left(\mathbf{P}_{\text{pseudo}}(i) - \frac{\exp\left(\frac{\cos(\alpha(i-\varphi))}{\sigma^2}\right)}{2\pi I_0(1/\sigma^2)}\right)^2\]

where \(\bar{\alpha}\) encodes periodicity (rotational symmetry) and \(\bar{\varphi}\) encodes the principal azimuth direction.

Cross-asset consistency calibration: Assuming objects of the same category should share the same rotational symmetry, intra-category annotation consistency is verified; only approximately 15% of categories require manual review.

3. Symmetry-Aware Periodic Distribution¶

The training objective is extended from V1's unimodal distribution to a periodic distribution:

\[\mathbf{P}_{\text{azi}}(i|\bar{\varphi}, \bar{\alpha}, \sigma) = \frac{\exp\left(\frac{\cos(\bar{\alpha}(i-\bar{\varphi}))}{\sigma^2}\right)}{2\pi I_0(1/\sigma^2)}\]

This design naturally subsumes V1's separate orientation confidence prediction; different rotational symmetry types are directly inferred from the predicted probability distribution.

4. Multi-Frame Relative Rotation Estimation¶

DINOv2 encodes each input frame into \(K\) tokens, augmented with a learnable token. Tokens from all frames are jointly encoded through a shared Transformer. The learnable token of the first frame predicts the absolute orientation, while those of subsequent frames predict relative rotations.

Loss & Training¶

Loss function: Binary Cross-Entropy (BCE) loss to fit target orientation/rotation distributions
Initialization: VGGT (1.2B parameters, pretrained on 3D geometric tasks)
Training: 20K iterations with cosine learning rate scheduling, initial learning rate \(1 \times 10^{-3}\)
Data augmentation: Random patch masking to simulate real-world occlusion
Batch size: Effective batch size 48; each sample randomly draws 1–2 frames
Symmetry constraint: Only four rotational symmetry types \(\{0, 1, 2, 4\}\) are considered, covering the vast majority of objects

Key Experimental Results¶

Main Results: Zero-Shot Absolute Orientation Estimation¶

Model	SUN-RGBD Med↓	ARKitScenes Med↓	Pascal3D+ Med↓	Objectron Med↓	Ori_COCO Acc↑
Orient Anything V1	33.94	77.58	22.90	30.67	72.4
Orient Anything V2	26.00	36.48	15.02	22.62	86.4

V2 significantly outperforms V1 across all datasets. On ARKitScenes, the median error drops from 77.58° to 36.48°; on Ori_COCO, accuracy improves from 72.4% to 86.4%.

Zero-Shot Relative Rotation Estimation (6DoF Pose)¶

Model	LINEMOD Med↓	YCB-Video Med↓	OnePose++ Med↓	OnePose Med↓
POPE (POPE sampling, avg 14.85°)	15.73	13.94	6.27	2.16
V2 (POPE sampling)	7.82	6.07	6.18	6.76
POPE (random sampling, avg 78.22°)	98.03	41.88	88.21	45.73
V2 (random sampling)	28.83	15.78	12.83	11.72

The advantage is especially pronounced under large rotation angles (random sampling): on LINEMOD, the median error drops from 98.03° to 28.83°. Feature-matching-based methods fail under large rotations, whereas V2 remains robust through holistic semantic understanding.

Ablation Study¶

Row	Asset Type	# Assets	Init	Objectron Med↓	LINEMOD Med↓ / Acc15↑	YCB Med↓ / Acc15↑
1	Real	40K	VGGT	25.05	10.70 / 69.8	15.49 / 72.5
2	Synthetic	40K	VGGT	24.44	10.16 / 74.1	7.28 / 76.2
5	Synthetic	600K	VGGT	22.62	7.82 / 89.7	6.07 / 86.4
7	Synthetic	600K	None	62.08	16.54 / 45.3	13.93 / 52.2

Key Findings¶

Synthetic vs. real data: Equal-scale synthetic data is on par with real data for orientation estimation, but substantially superior for rotation estimation due to richer textures.
Data scale effect: Rotation estimation is more sensitive to data scale (requiring diverse textures and fine-grained details); scaling from 40K to 600K improves LINEMOD Acc15 from 74.1% to 89.7%.
Importance of pretraining: Models without pretraining degrade substantially (Objectron Med rises from 22.62° to 62.08°); VGGT (3D geometry pretraining) outperforms DINOv2 (semantic pretraining).
Symmetry recognition: V2 achieves 65.2% accuracy on Omni6DPose, surpassing GPT-4o (62.5%) and other leading VLMs.

Highlights & Insights¶

Closed-loop data engine design: V1 model pseudo-labels → multi-view aggregation → cross-asset calibration feeds model prediction capability back into data construction.
Elegance of symmetry modeling: The periodic distribution naturally unifies the representation of front-free, unique-front, and multi-front objects.
Unification of orientation and rotation: The inherent coupling between absolute orientation and relative rotation enables knowledge sharing and transfer.
Effectiveness of synthetic data: This work is among the first to demonstrate that synthetic 3D assets can replace real assets for orientation estimation tasks, and that texture diversity renders them superior for rotation estimation.

Limitations & Future Work¶

Poor performance on low-information viewpoints: Prediction accuracy degrades under heavy occlusion or highly uninformative views.
Maximum two-frame input: The architecture cannot scale to video understanding scenarios.
Only four symmetry types: \(\{0, 1, 2, 4\}\) cannot handle higher-order rotationally symmetric objects (e.g., pentagrams).
Test set limitations: Existing benchmarks typically provide only a single ground-truth orientation, making it difficult to fully evaluate multi-orientation prediction capability.

Repurposing VGGT: The "camera" token of VGGT is repurposed for orientation/rotation prediction, leveraging the correlation between camera pose and object rotation.
Scalable synthetic data paradigm: The Class Tag → Caption → Image → 3D Mesh pipeline is generalizable to other 3D understanding tasks.
Iterative foundation model upgrading: The V1→V2 evolution provides a template for iterative improvement of other foundation models.

Rating¶

Novelty: ⭐⭐⭐⭐ — Symmetry-aware periodic distribution and multi-frame architecture are well-motivated and novel
Theoretical Contribution: ⭐⭐⭐ — Primarily engineering innovation with limited theoretical depth
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 11 benchmarks with comprehensive ablation studies
Value: ⭐⭐⭐⭐⭐ — A foundation model unifying orientation, rotation, and symmetry understanding
Overall Recommendation: ⭐⭐⭐⭐