CVPR 2025 Human Understanding 6D pose estimation model-free single anchor InstantMesh FoundationPose render-and-compare

Any6D: Model-free 6D Pose Estimation of Novel Objects¶

Conference: CVPR 2025
arXiv: 2503.18673
Code: Project Page
Area: Human Understanding
Keywords: 6D pose estimation, model-free, single anchor, InstantMesh, FoundationPose, render-and-compare

TL;DR¶

This paper proposes the Any6D framework to estimate the 6D pose and size of novel objects from a single RGB-D anchor image. By combining InstantMesh 3D reconstruction, oriented bounding box coarse alignment, and joint size-pose refinement, Any6D achieves an ADD-S of 98.7% on HO3D, significantly outperforming GEDI's 71.9%.

Background & Motivation¶

Background: 6D object pose estimation is crucial in robotic manipulation and augmented reality. Existing methods can be categorized into instance-level (requiring precise CAD models), category-level (requiring category priors), and category-agnostic methods.

Limitations of Prior Work: - CAD model-based methods require precise textured 3D models, which are expensive to acquire. - Multi-view methods (Gen6D, OnePose, FoundationPose model-free mode) require multiple reference images or video sequences. - Single-view matching methods (Oryon, LoFTR) suffer from drastic performance drops under occlusion or non-overlapping viewpoints.

Key Challenge: In practical robotic scenarios, robots facing novel objects in new environments cannot acquire CAD models or multi-view images. None of the existing methods can handle this effectively.

Key Insight: Leveraging image-to-3D generative models (InstantMesh) to reconstruct the complete 3D shape from a single image, combined with depth information to estimate the metric scale, thereby enabling a robust full-to-partial matching.

Core Idea: Single RGB-D \(\to\) InstantMesh normalized 3D reconstruction \(\to\) Oriented bounding box coarse alignment \(\to\) FoundationPose joint size-pose refinement \(\to\) Render-and-compare to select the optimal hypothesis.

Method¶

Overall Architecture¶

Given an anchor image \(I_A\) (RGB-D) and a query image \(I_Q\) (RGB-D), the goal is to estimate the relative pose \(\mathbf{T}_{A \to Q} \in SE(3)\). The method consists of two steps: 1. Reconstruct a normalized shape \(O_N\) from the anchor image, estimate the metric-scale shape \(O_M\) and the anchor pose \(T_{O_M \to A}\) via Object Alignment. 2. Estimate \(T_{O_M \to Q}\) using the metric-scale shape and the query image, with the final relative pose obtained as \(\mathbf{T}_{A \to Q} = (T_{O_M \to A})^{-1} \cdot T_{O_M \to Q}\).

Key Designs¶

3D Shape Reconstruction (InstantMesh)
- Function: Generates a normalized 3D mesh \(O_N\) (range \([-1, 1]\)) from the RGB of the anchor image.
- Core Limitation: The generated shape lacks a metric scale and cannot be directly used for pose estimation.
- Advantages: Compared to NeRF or partial-view reconstruction, it generates a complete shape, supporting full-to-partial matching.
Coarse Object Alignment
- Function: Estimates the initial object size \(s \in \mathbb{R}^3\) and the coarse pose.
- Mechanism: Uses the Oriented Bounding Box (OBB) to determine the object center.
- Why not use other center estimation methods:
  - Point cloud mean: Unreliable when partially visible.
  - Axis-Aligned Bounding Box (AABB): Center shifts under partial occlusion.
- Workflow: Samples different rotation angles, calculates the bounding box IoU between \(I_A\) and \(O_N\), and selects the rotation + scale combination with the highest IoU.
Fine Object Alignment
- Function: Jointly refines size and pose.
- Extensions based on FoundationPose:
  - Original FoundationPose only samples pose hypotheses in \(SO(3)\).
  - Any6D additionally samples sizes \(\Delta s \in [0.6, 1.4]\).
- Alternate iteration of three modules: pose estimation \(\to\) size estimation \(\to\) axis alignment.
- Render-and-Compare to select the optimal: Pose ordering network + self-attention global scoring.
Pose Selection
- Two-stage strategy: First, a pose ordering network compares the rendered images with the cropped observations, and then a self-attention module fuses all hypothesis embeddings to output the final score.

Loss & Training¶

No additional training required: Leverages pre-trained InstantMesh and FoundationPose.
Online inference: Conducts optimization-based alignment during online inference.

Key Experimental Results¶

Main Results (HO3D Dataset)¶

Method	Input Modality	ADD-S↑	ADD↑	AR↑
Oryon	RGB-D+Language	23.0	0.0	1.0
LoFTR	RGB-D	29.5	2.3	3.2
GEDI	Depth	71.9	9.7	7.4
Any6D (Ours)	RGB-D	98.7	40.4	38.3

Other Datasets¶

Dataset	ADD-S↑	ADD↑	AR↑
YCBINEOAT	89.3	45.6	37.5
Toyota-Light (ADD(-S))	32.2	AR: 43.3	MSSD: 55.8
REAL275 (ADD(-S))	53.5	AR: 51.0	MSPD: 65.3
LM-O (vs GigaPose)	AR: 28.6	MSPD: 36.1	VSD: 17.6

Ablation Study (HO3D Dataset)¶

Configuration	ADD-S↑	ADD↑	AR↑	CD↓
Baseline (NeRF partial view)	28.6	0.0	0.2	1.02
(1) Without any alignment	0.0	0.0	0.0	1.47
(2) W/o coarse size, w/ refinement + axis alignment	98.0	25.5	26.8	0.53
(3) W/ coarse size, w/o refinement	83.7	26.6	22.5	0.92
(4) W/ coarse size + refinement, w/o axis alignment	92.3	23.6	24.9	0.66
Full (Ours)	98.7	40.4	38.3	0.49

Key Findings¶

Coarse size estimation is the foundation; omitting it leads to complete failure (Configuration 1).
Axis alignment brings a significant improvement to ADD and AR (+14.9 AR).
Size refinement prevents XYZ aspect ratio distortion.

Highlights & Insights¶

Single RGB-D is sufficient: No CAD models, multi-view images, or video sequences are required.
Oriented Bounding Box (OBB) center estimation is simple and effective, addressing partial visibility issues.
Full-to-partial matching: Complete reconstruction eliminates the ambiguity of partial matching.
Significantly outperforms baseline methods in hand-occlusion (HO3D) and robotic-grasping (YCBINEOAT) scenarios.

Limitations & Future Work¶

Relies heavily on the reconstruction quality of InstantMesh; performance drops when the initial 3D shape is inaccurate.
Currently lacks a shape update/refinement step.
Inference speed is bottlenecked by InstantMesh.

Rating¶

Novelty: ⭐⭐⭐⭐ Jointly estimates size and pose with InstantMesh + FoundationPose.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 datasets + detailed ablation.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and complete framework.
Value: ⭐⭐⭐⭐⭐ Holds significant practical value for robotic manipulation.