Skip to content

Any6D: Model-free 6D Pose Estimation of Novel Objects

Conference: CVPR 2025
arXiv: 2503.18673
Code: Project Page
Area: Human Understanding
Keywords: 6D pose estimation, model-free, single anchor, InstantMesh, FoundationPose, render-and-compare

TL;DR

This paper proposes the Any6D framework to estimate the 6D pose and size of novel objects from a single RGB-D anchor image. By combining InstantMesh 3D reconstruction, oriented bounding box coarse alignment, and joint size-pose refinement, Any6D achieves an ADD-S of 98.7% on HO3D, significantly outperforming GEDI's 71.9%.

Background & Motivation

Background: 6D object pose estimation is crucial in robotic manipulation and augmented reality. Existing methods can be categorized into instance-level (requiring precise CAD models), category-level (requiring category priors), and category-agnostic methods.

Limitations of Prior Work: - CAD model-based methods require precise textured 3D models, which are expensive to acquire. - Multi-view methods (Gen6D, OnePose, FoundationPose model-free mode) require multiple reference images or video sequences. - Single-view matching methods (Oryon, LoFTR) suffer from drastic performance drops under occlusion or non-overlapping viewpoints.

Key Challenge: In practical robotic scenarios, robots facing novel objects in new environments cannot acquire CAD models or multi-view images. None of the existing methods can handle this effectively.

Key Insight: Leveraging image-to-3D generative models (InstantMesh) to reconstruct the complete 3D shape from a single image, combined with depth information to estimate the metric scale, thereby enabling a robust full-to-partial matching.

Core Idea: Single RGB-D \(\to\) InstantMesh normalized 3D reconstruction \(\to\) Oriented bounding box coarse alignment \(\to\) FoundationPose joint size-pose refinement \(\to\) Render-and-compare to select the optimal hypothesis.

Method

Overall Architecture

Given an anchor image \(I_A\) (RGB-D) and a query image \(I_Q\) (RGB-D), the goal is to estimate the relative pose \(\mathbf{T}_{A \to Q} \in SE(3)\). The method consists of two steps: 1. Reconstruct a normalized shape \(O_N\) from the anchor image, estimate the metric-scale shape \(O_M\) and the anchor pose \(T_{O_M \to A}\) via Object Alignment. 2. Estimate \(T_{O_M \to Q}\) using the metric-scale shape and the query image, with the final relative pose obtained as \(\mathbf{T}_{A \to Q} = (T_{O_M \to A})^{-1} \cdot T_{O_M \to Q}\).

Key Designs

  1. 3D Shape Reconstruction (InstantMesh)

    • Function: Generates a normalized 3D mesh \(O_N\) (range \([-1, 1]\)) from the RGB of the anchor image.
    • Core Limitation: The generated shape lacks a metric scale and cannot be directly used for pose estimation.
    • Advantages: Compared to NeRF or partial-view reconstruction, it generates a complete shape, supporting full-to-partial matching.
  2. Coarse Object Alignment

    • Function: Estimates the initial object size \(s \in \mathbb{R}^3\) and the coarse pose.
    • Mechanism: Uses the Oriented Bounding Box (OBB) to determine the object center.
    • Why not use other center estimation methods:
      • Point cloud mean: Unreliable when partially visible.
      • Axis-Aligned Bounding Box (AABB): Center shifts under partial occlusion.
    • Workflow: Samples different rotation angles, calculates the bounding box IoU between \(I_A\) and \(O_N\), and selects the rotation + scale combination with the highest IoU.
  3. Fine Object Alignment

    • Function: Jointly refines size and pose.
    • Extensions based on FoundationPose:
      • Original FoundationPose only samples pose hypotheses in \(SO(3)\).
      • Any6D additionally samples sizes \(\Delta s \in [0.6, 1.4]\).
    • Alternate iteration of three modules: pose estimation \(\to\) size estimation \(\to\) axis alignment.
    • Render-and-Compare to select the optimal: Pose ordering network + self-attention global scoring.
  4. Pose Selection

    • Two-stage strategy: First, a pose ordering network compares the rendered images with the cropped observations, and then a self-attention module fuses all hypothesis embeddings to output the final score.

Loss & Training

  • No additional training required: Leverages pre-trained InstantMesh and FoundationPose.
  • Online inference: Conducts optimization-based alignment during online inference.

Key Experimental Results

Main Results (HO3D Dataset)

Method Input Modality ADD-S↑ ADD↑ AR↑
Oryon RGB-D+Language 23.0 0.0 1.0
LoFTR RGB-D 29.5 2.3 3.2
GEDI Depth 71.9 9.7 7.4
Any6D (Ours) RGB-D 98.7 40.4 38.3

Other Datasets

Dataset ADD-S↑ ADD↑ AR↑
YCBINEOAT 89.3 45.6 37.5
Toyota-Light (ADD(-S)) 32.2 AR: 43.3 MSSD: 55.8
REAL275 (ADD(-S)) 53.5 AR: 51.0 MSPD: 65.3
LM-O (vs GigaPose) AR: 28.6 MSPD: 36.1 VSD: 17.6

Ablation Study (HO3D Dataset)

Configuration ADD-S↑ ADD↑ AR↑ CD↓
Baseline (NeRF partial view) 28.6 0.0 0.2 1.02
(1) Without any alignment 0.0 0.0 0.0 1.47
(2) W/o coarse size, w/ refinement + axis alignment 98.0 25.5 26.8 0.53
(3) W/ coarse size, w/o refinement 83.7 26.6 22.5 0.92
(4) W/ coarse size + refinement, w/o axis alignment 92.3 23.6 24.9 0.66
Full (Ours) 98.7 40.4 38.3 0.49

Key Findings

  • Coarse size estimation is the foundation; omitting it leads to complete failure (Configuration 1).
  • Axis alignment brings a significant improvement to ADD and AR (+14.9 AR).
  • Size refinement prevents XYZ aspect ratio distortion.

Highlights & Insights

  • Single RGB-D is sufficient: No CAD models, multi-view images, or video sequences are required.
  • Oriented Bounding Box (OBB) center estimation is simple and effective, addressing partial visibility issues.
  • Full-to-partial matching: Complete reconstruction eliminates the ambiguity of partial matching.
  • Significantly outperforms baseline methods in hand-occlusion (HO3D) and robotic-grasping (YCBINEOAT) scenarios.

Limitations & Future Work

  • Relies heavily on the reconstruction quality of InstantMesh; performance drops when the initial 3D shape is inaccurate.
  • Currently lacks a shape update/refinement step.
  • Inference speed is bottlenecked by InstantMesh.

Rating

  • Novelty: ⭐⭐⭐⭐ Jointly estimates size and pose with InstantMesh + FoundationPose.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 datasets + detailed ablation.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and complete framework.
  • Value: ⭐⭐⭐⭐⭐ Holds significant practical value for robotic manipulation.