PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis¶

Conference: NeurIPS 2025 arXiv: 2510.19527 Code: https://github.com/maoqingsunny/PoseCrafter Area: Video Generation Keywords: Extreme Pose Estimation, Video Diffusion, Hybrid Video Generation, Feature Matching Selection, Sparse Overlap

TL;DR¶

This paper proposes PoseCrafter, a training-free framework for extreme pose estimation. It synthesizes high-fidelity intermediate frames via Hybrid Video Generation (HVG, a two-stage pipeline combining DynamiCrafter and ViewCrafter) to address pose estimation for image pairs with minimal or no overlap, and employs a Feature Matching Selector (FMS) to efficiently identify the most informative intermediate frames. The method achieves significant improvements in extreme pose estimation accuracy across four datasets.

Background & Motivation¶

Background: Relative pose estimation from image pairs is a fundamental problem in 3D vision. Existing methods based on feature matching + RANSAC + the five-point algorithm are well-established for sufficiently overlapping pairs, but fail entirely under minimal or zero overlap.

Limitations of Prior Work: - InterPose bridges non-overlapping image pairs by generating intermediate frames via video interpolation, but the synthesized intermediate frames are blurry (especially central frames), and its statistical self-consistency score for frame selection is slow and misaligned with the pose estimation objective. - DynamiCrafter produces high-quality frames near the input endpoints but geometrically inconsistent intermediate frames, since the inputs themselves have little overlap. - Commercial models (Runway/Luma) yield sharper results but are costly and still exhibit drift.

Key Challenge: No single video model can simultaneously guarantee geometric consistency across all frames under minimal overlap.

Key Insight: Decompose the problem into two steps — first use video interpolation to obtain a small set of "reliable relay frames" (frames near the input endpoints are more trustworthy), then refine intermediate frames using a pose-conditioned ViewCrafter.

Core Idea: Couple video interpolation with a pose-conditioned novel view synthesis model, leveraging the complementary strengths of each, combined with feature-correspondence-based frame selection.

Method¶

Overall Architecture¶

Input: An image pair \((I_0, I_T)\) with minimal or no overlap. (1) HVG Stage 1: DynamiCrafter interpolates a coarse video; 4 frames near the endpoints are selected as "relay frames" \(\{I_0, I_1, I_{T-1}, I_T\}\). (2) HVG Stage 2: DUSt3R estimates camera poses from the relay frames; spherical linear interpolation yields a dense camera trajectory; ViewCrafter generates high-fidelity intermediate frames conditioned on these poses. (3) FMS: Feature matching + RANSAC is applied between each synthesized frame and the input pair; the top-\(k\) frames with the highest inlier counts are selected as input to the pose estimation model.

Key Designs¶

Hybrid Video Generation (HVG):
- Function: Two-stage synthesis of sharp intermediate frames.
- Stage 1 (Coarse Interpolation): DynamiCrafter generates a \(T\)-frame video. Only the 4 most reliable frames — \(\{I_0, I_1, I_{T-1}, I_T\}\) — are retained. Ablations confirm that 4 frames outperforms 2 frames (insufficient structural information), 6 frames (blurry frames introduced), and all frames (central frames universally blurry).
- Stage 2 (Pose-Guided Refinement): DUSt3R recovers poses from the 4 relay frames; SO(3) spherical linear interpolation produces a dense trajectory; ViewCrafter generates conditioned frames.
- Design Motivation: DynamiCrafter excels at synthesizing "near-endpoint frames," while ViewCrafter excels at "sharp synthesis given a known pose" — the two are complementary.
Feature Matching Selector (FMS):
- Function: Selects the \(k\) synthesized frames most beneficial to pose estimation.
- Mechanism: Local descriptors (e.g., SuperPoint) are extracted from each candidate frame and matched against the input image pair; RANSAC computes the inlier count. The top-\(k\) frames with the highest total inlier counts are selected.
- Design Motivation: InterPose's statistical self-consistency score requires multiple video generation passes (slow) and does not directly optimize for pose estimation utility. FMS uses feature correspondence counts to directly measure whether a frame can help establish geometric relationships between \(I_0\) and \(I_T\).

Loss & Training¶

Completely training-free — all pre-trained models (DynamiCrafter, ViewCrafter, DUSt3R, SuperPoint) are used off-the-shelf.
No ground-truth poses or 3D supervision are required.

Key Experimental Results¶

Main Results — Extreme Pose Estimation Accuracy (Mean Rotation Error MRE↓)¶

Dataset	DUSt3R (Direct)	InterPose (Single-Stage)	PoseCrafter (Hybrid)
Cambridge Landmarks	22.3°	17.8°	14.5°
ScanNet	25.1°	19.7°	16.2°
DL3DV-10K	18.6°	15.2°	14.3°
NAVI	11.2°	7.8°	6.9°

Ablation Study — Number of Relay Frames¶

Relay Frames	Cambridge MRE↓	ScanNet MRE↓	NAVI MRE↓
2	20.6°	19.7°	7.8°
4	14.5°	16.2°	6.9°
6	16.7°	17.0°	7.2°
16 (all)	17.8°	18.6°	10.9°

HVG vs. Single-Model Baselines¶

Method	Cambridge MRE↓	DUSt3R Confidence
DynamiCrafter only	17.8°	Low (blurry central frames)
ViewCrafter only	19.2°	Medium (no reliable initial poses)
HVG (coupled)	14.5°	High

Key Findings¶

4 relay frames is optimal — too few (2) loses structural information; too many (6+) introduces blurry intermediate frames that degrade overall pose estimation accuracy.
HVG outperforms any single video model — DynamiCrafter provides reliable near-endpoint frames, while ViewCrafter leverages pose conditioning for sharp synthesis; the combination is super-additive.
FMS is faster and more effective than InterPose's self-consistency score — it directly measures feature correspondence counts without requiring multiple video generation passes.
DUSt3R confidence maps show that HVG frames achieve significantly higher confidence than DynamiCrafter frames, confirming that sharper frames genuinely benefit pose estimation.

Highlights & Insights¶

The "reliable relay frame" concept is elegant — not all synthesized frames are equally valuable; only those near the input endpoints are trustworthy. This insight stems from a deep understanding of video diffusion model failure modes.
The coupling of two models rather than simple sequential substitution — DynamiCrafter addresses "where to start," while ViewCrafter addresses "how to get there."
FMS aligns frame selection with the downstream task objective (feature matching quality ≈ pose estimation utility), making it more purposeful than statistical proxy scores.

Limitations & Future Work¶

The method relies on DUSt3R for intermediate pose estimation, so errors in DUSt3R propagate through the pipeline.
ViewCrafter's synthesis quality is bounded by its pre-training data and model capacity.
Inference cost remains high, as two video models, DUSt3R, and feature matching must all be executed.
Evaluation is limited to static scenes — object motion in dynamic scenes would introduce additional challenges.

vs. InterPose: InterPose employs single-stage video interpolation with statistical frame selection; PoseCrafter improves on both dimensions with hybrid generation and feature-matching-based selection.
vs. DUSt3R: DUSt3R directly estimates pose/depth from two images; PoseCrafter "bridges" non-overlapping pairs by synthesizing intermediate frames, effectively increasing the available information.
vs. JOG3R: JOG3R fine-tunes intermediate features of a video model for SfM; PoseCrafter is entirely training-free.

Rating¶

Novelty: ⭐⭐⭐⭐ — Novel combination of hybrid video generation and feature-matching-based frame selection.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four datasets, detailed ablations, and comparisons against InterPose and DUSt3R.
Writing Quality: ⭐⭐⭐⭐ — Clear method pipeline with intuitive visualizations (confidence map comparisons).
Value: ⭐⭐⭐⭐ — Addresses a practically relevant problem of extreme viewpoint pose estimation.