GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis¶

Conference: CVPR 2026 arXiv: 2603.01010 Code: N/A Area: 3D Vision Keywords: Novel View Synthesis, Flow Matching, Geodesic, Probability Density Manifold, Data-to-Data Mapping

TL;DR¶

This paper proposes a Probability Density Geodesic Flow Matching (PDG-FM) framework that replaces the noise-to-data diffusion process with a deterministic data-to-data flow matching scheme, and optimizes interpolation paths to traverse high-density regions of the data manifold via probability-density-based geodesics, achieving geometrically consistent novel view synthesis.

Background & Motivation¶

Novel view synthesis (NVS) aims to generate unseen viewpoints of a scene from limited observations. Although diffusion models achieve high generation quality, they rely on stochastic noise-to-data transitions, which obscure deterministic structure and lead to inconsistent cross-view predictions. Flow Matching offers a deterministic alternative, yet existing conditional Flow Matching (CFM) methods largely employ simple linear interpolation between source and target data, failing to faithfully capture the non-linear geometry of the data manifold in latent space and thus yielding suboptimal view transitions.

Core problem: how to explicitly incorporate data-dependent geometric regularization into the generation process so that intermediate interpolations in flow matching adhere to the data manifold, thereby improving view consistency?

Method¶

Overall Architecture¶

PDG-FM consists of two major components: 1. Data-to-Data Flow Matching (D2D-FM): learns a deterministic flow between paired samples \((x_0, x_1)\) without requiring a noise prior. 2. Variational Distillation of Geodesics: trains a GeodesicNet to align interpolation paths with the probability density manifold.

Key Designs¶

Data-to-Data Flow Matching: Unlike conventional noise-to-data Flow Matching, D2D-FM establishes a deterministic flow directly between encoded pairs \((x_0, x_1)\) of different viewpoints of the same scene. The velocity network \(v_\theta(x_t, t, q, c)\) adopts a U-Net architecture with intermediate state \(x_t\) and time \(t\) as inputs; conditioning includes Plücker ray embeddings (target camera pose), CLIP-encoded source-view semantic features (injected via cross-attention), and VAE-encoded source-view spatial features (concatenated with \(x_t\) as input). Linear interpolation: \(x_t = (1-t)x_0 + tx_1 + \sigma_{\min}\epsilon\); target velocity: \(u_t = x_1 - x_0\).
Probability Density Geodesic (PDG): A local metric tensor \(G(x) = p(x)^{-2}I\) is defined inversely proportional to data density, with path length \(S[\gamma] = \int_0^1 \|\dot{\gamma}\|_{G(\gamma)} dt\). The geodesic satisfies the Euler–Lagrange equation: \(\ddot{\gamma} + \|\dot{\gamma}\|^2 (I - \hat{\dot{\gamma}}\hat{\dot{\gamma}}^\top)\nabla\log p(\gamma) = 0\). The data density gradient \(\nabla\log p\) is approximated via the score function of a pretrained diffusion model, estimated at DDIM timestep \(\tau=0.6\) using classifier-free guidance.
GeodesicNet Distillation: A teacher–student architecture is adopted — the teacher network \(\phi_\xi\) optimizes geodesic paths in diffusion latent space by minimizing the Euler–Lagrange residual, while the student network \(\phi_\eta\) distills these paths into VAE space via DDIM inversion. The geodesic interpolation is parameterized as \(x_t = (1-t)x_0 + tx_1 + \phi_\eta(x_0, x_1, t)\), where \(\phi_\eta\) satisfies the boundary constraints \(\phi_\eta(x_0,x_1,0) = \phi_\eta(x_0,x_1,1) = 0\). This two-stage design decouples geometric optimization from efficient path generation.

Loss & Training¶

Three-stage training: 1. D2D-FM Training: \(\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}[\|v_\theta(x_t,t,q,c) - (x_1-x_0)\|^2]\); source views are augmented with a cosine schedule. 2. GeodesicNet Distillation: Teacher minimizes the functional derivative \(\ell^\tau(\xi) = \mathbb{E}_t[\text{StopGrad}(g_t) \cdot z_t]\); student minimizes MSE: \(\ell^0(\eta) = \mathbb{E}_t[\|x_t - \text{DDIM-B}(z_t,c,\tau)\|^2]\). 3. Geodesic FM Training: Fine-tuned from the pretrained velocity network; the target velocity incorporates the time derivative of \(\phi_\eta\): \(v_{\text{target}} = x_1 - x_0 + \nabla_t\phi_\eta(x_0,x_1,t)\).

Optimizer: AdamW; batch size: 256; learning rate: \(1 \times 10^{-5}\); resolution: 256×256.
Training data: Objaverse (772k+ 3D objects, 12 rendered views per object).

Key Experimental Results¶

Main Results¶

Dataset	Metric	D2D-FM (Ours)	Naive FM	Free3D	Zero-1-to-3
Objaverse	FID↓	5.43	5.51	5.54	6.00
Objaverse	PSNR↑	20.84	20.82	20.32	19.59
Objaverse	SSIM↑	0.8634	0.8622	0.8537	0.8446
GSO30	FID↓	15.05	15.28	12.06	12.58
Objaverse (10 NFE)	FID↓	5.82	5.78	22.45	-

Geodesic FM vs. Linear FM:

Dataset	Metric	Geodesic FM	Linear FM
Objaverse	FID↓	10.40	11.81
Objaverse	SSIM↑	0.8768	0.8736
Objaverse	LPIPS↓	0.0804	0.0809

Ablation Study¶

Configuration	PPL↓	AOFM↑	Description
Linear Interpolation	0.213	1.04	Simple blending, minimal geometric motion
DDIM Initialization	0.571	6.48	Motion present but unoptimized
Geodesic Interpolation	0.502	13.70	Strongest geometrically consistent motion along manifold

Key Findings¶

D2D-FM consistently outperforms Noise-to-Data FM in fidelity and perceptual quality, particularly on FID and LPIPS.
The advantage of D2D-FM is more pronounced under few-step inference (10 NFE): the diffusion baseline (Free3D) degrades from FID 5.5 to 22.5, whereas D2D-FM degrades only from 5.4 to 5.8.
The average optical flow magnitude (AOFM) of geodesic interpolation is 13× that of linear interpolation, indicating genuine viewpoint transformation rather than simple blending.
Geodesic paths exhibit lower Euler–Lagrange residuals, confirming adherence to meaningful manifold structure.

Highlights & Insights¶

Theoretical Elegance: Introducing probability density geodesics into conditional Flow Matching provides a rigorous mathematical framework for geometric regularization in generative models.
Two-Stage Decoupled Design: GeodesicNet distillation separates the score-dependent Riemannian metric from flow model training and deployment, ensuring computational efficiency.
Data-to-Data Paradigm: Eliminating the noise prior and establishing deterministic flows directly between structured data pairs yields significant advantages under few-step inference.
AOFM as a Novel Evaluation Metric: Low PPL may correspond to simple blending; AOFM more faithfully reflects the quality of genuine viewpoint transformation.

Limitations & Future Work¶

Experiments are limited to single-object NVS (Objaverse/GSO); validation on scene-level multi-view data is absent.
GeodesicNet distillation requires the score function of a pretrained diffusion model, increasing training complexity and dependency.
Generation resolution is limited to 256×256, falling short of current high-resolution generation requirements.
Evaluation focuses primarily on perceptual metrics, lacking direct measures of 3D consistency (e.g., multi-view reconstruction quality).
Geodesic optimization in high-dimensional latent spaces may be susceptible to local optima.

Riemannian Flow Matching (RFM) performs flow matching on fixed geometry; PDG-FM further introduces data-dependent metrics.
The Zero-1-to-3 series represents canonical diffusion-based NVS methods; this work demonstrates the advantages of Flow Matching on the same architectural backbone.
Metric Flow Matching (MFM) directly employs Riemannian metrics during training; the two-stage approach proposed here is more computationally efficient.
Insights: The probability density geodesic idea is extensible to video generation, 4D scene modeling, and other tasks requiring spatiotemporal consistency; using the score function as a proxy for manifold metric is a broadly applicable and valuable idea.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of probability density geodesics and Flow Matching is novel, with solid theoretical contributions.
Experimental Thoroughness: ⭐⭐⭐ Evaluation is confined to single-object NVS; scene-level and high-resolution experiments are missing.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous, the framework is clearly presented, and illustrations are intuitive.
Value: ⭐⭐⭐⭐ Opens a new direction for geometric regularization in Flow Matching, offering meaningful inspiration to the generative modeling community.
Value: TBD