Skip to content

Any Resolution Any Geometry: From Multi-View To Multi-Patch

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://dreamaker-mrc.github.io/Any-Resolution-Any-Geometry (Project Page)
Area: 3D Vision
Keywords: High-resolution depth estimation, surface normals, multi-patch Transformer, cross-patch attention, VGGT

TL;DR

A single ultra-high-definition image is decomposed into patches and treated as "virtual multi-views" within a VGGT-style framework for joint processing. Combined with cross-patch attention for global consistency reasoning, the model outputs sharp and globally coherent high-resolution depth maps and surface normals in a single forward pass, reducing AbsRel on UnrealStereo4K from 0.0582 to 0.0291.

Background & Motivation

Background: High-resolution depth/normal estimation is essential for 3D reconstruction and scene understanding. However, mainstream joint depth-normal models (GeoWizard, Metric3D v2) are constrained by GPU memory and computation, often limited to low input resolutions, which blurs details when processing 4K/8K images.

Limitations of Prior Work: To handle high resolutions, the community has turned to patch-based refinement (PatchFusion, PatchRefiner, PRO), which predicts patches independently and stitches them. This approach suffers from two issues: ① Methods like PatchRefiner refine blocks in isolation, lacking information exchange between adjacent patches, leading to depth jumps and blocking artifacts at seams; ② These pipelines are primarily designed for depth only, making it difficult to extend to joint depth-normal estimation while ensuring geometric consistency at scale.

Key Challenge: High-resolution geometric prediction must satisfy two conflicting goals—preserving local details like object boundaries (requiring a small receptive field) while maintaining global consistency for the entire depth/normal field (requiring a large receptive field). Patching solves details and memory but sacrifices the global view; avoiding patching preserves the global view but fails at high resolutions.

Key Insight: The authors observe that multi-view Transformers (DUSt3R, VGGT) have demonstrated that "processing multiple views in a unified backbone using attention for global information propagation" effectively scales geometric prediction. This leads to the question: can patches from a single high-res image be treated as "virtual multi-views"? The relationships between patches correspond precisely to those between multiple views.

Core Idea: Transfer the "multi-view" paradigm to a "multi-patch" framework. Patches of a high-res image are processed jointly in a shared backbone, replacing inter-view attention with cross-patch attention to preserve local details while enforcing global consistency via global token communication. This results in the Ultra Resolution Geometry Transformer (URGT).

Method

Overall Architecture

URGT acts as a geometry refiner: given a high-resolution RGB image \(I \in \mathbb{R}^{3 \times H \times W}\), it first obtains low-resolution coarse estimates using off-the-shelf models (Depth-Anything v2 for \(D^{coarse}\), Metric3D v2 for \(n^{coarse}\)), aligned via bilinear upsampling. The image and coarse estimates are then patched and fed into a unified Transformer that outputs offsets relative to the coarse estimates, which are added back to obtain refined high-resolution depth and normals. The key is enabling the model to reason within patches (local details) and across patches (global consistency) in a single forward pass.

Specifically, the \(k\)-th RGB patch \(J_k\) and its aligned coarse depth/normal crops are encoded by DINOv2 into visual, depth, and normal tokens. These are element-wise summed to fuse into a geometry-aware representation \(t^{joint}_k = t_{J_k} + t_{D^{coarse}_k} + t_{n^{coarse}_k}\). The fused tokens of all patches form a unified sequence passing through \(L\) blocks of alternating "intra-patch attention + cross-patch attention." Finally, a lightweight DPT-style head predicts the depth offset \(\Delta^{Depth}_k\) and normal offset \(\Delta^{Normal}_k\), yields \(D^{refined}_k = D^{coarse}_k + \Delta^{Depth}_k\) and \(n^{refined}_k = n^{coarse}_k + \Delta^{Normal}_k\).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["High-res RGB Image<br/>+ Coarse Depth/Normal Priors<br/>(DAv2 + Metric3D v2 Upsampled)"] --> B["Multi-view to Multi-patch Reconstruction<br/>Patching + DINOv2 Encoding<br/>Token Summation Fusion"]
    B --> C["Global Positional Encoding + Cross-Patch Attention<br/>Global RoPE Calibration<br/>L× (intra + cross attention)"]
    C --> D["DPT Head Predicts Offsets"]
    D -->|Offset + Coarse Estimate| E["Refined High-res Depth + Normal"]

Key Designs

1. Multi-view to Multi-patch: Treating Patches as Virtual Views for Joint Processing

This directly addresses the loss of global context in patch-based methods. The authors reuse the unified Transformer architecture from VGGT designed for multi-views but redefine its semantics to "multi-patch." Each patch is no longer processed in isolation but is treated as a view within the same sequence and backbone. Every patch carries its aligned coarse geometric priors, making the tokens "geometry-aware" from the start. This allows the model to learn residual offsets rather than absolute values, simplifying the task, while cross-patch attention naturally provides a global communication mechanism.

2. Global Positional Encoding + Cross-patch Attention: Enforcing Global Consistency while Preserving Details

In standard patching, local coordinates start from (0,0) for each patch, making tokens at the same physical location indistinguishable across patches. The authors assign a global origin \((x_k, y_k)\) for each patch \(J_k\), mapping local coordinates \(p_i=(u_i,v_i)\) to global coordinates \(p^g_i = (u_i + x_k, v_i + y_k)\), which are then used for RoPE encoding.

Two types of attention are alternated: intra-patch attention performs self-attention within a patch (\(\text{softmax}(\tilde Q_c \tilde K_c^\top/\sqrt{d})\tilde V_c\)) to refine local details; cross-patch attention operates on the entire sequence (\(\text{softmax}(\tilde Q \tilde K^\top/\sqrt{d})\tilde V\)), allowing every token to attend to all others across the image. Removing cross-patch attention causes AbsRel to degrade from 0.0500 to 0.0678, while replacing global RoPE with local RoPE results in a consistency error (CE) spike from 0.0635 to 0.2830.

3. GridMix Sampling: Probabilistic Multi-scale Patching as Data Augmentation

To overcome the scarcity of high-resolution training data and improve generalization, GridMix is used. While the patch dimensions are fixed at \(\frac{H}{4}\times\frac{W}{4}\), each iteration randomly selects one of four configurations based on a probability distribution: \(M=1\) (single random patch), \(M\in\{2,3\}\) (random \(M\times M\) grid), and \(M=4\) (fixed \(4\times4\) grid covering the image). This forces the model to adapt to various granularities. Ablations show \((p_1,p_2,p_3,p_4)=(0.1,0.2,0.3,0.4)\) is optimal.

4. Geometric Consistency Supervision: Binding Depth and Normals to the Same Geometry

To prevent conflicts between predicted depth and normals, the authors derive a pseudo-normal field \(n^{pseudo}\) from the GT depth \(D^{gt}\) via local least squares. The depth loss \(L_{depth}\) constrains numerical accuracy (MSE) and boundary sharpness (gradient), while the normal loss \(L_{normal}\) constrains orientation (angular loss) and alignment. Since \(n^{pseudo}\) is derived from \(D^{gt}\), both heads are bound to the same underlying geometry, leading to self-consistency.

Key Experimental Results

Main Results

Joint depth and normal evaluation on UnrealStereo4K (4K images):

Method Inference Time↓ AbsRel↓ δ1↑ RMSE↓ CE↓
Depth-Anything v2 0.0812 0.924 2.86
PatchRefiner (p=16) 1.02s 0.0633 0.950 2.28 0.0753
PatchRefiner (p=49) 4.12s 0.0582 0.956 2.17 0.0715
PRO 1.88s 0.0771 0.927 2.73 0.0549
Ours (Separate) 0.94s 0.0295 0.982 1.38 0.0418
Ours (Joint) 0.97s 0.0291 0.983 1.31 0.0415

Normal estimation (UnrealStereo4K, Angular Error):

Method Mean↓ Median↓ RMSE↓ <5°↑ <11.25°↑ <30°↑
Metric3D v2 23.36 33.15 13.90 11.74 44.96 79.77
Ours (Joint) 18.51 28.83 9.60 29.37 59.43 85.06

Compared to PatchRefiner, AbsRel drops by over 49% and RMSE by 35%, with a faster inference time of 0.97s.

Ablation Study

Configuration Key Metric Description
GridMix (0.1,0.2,0.3,0.4) AbsRel 0.0295 / CE 0.0418 Optimal mixed granularity
Pure 1×1 (1,0,0,0) AbsRel 0.0500 / CE 0.0648 Good details, poor global context
Pure 4×4 (0,0,0,1) AbsRel 0.0321 / CE 0.0635 Poor consistency
Global RoPE AbsRel 0.0321 / CE 0.0635 Core align mechanism
Local RoPE AbsRel 0.0343 / CE 0.2830 Causes severe cross-patch misalignment
w/ Cross-Patch Attn AbsRel 0.0500 / RMSE 2.01 Full model
w/o Cross-Patch Attn AbsRel 0.0678 / RMSE 2.51 Significant seam artifacts

Key Findings

  • Cross-patch attention is the bottleneck: Removing it causes AbsRel and RMSE to degrade significantly, making patch seams visible.
  • Global RoPE is critical for alignment: Switching to local RoPE causes the consistency error (CE) to surge, proving its role in cross-patch alignment.
  • Mixed sampling is superior: Probabilistic mixing of different grid sizes outperforms any single fixed slicing method.
  • Joint training is beneficial: Coupling depth and normals provides mutual refinement.
  • Scalable to 8K: The framework can process 8K images without retraining, maintaining fine details and global coherence.

Highlights & Insights

  • Elegant "Patch = Virtual View" paradigm: Transfers the mature multi-view Transformer architecture to high-res single-image tasks by redefining semantics.
  • Predicting offsets instead of absolute values: Building on frozen foundation model priors simplifies the task to refinement, leveraging pre-trained generalization.
  • Global RoPE utility: A simple global coordinate shift enables standard RoPE to achieve cross-patch alignment with zero extra parameters.
  • GridMix as data augmentation: Treats patching strategy as a stochastic training variable, enhancing robustness to various resolutions and slicing methods.

Limitations & Future Work

  • Dependency on coarse priors: URGT is a refiner; if the foundation models (DAv2/Metric3D v2) fail completely in a specific domain, the refinement cannot recover the geometry.
  • Noisy pseudo-normal supervision: \(n^{pseudo}\) is derived from GT depth and may contain noise, limiting the upper bound of normal accuracy.
  • Computational scaling: Memory usage grows with the number of patches; the full scaling curve for resolutions beyond 8K is not finalized.
  • GridMix hyperparameter sensitivity: The probabilities \((p_1, p_2, p_3, p_4)\) require manual tuning and lack an adaptive solution.
  • vs PatchRefiner / PatchFusion: These methods rely on post-hoc fusion or consistency losses, whereas URGT uses single-pass global communication, proving faster and more consistent.
  • vs Metric3D v2 / GeoWizard: Instead of competing with these foundation models, URGT builds upon them to achieve high-resolution refinement.
  • vs VGGT / DUSt3R: Proves that set-based geometry reasoning is not limited to physical multi-view scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐