CVPR2026 Autonomous Driving multi-camera 3D object detection cross-configuration generalization spatial prior modulation 3D Gaussian data augmentation BEV perception

CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection¶

Conference: CVPR2026 arXiv: 2603.05042 Code: GitHub (open-sourced per authors; link to be confirmed) Area: Autonomous Driving Keywords: multi-camera 3D object detection, cross-configuration generalization, spatial prior modulation, 3D Gaussian data augmentation, BEV perception

TL;DR¶

CoIn3D is proposed as a framework that explicitly models spatial prior discrepancies arising from camera intrinsics, extrinsics, and array layouts via two modules — Spatial-aware Feature Modulation (SFM) and Camera-aware Data Augmentation (CDA) — enabling strong generalization transfer of multi-camera 3D detection models from source configurations to unseen target configurations. The framework is plug-and-play compatible with three mainstream paradigms: BEVDepth, BEVFormer, and PETR.

Background & Motivation¶

Widespread deployment of multi-camera 3D detection (MC3D): Autonomous vehicles and robotic platforms increasingly adopt surround-view multi-camera setups for 3D object detection, creating an urgent need for cross-platform deployment capability.
Difficulty of cross-configuration generalization: Current MC3D models perform well under their training configurations but suffer severe performance degradation when transferred to new platforms with different intrinsics, extrinsics, camera counts, and layouts. For instance, direct transfer of BEVDepth from NuScenes to Waymo yields only 0.040 mAP.
Incompleteness of existing solutions: Prior methods either align images to a meta-camera via warping (causing resolution loss and 3D scene structure distortion) or address only focal length discrepancies (virtual focal length + depth rescaling), without comprehensively accounting for extrinsics and array layouts.
Focal length ambiguity: The same object occupies different pixel-space sizes under different focal lengths, introducing ambiguity in depth estimation and feature aggregation and preventing consistent object distance understanding.
Ground geometry priors vary with extrinsics: Cameras mounted at different heights and orientations produce distinct ground depth distributions and depth growth rates, causing models to overfit to specific perspective effects during training.
Array layout differences affect multi-camera fusion: Variations in camera count and overlap regions across platforms directly alter multi-camera feature association and fusion patterns, which existing methods do not model.

Method¶

Overall Architecture¶

CoIn3D comprises two core modules: Spatial-aware Feature Modulation (SFM) and Camera-aware Data Augmentation (CDA). During training, CDA first renders novel-view images under randomly sampled configurations via 3DGS, after which SFM embeds spatial priors into features. During inference, only SFM is required for generalization to new configurations. The framework is plug-and-play across three paradigms: bottom-up BEV (BEVDepth), top-down BEV (BEVFormer), and sparse query (PETR).

Spatial-aware Feature Modulation (SFM)¶

SFM explicitly encodes camera configurations via four types of pixel-level spatial prior representations:

Inverse Focal Map: Image features are multiplied by the inverse squared focal length \(M_{IF} = \mathbf{1} \cdot \frac{1}{f^2}\), eliminating focal length ambiguity. The rationale is that a \(k\)-fold difference in focal length causes a \(k^2\)-fold difference in object pixel area; normalization renders feature activations consistent across focal lengths.
Ground Depth Map: Assuming a flat ground plane, a plane equation \(Ax+By+Cz+D=0\) is fitted from at least three non-collinear ground points, from which per-pixel ground depth \(z(u,v) = -\frac{D}{AX+BY+C}\) is derived, providing the model with explicit scene spatial priors.
Ground Gradient Map: Obtained by computing row-wise differences of the ground depth map followed by a log-inverse transform \(M_{GG} = \log(\frac{1}{\Delta z} + 1)\), encoding differences in depth growth rates across varying mounting heights and preventing overfitting to specific perspective effects.
Plücker Raymap: For each pixel, the ray direction \(\mathbf{d} = \mathbf{R}\mathbf{K}^{-1}\mathbf{p}\) from the optical center and the moment \(\mathbf{m} = \mathbf{t} \times \mathbf{d}\) are computed, yielding a 6-channel Plücker coordinate representation that holistically encodes FoV, rotation, translation, and continuous spatial positions of pixels across cameras.

Fusion pipeline: The inverse focal map is first multiplied with features to obtain focal-length-invariant features \(F^1\); GD, GG, and PR are then concatenated and encoded by a shallow projector into spatial embeddings added to \(F^1\) to produce \(F^2\); finally, the four raw prior maps are concatenated with \(F^2\) to yield the final spatial-aware feature \(F^3\).

Camera-aware Data Augmentation (CDA)¶

A training-free ego-centric 3DGS construction pipeline is proposed to dynamically and efficiently generate training images under diverse configurations:

Decomposition and reconstruction: Using 4D annotations, LiDAR sequences are decomposed into foreground objects and background, each reconstructed as meshes via TSDF integration; object meshes are patched into closed surfaces.
Depth rendering and completion: Meshes are assembled per-frame according to annotations and rendered as depth maps, followed by depth completion to fill regions without mesh coverage.
Texture asset construction: Point clouds are sampled from object meshes and camera blind spots; textures are retrieved via cross-frame depth matching and completed for invisible regions.
Gaussian representation: RGB-D images are projected into textured point clouds, with isotropic Gaussians assigned (fixed radius, no rotation, opacity set to 1), leveraging the high-speed rendering (~450 fps) of 3DGS via point-based rendering.

During training, novel camera configurations are randomly sampled to render novel-view images; random focal length scaling augmentation is additionally applied to original images.

Loss & Training¶

The original detection losses of each backbone model (BEVDepth / BEVFormer / PETR) are retained. SFM and CDA, as plug-and-play modules, introduce no additional training losses.

Key Experimental Results¶

Main Results: Cross-Dataset Generalization with BEVDepth¶

Setting	Method	mAP↑	mATE↓	mAOE↓	NDS*↑
NuScenes→Waymo	Direct Transfer	0.040	1.303	0.790	0.178
NuScenes→Waymo	UDGA-BEV (Prev. SOTA)	0.349	0.754	0.250	0.459
NuScenes→Waymo	CoIn3D (Ours)	0.381	0.687	0.155	0.513
NuScenes→Lyft	Direct Transfer	0.112	0.997	0.389	0.296
NuScenes→Lyft	UDGA-BEV	0.324	0.709	0.180	0.487
NuScenes→Lyft	CoIn3D (Ours)	0.375	0.660	0.101	0.534
Waymo→NuScenes	CoIn3D (Ours)	0.349	0.727	0.179	0.481
Lyft→NuScenes	CoIn3D (Ours)	0.303	0.647	0.377	0.452

State-of-the-art results are achieved across all settings, with NDS* gains over UDGA-BEV of +0.054 / +0.047 / +0.004 / +0.031 respectively.

Cross-Paradigm Generalization: BEVFormer and PETR¶

Setting	Method	mAP↑	NDS*↑
N→L (BEVFormer)	Direct Transfer	0.149	0.115
N→L (BEVFormer)	CoIn3D	0.237	0.377
N→L (PETR)	Direct Transfer	0.013	0.046
N→L (PETR)	CoIn3D	0.332	0.456

CoIn3D is the first unified cross-configuration generalization framework applicable to all three major MC3D paradigms.

Ablation Study¶

Module ablation (NuScenes→Waymo):

CDA	SFM	NDS*↑
✗	✗	0.178
✗	✓	0.358
✓	✗	0.224
✓	✓	0.513

SFM alone is effective (+0.180); CDA alone yields limited gain (+0.046); their combination produces strong synergy.
The Camera-Aware SE module in BEVDepth conflicts with SFM; removing CA yields superior performance (0.513 vs. 0.504).

SFM spatial prior ablation: The inverse focal map contributes the most (+0.238), with ground depth/gradient/Plücker adding +0.036 / +0.008 / +0.007 incrementally.

CDA augmentation ablation: Focal length augmentation contributes +0.060; novel-view synthesis augmentation adds a further +0.095, demonstrating that NVS substantially outperforms simple focal length scaling for diversifying camera configurations.

Highlights & Insights¶

Systematic decomposition of cross-configuration discrepancies: The generalization problem is systematically decomposed into three dimensions — intrinsics (focal length/FoV), extrinsics (mounting pose), and array layout — with four targeted spatial prior representations designed accordingly.
Inverse focal normalization is concise yet highly effective: A simple \(1/f^2\) multiplication raises NDS* from 0.224 to 0.462, representing the single largest contribution in the ablation study.
Training-free 3DGS data augmentation: The approach avoids the high training cost of conventional 3DGS by constructing Gaussian representations directly from predefined parameters via point-based rendering at ~450 fps, making online dynamic augmentation practical.
Paradigm-agnostic unified framework: The same SFM+CDA pipeline is plug-and-play across BEVDepth, BEVFormer, and PETR without relying on paradigm-specific depth prediction designs.
Substantially closes the gap to Oracle: NDS* on NuScenes→Waymo improves from 0.178 to 0.513 (Oracle: 0.649), bridging approximately 71% of the performance gap.

Limitations & Future Work¶

Semantic distribution shift remains unaddressed: The current approach handles only configuration discrepancies; category and scene distribution differences across datasets continue to affect cross-domain generalization, which the authors identify as future work.
Dependency on LiDAR for 3DGS construction: The CDA module requires LiDAR data for mesh reconstruction and depth, limiting applicability to purely vision-based datasets.
Flat ground plane assumption: The ground depth and gradient maps are derived under a flat ground assumption, which may fail in non-planar scenarios such as ramps or undulating surfaces.
Predominantly single-class evaluation: Main experiments are primarily validated on a unified "car" category; generalization across multiple categories warrants further investigation.
Storage overhead of CDA: Per-frame construction and storage of ego-centric Gaussian point clouds incurs non-trivial storage and preprocessing costs for large-scale datasets.

Method	Focal Length	Extrinsics	Array Layout	Paradigm	NDS* (N→W)
DG-BEV	Virtual focal length	✗	✗	Bottom-up BEV	0.415
PD-BEV	Virtual focal length + depth rescaling	✗	✗	Bottom-up BEV	—
UDGA-BEV	Virtual focal length + depth/photometric consistency	✗	✗	Bottom-up BEV	0.459
UniPAD [47]	Image warping to sphere	Spherical alignment	✗	Bottom-up BEV	—
CoIn3D (Ours)	Inverse focal map	Ground depth/gradient + Plücker	Plücker continuous encoding	All paradigms	0.513

This work is the first to comprehensively and explicitly model all three categories of configuration priors and the only approach simultaneously applicable to all three major MC3D paradigms.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combined design of four spatial priors and the training-free 3DGS augmentation are novel; the inverse focal normalization is concise and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets × three paradigms × four settings, with detailed ablations and comprehensive comparisons.
Writing Quality: ⭐⭐⭐⭐ — Problem analysis is systematic and clear; figures are intuitive; mathematical derivations are complete.
Value: ⭐⭐⭐⭐ — Addresses a practical pain point in cross-platform MC3D deployment with strong industrial application potential.