Adversarial Exploitation of Data Diversity Improves Visual Localization¶

Conference: ICCV 2025 arXiv: 2412.00138 Code: https://ai4ce.github.io/RAP Area: Visual Localization Keywords: Absolute Pose Regression, 3D Gaussian Splatting, Adversarial Training, Data Augmentation, Appearance Diversity

TL;DR¶

This paper proposes RAP, a framework that synthesizes diverse training data via appearance-controllable 3DGS and introduces an adversarial discriminator to bridge the synthetic-to-real domain gap, enabling absolute pose regression methods to substantially surpass the state of the art across multiple datasets — reducing indoor translation/rotation errors by 50%/41% and outdoor errors by 38%/44%.

Background & Motivation¶

Visual localization — estimating the 6-DoF camera pose from a query image — is a foundational capability for autonomous driving, robotics, and VR. Absolute Pose Regression (APR) methods directly regress poses from images, offering fast inference and advantages in sparse-view or highly variable illumination settings, yet lag behind geometry-based methods in accuracy.

A key insight from Sattler et al. is that APR is essentially performing image-based memorization, i.e., retrieving poses seen during training. To improve such memorization, methods such as DFNet, LENS, and PMNet augment training with novel views synthesized via NeRF. However, they all neglect appearance diversity — LENS did attempt appearance perturbation via NeRF-W but found it ineffective.

The authors hypothesize that the failure of appearance augmentation is not due to its inherent uselessness, but rather that the training pipeline fails to effectively leverage diverse data. Artifacts in synthetic images corrupt the feature space, necessitating a dedicated mechanism to bridge the synthetic-to-real domain gap.

Method¶

Overall Architecture¶

RAP consists of three components: (1) appearance-controllable 3DGS — efficiently rendering synthetic images under diverse lighting and weather conditions; (2) Pose Transformer — a Transformer-based pose regressor; and (3) dual-branch joint training — Branch-1 aligns synthetic and real features via an adversarial discriminator, while Branch-2 provides additional supervision by online-synthesizing images at novel poses with novel appearances.

Key Designs¶

Appearance-Controllable 3DGS Data Engine:
- Built upon GS-W, each Gaussian carries intrinsic attributes (position \(\bm{\mu}\), spherical harmonics \(\bm{\mathcal{Y}}\)) and a dynamic appearance feature \(\bm{\mathcal{E}}\).
- Features extracted from input images are assigned to each Gaussian via a learnable sampler \(\bm{\mathcal{S}}\).
- Final color is fused via an MLP: \(\bm{\mathcal{C}} = \text{MLP}(\bm{\mu}, \bm{\mathcal{Y}}, \omega \bm{\mathcal{E}}, \theta)\), where \(\omega\) is a blending weight controlling dynamic appearance.
- Deblurring Modeling: Inspired by Deblur-GS, motion blur is modeled as the inverse of scene motion (SE(3) transforms of Gaussian positions), with time steps sampled along a linear trajectory and blended.
- Design Motivation: Localization datasets frequently contain motion blur and appearance variation; addressing both simultaneously is critical for rendering quality and localization accuracy.
Pose Transformer Regressor:
- Multi-scale features are extracted with EfficientNet-B0; layers 3 and 4 are used for translation and rotation regression, respectively.
- Learnable global tokens (Trans and Rot) are appended to the flattened feature sequence and fed into a Transformer.
- After multi-head self-attention, only the processed global tokens are passed through MLP regression heads to output \(\hat{\bm{t}}\) and \(\hat{\bm{r}}\).
- Design Motivation: Compared to CNN-based regression heads, the Transformer better captures long-range dependencies and is less susceptible to noise from fine-grained local features.
Dual-Branch Joint Training Paradigm:
- Branch-1 (Feature Alignment): For each real image \(\bm{I}\), a synthetic image \(\bm{I}'\) is rendered at the same pose via 3DGS; both undergo pose regression while an adversarial discriminator is introduced.
  - Discriminator objective: distinguish real from synthetic features.
  - Generator (feature extractor) objective: fool the discriminator with synthetic features.
  - LSGAN loss is adopted to avoid gradient vanishing: \(\mathcal{L}_{Dis} = \frac{1}{2}\mathbb{E}[(D(\text{Adj}(\mathcal{F}_t(\bm{I})))-1)^2] + \frac{1}{2}\mathbb{E}[D(\text{Adj}'(\mathcal{F}_t(\bm{I}')))^2]\)
- Branch-2 (Progressive Data Synthesis): Every 20 epochs, new images with pose perturbations and random appearance blending weights are synthesized online as additional training samples.
  - Indoor: \(\delta t = 20\) cm, \(\delta r = 10°\); outdoor: \(\delta t = 150\) cm, \(\delta r = 4°\).
  - Synthesis stops when validation MSE and median error cease to decrease.
- Total loss: \(\mathcal{L}_{total} = \beta_1 \mathcal{L}_{pose}^1 + \beta_2 \mathcal{L}_{pose}^2 + \beta_3 (\mathcal{L}_{Gen} + \mathcal{L}_{Dis})\)

Loss & Training¶

Pose regression employs adaptive weights to balance translation and rotation: \(\mathcal{L}_{pose} = \mathcal{L}_t \exp(-s_t) + s_t + \mathcal{L}_r \exp(-s_r) + s_r\), where \(s_t, s_r\) are learnable parameters. 3DGS training is performed without masking dynamic objects. At inference, only the pose regressor is retained; the discriminator and adaptation layers are discarded.

Key Experimental Results¶

Main Results¶

Cambridge Landmarks (outdoor) median translation (cm) / rotation (°) error:

Method	College	Hospital	Shop	Church	Average
DFNet	73/2.37	200/2.98	67/2.21	137/4.03	119/2.90
PMNet	68/1.97	103/1.31	58/2.10	133/3.73	90/2.27
RAP	52/0.90	87/1.21	33/1.48	53/1.52	56/1.28

7-Scenes (indoor) average error:

Method	Avg. Translation (cm) / Rotation (°)
PMNet	10/3.24
CoordiNet+LENS	9/3.07
RAP (SfM GT)	5/1.90
RAPref (SfM GT)	0.60/0.20

Ablation Study¶

Configuration	Translation (cm)↓	Rotation (°)↓	Note
I: VGG16 baseline	174	5.45	Baseline
II: EfficientNet-B0	103	4.64	Better features
III: +Pose augmentation	75	3.52	Novel views effective
IV: +Appearance augmentation	60	3.14	Appearance diversity effective
V: +Conv decoder	52	2.51	More parameters
VI: +Transformer	40	1.98	Long-range dependency
VII: +Discriminator	33	1.48	Domain gap bridged

Key Findings¶

MARS autonomous driving scene: Under challenges of dynamic objects, illumination variation, and motion blur, RAP achieves an average error of 28 cm/0.60°, substantially outperforming PoseNet at 121 cm/1.67°.
Aachen Day-Night: RAP reduces rotation error from 75.99° to 13.70°, demonstrating the critical role of appearance diversity under extreme illumination changes — while SCR methods such as ACE (104.50°) and GLACE (36.4°) fail in this scenario.
Generalization evaluation: On St. George's Basilica, the model still produces reasonable pose predictions even for test regions completely unseen during training, suggesting that APR begins to exhibit generalization beyond simple memorization.
RAPref, combined with a single render-and-match refinement step, reduces indoor error to sub-centimeter level (0.60 cm/0.20°).

Highlights & Insights¶

Core Finding: The prior failure of appearance augmentation in APR was not due to the futility of appearance diversity, but rather because the training pipeline — lacking domain alignment — could not exploit synthetic data containing artifacts.
Elegant Use of Adversarial Training: Instead of using a GAN to generate images, a discriminator is employed to align feature spaces, encouraging the regressor to learn domain-invariant pose features.
This work challenges the prevailing view that "APR is merely doing image retrieval" — with sufficiently diverse synthetic data, APR can indeed perform interpolation and a degree of extrapolation on the SE(3) manifold.

Limitations & Future Work¶

Generalization on the SE(3) manifold has boundaries: when rotational perturbations are large (i.e., visual content changes entirely), the model still fails to generalize.
3DGS training without masking dynamic objects may introduce noise in highly dynamic scenes.
Hyperparameters of the progressive data synthesis strategy (perturbation range, synthesis frequency) require manual tuning for different scenes.
Extension of appearance augmentation to a broader range of environmental variations (e.g., seasonal changes, haze) remains unexplored.

DFNet and PMNet perform only pose augmentation; RAP additionally incorporates appearance augmentation and adversarial training, forming a complete "diverse data utilization" paradigm.
The appearance modeling concept from GS-W is repurposed as a data augmentation engine rather than for scene reconstruction per se.
The strategy of using an adversarial discriminator to bridge domain gaps is generalizable to other vision tasks trained on synthetic data.

Rating¶

Novelty: ⭐⭐⭐⭐ First effective application of adversarial training to bridge the synthetic-to-real domain gap in visual localization.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four datasets (indoor/outdoor/driving/day-night) with comprehensive ablations and exploration of generalization boundaries.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; experimental design is convincing.
Value: ⭐⭐⭐⭐ Achieves substantial improvements in APR and redefines the role of data augmentation for this task.