VidTAG: Temporally Aligned Video to GPS Geolocalization¶

Conference: CVPR 2026 arXiv: 2604.12159 Code: https://parthpk.github.io/vidtag_webpage Area: Video Understanding / Geolocalization Keywords: Video geolocalization, frame-to-GPS retrieval, temporal consistency, trajectory prediction, denoising

TL;DR¶

This paper proposes VidTAG, a dual-encoder (CLIP+DINOv2) frame-to-GPS retrieval framework that achieves temporally consistent per-frame video geolocalization at global scale, via a TempGeo module for inter-frame temporal alignment and a GeoRefiner encoder-decoder module for GPS prediction refinement.

Background & Motivation¶

Background: Image geolocalization is dominated by two paradigms — classification (partitioning the Earth into regions and predicting labels) and retrieval (matching against a geo-referenced image database). GeoCLIP embeds images and GPS coordinates into a shared space to enable direct GPS retrieval.

Limitations of Prior Work: Classification methods offer only coarse city-level localization; image retrieval methods require enormous image databases, making them infeasible at global scale. For video, applying image-based methods frame-by-frame produces "jittery" trajectories, with worst-case predictions spanning continents. The only global-scale video method, CityGuessr, reasons at the full video level and does not support per-frame localization.

Key Challenge: Achieving accurate and temporally consistent per-frame trajectories at global scale remains an open challenge.

Goal: (1) Introduce a new frame-to-GPS retrieval paradigm; (2) Address temporal inconsistency in video-level prediction.

Key Insight: Constructing a GPS coordinate gallery (rather than an image gallery) is simple and inexpensive, making frame-to-GPS retrieval tractable at global scale.

Core Idea: TempGeo performs inter-frame temporal alignment, and GeoRefiner applies denoising-based refinement, together enabling temporally consistent per-frame GPS prediction.

Method¶

Overall Architecture¶

Training proceeds in two phases. Phase I trains the dual frame encoder (CLIP+DINOv2), TempGeo, and a location encoder via contrastive learning. Phase II freezes Phase I and trains the GeoRefiner encoder-decoder for denoising-based GPS refinement. At inference, frames are encoded through the dual encoder and TempGeo to produce embeddings; initial GPS predictions are retrieved and subsequently refined by GeoRefiner.

Key Designs¶

Dual Frame Encoder (CLIP + DINOv2):
- Function: Generates semantically and visually complementary representations for each frame.
- Mechanism: CLIP provides language-aligned semantics (disambiguating landmarks, signs, and scenes); DINOv2 provides robust self-supervised features (global appearance, insensitive to domain shift). The CLS tokens from both are concatenated as the frame representation \(\mathbf{z}_t = [\mathbf{f}_{clip} \| \mathbf{f}_{dino}]\).
- Design Motivation: CLIP excels at semantic understanding while DINOv2 excels at visual description; their complementary strengths benefit frame-to-GPS retrieval.
TempGeo Temporal Alignment Module:
- Function: Produces temporally consistent frame embeddings via inter-frame attention.
- Mechanism: A lightweight Transformer encoder applies full self-attention across all frames, augmented with temporal positional encodings. Uncertain or ambiguous frames can borrow contextual information from neighboring and distant frames, pulling isolated outlier predictions toward the consensus.
- Design Motivation: Unlike post-hoc smoothing, TempGeo performs temporal alignment prior to retrieval, allowing cross-frame context to directly shape the learning signal.
GeoRefiner Denoising Refinement Module:
- Function: Refines GPS sequence predictions through an encoder-decoder architecture.
- Mechanism: The encoder processes frame embeddings from TempGeo; the decoder receives GPS embeddings as queries and aligns the GPS sequence with visual tokens via cross-attention. During training, synthetic noise is injected into ground-truth GPS coordinates to simulate typical Phase I failure modes (sequential drift, collapse, and random jitter), and the decoder learns to denoise using visual context.
- Design Motivation: Per-frame predictions from Phase I remain noisy; GeoRefiner performs in-domain retrieval refinement in the GPS space.

Loss & Training¶

Phase I: contrastive loss (cross-entropy between the similarity matrix of frame and GPS embeddings and the identity matrix). Phase II: weighted hinge loss jointly optimizing frame-level and video-level alignment.

Key Experimental Results¶

Main Results¶

Model	Frame@1km↑	Frame@5km↑	Frame Median Error↓	Video@1km↑	DFD↓	MRD↓
GeoCLIP-ZS	2.7%	22.9%	11.54km	3.8%	24.94	2.83
GeoCLIP-FT	22.5%	63.0%	2.97km	18.6%	22.52	2.82
DINOv2-Cls	18.1%	58.2%	3.86km	18.4%	4.28	1.60
VidTAG	41.0%	76.7%	1.35km	39.8%	3.87	1.07

Ablation Study¶

Configuration	@1km	Median Error	DFD
CLIP only	32.5%	1.85km	8.42
DINOv2 only	28.3%	2.15km	5.12
Dual encoder	35.2%	1.62km	6.78
+ TempGeo	38.1%	1.48km	4.25
+ GeoRefiner (full)	41.0%	1.35km	3.87

Key Findings¶

VidTAG surpasses GeoCLIP by 20 percentage points at @1km on MSLS, and outperforms the state of the art by 25% on CityGuessr68k.
TempGeo and GeoRefiner yield the most substantial improvements in trajectory quality (DFD, MRD).
The complementarity of the dual encoder is confirmed through ablation.

Highlights & Insights¶

Frame-to-GPS retrieval is an elegant problem reformulation: GPS gallery construction is simple and inexpensive, making global-scale per-frame localization feasible.
The denoising training strategy of GeoRefiner is noteworthy: injecting synthetic noise rather than directly using Phase I predictions avoids train-inference distribution mismatch.

Limitations & Future Work¶

The method relies on a uniform-grid GPS gallery; gallery resolution directly caps localization accuracy.
Performance may degrade in regions with sparse geographic coverage.
Additional cues such as OCR (road signs, text) are not exploited.
Integration with multimodal large language models for further geographic reasoning is a promising direction.

vs. GeoCLIP: GeoCLIP operates at the image level; VidTAG extends retrieval to video frames and addresses temporal consistency.
vs. CityGuessr: CityGuessr performs only video-level city prediction; VidTAG enables per-frame localization and trajectory mapping.

Rating¶

Novelty: ⭐⭐⭐⭐ First global-scale per-frame video geolocalization method
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-dataset, multi-metric, multi-baseline evaluation
Writing Quality: ⭐⭐⭐⭐ Problem formulation and method description are clear
Value: ⭐⭐⭐⭐ Practical applications in forensics, social media analysis, and related domains