VidTAG: Temporally Aligned Video to GPS Geolocalization¶
Conference: CVPR 2026 arXiv: 2604.12159 Code: https://parthpk.github.io/vidtag_webpage Area: Video Understanding / Geolocalization Keywords: Video geolocalization, frame-to-GPS retrieval, temporal consistency, trajectory prediction, denoising
TL;DR¶
This paper proposes VidTAG, a dual-encoder (CLIP+DINOv2) frame-to-GPS retrieval framework that achieves temporally consistent per-frame video geolocalization at global scale, via a TempGeo module for inter-frame temporal alignment and a GeoRefiner encoder-decoder module for GPS prediction refinement.
Background & Motivation¶
Background: Image geolocalization is dominated by two paradigms — classification (partitioning the Earth into regions and predicting labels) and retrieval (matching against a geo-referenced image database). GeoCLIP embeds images and GPS coordinates into a shared space to enable direct GPS retrieval.
Limitations of Prior Work: Classification methods offer only coarse city-level localization; image retrieval methods require enormous image databases, making them infeasible at global scale. For video, applying image-based methods frame-by-frame produces "jittery" trajectories, with worst-case predictions spanning continents. The only global-scale video method, CityGuessr, reasons at the full video level and does not support per-frame localization.
Key Challenge: Achieving accurate and temporally consistent per-frame trajectories at global scale remains an open challenge.
Goal: (1) Introduce a new frame-to-GPS retrieval paradigm; (2) Address temporal inconsistency in video-level prediction.
Key Insight: Constructing a GPS coordinate gallery (rather than an image gallery) is simple and inexpensive, making frame-to-GPS retrieval tractable at global scale.
Core Idea: TempGeo performs inter-frame temporal alignment, and GeoRefiner applies denoising-based refinement, together enabling temporally consistent per-frame GPS prediction.
Method¶
Overall Architecture¶
Training proceeds in two phases. Phase I trains the dual frame encoder (CLIP+DINOv2), TempGeo, and a location encoder via contrastive learning. Phase II freezes Phase I and trains the GeoRefiner encoder-decoder for denoising-based GPS refinement. At inference, frames are encoded through the dual encoder and TempGeo to produce embeddings; initial GPS predictions are retrieved and subsequently refined by GeoRefiner.
Key Designs¶
-
Dual Frame Encoder (CLIP + DINOv2):
- Function: Generates semantically and visually complementary representations for each frame.
- Mechanism: CLIP provides language-aligned semantics (disambiguating landmarks, signs, and scenes); DINOv2 provides robust self-supervised features (global appearance, insensitive to domain shift). The CLS tokens from both are concatenated as the frame representation \(\mathbf{z}_t = [\mathbf{f}_{clip} \| \mathbf{f}_{dino}]\).
- Design Motivation: CLIP excels at semantic understanding while DINOv2 excels at visual description; their complementary strengths benefit frame-to-GPS retrieval.
-
TempGeo Temporal Alignment Module:
- Function: Produces temporally consistent frame embeddings via inter-frame attention.
- Mechanism: A lightweight Transformer encoder applies full self-attention across all frames, augmented with temporal positional encodings. Uncertain or ambiguous frames can borrow contextual information from neighboring and distant frames, pulling isolated outlier predictions toward the consensus.
- Design Motivation: Unlike post-hoc smoothing, TempGeo performs temporal alignment prior to retrieval, allowing cross-frame context to directly shape the learning signal.
-
GeoRefiner Denoising Refinement Module:
- Function: Refines GPS sequence predictions through an encoder-decoder architecture.
- Mechanism: The encoder processes frame embeddings from TempGeo; the decoder receives GPS embeddings as queries and aligns the GPS sequence with visual tokens via cross-attention. During training, synthetic noise is injected into ground-truth GPS coordinates to simulate typical Phase I failure modes (sequential drift, collapse, and random jitter), and the decoder learns to denoise using visual context.
- Design Motivation: Per-frame predictions from Phase I remain noisy; GeoRefiner performs in-domain retrieval refinement in the GPS space.
Loss & Training¶
Phase I: contrastive loss (cross-entropy between the similarity matrix of frame and GPS embeddings and the identity matrix). Phase II: weighted hinge loss jointly optimizing frame-level and video-level alignment.
Key Experimental Results¶
Main Results¶
| Model | Frame@1km↑ | Frame@5km↑ | Frame Median Error↓ | Video@1km↑ | DFD↓ | MRD↓ |
|---|---|---|---|---|---|---|
| GeoCLIP-ZS | 2.7% | 22.9% | 11.54km | 3.8% | 24.94 | 2.83 |
| GeoCLIP-FT | 22.5% | 63.0% | 2.97km | 18.6% | 22.52 | 2.82 |
| DINOv2-Cls | 18.1% | 58.2% | 3.86km | 18.4% | 4.28 | 1.60 |
| VidTAG | 41.0% | 76.7% | 1.35km | 39.8% | 3.87 | 1.07 |
Ablation Study¶
| Configuration | @1km | Median Error | DFD |
|---|---|---|---|
| CLIP only | 32.5% | 1.85km | 8.42 |
| DINOv2 only | 28.3% | 2.15km | 5.12 |
| Dual encoder | 35.2% | 1.62km | 6.78 |
| + TempGeo | 38.1% | 1.48km | 4.25 |
| + GeoRefiner (full) | 41.0% | 1.35km | 3.87 |
Key Findings¶
- VidTAG surpasses GeoCLIP by 20 percentage points at @1km on MSLS, and outperforms the state of the art by 25% on CityGuessr68k.
- TempGeo and GeoRefiner yield the most substantial improvements in trajectory quality (DFD, MRD).
- The complementarity of the dual encoder is confirmed through ablation.
Highlights & Insights¶
- Frame-to-GPS retrieval is an elegant problem reformulation: GPS gallery construction is simple and inexpensive, making global-scale per-frame localization feasible.
- The denoising training strategy of GeoRefiner is noteworthy: injecting synthetic noise rather than directly using Phase I predictions avoids train-inference distribution mismatch.
Limitations & Future Work¶
- The method relies on a uniform-grid GPS gallery; gallery resolution directly caps localization accuracy.
- Performance may degrade in regions with sparse geographic coverage.
- Additional cues such as OCR (road signs, text) are not exploited.
- Integration with multimodal large language models for further geographic reasoning is a promising direction.
Related Work & Insights¶
- vs. GeoCLIP: GeoCLIP operates at the image level; VidTAG extends retrieval to video frames and addresses temporal consistency.
- vs. CityGuessr: CityGuessr performs only video-level city prediction; VidTAG enables per-frame localization and trajectory mapping.
Rating¶
- Novelty: ⭐⭐⭐⭐ First global-scale per-frame video geolocalization method
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-dataset, multi-metric, multi-baseline evaluation
- Writing Quality: ⭐⭐⭐⭐ Problem formulation and method description are clear
- Value: ⭐⭐⭐⭐ Practical applications in forensics, social media analysis, and related domains