Skip to content

VidTAG: Temporally Aligned Video to GPS Geolocalization

Conference: CVPR 2026 arXiv: 2604.12159 Code: https://parthpk.github.io/vidtag_webpage Area: Video Understanding / Geolocalization Keywords: Video geolocalization, frame-to-GPS retrieval, temporal consistency, trajectory prediction, denoising

TL;DR

This paper proposes VidTAG, a dual-encoder (CLIP+DINOv2) frame-to-GPS retrieval framework that achieves temporally consistent per-frame video geolocalization at global scale, via a TempGeo module for inter-frame temporal alignment and a GeoRefiner encoder-decoder module for GPS prediction refinement.

Background & Motivation

Background: Image geolocalization is dominated by two paradigms — classification (partitioning the Earth into regions and predicting labels) and retrieval (matching against a geo-referenced image database). GeoCLIP embeds images and GPS coordinates into a shared space to enable direct GPS retrieval.

Limitations of Prior Work: Classification methods offer only coarse city-level localization; image retrieval methods require enormous image databases, making them infeasible at global scale. For video, applying image-based methods frame-by-frame produces "jittery" trajectories, with worst-case predictions spanning continents. The only global-scale video method, CityGuessr, reasons at the full video level and does not support per-frame localization.

Key Challenge: Achieving accurate and temporally consistent per-frame trajectories at global scale remains an open challenge.

Goal: (1) Introduce a new frame-to-GPS retrieval paradigm; (2) Address temporal inconsistency in video-level prediction.

Key Insight: Constructing a GPS coordinate gallery (rather than an image gallery) is simple and inexpensive, making frame-to-GPS retrieval tractable at global scale.

Core Idea: TempGeo performs inter-frame temporal alignment, and GeoRefiner applies denoising-based refinement, together enabling temporally consistent per-frame GPS prediction.

Method

Overall Architecture

Training proceeds in two phases. Phase I trains the dual frame encoder (CLIP+DINOv2), TempGeo, and a location encoder via contrastive learning. Phase II freezes Phase I and trains the GeoRefiner encoder-decoder for denoising-based GPS refinement. At inference, frames are encoded through the dual encoder and TempGeo to produce embeddings; initial GPS predictions are retrieved and subsequently refined by GeoRefiner.

Key Designs

  1. Dual Frame Encoder (CLIP + DINOv2):

    • Function: Generates semantically and visually complementary representations for each frame.
    • Mechanism: CLIP provides language-aligned semantics (disambiguating landmarks, signs, and scenes); DINOv2 provides robust self-supervised features (global appearance, insensitive to domain shift). The CLS tokens from both are concatenated as the frame representation \(\mathbf{z}_t = [\mathbf{f}_{clip} \| \mathbf{f}_{dino}]\).
    • Design Motivation: CLIP excels at semantic understanding while DINOv2 excels at visual description; their complementary strengths benefit frame-to-GPS retrieval.
  2. TempGeo Temporal Alignment Module:

    • Function: Produces temporally consistent frame embeddings via inter-frame attention.
    • Mechanism: A lightweight Transformer encoder applies full self-attention across all frames, augmented with temporal positional encodings. Uncertain or ambiguous frames can borrow contextual information from neighboring and distant frames, pulling isolated outlier predictions toward the consensus.
    • Design Motivation: Unlike post-hoc smoothing, TempGeo performs temporal alignment prior to retrieval, allowing cross-frame context to directly shape the learning signal.
  3. GeoRefiner Denoising Refinement Module:

    • Function: Refines GPS sequence predictions through an encoder-decoder architecture.
    • Mechanism: The encoder processes frame embeddings from TempGeo; the decoder receives GPS embeddings as queries and aligns the GPS sequence with visual tokens via cross-attention. During training, synthetic noise is injected into ground-truth GPS coordinates to simulate typical Phase I failure modes (sequential drift, collapse, and random jitter), and the decoder learns to denoise using visual context.
    • Design Motivation: Per-frame predictions from Phase I remain noisy; GeoRefiner performs in-domain retrieval refinement in the GPS space.

Loss & Training

Phase I: contrastive loss (cross-entropy between the similarity matrix of frame and GPS embeddings and the identity matrix). Phase II: weighted hinge loss jointly optimizing frame-level and video-level alignment.

Key Experimental Results

Main Results

Model Frame@1km↑ Frame@5km↑ Frame Median Error↓ Video@1km↑ DFD↓ MRD↓
GeoCLIP-ZS 2.7% 22.9% 11.54km 3.8% 24.94 2.83
GeoCLIP-FT 22.5% 63.0% 2.97km 18.6% 22.52 2.82
DINOv2-Cls 18.1% 58.2% 3.86km 18.4% 4.28 1.60
VidTAG 41.0% 76.7% 1.35km 39.8% 3.87 1.07

Ablation Study

Configuration @1km Median Error DFD
CLIP only 32.5% 1.85km 8.42
DINOv2 only 28.3% 2.15km 5.12
Dual encoder 35.2% 1.62km 6.78
+ TempGeo 38.1% 1.48km 4.25
+ GeoRefiner (full) 41.0% 1.35km 3.87

Key Findings

  • VidTAG surpasses GeoCLIP by 20 percentage points at @1km on MSLS, and outperforms the state of the art by 25% on CityGuessr68k.
  • TempGeo and GeoRefiner yield the most substantial improvements in trajectory quality (DFD, MRD).
  • The complementarity of the dual encoder is confirmed through ablation.

Highlights & Insights

  • Frame-to-GPS retrieval is an elegant problem reformulation: GPS gallery construction is simple and inexpensive, making global-scale per-frame localization feasible.
  • The denoising training strategy of GeoRefiner is noteworthy: injecting synthetic noise rather than directly using Phase I predictions avoids train-inference distribution mismatch.

Limitations & Future Work

  • The method relies on a uniform-grid GPS gallery; gallery resolution directly caps localization accuracy.
  • Performance may degrade in regions with sparse geographic coverage.
  • Additional cues such as OCR (road signs, text) are not exploited.
  • Integration with multimodal large language models for further geographic reasoning is a promising direction.
  • vs. GeoCLIP: GeoCLIP operates at the image level; VidTAG extends retrieval to video frames and addresses temporal consistency.
  • vs. CityGuessr: CityGuessr performs only video-level city prediction; VidTAG enables per-frame localization and trajectory mapping.

Rating

  • Novelty: ⭐⭐⭐⭐ First global-scale per-frame video geolocalization method
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-dataset, multi-metric, multi-baseline evaluation
  • Writing Quality: ⭐⭐⭐⭐ Problem formulation and method description are clear
  • Value: ⭐⭐⭐⭐ Practical applications in forensics, social media analysis, and related domains