Skip to content

DINO-Foresight: Looking into the Future with DINO

Conference: NeurIPS 2025 arXiv: 2412.11673 Homepage: https://dino-foresight.github.io/ Area: Autonomous Driving Keywords: Future Prediction, VFM Feature Forecasting, DINOv2, Multi-Task Dense Prediction, Masked Feature Transformer, Self-Supervised Learning

TL;DR

This paper proposes DINO-Foresight, which forecasts future-frame feature evolution within the semantic feature space of a Vision Foundation Model (VFM). A self-supervised Masked Feature Transformer predicts PCA-compressed representations of multi-layer DINOv2 features. Paired with plug-and-play task-specific heads, a single model simultaneously handles semantic segmentation, instance segmentation, depth estimation, and surface normal prediction, substantially outperforming the VISTA world model while achieving 100× faster inference.

Background & Motivation

Background: Future scene prediction is critical for autonomous driving and robotics. Existing approaches fall into two categories: (a) pixel-level prediction—computationally expensive and focused on irrelevant details; (b) latent-space generative methods—using VAE latents for diffusion/autoregressive prediction, but VAE latents lack semantic alignment and cannot be directly used for downstream scene understanding.

Limitations of Prior Work: (a) VAE latents lack semantic content, requiring reconstruction back to RGB before applying task heads; (b) each downstream task requires an independently trained prediction model (PFA, F2MF, etc. are not scalable); (c) world models such as VISTA have 2.5B parameters and extremely slow inference.

Key Challenge: Autonomous driving decision systems require semantic scene understanding (object locations and categories), not low-level appearance reconstruction. Existing methods waste model capacity modeling irrelevant low-level details.

Key Insight: VFM features (e.g., DINOv2) are inherently rich in semantics and support multi-task heads. Directly predicting the temporal evolution of VFM features bypasses RGB reconstruction and enables future-frame understanding.

Core Idea: Rather than predicting future RGB frames or VAE latents, DINO-Foresight predicts the temporal evolution of DINOv2 features. The VFM feature space is treated as a semantically rich high-dimensional latent space; after prediction, off-the-shelf task heads can be attached directly for various dense prediction tasks.

Method

Overall Architecture

Given an input video sequence of \(N\) frames (\(N_c\) context frames + \(N_p\) frames to be predicted), a frozen DINOv2 ViT-B/14 extracts multi-layer features from all frames. After PCA compression, these serve as the target feature space. A Masked Feature Transformer replaces future-frame tokens with [MASK] tokens and predicts the masked features via spatiotemporal factorized attention. The predicted features are directly fed into task heads (DPT / Mask2Former) to produce various dense prediction outputs.

Key Designs

  1. Hierarchical Target Feature Construction:

    • Function: Construct a high-quality target feature space for prediction.
    • Mechanism: \(L\) layers of features \(\mathbf{F}^{(l)} \in \mathbb{R}^{N \times H \times W \times D_{enc}}\) are extracted from DINOv2 ViT and concatenated along the channel dimension to form \(\mathbf{F}_{concat} \in \mathbb{R}^{N \times H \times W \times L \cdot D_{enc}}\), then compressed via PCA to \(D \ll L \cdot D_{enc}\) dimensions, yielding target features \(\mathbf{F}_{TRG} = \mathbf{F}_{PCA}\).
    • Design Motivation: Multi-layer features capture semantic information at different levels of abstraction; PCA compression substantially reduces prediction difficulty while retaining over 98% of variance.
  2. Masked Feature Transformer:

    • Function: Self-supervised prediction of future-frame VFM features.
    • Mechanism: A 12-layer transformer, where each layer consists of temporal MSA + spatial MSA + FFN. Token embeddings project \(D\)-dimensional features to hidden dimension \(D_{dec}=1152\). During training, future-frame tokens are replaced with learnable [MASK] vectors; during inference, [MASK] tokens are directly concatenated. Factorized spatiotemporal attention reduces complexity from \(O((NHW)^2)\) to \(O(N^2 + (HW)^2)\).
    • Training Objective: SmoothL1 loss, \(\mathcal{L}_{MFM} = \mathbb{E}_{x \in \mathcal{X}} \left[ \sum_{p \in \mathcal{P}} L(\mathbf{F}_{TRG}(p), \tilde{\mathbf{F}}_{TRG}(p)) \right]\), with \(\beta=0.1\). SmoothL1 is more robust to outliers than L1/MSE.
  3. High-Resolution Training Strategy (three variants compared):

    • Low-resolution training + high-resolution inference via positional encoding interpolation: suffers from distribution shift, performs worst.
    • Sliding window: features extracted at high resolution, with random \(16 \times 32\) patch crops during training and sliding window at inference.
    • Two-stage training (optimal): train at low resolution \(224 \times 448\) for more epochs, then fine-tune at high resolution \(448 \times 896\) for fewer epochs. The advantage is that the transformer can exploit larger spatial context at full resolution.
  4. Modular Multi-Task Prediction Framework:

    • Function: A plug-and-play task head library where adding new tasks requires no retraining of the core model.
    • Mechanism: DPT heads are used for semantic segmentation, depth, and surface normals; Mask2Former is used for instance segmentation. Task heads are trained independently on frozen VFM features, optionally with PCA compression/decompression adaptation, and do not require video data during training.
    • Design Motivation: The unified nature of the VFM feature space allows task heads to be trained independently and combined freely.

Loss & Training

  • Sequence length: \(N=5\) (\(N_c=4\) context frames + \(N_p=1\) prediction frame).
  • Hardware: 8×A100 40 GB, effective batch size 64.
  • Optimizer: Adam (\(\beta_1=0.9\), \(\beta_2=0.99\)), lr \(6.4 \times 10^{-4}\), cosine annealing.

Key Experimental Results

Main Results — Cityscapes Multi-Task Future Prediction (Short-term)

Method Seg ALL Seg MO Inst AP50 Depth \(\delta_1\) Normals 11.25° Params
F2MF 69.6 67.7 - - - -
PFA (semantic) 71.1 69.2 - - - -
PFA (instance) - - 48.7 - - -
Futurist 73.9 74.9 - 96.0 - -
VISTA (fine-tuned) 64.9 62.1 33.1 86.4 93.0 2.5B
DINO-Foresight 71.8 71.7 50.5 88.6 94.4 ~0.1B

Key comparisons:

  • Compared to VISTA (2.5B parameter world model): +6.9 mIoU on Seg ALL, +9.6 on MO, +17.4 AP50 on instance segmentation, +2.2 on depth \(\delta_1\).
  • Most significant advantage: a single prediction model handles all 4 tasks simultaneously, whereas methods such as PFA require an independent prediction model per task.
  • Inference speed: mid-term prediction for 500 scenes takes ~5 minutes vs. ~8.3 hours for VISTA (single A100), a 100× speedup.

VFM Encoder Comparison

Encoder Seg Short ALL Seg Short MO Depth \(\delta_1\) Short
VAE (Stable Diffusion) 33.4 17.9 64.1
SAM (ViT-B) 65.3 59.3 81.3
EVA2-CLIP (ViT-B) 66.3 64.2 85.1
DINOv2-Reg (ViT-B) 71.8 71.7 88.6

VAE latent features are entirely inadequate for scene understanding (33.4 vs. 71.8), validating the paper's central hypothesis: what features are predicted matters far more than how they are predicted. DINOv2 achieves the best results across all tasks among the VFM encoders evaluated.

Continuous vs. Discrete VFM Representations

Method Seg Short ALL Seg Mid ALL
Discrete (4M tokenizer) 61.7 53.7
Continuous (ours) 68.9 57.3

Retaining continuous VFM feature representations (without vector quantization) yields a clear advantage for dense semantic prediction tasks.

Ablation Study on High-Resolution Training Strategy

Strategy Seg Short ALL Seg Mid ALL
Low-res training + positional interpolation 64.34 48.31
Sliding window 71.26 58.75
Two-stage training 71.81 59.78

Two-stage training achieves the best results, as the transformer can leverage larger spatial context at full resolution.

Key Findings

  • VFM features vs. VAE latents: The performance gap is striking (71.8 vs. 33.4), confirming that prediction in a semantic feature space is the fundamental reason for the method's success.
  • Multi-layer features outperform single-layer features by ~1.3 mIoU, validating the value of multi-scale semantic representations.
  • Zero-shot transfer: A model trained on Cityscapes, when evaluated directly on nuScenes, performs only marginally below a model trained directly on nuScenes, while surpassing all baselines.
  • Model scalability: Increasing parameters from Small (115M) → Base (258M) → Large (460M) yields consistent performance gains; combining training data (Cityscapes + nuScenes) is similarly effective.
  • Intermediate transformer layer features can further improve downstream task performance (Appendix A.2), suggesting that self-supervised feature prediction has an enhancement effect on VFM features.

Highlights & Insights

  • Paradigm shift: From "predict pixels/latents → reconstruct RGB → run task heads" to "directly predict VFM features → attach off-the-shelf heads." This paradigm simplifies the system (no RGB decoder needed) and naturally supports multi-task extension—the paper's most central contribution.
  • PCA compression appears simple but is critical: reducing \(L \times D_{enc}\) dimensions to \(D\) substantially lowers the difficulty of prediction while retaining over 98% of variance. Simple techniques applied at the right place are often the most effective.
  • Modular design offers substantial engineering value: adding a new task requires only training a new head (without even requiring video data), with no need to retrain the core prediction model, making deployment highly practical.
  • An important implicit insight: temporal continuity of VFM feature space—although VFMs are trained on static images, the temporal evolution of their feature space is smooth and predictable, which is the prerequisite for the entire approach to work.

Limitations & Future Work

  • Only \(N_p=1\) frame (~0.5 s) is predicted; performance degradation under long-horizon multi-frame prediction is not thoroughly investigated.
  • Dependence on a frozen VFM means that the VFM's own biases (e.g., weaknesses in extreme weather or nighttime scenes) are directly inherited by the prediction.
  • Action-conditioned prediction is not addressed, precluding use in closed-loop planning scenarios.
  • Instance segmentation degrades substantially under mid-term prediction (AP50 drops from 50.5 to 27.3), indicating that fine-grained instance-level information is harder to preserve over longer horizons in feature space.
  • Evaluation is limited to two urban driving datasets (Cityscapes and nuScenes); generalization to other domains (indoor, off-road) is not verified.
  • vs. VISTA: VISTA is a 2.5B-parameter world model that reconstructs full RGB frames; DINO-Foresight (~0.1B) predicts only semantic features, achieves better performance on all four tasks, and is 100× faster. VISTA's sole advantage is its ability to generate visualizable RGB frames.
  • vs. F2F/F2MF/PFA: These methods rely on task-specific encoder features, requiring an independent prediction model per task; DINO-Foresight uses a unified VFM feature space to achieve multi-task prediction with a single model.
  • vs. Futurist: Futurist supports only semantic segmentation and depth, and requires multi-modal feature prediction; DINO-Foresight extends to four tasks with a simpler architecture.
  • vs. DINO-WM: A concurrent work that also uses DINOv2 for world modeling, but targets action-conditioned planning in simulated environments; DINO-Foresight targets multi-task dense prediction in real-world scenes.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The paradigm of VFM feature space prediction with plug-and-play heads is pioneering
  • Experimental Thoroughness: ⭐⭐⭐⭐ Four tasks with extensive ablations, though validated only on two urban driving datasets
  • Writing Quality: ⭐⭐⭐⭐ Clear exposition, well-designed experiments, and convincing motivation
  • Value: ⭐⭐⭐⭐⭐ A paradigm-level contribution; the modular design is engineering-friendly and opens an efficient new direction for scene prediction