DONUT: A Decoder-Only Model for Trajectory Prediction¶

Conference: ICCV 2025 arXiv: 2506.06854 Code: https://vision.rwth-aachen.de/donut Area: Autonomous Driving Keywords: trajectory prediction, decoder-only, autoregressive model, motion prediction, autonomous driving

TL;DR¶

DONUT draws inspiration from the decoder-only architecture of LLMs and proposes a unified autoregressive model for processing both historical and future trajectories, coupled with an overprediction strategy to improve anticipation of the distant future. It achieves state-of-the-art performance on the Argoverse 2 benchmark.

Background & Motivation¶

Motion prediction is a core task in autonomous driving—by forecasting the future trajectories of surrounding agents, autonomous vehicles can plan ahead. Dominant approaches adopt an encoder-decoder architecture: the encoder embeds historical trajectories, and the decoder predicts future ones.

Limitations of Prior Work:

Single-shot prediction of the entire future lacks awareness of scene elements near distant time steps, leading to inaccurate long-horizon predictions.
Recurrent decoders, while iterative, suffer from inconsistency—inputs alternate between learned embeddings and previous-step outputs, complicating the decoder's task.
Recurrent decoders can only access the "stale" historical information provided by the encoder; at distant time steps, the states of other agents are already significantly out of sync.
Encoder and decoder employ different modules, creating a structural disconnect in how historical and future trajectories are processed.

Key Insight: Analogous to the successful decoder-only paradigm in LLMs, the paper unifies the processing of historical and future trajectory sequences within a single autoregressive model to ensure consistency and up-to-date information. Inspired by multi-token prediction in LLMs, an overprediction strategy is introduced to help the model anticipate further into the future.

Method¶

Overall Architecture¶

All agent trajectories are segmented into sub-trajectories of \(T_{\text{sub}}=10\) time steps (1 second). A single decoder network processes historical sub-trajectories and autoregressively predicts future sub-trajectories step by step. Each step consists of: a proposer generating initial predictions and overpredictions → updating the reference point → a refiner correcting the predictions.

Key Designs¶

Unified Decoder-Only Architecture:
- Function: A single model simultaneously encodes historical trajectories and predicts future ones.
- Mechanism:
  - Historical trajectories are processed through the proposer and refiner to produce historical tokens.
  - Future sub-trajectories are predicted step by step using the same network structure.
  - After each prediction, the reference point is updated to the predicted endpoint, and relative positional encodings with respect to surrounding scene elements are recomputed.
  - A query-centric scheme (each scene element has its own local reference frame) enables reuse of features from the previous step.
- Design Motivation: Eliminates the inconsistencies of encoder-decoder designs (heterogeneous input types, stale information of other agents) and allows the model to naturally transfer knowledge from historical encoding to future prediction.
Overprediction Strategy:
- Function: When predicting the current sub-trajectory, the model simultaneously predicts the next sub-trajectory as an auxiliary task.
- Mechanism: The proposer outputs a prediction \(\hat{Y}'_{\{0;T_{\text{sub}}\}}\) and an overprediction \(\hat{Y}'^{\text{over}}_{\{T_{\text{sub}};2T_{\text{sub}}\}}\). The overprediction is supervised by ground truth during training and discarded at inference.
- Design Motivation: Inspired by multi-token prediction in LLMs. It forces the model to consider a longer temporal horizon, providing better future awareness for the current prediction step. Experiments show that overprediction promotes more stable training convergence and unlocks the potential of the refiner.
Proposer and Refiner:
- Function: Responsible for initial prediction and subsequent correction, respectively.
- Core Structure: Both share the same architecture.
  - Tokenizer: Extracts Fourier features of position, heading, motion vectors, and velocity for each sub-trajectory; an MLP fuses these into a single token.
  - Four attention types: (1) temporal self-attention (historical tokens of the same agent), (2) map attention (road tokens within radius \(r=50\) m), (3) social attention (tokens of other agents), (4) modal attention (across different modes of the same agent).
  - Detokenizer: An MLP predicting the next sub-trajectory and the overprediction.
- Design Motivation: After the reference point is updated, the refiner can access the latest predicted trajectories of other agents and more accurate scene information.

Loss & Training¶

Positions are parameterized with a Laplace mixture distribution: \(p(\hat{Y}_n^{\text{pos}}) = \sum_k P_{n,k} \prod_t \text{Laplace}(\hat{Y}_{n,t}^{\text{pos}} | \mu, b)\)
Headings are parameterized with a von Mises distribution.
Loss is computed only for the closest mode (minimum endpoint distance).
Losses are applied independently to proposed and refined trajectories, as well as to main predictions and overpredictions.
Training: AdamW optimizer, 4×H100 GPUs, batch size 64 (with gradient accumulation), 60 epochs.

Key Experimental Results¶

Main Results (Argoverse 2 Test Leaderboard, Non-Ensemble Methods)¶

Method	b-minFDE₆↓	minFDE₆↓	minADE₆↓	MR₆↓
QCNet	1.91	1.29	0.65	0.16
DeMo	1.84	1.17	0.61	0.13
SmartRefine	1.86	1.23	0.63	0.15
SEPT*	1.74	1.15	0.61	0.14
QCNet* (ensemble)	1.78	1.19	0.62	0.14
DeMo* (ensemble)	1.73	1.11	0.60	0.12
DONUT	1.79	1.16	0.63	0.14

DONUT achieves state-of-the-art among non-ensemble methods on the primary metric b-minFDE₆, and achieves overall state-of-the-art on MR₁ (0.54).

Ablation Study¶

Configuration	b-minFDE₆↓	minFDE₆↓	minADE₆↓	MR₆↓	Note
Encoder-decoder baseline	1.874	1.253	0.720	0.157	QCNet
Decoder-only	1.838	1.198	0.745	0.145	Decoder-only alone improves FDE
+ Overprediction	1.838	1.193	0.728	0.146	Marginal improvement
+ Refinement	1.835	1.218	0.751	0.150	Unstable training when used alone
+ Both	1.807	1.176	0.722	0.144	Significant synergy

Key Findings¶

The decoder-only architecture outperforms the encoder-decoder baseline on both b-minFDE₆ and minFDE₆, with a particularly pronounced advantage in long-horizon prediction (6 seconds).
Overprediction and refinement yield limited gains individually, but their combination produces significant synergistic improvement.
The largest accuracy gains occur at distant horizons: the decoder-only model achieves substantially lower FDE than the encoder-decoder in the 3–6 second range.
DONUT achieves state-of-the-art on MR₁ (0.54), indicating highly accurate best-mode predictions.
The encoder-decoder baseline slightly outperforms DONUT on minADE₆, possibly because its encoder generates a dedicated token for each individual time step, offering finer granularity.

Highlights & Insights¶

The core insight draws on the analogy to the success of LLMs: motion prediction is fundamentally a sequence prediction task, for which the decoder-only paradigm is most fitting.
The overprediction strategy cleverly adapts multi-token prediction from LLMs to the trajectory prediction domain.
Real-time reference point updates ensure the model always has access to scene information in the vicinity of the current position.
Qualitative analyses are highly intuitive: in complex intersection scenarios, DONUT demonstrably outperforms QCNet.

Limitations & Future Work¶

DONUT slightly underperforms the encoder-decoder baseline on minADE₆; the coarse-grained sub-trajectory tokenization may sacrifice precision at intermediate time steps.
Validation is currently limited to single-agent prediction; extension to joint multi-agent prediction warrants future exploration.
Inference speed analysis is absent—it remains unclear whether the autoregressive approach is slower than single-shot prediction.
Sensitivity analysis with respect to different values of \(T_{\text{sub}}\) is insufficient.

Builds upon QCNet's query-centric scene encoding, replacing the decoder structure on top of this foundation.
Closely mirrors the decoder-only trend in the LLM community (GPT series).
Differs from GPT-style models in motion simulation (e.g., MotionLM): motion prediction requires a fixed number of multi-modal predictions to cover the future distribution.
Takeaway: successful paradigms from NLP can be transferred to other sequence prediction tasks through thoughtful adaptation.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of decoder-only architecture and overprediction is a novel contribution to the trajectory prediction field.
Experimental Thoroughness: ⭐⭐⭐⭐ Ablations are comprehensive, leaderboard validation is rigorous, and qualitative analyses are intuitive.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is articulated clearly; the analogy to LLMs is apt and well-calibrated.
Value: ⭐⭐⭐⭐ Achieves state-of-the-art on the highly competitive Argoverse 2 benchmark, validating the effectiveness of the decoder-only paradigm for trajectory prediction.