Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction¶

Conference: CVPR2026 arXiv: 2603.10597 Code: zhouhao94/PRF Area: Autonomous Driving Keywords: trajectory prediction, variable-length observation, progressive retrospection, knowledge distillation, autonomous driving

TL;DR¶

This paper proposes the Progressive Retrospective Framework (PRF), which employs cascaded retrospective units to progressively align features from incomplete observations to those of complete observations, substantially improving variable-length trajectory prediction performance in a plug-and-play manner compatible with existing methods.

Background & Motivation¶

Trajectory prediction is a core task in autonomous driving: Accurately forecasting the future motion of dynamic traffic participants is critical for safe planning and collision avoidance.
Existing methods rely on fixed-length observations: The vast majority of methods are optimized for standard-length inputs (e.g., 50 or 20 steps) and are highly sensitive to variation in observation length.
Incomplete observations are pervasive in real-world scenarios: Vehicles newly entering the perception range, re-detected after occlusion, or recovered from tracking loss all produce variable-length or incomplete trajectories.
Performance degrades sharply as observations shorten: The SOTA method DeMo sees mADE6 deteriorate from 0.658 to 0.861 on Argoverse 2 when observation length drops to 10 steps—a substantial regression.
One-step mapping strategies struggle with short trajectories: Existing approaches (DTO, FLN, LaKD, CLLS) directly map incomplete features to complete features, performing poorly when the information gap is large.
Independent training (IT) offers diminishing returns at high cost: Training a separate model for each observation length yields marginal improvements but incurs enormous computational and storage overhead.

Method¶

Overall Architecture: Progressive Retrospective Framework (PRF)¶

PRF inserts \(\tau\) cascaded retrospective units between the encoder and decoder. Each unit \(\Phi^v\) is responsible for retrospecting features of an incomplete observation of length \(T_v\) toward features corresponding to length \(T_{v-1}\) (recovering an additional \(\Delta T\) steps). At inference, an input of length \(T_v\) passes sequentially through \(\Phi^v, \Phi^{v-1}, \dots, \Phi^1\), progressively recovering to the standard length \(T_0\), before being fed into the shared decoder for prediction.

Plug-and-play: PRF operates between the encoder and decoder and is directly compatible with existing prediction methods (QCNet, DeMo).
Shared encoder: A single encoder extracts features for all variable-length observations, avoiding the need to maintain multiple models.
Progressive alignment: Each unit need only bridge a small temporal gap \(\Delta T\), reducing learning difficulty.

Retrospective Distillation Module (RDM)¶

RDM adopts a residual distillation strategy that models missing time-step features as learnable residuals, avoiding feature conflicts induced by the shared encoder:

Scene context injection: Agent features are fused with HD Map features \(\mathbf{F}_m\) via cross-attention.
Dual-branch structure:
Logit branch: self-attention → MLP → Sigmoid, producing an element-wise gating vector \(\mathbf{g}^v\).
Residual branch: self-attention → MLP → ReLU, learning a residual feature \(\mathbf{F}_r^v\).
Gated fusion: \(\tilde{\mathbf{F}}^{v-1} = \mathbf{g}^v \odot \mathbf{F}^v + \mathbf{F}_r^v\), retaining reliable components while supplementing missing information.

Retrospective Prediction Module (RPM)¶

RPM recovers the missing \(\Delta T\) historical time steps from the distilled features, employing a decoupled query strategy for coarse-to-fine retrospection:

Anchor-Free Mode Queries: \(K\) mode queries are initialized via MLP → cross-attention extracts scene features → self-attention models inter-mode interactions → multimodal coarse trajectory proposals are predicted.
Anchor-Based State Queries: \(\Delta T\) state queries are initialized via MLP → cross-attention + Mamba models temporal dynamics → fine-grained refinement anchored on the coarse proposals.
Cross-unit sharing: All retrospective units share a single RPM (since each recovers a fixed \(\Delta T\) steps); batched processing accelerates training.
Training-only module: RPM provides implicit supervision for RDM and is disabled at inference, adding no inference overhead.

Rolling-Start Training Strategy (RSTS)¶

Exploiting PRF's natural support for short-trajectory training, RSTS generates multiple training samples from a single sequence:

In addition to the standard sample \(([1,50], [51,110])\), samples \(([1,40],[41,100])\), \(([1,30],[31,90])\), and \(([1,20],[21,80])\) are also generated.
Each retrospective unit receives a number of training samples inversely proportional to its input length—shorter observations are harder to retrospect and thus receive more training data.
On Argoverse 2, a single sequence yields 4 decoder training samples and \(\{4,3,2,1\}\) samples for each retrospective unit.

Loss & Training¶

End-to-end training comprises three components:

Decoder loss: Smooth-L1 (trajectory regression) + cross-entropy (mode probability classification), following the QCNet/DeMo setup.
RPM loss: \(\mathcal{L}_{rpm} = \frac{1}{\tau}\sum_{v=1}^{\tau}(\mathcal{L}_{mq}^v + \mathcal{L}_{sq}^v)\), supervising mode queries and state queries respectively.
RDM loss: \(\mathcal{L}_{rdm} = \frac{1}{\tau}\sum_{v=1}^{\tau}\text{SmoothL1}(\tilde{\mathbf{F}}^{v-1}, \mathbf{F}^{v-1})\)

Key Experimental Results¶

Variable-Length Trajectory Prediction (Argoverse 2 Validation Set, mADE6/mFDE6)¶

Method	Obs=10	Obs=20	Obs=30	Obs=40	Obs=50	Avg-Δ50
DeMo-Ori	0.861/1.533	0.700/1.358	0.671/1.306	0.662/1.288	0.658/1.278	0.066/0.093
DeMo-CLLS	0.641/1.258	0.630/1.249	0.623/1.234	0.614/1.225	0.615/1.223	0.012/0.019
DeMo-PRF	0.617/1.183	0.603/1.155	0.598/1.143	0.599/1.145	0.596/1.142	0.008/0.015
QCNet-CLLS	0.735/1.247	0.727/1.232	0.725/1.227	0.719/1.222	0.714/1.215	0.013/0.017
QCNet-PRF	0.727/1.213	0.711/1.181	0.706/1.169	0.702/1.164	0.702/1.166	0.010/0.016

Standard Prediction — Argoverse 2 Leaderboard (b-mFDE6)¶

Method	b-mFDE6	mADE6	mFDE6	MR6
DeMo+ReMo	1.84	0.61	1.17	0.13
DeMo-PRF	1.81	0.60	1.14	0.13

Ablation Study (Argoverse 2 Validation Set, DeMo backbone)¶

RDM	RPM	RSTS	Obs=10	Obs=50
✗	✗	✗	0.876/1.455	0.725/1.256
✓	✗	✗	0.655/1.257	0.639/1.231
✓	✓	✗	0.652/1.241	0.635/1.208
✓	✓	✓	0.617/1.183	0.596/1.142

RDM contributes the most: at Obs=10, mADE6 drops from 0.876 to 0.655 (↓25.2%).
RPM further reduces mFDE6 by approximately 1.3% on top of RDM.
RSTS improves performance across all observation lengths; at Obs=10, mADE6 decreases by an additional 5.3%.
Progressive distillation vs. direct distillation: mADE6 of 0.652 vs. 0.663 at Obs=10, with the advantage more pronounced for shorter sequences.
Mamba vs. GRU vs. Attention (RPM temporal modeling): Mamba achieves the best mFDE6 across all settings.
Inference overhead: each additional retrospective stage adds only approximately 0.07G FLOPs and 0.03s latency.

Highlights & Insights¶

Progressive retrospection is both simple and effective: Decomposing "long-range feature alignment" into multiple "short-range alignment" steps substantially reduces learning difficulty, as clearly verified by t-SNE visualization.
Plug-and-play design: Inserted between encoder and decoder, PRF successfully adapts to two SOTA methods, QCNet and DeMo.
RPM is training-only: It provides implicit supervision without adding inference overhead, making it engineering-friendly.
RSTS data augmentation is elegantly designed: Leveraging the variable-length property to generate multiple samples from a single sequence, with shorter trajectories receiving more training data.
SOTA on both standard and variable-length tracks: PRF leads comprehensively on variable-length prediction and simultaneously sets a new record on the standard Argoverse 2 leaderboard.

Limitations & Future Work¶

Discrete observation lengths: Only observation lengths that are integer multiples of \(\Delta T\) are supported; intermediate lengths must be truncated to the nearest valid value (e.g., 32→30), potentially wasting information.
Inference latency scales linearly with missing steps: The shortest observation must pass through all \(\tau\) retrospective units; at 10 steps, inference time is 1.9× that of the standard setting (0.268s vs. 0.140s).
Validation limited to two backbones: Although plug-and-play compatibility is claimed, the method is only validated on QCNet and DeMo; diffusion-based and GPT-based predictors remain untested.
Extremely short observations unexplored: The behavior under only 1–5 observation steps is not investigated.
Training cost not thoroughly compared: RSTS generates several times more samples, yet total training time (8×RTX4090, 60 epochs) is not compared against baselines.
No real-world deployment validation: All experiments are conducted on offline datasets; online or on-vehicle deployment results are not presented.

Method	Strategy	Short-Trajectory Performance	Inference Overhead	Compatibility
DTO	Teacher–student distillation	Moderate	None	Medium
FLN	Temporal-invariant representation	Moderate	None	Medium
LaKD	Length-agnostic distillation	Good	None	Medium
CLLS	Contrastive learning	Good	None	Medium
PRF	Progressive retrospective distillation	Best	Marginal increase	High (plug-and-play)

Core distinction: the above methods all perform one-step mapping from short features to long features, whereas PRF progressively aligns features through cascaded units—yielding greater advantages as the information gap widens.

Rating¶

Novelty: ⭐⭐⭐⭐ (The combination of progressive retrospection, residual distillation, and decoupled queries is original; the approach is conceptually clean and elegant.)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Two datasets, two backbones, six baselines, detailed ablations, t-SNE visualization, and efficiency analysis.)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, well-formatted equations, and rich figures and tables.)
Value: ⭐⭐⭐⭐ (Variable-length observation is a key pain point in real-world driving; the method is practical and achieves SOTA.)