Progressive Pretext Task Learning for Human Trajectory Prediction¶
Conference: ECCV 2024
arXiv: 2407.11588
Code: Yes (https://github.com/iSEE-Laboratory/PPT)
Area: Autonomous Driving
Keywords: Pedestrian Trajectory Prediction, Progressive Learning, Pretext Task, Transformer, Knowledge Distillation
TL;DR¶
Proposes a progressive pretext task learning framework, PPT, which progressively enhances the model's ability to capture short-term dynamics and long-term dependencies through three-stage training (step-by-step next-position prediction → destination prediction → complete trajectory prediction). Together with an efficient two-step non-autoregressive Transformer predictor, it achieves SOTA on multiple pedestrian trajectory prediction benchmarks.
Background & Motivation¶
Pedestrian trajectory prediction requires predicting all future positions from the short term to the long term. However, short-term and long-term predictions rely on vastly different understanding capabilities: - Short-term prediction: Requires identifying fine-grained local dynamic patterns between adjacent time steps. - Long-term prediction: Requires inferring global motion trends and capturing long-range dependencies of the trajectory.
Limitations of Prior Work: 1. Most methods (Social-GAN, MID, LED, etc.) handle predictions across all time horizons using a single unified training paradigm, often making sub-optimal trade-offs between short-term and long-term performance. 2. Although destination-driven methods (MemoNet, PECNet, etc.) first predict the destination and then interpolate intermediate positions, there is a lack of knowledge transfer between the destination predictor and the trajectory predictor, leading to a disconnection between the two. 3. Existing Transformer methods mostly employ autoregressive generation, which has low inference efficiency; non-autoregressive methods like MID rely on diffusion models (slow), or TUTR ignores temporal dynamics (limiting performance).
Core Idea of this paper: Since short-term and long-term predictions require different capabilities, why not train these capabilities progressively in stages?
Method¶
Overall Architecture¶
The PPT framework consists of three progressive training stages and a Transformer backbone model:
- Stage I - Step-by-step next-position prediction: Learn short-term dynamics
- Stage II - Jump-step destination prediction: Learn long-range dependencies
- Stage III - Complete trajectory prediction: Leverage knowledge from the first two stages to complete the final task
Each stage uses the same architecture but progressively enhances its capabilities, using cross-task knowledge distillation to prevent forgetting.
Key Designs¶
Task-I: Step-by-step next-position prediction - Randomly sample a subsequence \(\mathcal{S}^{T_1:T_{t-1}}\) from the complete trajectory \(\mathcal{S}^{T_1:T_e}\), and predict the next position \(\mathcal{S}^{T_t}\). - Utilize causal self-attention masks to achieve parallel processing of multiple random subsequences in a single forward pass, improving training efficiency. - Inputs of arbitrary lengths enable the model to comprehensively understand local motion patterns in the trajectory.
Task-II: Jump-step destination prediction - Input the observed trajectory \(\mathcal{S}^{T_1:T_h}\), and predict the destination of the entire trajectory \(\mathcal{S}^{T_e}\). - Since there is no position input at time \(T_{e-1}\), a learnable prompt embedding is appended to the observed sequence, assigned with the positional encoding of \(T_{e-1}\), to achieve "jump-step" prediction. - Predict K=20 candidate destinations, using a precision loss + diversity loss:
Task-III: Complete trajectory prediction - Copy the trained Task-II model \(\theta_{II}\) as both the destination predictor and the trajectory predictor. - The destination predictor generates K candidate destinations; the destination closest to the ground truth (GT) is input into the trajectory predictor. - The input to the trajectory predictor consists of three parts: observed trajectory + learnable prompt embeddings (representing future intermediate positions) + pseudo-destination. - Outputs all future positions in a non-autoregressive, parallel fashion.
Backbone: Transformer Encoder - 3-layer Transformer encoder, hidden dimension of 128, with 8 attention heads. - The input 2D positions are mapped via an embedding layer and then added to temporal positional encodings. - Outputs the next-frame prediction for each position, obtaining 2D coordinates through a LayerNorm and a linear projector.
Cross-Task Knowledge Distillation: - \(L_{kd}^t\): Trajectory features of the Task-I model guide the Task-III trajectory predictor - \(L_{kd}^d\): Destination features of the Task-II model guide the Task-III destination predictor - Calculate the L2 distance after aligning feature dimensions using linear projections
Loss & Training¶
- Task-I: Next-position prediction loss based on L2 distance
- Task-II: \(L_{Des} = L_{Precision} + \lambda_d L_{Diversity}\), \(\lambda_d = 100\)
- Task-III: \(L_{Traj} = L_{Recon} + \lambda_{kd}^t L_{kd}^t + \lambda_{kd}^d L_{kd}^d\), \(\lambda_{kd}^t = 5\), \(\lambda_{kd}^d = 0.5\)
The learning rates for the three stages are 0.001, 0.0001, and 0.0015 respectively. Before Task-II training, the MLP is first warmed up, followed by joint training of the whole model.
Key Experimental Results¶
Main Results¶
minADE20/minFDE20 on SDD dataset (pixels):
| Method | ADE↓ | FDE↓ |
|---|---|---|
| Social-GAN | 27.23 | 41.44 |
| PECNet | 9.96 | 15.88 |
| MemoNet | 8.56 | 12.66 |
| Social-VAE | 8.10 | 11.72 |
| MID | 7.61 | 14.30 |
| LED | 8.48 | 11.66 |
| TUTR | 7.76 | 12.69 |
| PPT (Ours) | 7.03 | 10.65 |
minADE20/minFDE20 on ETH/UCY dataset (meters):
| Method | ETH | HOTEL | UNIV | ZARA1 | ZARA2 | AVG |
|---|---|---|---|---|---|---|
| Social-GAN | 0.87/1.62 | 0.67/1.37 | 0.76/1.52 | 0.35/0.68 | 0.42/0.84 | 0.61/1.21 |
| MemoNet | 0.40/0.61 | 0.11/0.17 | 0.24/0.43 | 0.18/0.32 | 0.14/0.24 | 0.21/0.35 |
| SocialVAE | 0.41/0.58 | 0.13/0.19 | 0.21/0.36 | 0.17/0.29 | 0.13/0.22 | 0.21/0.33 |
| PPT | 0.36/0.51 | 0.11/0.15 | 0.22/0.40 | 0.17/0.30 | 0.12/0.21 | 0.20/0.31 |
GCS dataset (pixels):
| Method | ADE↓ | FDE↓ |
|---|---|---|
| EigenTrajectory | 7.42 | 12.49 |
| PPT (Ours) | 6.20 | 9.34 |
PPT overwhelmingly outperforms SOTA on GCS, with ADE reduced by 16.4% and FDE reduced by 25.2%.
Ablation Study¶
Ablation of pretext tasks (SDD dataset):
| Task-I | Task-II | Task-III | ADE↓ | FDE↓ |
|---|---|---|---|---|
| ✗ | ✗ | ✓ | 10.40 | 18.64 |
| ✗ | ✓ | ✓ | 7.71 | 11.42 |
| ✓ | ✓ | ✓ | 7.03 | 10.65 |
Other ablation findings: - Task-I reduces the destination prediction FDE of Task-II from 11.58 to 10.70 - Cross-task knowledge distillation reduces prediction variance and improves training stability - The diversity loss weight \(\lambda_d = 100\) is optimal; too small leads to mode collapse, while too large sacrifices precision
Key Findings¶
- Progressive training significantly outperforms direct training: Direct training without pretext tasks yields an ADE/FDE of 10.40/18.64, which drops to 7.03/10.65 after incorporating two pretext tasks—a massive improvement (32%/43%).
- Both pretext tasks contribute and complement each other: Task-I improves short-term precision, while Task-II enhances long-term accuracy and destination diversity; neither can be omitted.
- Efficient inference: The two-step inference (first destination, then parallel generation of all intermediate points) takes only 5.28ms/sample, which is significantly faster than autoregressive methods (STAR 35.8ms, AgentFormer 99.3ms, MID 736.8ms), comparable to TUTR (4.06ms), but with far superior performance.
- Efficient training: Pre-training in previous stages accelerates convergence in subsequent stages. The total training time on SDD is only 4.7 hours (on a single RTX 3090).
Highlights & Insights¶
- "Easy-to-hard" training philosophy: Similar to curriculum learning, the model is first trained to walk (next-step prediction), then to look far ahead (destination prediction), and finally to complete the whole journey (complete trajectory). This progressive strategy prevents the model from making sub-optimal trade-offs when simultaneously learning short-term and long-term patterns.
- Clever design of learnable prompt embeddings: Prompts are used to represent unknown future positions. Combined with positional encodings, they enable non-autoregressive parallel generation. This maintains the sequence modeling advantages of Transformers while avoiding the efficiency bottleneck of step-by-step decoding.
- Cross-task knowledge distillation to prevent catastrophic forgetting: The models from the first two stages act as teachers to continuously supervise the Task-III model, ensuring that short-term and long-term capabilities are not forgotten.
- Convincing visual validation: Models with Task-I are more accurate in proximal trajectory segments, while models with Task-II are more accurate in distal trajectory segments; combining both achieves overall optimality.
Limitations & Future Work¶
- It only models individual pedestrian trajectories. It does not explicitly model interactions between pedestrians or scene constraints (e.g., obstacles, road boundaries), which may be insufficient in highly crowded scenarios.
- Although training in each of the three sequential stages is relatively fast, it increases the complexity of the training pipeline, requiring careful design of hyperparameters for each stage.
- Employs the Best-of-20 evaluation strategy—which is standard practice but can mask the true quality of the generated distribution.
- The diversity loss for destination prediction is based on a Gaussian RBF kernel; more flexible distribution modeling approaches (e.g., normalizing flows) could be explored.
- The current Transformer is a 3-layer encoder-only model; an encoder-decoder architecture or deeper models could be explored.
Related Work & Insights¶
- Connection to curriculum learning / progressive training: Similar to ProGAN progressively increasing resolution to train GANs, and PGBIG progressively refining motion prediction, PPT is the first to introduce progressive pretext tasks to the field of trajectory prediction.
- Comparison with TUTR: Like PPT, TUTR is a non-autoregressive Transformer, but it does not utilize prompt embeddings or progressive pre-training, resulting in significantly lower performance than PPT.
- Advantages of the destination-driven strategy: Explicitly decomposing long-range dependencies into destination prediction + intermediate point generation is more effective than end-to-end prediction of all positions. Furthermore, the shared model architecture between both stages allows knowledge transfer to occur naturally.
- The idea of progressive pretext tasks can be extended to related tasks such as vehicle trajectory prediction and robot path planning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Progressive pretext task training is a first in trajectory prediction
- Technical Quality: ⭐⭐⭐⭐ — The three-stage design is well-justified, with knowledge distillation preventing forgetting
- Experimental Thoroughness: ⭐⭐⭐⭐ — Four datasets, detailed ablations, and visual analysis
- Practicality: ⭐⭐⭐⭐ — Efficient inference (5.28ms), suitable for real-time applications
- Overall Recommendation: ⭐⭐⭐⭐