VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions¶

Conference: ECCV 2024
arXiv: 2407.12345
Code: https://moonseokha.github.io/VisionTrap (with project page)
Area: Autonomous Driving / Trajectory Prediction
Keywords: trajectory prediction, visual semantics, textual guidance, contrastive learning, nuScenes-Text

TL;DR¶

This work proposes VisionTrap, which introduces surround-view camera images and textual descriptions to the trajectory prediction task. Guided by a BEV visual semantic encoder and text-driven debiased contrastive learning, the model learns visual semantic cues (e.g., pedestrian poses, turn signals). It significantly improves prediction accuracy while maintaining a real-time inference speed of 53ms, and releases the nuScenes-Text dataset.

Background & Motivation¶

Trajectory prediction is a core task in autonomous driving. Existing methods primarily rely on historical agent trajectories and High-Definition Maps (HD Maps) as inputs. However, HD Maps are static and fail to reflect temporary changes (such as construction zones). Furthermore, they cannot provide visual cues critical for behavior prediction, such as pedestrian gaze direction, gestures, or vehicle turn signals.

Although a few works attempt to introduce visual data, they either only use the front-view camera or perform global processing on the entire image without targeted extraction of key regions. Consequently, the models only focus on salient features, limiting their efficacy. In addition, existing intent classification methods categorize behaviors into a limited number of classes (e.g., stationary, lane change, right turn), which introduces annotation ambiguity and restricts the model's expressiveness.

Core Idea: Leverage surround-view visual semantic information to enhance trajectory prediction, and use textual descriptions generated by VLMs and LLMs as supervision signals during training. This guides the model to learn richer behavioral contexts from images while maintaining real-time inference (no text input is required during inference).

Method¶

Overall Architecture¶

VisionTrap consists of four main modules: 1. Per-agent State Encoder: Encodes the historical state of each agent. 2. Visual Semantic Encoder: Encodes surround-view images into BEV features and fuses them with agent states. 3. Text-driven Guidance Module: Uses textual descriptions to supervise visual semantic learning during training (not used during inference). 4. Trajectory Decoder: Predicts future trajectories for all agents.

Input: Agent historical trajectories + types + surround-view camera images + HD map; Output: Multimodal future trajectory distributions for all agents.

Key Designs¶

Per-agent State Encoder:
- Function: Encodes the spatiotemporal states of each agent and their interaction relationships.
- Mechanism: Represents historical trajectories using relative displacement (rather than absolute positions), combined with agent types and temporal positional encodings.
- Key formula: \(s_i^t = f_{\text{geometric}}(p_i^t - p_i^{t-1}) + f_{\text{type}}(a_i) + f_{\text{PE}}(e^t)\)
- Three-level encoding: (1) MLP encodes geometric displacement, (2) Temporal Transformer encodes temporal information, (3) Cross-attention encodes interaction relationships between agents.
- Interaction encoding is completed once in the ego-centric coordinate system, avoiding redundant computation for each individual agent.
Visual Semantic Encoder:
- Function: Converts surround-view images into BEV features and fuses them with agent state embeddings.
- Mechanism: Uses the BEVDepth architecture to encode multi-view images into BEV image features \(B_I \in \mathbb{R}^{h \times w \times d_{\text{bev}}}\), which is then concatenated with the rasterized map features \(B_{\text{map}}\) to obtain the composite BEV scene features \(B = [B_I; B_{\text{map}}]\).
- Scene-agent interaction: Uses Deformable Cross-Attention to extract region information relevant to each agent from BEV features. A Recurrent Trajectory Prediction module is introduced, which uses the predicted positions from an auxiliary trajectory predictor as reference points for the deformable attention.
- Key formula: \(z_i^{\text{scene}} = z_i^{\text{interact}} + \sum_{h=1}^{H} W_h \left[\sum_{o=1}^{O} \alpha_{hio} W'_h \mathbf{B}_{(u_i^{\text{aux}} + \triangle u_{hio}^{\text{aux}})}\right]\)
- Design Motivation: Compared to the global perception of ConvNet, deformable attention can selectively focus on the areas that an agent needs to attend to. Furthermore, the number of attention points is much smaller than the number of scene elements, reducing the computational load.
Text-driven Guidance Module:
- Function: Uses textual descriptions as supervision during training to guide agent state embeddings to learn fine-grained visual semantics.
- Mechanism: Aligns the agent state embedding \(z_i^{\text{scene}}\) with its textual description embedding \(\mathcal{T}_i\) via cross-modal contrastive learning.
- Debiased design: Multiple agents in driving scenes may exhibit similar behaviors, so naive contrastive learning can produce false negatives. The solution: (1) Use BERT to extract sentence-level embeddings \(\mathcal{T}_i\), (2) filter out samples with cosine similarity \(> \theta_{th}=0.8\) (potential false negatives), and (3) select the top-\(k\) most dissimilar samples as negative samples to ensure a stable sample count.
- Key formula: \(\mathcal{L}_{\text{cl}} = -\log\frac{e^{\text{sim}_{\text{cos}}(z_i^{\text{scene}}, \mathcal{T}_i)/\tau}}{\sum_{j=1}^{k} e^{\text{sim}_{\text{cos}}(z_i^{\text{scene}}, \mathcal{T}_j)/\tau}}\)
- An asymmetric contrastive loss (only from agent to text direction) is employed.
- No text input is required during inference, resulting in zero additional latency.
Transformation Module + Trajectory Decoder:
- Function: Performs multimodal trajectory prediction after normalizing the orientations of the agents.
- Mechanism: The drawback of ego-centric methods is that the features of agents with similar behaviors are not standardized. Therefore, a rotation matrix \(\mathcal{R}\) is used to normalize the orientation of each agent before prediction.
- The Trajectory Decoder models the future trajectory distribution using a GMM: \(p(u) = \sum_{m=1}^{M}\rho_m \prod_{t=1}^{T_f} \mathcal{N}(u_t - \mu_m^t, \Sigma_m^t)\)

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_{\text{traj}} + \lambda_{\text{traj}}^{\text{aux}} \mathcal{L}_{\text{traj}}^{\text{aux}} + \lambda_{\text{cl}} \mathcal{L}_{\text{cl}}\)
\(\mathcal{L}_{\text{traj}}\): GMM negative log-likelihood loss.
\(\mathcal{L}_{\text{traj}}^{\text{aux}}\): Auxiliary trajectory prediction loss (used for the Recurrent Trajectory Prediction module).
\(\mathcal{L}_{\text{cl}}\): Debiased asymmetric contrastive learning loss.
nuScenes-Text Dataset Generation: (1) Finetune BLIP-2 with the DRAMA dataset, (2) crop images for each agent in nuScenes and generate descriptions, and (3) use GPT to refine the text, combining Ground Truth (GT) types and rule-extracted behavioral features. A total of 1,216,206 textual descriptions were generated, averaging 13 words per description, with a human-evaluated alignment accuracy of 94.8%.

Key Experimental Results¶

Main Results¶

Comparison on the nuScenes prediction dataset (single/multi-agent prediction):

Method	Prediction Mode	Inference Time (ms)	ADE10↓	MR10↓	FDE1↓
PGP	Single agent	215	1.00	0.37	7.17
LAformer	Single agent	115	0.93	0.33	-
Trajectron++	Multi-agent	38	1.51	0.57	9.52
VisionTrap baseline	Multi-agent	13	1.48	0.56	10.75
+ Map Encoder	Multi-agent	21	1.40	0.53	10.41
+ Visual Semantic Encoder	Multi-agent	53	1.23	0.36	9.32
+ Text-driven Guidance (Full)	Multi-agent	53	1.17	0.32	8.72

The full model achieves a 27.56% improvement over the baseline. It achieves comparable accuracy to single-agent SOTA methods in the multi-agent prediction mode, while maintaining an inference speed of only 53ms (real-time).

Ablation Study¶

nuScenes Full Dataset (all agents):

Method	ADE10↓	FDE10↓	MR10↓
VisionTrap baseline	0.425	0.641	0.081
+ Map Encoder	0.407	0.601	0.075
+ Visual Semantic Encoder	0.382	0.551	0.056
+ Text-driven Guidance	0.368	0.535	0.051

Internal Ablation of the Text-driven Guidance Module:

Configuration	ADE6↓	FDE6↓	MR6↓
A. CLIP symmetric contrastive loss	0.51	0.79	0.10
B. Symmetric loss variant	0.50	0.76	0.10
C. W/o negative sample refinement	0.49	0.72	0.09
D. W/o top-k constraint	0.46	0.67	0.08
E. Full Model	0.44	0.66	0.07

Key Findings¶

The Visual Semantic Encoder is the component making the largest contribution: ADE10 drops from 1.40 to 1.23 (a 12.1% improvement), and MR10 drops from 0.53 to 0.36.
On top of this, the Text-driven Guidance module yields an additional 4.9% improvement, proving that textual descriptions effectively guide the model to focus on fine-grained visual semantics.
The debiased contrastive learning strategy is crucial: removing negative sample refinement or the top-\(k\) constraint leads to performance degradation.
Qualitative analysis indicates that the model can utilize visual cues such as traffic light states to predict whether pedestrians will cross the street, pedestrian gaze directions, and vehicle turn signals.
UMAP visualization demonstrates that incorporating visual and textual semantics results in tighter embedding clusters for agents with similar behaviors.

Highlights & Insights¶

Text-guided training, free at inference: Textual descriptions are used only for contrastive learning supervision during training, incurring zero extra overhead during inference. This design paradigm of "rich during training, streamlined during inference" is highly practical.
Debiased contrastive learning: By filtering out false negatives via textual semantic similarity, it resolves the negative sample ambiguity problem caused by similar multi-agent behaviors in driving scenes.
nuScenes-Text Dataset: A semi-automatic annotation pipeline combining VLM finetuning and LLM refinement with a 94.8% accuracy. It can be transferred to textual annotation for other datasets.
Scene-centric real-time inference: Employs an ego-centric paradigm to process all agents simultaneously. The 53ms inference time is vastly faster than agent-centric approaches.

Limitations & Future Work¶

The generation of the textual dataset relies on a cascaded VLM+LLM pipeline, making the annotation quality dependent on the performance of both models.
It is only validated on nuScenes; generalization to other datasets like Waymo has not been tested.
Currently, it does not support long-horizon prediction (limited to 2-6 seconds). Longer forecasting horizons are likely components that stand to benefit even more from visual semantics.
The rotation normalization of the Transformation Module only considers orientation, leaving distribution discrepancies in other state quantities like velocity unaddressed.

vs. TPNet/IntentNet: These methods use front-view images or agent-region images, whereas VisionTrap utilizes surround-view BEV features, providing more comprehensive information.
vs. Trajectron++: Both are scene-centric multi-agent prediction methods, but VisionTrap introduces visual and textual semantics, reducing ADE10 from 1.51 to 1.17.
vs. CLIP-based methods: VisionTrap does not use generic CLIP models. Instead, it designs a contrastive learning framework tailored for trajectory prediction, taking false negative debiasing into account.

Rating¶

Novelty: ⭐⭐⭐⭐ The approach of using textual descriptions as training-time supervision to guide visual semantic learning is novel, and the debiasing strategy is soundly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-level ablations, qualitative visualizations, UMAP analysis, and human evaluation of the nuScenes-Text dataset are provided.
Writing Quality: ⭐⭐⭐⭐ Good integration of text and figures; the qualitative analysis intuitively demonstrates the contribution of visual semantics.
Value: ⭐⭐⭐⭐ This work demontrates the value of visual semantics in trajectory prediction, and the nuScenes-Text dataset holds public utility value.