VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness¶

Conference: AAAI 2026 arXiv: 2601.12672 Code: None Area: Autonomous Driving / VLM Applications Keywords: VLM-in-the-Loop, Adversarial Trajectory Generation, Closed-Loop Training, Long-Tail Scenarios, Driving Policy Robustness

TL;DR¶

VILTA embeds a VLM (Gemini-2.5-Flash) directly into the RL training loop for autonomous driving. Via a Vision-Language-Editing (VLE) paradigm, the VLM edits the future trajectories of surrounding vehicles to generate challenging hazardous scenarios. The resulting driving policy achieves a 13.3% improvement in route completion rate and a 28.5% reduction in collision rate on CARLA challenging scenarios.

Background & Motivation¶

Background: Safe deployment of autonomous driving systems is severely hindered by the long-tail problem — extreme hazardous scenarios are exceedingly rare in real-world driving data, causing AD models to perform poorly on these corner cases.

Limitations of Prior Work: Existing approaches fall into two categories: (1) safety-critical scenario generation (offline generation not used for training); and (2) closed-loop learning (generated scenarios used for training, but relying on rule-based / resampling / offline generative models with insufficient diversity). While recent work has employed VLMs to analyze scenes and guide downstream models for trajectory generation, this two-stage approach limits the generative potential of VLMs.

Key Challenge: VLMs possess strong generalization and reasoning capabilities, yet existing methods use them only for high-level description or analysis before delegating actual generation to downstream models — the diversity of the final trajectories is therefore capped by the generalization ability of those downstream models.

Goal: How can the generative capacity of VLMs be fully leveraged to create diverse and highly challenging adversarial driving scenarios that are integrated into the training loop?

Key Insight: Drawing inspiration from the image editing domain — rather than generating trajectories from scratch (which tends to produce implausible results), the VLM edits a rule-generated base trajectory to make it more challenging while preserving kinematic plausibility.

Core Idea: The VLM directly edits surrounding vehicle trajectories within the training loop to create adversarial scenarios, replacing the indirect two-stage generation pipeline.

Method¶

Overall Architecture¶

Closed-loop training: the RL environment provides the current scene → BEV representation + vehicle states → VLM (Gemini-2.5-Flash) performs simultaneous scene understanding and trajectory editing → post-processing enforces kinematic feasibility → the edited trajectory controls the "hazardous vehicle" in the environment → the ego vehicle learns under these challenging conditions. Normal and challenging scenarios are trained in alternation (default ratio 1:2).

Key Designs¶

Scene Representation and Hazardous Vehicle Selection:
- Function: Converts the driving scene into a BEV representation and identifies the most appropriate object to serve as the "hazardous vehicle."
- Mechanism: A BEV map is constructed (ego = white, hazardous vehicle = yellow, others = blue, drivable area = grey, lane markings = magenta). The vehicle nearest to the ego within a hazard radius is selected. Based on its relative position and heading, a hazardous maneuver type is automatically assigned (ahead → hard braking; behind → overtaking; adjacent → lane cutting; oncoming → U-turn / lane crossing).
- Design Motivation: Rule-based hazardous maneuver type assignment ensures accuracy of scene description; BEV input allows the VLM to directly comprehend spatial relationships.
Vision-Language-Editing (VLE) Paradigm:
- Function: Enables the VLM to edit a base trajectory rather than generate one from scratch.
- Mechanism: A "normal" base trajectory \(T_{\text{base}}\) is first generated by linearly interpolating between a CTRV motion model and map waypoints (early phase dominated by CTRV for continuity; late phase dominated by map waypoints for directionality). The VLM then edits this trajectory based on scene understanding and the assigned hazardous maneuver type, outputting edited waypoints \(T_{\text{edit}}\).
- Design Motivation: Analogous to image editing — editing is more controllable than generation from scratch. Experimental validation shows that while trajectories directly generated by the VLM exhibit comparable diversity, they are insufficiently challenging (minimum distance to ego is comparable to the original trajectory), whereas VLE-edited trajectories are both more diverse and more challenging.
Three-Stage Trajectory Post-Processing:
- Function: Ensures kinematic feasibility of VLM-edited trajectories.
- Three components: (a) B-spline smoothing \(T_{\text{edit}} \rightarrow T_B\); (b) Sigmoid blending \(T_{\text{curve}} = w_i \cdot T_{\text{base}} + (1-w_i) \cdot T_B\) (initial-phase weights favor \(T_{\text{base}}\) to ensure behavioral continuity; later phases favor the edited trajectory); (c) An LQR controller executes \(T_{\text{curve}}\) to produce the final kinematically feasible trajectory \(T_{\text{final}}\).
- Design Motivation: Raw waypoints output by the VLM may contain kinematically infeasible abrupt changes; the three-stage processing progressively smooths and constrains them.

Loss & Training¶

The ego driving policy is trained using SAC (Soft Actor-Critic). Normal and challenging scenarios are trained in alternation; ablation studies show that a ratio of 8:1 (one challenging scenario per eight normal scenarios) yields the best performance. Gemini-2.5-Flash is accessed via API without fine-tuning.

Key Experimental Results¶

Main Results — CARLA Town01–03 (Challenging Scenarios)¶

Method	Total RC↑	Total CR↓	Total CPM↓
VLM-RL	2.10	1.44	51.74
CAT	1.90	1.70	67.50
VILTA	2.38	1.03	48.62

Ablation Study — Town02 Challenging Scenarios¶

Configuration	RC↑	CR↓	CPM↓
w/o Post-Processing	0.73	0.60	18.75
w/o Following Reward	0.74	0.60	19.28
w/o VLE (direct generation)	0.75	0.53	18.50
×2 (default)	0.77	0.50	18.13
×8 (optimal ratio)	0.87	0.30	5.33

Key Findings¶

VILTA achieves 13.3% higher total RC than VLM-RL on challenging scenarios (2.38 vs. 2.10) and 28.5% lower collision rate (1.03 vs. 1.44).
VILTA performs best on normal scenarios (RC = 2.70, CR = 0.63), demonstrating the absence of catastrophic forgetting.
VLE editing vs. direct generation: edited trajectories exhibit a substantially smaller minimum distance to the ego vehicle and more extreme acceleration and steering angles — VLE simultaneously increases both diversity and challenge.
Post-processing contributes significantly: its inclusion raises RC by 5.5% and reduces CPM by 3.3%.
Performance is sensitive to scenario alternation frequency: a 1:8 ratio is optimal; either too frequent (1:2) or too infrequent (1:16) degrades performance.
Offline validation on nuScenes is consistent: VILTA achieves a 65% success rate on challenging scenarios (vs. BC+PPO 62%) and 93% on normal scenarios (vs. BC+PPO 89%).

Highlights & Insights¶

VLE Paradigm Innovation: Having the VLM edit a base trajectory rather than generate from scratch — the insight that "editing > generation" is borrowed from the image editing domain and validated here for the first time in autonomous driving. Direct generation yields diversity but insufficient challenge; editing achieves both simultaneously.
VLM as a Closed-Loop Adversary: This breaks the two-stage paradigm in which VLMs serve only as analyzers or describers, enabling the VLM to directly participate in trajectory editing and fully exploit its generalization capability.
Alternating Normal/Challenging Training Prevents Catastrophic Forgetting: A simple alternation strategy effectively preserves performance on normal scenarios.

Limitations & Future Work¶

Validation is conducted in simulation only; sim-to-real transfer is not discussed.
Only a single "hazardous vehicle" trajectory is edited at a time; real-world scenarios may involve complex multi-vehicle coordinated interactions.
Hazardous vehicle selection and maneuver type assignment rely on predefined rules, which could be replaced by a learned discovery mechanism.
Performance depends on the capabilities of the underlying VLM; specific biases of the VLM are not analyzed.
On Town05, VILTA's collision rate in challenging scenarios is higher than CAT's (0.43 vs. 0.27), indicating that generalization requires further improvement.

vs. CurricuVLM: CurricuVLM uses a VLM to analyze weaknesses and guide resampling (two-stage); VILTA has the VLM directly edit trajectories (single-stage).
vs. CAT: CAT generates adversarial trajectories using a motion prediction model, whose diversity is limited by training data; VILTA exploits the generalization capacity of VLMs to produce more diverse challenges.
vs. VLM-RL: VLM-RL applies VLMs to reward signal generation; VILTA applies them to scenario generation — these represent different dimensions of VLM utilization.
Insight: The TAPA paper's concept of "LLM as action-space modulator" and VILTA's "VLM as scenario generator" share a similar philosophy — both use large models as "behind-the-scenes designers" rather than front-end decision makers.

Rating¶

Novelty: ⭐⭐⭐⭐ VLM-in-the-Loop combined with the VLE editing paradigm is a novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ CARLA across 5 towns + nuScenes + extensive ablations + trajectory analysis.
Writing Quality: ⭐⭐⭐⭐ Framework description is clear and visualizations are rich.
Value: ⭐⭐⭐⭐ Establishes a new paradigm for applying VLMs in autonomous driving training.