CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving¶
Conference: ICCV 2025 arXiv: 2503.08683 Code: cxliu0314/CoLMDriver Area: Autonomous Driving Keywords: cooperative driving, V2V, LLM negotiation, actor-critic, waypoint planning, CARLA
TL;DR¶
The first end-to-end LLM-driven cooperative driving system. Through an Actor-Critic language negotiation module and an intention-guided trajectory generator, it achieves an 11% higher success rate than existing methods across diverse V2V interaction scenarios.
Background & Motivation¶
Core Problem: Vehicle-to-vehicle (V2V) cooperative autonomous driving can compensate for the perception and prediction uncertainties of single-vehicle systems by sharing sensory information, yet faces two key challenges:
Limitations of Prior Work: Existing V2V cooperative methods (e.g., CoDriving) rely on rigid cooperation protocols whose predefined interaction rules generalize poorly to unseen complex scenarios. In non-standard situations such as multi-vehicle intersections or emergency avoidance, fixed rules lack flexibility.
Difficulties of Applying LLMs Directly to Driving: Although LLMs possess general reasoning capabilities, two obstacles remain: (a) insufficient spatial planning ability — LLMs cannot directly output precise trajectory coordinates; (b) unstable inference latency — real-time driving demands extremely tight latency bounds, and variable LLM inference times cause control instability.
Key Insight: Rather than forcing LLMs to perform spatial planning, the paper leverages LLMs for what they excel at — language-based negotiation. The driving system is decomposed into two parallel pipelines: an LLM negotiates cooperative intentions via natural language (e.g., "I will pass first; you slow down and yield"), and a dedicated trajectory generator then translates the negotiation outcome into executable waypoints.
Method¶
Overall Architecture¶
CoLMDriver adopts a parallel driving pipeline architecture consisting of three modules:
- Cooperative Perception Module: BEV feature fusion based on V2Xverse/CoDriving; multiple vehicles share LiDAR point clouds and extract voxel features via SpConv to produce a unified bird's-eye-view (BEV) representation.
- LLM-based Negotiation Module: Employs an Actor-Critic paradigm in which vehicles conduct multi-round natural language negotiation.
- Intention-Guided Waypoint Generator: Converts the negotiated driving intention into 20 future waypoints.
Collaborative workflow: the perception module provides environmental understanding → a VLM analyzes the scene and generates driving intentions → the LLM negotiation module coordinates intentions across vehicles → the waypoint generator outputs executable trajectories → a PID controller executes the plan.
Key Designs 1: LLM Negotiation Module (Actor-Critic Paradigm)¶
This is the paper's central contribution, modeling multi-vehicle negotiation as an Actor-Critic iterative process:
Actor (Negotiation Generation): Each vehicle hosts an LLM agent (Comm_Client) whose inputs include: - Ego-vehicle information: ID, intention (e.g., "turn left"), speed - Surrounding vehicle information: heading, course, distance, speed, intention - Traffic rules (right-of-way, yielding to emergency vehicles, etc.) - Historical negotiation records and Critic feedback
The LLM generates natural-language negotiation proposals, e.g., "I will slow down to let Vehicle 2 pass first, then proceed with my left turn."
Critic (Negotiation Evaluation): Implemented by Nego_Client, evaluating negotiation quality along three dimensions:
| Evaluation Dimension | Computation | Weight |
|---|---|---|
| Consensus Score | LLM analyzes dialogue content, outputs a 0–100 consensus score | 3 |
| Safety Score | Based on minimum distance between predicted trajectory pairs using a sigmoid: \(\frac{1}{1+e^{-2(d_{min}-3)}}\) | 5 |
| Efficiency Score | Evaluates whether each vehicle travels efficiently based on trajectory length | 2 |
Total score: \(S_{total} = 3 \cdot S_{cons} + 5 \cdot \min(S_{safety}) + 2 \cdot S_{eff}\)
Iterative Feedback: When the safety score < 0.7, safety suggestions are generated (e.g., "Vehicle X and Vehicle Y may conflict; one should yield"); when the efficiency score < 0.4, efficiency suggestions are generated (e.g., "Some vehicles can accelerate"). These suggestions are injected as Critic Feedback into the next Actor prompt. A maximum of 3 rounds (local_max_round=3) are performed, with early termination upon consensus.
Key Designs 2: VLM Scene Understanding¶
A vision-language model (VLM) processes front-camera images together with a system prompt (containing vehicle measurements, communication information, and perception results) to output structured driving commands:
- Directional commands (6 classes): turn left / turn right / straight / lane follow / lane change left / lane change right
- Speed commands (4 classes): Stop / Slow down / Hold / Accelerate
The VLM is fine-tuned via LoRA (using the ms-swift framework) on data from the V2Xverse simulation dataset.
Key Designs 3: Intention-Guided Waypoint Generator¶
The core model is WaypointPlanner_e2e_cmd_attn_fix_20points, with the following architecture:
Input Fusion:
- Occupancy map: 6-channel representation including obstacle and road information, shape (B, T=5, C=6, H=192, W=96)
- BEV perception features: 128-dimensional feature map from the cooperative perception module
- Navigation commands: directional embedding (Embedding, 6→256) + speed embedding (Embedding, 4→256) + target point encoding (MLP, 2→256)
Encoder: Spatio-temporal convolutional (STC) blocks alternating 2D convolution and Conv3D for spatial and temporal modeling: - STC Block 1: Conv2D(32→64) → Conv3D(64→64) (temporal fusion 5→3 frames) - STC Block 2: Conv2D(64→128) → Conv3D(128→128) (temporal fusion 3→1 frame) - Spatial features further convolved to 256 dimensions
Decoder: Cross-attention mechanism that interacts navigation command features with BEV spatial features:
- Constructs query embeddings
- Alternates between command-to-query attention and spatial-to-query attention
- Final regression branch outputs 20 waypoints (B, 20, 2), with cumulative summation enforcing trajectory continuity
Loss & Training¶
A weighted L1 loss (WaypointL1Loss20) supervises trajectory prediction, applying decreasing weights across time steps (higher weight for near-term, lower for far-term), ranging from 0.14 to 0.04. ADE/FDE serve as auxiliary evaluation metrics.
InterDrive Benchmark¶
The authors introduce InterDrive, a V2V interactive driving evaluation benchmark based on CARLA 0.9.10.1: - 10 challenging interaction scenario types: covering multi-vehicle intersection crossing, lane merging conflicts, emergency avoidance, etc. - 92 test cases: 46 cooperative-vehicle-only scenarios (no NPC) + 46 scenarios with background traffic participants (with NPC) - Evaluation metrics: Success Rate, collision rate, timeout rate, etc. - Supports two evaluation modes: ideal (ignoring inference latency) and realtime (accounting for inference latency)
Key Experimental Results¶
Main Results¶
Comparison against multiple baselines on the InterDrive benchmark:
| Method | Type | Success Rate (↑) |
|---|---|---|
| TCP | Single-vehicle end-to-end | baseline |
| CoDriving | V2V cooperative (rule-driven) | baseline |
| LMDrive | LLM-based driving (single vehicle) | baseline |
| UniAD | End-to-end planning | baseline |
| VAD | End-to-end planning | baseline |
| CoLMDriver | V2V + LLM negotiation | +11% vs. best baseline |
CoLMDriver substantially outperforms all methods in high-interaction V2V scenarios, demonstrating the effectiveness of language negotiation for cooperative driving.
Ablation Study¶
| Configuration | Effect |
|---|---|
| w/o negotiation | Significant performance drop; validates the necessity of the negotiation module |
| w/o Critic feedback | Single-round negotiation only; performance drops, validating the value of iterative feedback |
| w/o VLM intention | Removing scene understanding and using rule-based commands instead; performance drops |
| Varying negotiation rounds | 3 rounds achieves a favorable trade-off |
Key Findings¶
- Negotiation quality correlates positively with driving safety: The Critic feedback mechanism elevates safety scores from the unstable levels of single-round negotiation to consistently high values.
- NPC scenarios are more challenging: Success rates in scenarios with background traffic are generally lower than in cooperative-only scenarios.
- Impact of inference latency: Performance degrades in realtime mode, but CoLMDriver's parallel pipeline design effectively mitigates this — negotiation and trajectory planning proceed asynchronously.
Highlights & Insights¶
-
"Let LLMs do what they are good at": Rather than forcing LLMs to output precise coordinates, the system assigns them natural-language negotiation while a dedicated model converts the outcome into trajectories. This division of labor is elegant.
-
Actor-Critic negotiation paradigm: The Actor-Critic concept from reinforcement learning is transferred to multi-vehicle negotiation. The Critic evaluates along three dimensions — safety, efficiency, and consensus — and automatically generates improvement suggestions, realizing interpretable closed-loop optimization.
-
Clever Critic design: The safety score is based on a sigmoid function of minimum inter-trajectory distance, \(\frac{1}{1+e^{-2(d-3)}}\), where the 3-meter threshold corresponds to the safe following distance, making the physical interpretation explicit. The consensus score leverages the LLM's own comprehension ability to assess dialogue quality.
-
Fully open-sourced: The complete codebase — from perception to planning to negotiation — along with model weights, training scripts, and the InterDrive benchmark, is fully released, representing a substantial contribution to the community.
-
Parallel pipeline: The negotiation process runs in parallel with perception and planning, effectively alleviating the impact of LLM inference latency on real-time performance.
Limitations & Future Work¶
-
Simulation-to-real gap: Experiments are conducted entirely in the CARLA simulator; real-world issues such as V2V communication delay, bandwidth constraints, and signal loss are not addressed.
-
High LLM inference cost: Each negotiation step requires multiple LLM calls (one per vehicle per round plus Critic evaluation); three rounds of negotiation involve substantial API calls, making deployment cost and latency ongoing challenges.
-
Strong negotiation assumptions: The framework assumes all cooperative vehicles honestly participate in negotiation and accurately report their intentions. Adversarial problems such as malicious vehicles and communication spoofing in real-world settings are not addressed.
-
Limited scenario scale: The InterDrive benchmark only covers interactions among 2–3 cooperative vehicles; the negotiation efficiency and scalability for larger platoons remain unvalidated.
-
VLM relies on a single-view image: Scene understanding is based solely on the front-facing camera, potentially missing lateral and rear information.
Related Work & Insights¶
- V2Xverse / CoDriving: Foundation for the perception module, providing the V2V cooperative perception framework and simulation dataset.
- TCP (Trajectory-guided Control Prediction): Serves as the single-vehicle end-to-end driving baseline.
- LMDrive: A pioneering work applying LLMs to single-vehicle driving, but lacking V2V cooperative capability.
- Bench2Drive: A CARLA driving evaluation benchmark that partially informs the paper's evaluation framework.
Insights: - Natural language negotiation in multi-agent cooperation is a promising direction, extensible to robot formation control, drone swarms, and related domains. - The Actor-Critic negotiation paradigm can be generalized to any scenario requiring multi-party consensus (e.g., multi-robot task allocation). - Decoupling LLM language reasoning from domain-specific model precise control constitutes a broadly applicable system design principle.
Rating¶
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Novelty | ⭐⭐⭐⭐ | First end-to-end LLM cooperative driving system; Actor-Critic negotiation paradigm is original |
| Technical Depth | ⭐⭐⭐⭐ | Full-stack perception–negotiation–planning design with well-reasoned inter-module coupling |
| Experimental Thoroughness | ⭐⭐⭐⭐ | Introduces a new benchmark, compares against multiple baselines, and provides complete ablation studies |
| Writing Quality | ⭐⭐⭐⭐ | Problem formulation is clear; method description is detailed |
| Practical Value | ⭐⭐⭐ | Simulation results are convincing, but substantial work remains for real-world deployment |
| Overall | ⭐⭐⭐⭐ | An important contribution to cooperative driving and a valuable exploration of LLM integration with autonomous driving |
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD