Skip to content

AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving

Conference: ICCV 2025 arXiv: 2511.06253 Code: https://github.com/ReaFly/AdaDrive Area: Autonomous Driving Keywords: Large Language Models, Autonomous Driving, Adaptive Slow-Fast System, Language-Grounded Driving, Efficient Inference

TL;DR

AdaDrive presents the first LLM-augmented autonomous driving framework with an adaptive slow-fast architecture. Two adaptive connectors dynamically determine when to activate the LLM (Connector-W) and how much the LLM contributes (Connector-H), achieving SOTA performance on language-grounded driving benchmarks (driving score 80.9%) while reducing inference latency to 189ms and GPU memory to 6.79GB.

Background & Motivation

  • Background: LLMs can provide high-level reasoning and decision-making capabilities for autonomous driving, yet efficient integration remains an open problem.
  • First-generation methods (LMDrive, AD-H): Synchronous architectures where the LLM participates at every step—precise in reasoning but with high latency and memory overhead, precluding real-time deployment.
  • Second-generation methods (AsyncDriver, DriveVLM): Asynchronous architectures with fixed-frequency LLM activation—reduced overhead but unable to adapt to dynamic driving scenarios. The LLM may not be invoked during emergencies, yet wastes computation in simple scenarios.
  • Key Challenge: High-frequency LLM activation ensures performance but incurs unacceptable latency; low-frequency fixed activation misses critical scenarios and lacks flexibility.
  • Key Insights: (1) LLM activation should be scene-complexity-driven rather than fixed-frequency; (2) the LLM's contribution should not be a binary all-or-nothing decision—continuous weighted fusion (e.g., weight 0.7) outperforms full-weight fusion (1.0), as validated experimentally (Table 4: ID#3 vs. ID#4).

Method

Overall Architecture

AdaDrive adopts a parallel slow-fast dual-path architecture: the fast path (lightweight planner) processes every frame at high frequency for trajectory prediction; the slow path (LLM) is activated at low frequency as a cognitive unit providing decision assistance. The two paths are adaptively connected via Connector-W and Connector-H. An LS-Qformer handles temporal feature extraction, and a streaming memory buffer manages historical context.

Key Designs

  1. Connector-W: Adaptive LLM Activation

  2. Function: Dynamically determines whether the LLM should be activated for the current frame.

  3. Mechanism: An MLP predicts a confidence score \(\theta_T\) from the current driving context feature \(f_T'\), which is converted to a binary decision \(\pi_T \in \{0, 1\}\) via Gumbel-Softmax.
  4. Adaptive Activation Loss (Core Innovation): $\(\mathcal{L}_{ada} = \pi_T \cdot (\mathcal{L}_T^{LLM} + \gamma) + (1-\pi_T) \cdot \mathcal{L}_T\)$ where \(\gamma = \max(d - (L_T - L_T^{LLM}), 0)\).
  5. Training Mechanism: Two forward passes are performed at each step—one with LLM assistance (\(W_T^{LLM}\)) and one without (\(W_T\))—comparing their respective trajectory losses. The model automatically learns to activate the LLM when it provides significant benefit (\(\mathcal{L}_T^{LLM} \ll \mathcal{L}_T\)).
  6. Role of Penalty Term \(\gamma\): Controls activation frequency via a preset margin \(d=0.3\), ensuring LLM activation only when its contribution is sufficiently significant.
  7. Design Motivation: No manual annotation of ground truth for "when to use the LLM" is required; comparative learning automatically discovers the optimal activation timing.

  8. Connector-H: Dynamic LLM Contribution Scaling

  9. Function: Controls the degree to which LLM features contribute to trajectory prediction when the LLM is activated.

  10. Mechanism: The confidence score \(\theta_T\) predicted by Connector-W serves as a continuous weighting coefficient rather than a simple full-weight fusion.
  11. Fusion Formula: \(W_T^{Fuse} = \mathcal{P}(f_T' + \theta_T \cdot f_T'')\)
  12. Unified Inference Formula: $\(W_T = \begin{cases} \mathcal{P}(f_T'), & \text{LLM not activated} \\ \mathcal{P}(f_T' + \theta_T \cdot f_T''), & \text{LLM activated} \end{cases}\)$
  13. Design Motivation: Experiments demonstrate that continuous weighting (e.g., \(\theta_T=0.7\)) outperforms binary full-weight fusion (\(\theta_T=1.0\)), and adaptive scaling yields finer-grained feature integration.

  14. Long-Short Q-former (LS-Qformer)

  15. Function: Enhances temporal modeling of visual features, balancing current-frame precision with long-range context retention.

  16. Mechanism: Learnable tokens are divided into two groups—memory tokens \(\mathbf{Q}^m\) propagate across frames to aggregate long-range information, while local tokens \(\mathbf{Q}^l\) focus on the current frame.
  17. Formula: \(f_T' = [\mathbf{Q}^l; \mathbf{Q}_T^m] = \text{Q-former}(\mathbf{Q}^l, \mathbf{Q}_{T-1}^m, f_T, \mathbf{I}_T)\)
  18. Hyperparameters: 20 local tokens + 20 memory tokens.
  19. Design Motivation: Standard Q-formers process each frame independently, neglecting temporal dependencies. LS-Qformer simultaneously extracts salient current-frame features and models temporal evolution through its grouped design.

  20. Propagative Memory Fusion (PMF)

  21. Function: Manages a fixed-size memory buffer for streaming data, preventing unbounded GPU memory growth.

  22. Mechanism: When the buffer is full, features of the frame to be evicted are fused into adjacent frames: \(\hat{f}_{T-k+1}' = (f_{T-k}' + f_{T-k+1}')/2\)
  23. Comparison with FIFO: FIFO directly discards information from the oldest frame; PMF preserves historical context through fusion.
  24. Buffer Capacity: \(k=10\).

Loss & Training

  • AdamW optimizer with cosine learning rate scheduling; initial learning rate \(1 \times 10^{-5}\).
  • Training for 15 epochs; margin \(d=0.3\) in the adaptive activation loss.
  • The visual encoder is initialized from LMDrive pretraining and frozen; the LLM is TinyLLaMA (1.1B).
  • The planner is a 4-layer Transformer with only 3M parameters.
  • A warmup phase allows the LLM-assisted and LLM-free trajectory losses to converge to stable values before adaptive training begins.

Key Experimental Results

Main Results

Method LLM Params DS ↑ RC ↑ IS ↑ Memory ↓ Latency ↓
LMDrive (LLaMA2-7B) 7B 32.8 40.1 0.81 26.91G 526ms
LMDrive (TinyLLaMA) 1.1B 25.2 38.6 0.71 16.29G 445ms
AD-H (Mipha-3B) 3.35B 41.1 48.5 0.86
AdaDrive 1.1B+3M 42.9 53.4 0.82 6.79G 189ms

Ablation Study

ID Connector-W Connector-H LS-Qformer DS ↑ RC ↑ IS ↑
1 67.4 75.3 0.86
2 71.9 82.6 0.84
3 77.9 84.8 0.89
4 80.9 87.6 0.90

Key Findings

  • The average activation frequency of the adaptive mechanism is only 0.28 (short routes) and 0.33 (full routes), yet performance approaches that of full activation (frequency = 1.0), with GFLOPs reduced by 62%.
  • Challenging routes (dense urban streets, nighttime, mountain roads) exhibit higher activation frequencies, validating the rationale of the adaptive mechanism.
  • Temporal distribution analysis shows the LLM is primarily activated at critical moments such as turns and intersections, remaining silent during cruising.
  • LS-Qformer improves DS from 75.8 to 80.9 (+5.1) compared to a standard Q-former.
  • PMF outperforms hard FIFO replacement; notably, a smaller memory buffer (\(k=10\)) yields the best results.

Highlights & Insights

  • Elegant Design of the Adaptive Activation Loss: No ground-truth annotation of when to activate the LLM is required; comparative learning during training automatically discovers the optimal activation timing. This design is generalizable to other systems requiring on-demand activation of expensive modules.
  • Continuous Fusion Outperforms Binary Fusion: The ablation results for Connector-H (ID#3 vs. ID#4) reveal a valuable insight—LLM outputs should not be used at full weight; adaptive weighting is superior.
  • Extreme Efficiency: Using a 1.1B small model with a 3M planner, AdaDrive substantially outperforms the 7B model baseline in both memory (6.79G vs. 26.91G) and speed (189ms vs. 526ms), while achieving stronger performance.
  • Interpretability of LLM Activation Patterns: Activations are concentrated at turns and intersections, which aligns with intuitive expectations.

Limitations & Future Work

  • Training Connector-W requires two forward passes per step (with/without LLM), increasing training cost.
  • The margin \(d=0.3\) in the penalty term is a manually preset hyperparameter that may require tuning for different scenarios.
  • The simple averaging in PMF may not be optimal; attention-weighted fusion could be more effective.
  • The IS score (0.82) on long-distance tasks remains below AD-H (0.86), indicating room for improvement in safety.
  • Validation is limited to CARLA simulation; real-world deployment performance remains unknown.
  • vs. LMDrive: Synchronous per-step LLM invocation yields unacceptable 526ms latency; AdaDrive's adaptive invocation achieves only 189ms.
  • vs. AsyncDriver/DriveVLM: Fixed-frequency activation cannot adapt to dynamic scenarios; AdaDrive activates on demand.
  • vs. AD-H: AD-H employs additional intermediate language command supervision to train a hierarchical multi-agent system with a larger total parameter count (3.35B); AdaDrive is more lightweight yet achieves stronger performance.
  • vs. Flash-VStream: Video understanding focuses on high-level semantic dialogue, whereas autonomous driving targets low-level high-frequency trajectory prediction—the two have fundamentally different objectives.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The adaptive activation loss and continuous fusion design are highly innovative; the dual-connector architecture is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Ablations are comprehensive and activation pattern analyses are convincing, though evaluation is limited to CARLA simulation.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and figures are informative.
  • Value: ⭐⭐⭐⭐⭐ Provides a practical paradigm for efficient LLM deployment in autonomous driving.