Skip to content

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Conference: CVPR 2026 arXiv: 2603.25740 Code: https://dmw-cvpr.github.io/ Area: Autonomous Driving Keywords: personalized driving, VLA model, preference alignment, reinforcement fine-tuning, user embedding

TL;DR

This paper proposes DMW (Drive My Way), a personalized VLA driving framework that learns long-term driving habits via user embeddings and adapts to short-term preferences through natural language instructions. Personalized driving behavior is generated using GRPO-based reinforcement fine-tuning and style-aware rewards.

Background & Motivation

Driving behavior is inherently highly personalized — different drivers exhibit markedly distinct preferences in acceleration, braking, lane changing, overtaking, and other maneuvers. However, existing end-to-end autonomous driving systems suffer from the following shortcomings:

Generic optimization: Existing systems typically optimize general objectives such as safety and efficiency, ignoring individual differences.

Fixed preset modes: Only a few modes (e.g., "sport/comfort/economy") are offered, which cannot capture nuanced and continuously evolving user preferences.

Inability to understand natural language: Users cannot adjust driving style through intuitive utterances such as "I'm tired" or "I'm going to be late for work."

Two major limitations of existing personalization approaches: - Data-driven methods (behavioral cloning / IRL): Require large-scale data, scale poorly, and cannot handle real-time language interaction. - Language-driven methods (e.g., Talk2Drive): Validated only in simple scenarios without considering long-term driving habits.

The core idea of DMW is to simultaneously address long-term preference alignment and short-term instruction adaptation.

Method

Overall Architecture

DMW adopts SimLingo (based on InternVL2-1B) as the VLA backbone. The inputs include front-camera images, navigation targets, a user profile, and language instructions; the output is personalized driving actions (throttle / brake / steering).

Key Designs

  1. Personalized Driving Dataset (PDD):

    • 30 real drivers with diverse backgrounds were recruited.
    • Each driver completed 20 standardized scenarios (overtaking, merging, intersections, pedestrian crossings, etc.).
    • Data were collected in CARLA using a Logitech steering wheel and pedals.
    • Recorded information: ego-vehicle motion state, surrounding perception (vehicles / pedestrians / cyclists / roadside hazards), and traffic context (traffic lights / speed limits / route).
    • Each driver completed a structured questionnaire (demographics, driving history, travel purpose, etc.) as their profile.
    • Expert target speed (PDM-Lite) was recorded; the deviation between human speed and expert speed serves as a style descriptor.
  2. Long-Term Preference Learning and Alignment:

    • User embedding learning: Contrastive learning is used to establish a shared latent space between profile embeddings and behavior embeddings.
      • Profile encoder \(f_p(\cdot)\): DeBERTaV3 + projection head → user embedding \(z_p^m\)
      • Behavior encoder \(f_b(\cdot)\): Temporal encoder + multi-head self-attention processing a past \(k\)-step trajectory window → behavior embedding \(z_{b,t}^m\)
      • InfoNCE contrastive loss: pulls together the profile and behavior embeddings of the same driver while pushing apart those of different drivers.
    • Preference alignment: The user embedding \(z_p^m\) is injected into the VLA policy and further adapted through reinforcement fine-tuning.
    • Data augmentation: The driver \(u\) whose embedding is least similar to the target driver is selected; augmented actions are generated by scaling according to the action statistics ratio: \(\tilde{a}_t^m = \frac{\bar{a}^m}{\bar{a}^u} \cdot a_t^m\).
  3. Personalization via Reinforcement Fine-Tuning (GRPO):

    • Group Relative Policy Optimization is employed.
    • Residual decoder: Learnable residual query tokens are injected into the language model, outputting discrete residual adjustments (speed delta + steering delta).
    • Final action = base action + personalized residual: \(a_t = a_t^{base} + a_t^\Delta\)
    • This design injects personalized expression while preserving safe planning.
  4. Style-Aware Reward Adaptation:

    • Weighted reward: \(\mathcal{R}(s_t, a_t) = w_s \cdot R_{safety} + w_e \cdot R_{efficiency} + w_c \cdot R_{comfort}\)
    • Safety reward: Based on time-to-collision (TTC): \(R_{safety} = \mathbb{I}(TTC_t \geq \beta_{safety})\)
    • Efficiency reward: \(R_{efficiency} = \exp(-\alpha \cdot |v_t - v_{pref}|)\)
    • Comfort reward: Steering and acceleration do not exceed predefined thresholds.
    • Reward parameters (weights, thresholds, preferred speed) are dynamically adjusted based on language instructions and scene context.
    • Parameters are initialized via GPT-5 inference and refined through expert review.

Loss & Training

  • User embedding training: AdamW, weight decay 1e-3, lr 1e-4.
  • After the preference encoder converges, it is frozen; the motion predictor and residual decoder are then fine-tuned.
  • LoRA is used to adapt Qwen2-0.5B.
  • 8× A6000 GPUs, per-GPU batch size 8.
  • GRPO samples 4 responses per input.

Key Experimental Results

Main Results

Bench2Drive closed-loop driving metrics:

Method Style DS SR Efficiency Comfort Speed TT
SimLingo Aggressive 78.56 65.83 247.60 18.61 7.66 25.35
SimLingo Conservative 78.18 65.56 238.77 26.99 7.21 33.02
DMW Aggressive 79.50 67.36 281.56 21.62 7.72 26.93
DMW Conservative 82.72 71.56 237.06 34.62 6.18 47.38

Under the Aggressive setting, DMW achieves an 18.77% improvement in efficiency (SimLingo achieves only 3.70%), with DS declining by only 3.89%.

Long-term preference alignment (user study):

Method AS (ID) D1/D2 AS (OOD) D3/D4 Ratings (ID) Ratings (OOD)
MORL-PD 0.42/0.58 0.25/0.33 5.1/6.2 3.9/3.5
DMW 0.92/0.92 0.83/0.83 8.7/8.3 7.8/8.0

Ablation Study

Adaptive Average Pooling (AAP) ablation:

Driver w/ AAP w/o AAP Notes
D1 AS 0.92 0.67 Higher alignment score with AAP
D2 AS 0.92 0.58
D3 AS 0.83 0.25 Larger gap in OOD setting
Configuration Key Metrics Notes
w/o AAP AS avg. 0.50, Ratings avg. 5.5 Global mean pooling reduces embedding expressiveness
w/ AAP AS avg. 0.88, Ratings avg. 8.2 Semantically important embeddings are preserved

Key Findings

  1. DMW achieves effective style differentiation while maintaining safety: DS/SR is highest under the Conservative setting, and efficiency improves significantly under the Aggressive setting.
  2. Long-term preference alignment generalizes to OOD drivers: Alignment Score reaches 0.83 for unseen drivers D3/D4.
  3. Policy behavior differs significantly across driver conditions: Aggressive drivers (D1/D4) exhibit higher speed and acceleration, while conservative drivers (D2/D3) maintain larger following distances.
  4. Short-term instructions can be superimposed on long-term preferences: The two personalization dimensions are orthogonal and complementary.

Highlights & Insights

  1. Decoupling long- and short-term preferences: Driving personalization is decomposed into long-term habits (user embeddings) and short-term intent (language instructions) — a design that is both concise and effective.
  2. Residual action design: \(a_t = a_t^{base} + a_t^\Delta\) superimposes personalization on a safe base action, mitigating the risks of fully end-to-end approaches.
  3. Real driver data: The PDD dataset, collected from 30 real participants driving in CARLA, exhibits greater behavioral diversity than synthetic data.
  4. Interpretability of style-aware rewards: The mapping from language instructions to safety/efficiency/comfort weights is transparent and traceable.
  5. GRPO reinforcement fine-tuning: Achieves better specialization to individual styles compared to pure behavioral cloning.

Limitations & Future Work

  1. Validated in CARLA simulation only: Real-world performance is unknown, and the sim-to-real gap may be substantial.
  2. Limitations of profile questionnaires: Driving style may change dynamically with mood and road conditions; a static profile cannot fully capture such variations.
  3. Diversity of the 30-driver sample: The sample size is limited, and whether it covers the diversity of global driving cultures remains questionable.
  4. Safety risks in Aggressive mode: Lowering the TTC threshold in Aggressive mode may introduce safety hazards.
  5. Computational overhead: The real-time feasibility of VLA + GRPO + user embedding inference is not adequately discussed.
  • SimLingo: Serves as the VLA backbone, providing foundational language-vision-action capabilities.
  • Talk2Drive: A pioneer in language-driven personalization, but validated only in simple scenarios.
  • MAVERIC: Learns a latent space for diverse socially-aware driving behaviors.
  • StyleDrive: A comparison method that injects fixed style conditions into the policy.
  • Insight: The paradigm of preference alignment combined with reinforcement fine-tuning may be applicable to other embodied AI tasks requiring personalization, such as robotic manipulation.

Rating

  • Novelty: ⭐⭐⭐⭐ — The long/short-term preference decoupling and GRPO + residual design are creative, though the overall approach is not a fundamental breakthrough.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Closed-loop evaluation plus user study, but limited to CARLA simulation.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure and well-designed experiments, though notation is dense.
  • Value: ⭐⭐⭐⭐ — Personalized driving addresses a genuine practical need, and the PDD dataset has potential for reuse.