Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving¶
Conference: CVPR 2026 arXiv: 2603.25740 Code: https://dmw-cvpr.github.io/ Area: Autonomous Driving Keywords: personalized driving, VLA model, preference alignment, reinforcement fine-tuning, user embedding
TL;DR¶
This paper proposes DMW (Drive My Way), a personalized VLA driving framework that learns long-term driving habits via user embeddings and adapts to short-term preferences through natural language instructions. Personalized driving behavior is generated using GRPO-based reinforcement fine-tuning and style-aware rewards.
Background & Motivation¶
Driving behavior is inherently highly personalized — different drivers exhibit markedly distinct preferences in acceleration, braking, lane changing, overtaking, and other maneuvers. However, existing end-to-end autonomous driving systems suffer from the following shortcomings:
Generic optimization: Existing systems typically optimize general objectives such as safety and efficiency, ignoring individual differences.
Fixed preset modes: Only a few modes (e.g., "sport/comfort/economy") are offered, which cannot capture nuanced and continuously evolving user preferences.
Inability to understand natural language: Users cannot adjust driving style through intuitive utterances such as "I'm tired" or "I'm going to be late for work."
Two major limitations of existing personalization approaches: - Data-driven methods (behavioral cloning / IRL): Require large-scale data, scale poorly, and cannot handle real-time language interaction. - Language-driven methods (e.g., Talk2Drive): Validated only in simple scenarios without considering long-term driving habits.
The core idea of DMW is to simultaneously address long-term preference alignment and short-term instruction adaptation.
Method¶
Overall Architecture¶
DMW adopts SimLingo (based on InternVL2-1B) as the VLA backbone. The inputs include front-camera images, navigation targets, a user profile, and language instructions; the output is personalized driving actions (throttle / brake / steering).
Key Designs¶
-
Personalized Driving Dataset (PDD):
- 30 real drivers with diverse backgrounds were recruited.
- Each driver completed 20 standardized scenarios (overtaking, merging, intersections, pedestrian crossings, etc.).
- Data were collected in CARLA using a Logitech steering wheel and pedals.
- Recorded information: ego-vehicle motion state, surrounding perception (vehicles / pedestrians / cyclists / roadside hazards), and traffic context (traffic lights / speed limits / route).
- Each driver completed a structured questionnaire (demographics, driving history, travel purpose, etc.) as their profile.
- Expert target speed (PDM-Lite) was recorded; the deviation between human speed and expert speed serves as a style descriptor.
-
Long-Term Preference Learning and Alignment:
- User embedding learning: Contrastive learning is used to establish a shared latent space between profile embeddings and behavior embeddings.
- Profile encoder \(f_p(\cdot)\): DeBERTaV3 + projection head → user embedding \(z_p^m\)
- Behavior encoder \(f_b(\cdot)\): Temporal encoder + multi-head self-attention processing a past \(k\)-step trajectory window → behavior embedding \(z_{b,t}^m\)
- InfoNCE contrastive loss: pulls together the profile and behavior embeddings of the same driver while pushing apart those of different drivers.
- Preference alignment: The user embedding \(z_p^m\) is injected into the VLA policy and further adapted through reinforcement fine-tuning.
- Data augmentation: The driver \(u\) whose embedding is least similar to the target driver is selected; augmented actions are generated by scaling according to the action statistics ratio: \(\tilde{a}_t^m = \frac{\bar{a}^m}{\bar{a}^u} \cdot a_t^m\).
- User embedding learning: Contrastive learning is used to establish a shared latent space between profile embeddings and behavior embeddings.
-
Personalization via Reinforcement Fine-Tuning (GRPO):
- Group Relative Policy Optimization is employed.
- Residual decoder: Learnable residual query tokens are injected into the language model, outputting discrete residual adjustments (speed delta + steering delta).
- Final action = base action + personalized residual: \(a_t = a_t^{base} + a_t^\Delta\)
- This design injects personalized expression while preserving safe planning.
-
Style-Aware Reward Adaptation:
- Weighted reward: \(\mathcal{R}(s_t, a_t) = w_s \cdot R_{safety} + w_e \cdot R_{efficiency} + w_c \cdot R_{comfort}\)
- Safety reward: Based on time-to-collision (TTC): \(R_{safety} = \mathbb{I}(TTC_t \geq \beta_{safety})\)
- Efficiency reward: \(R_{efficiency} = \exp(-\alpha \cdot |v_t - v_{pref}|)\)
- Comfort reward: Steering and acceleration do not exceed predefined thresholds.
- Reward parameters (weights, thresholds, preferred speed) are dynamically adjusted based on language instructions and scene context.
- Parameters are initialized via GPT-5 inference and refined through expert review.
Loss & Training¶
- User embedding training: AdamW, weight decay 1e-3, lr 1e-4.
- After the preference encoder converges, it is frozen; the motion predictor and residual decoder are then fine-tuned.
- LoRA is used to adapt Qwen2-0.5B.
- 8× A6000 GPUs, per-GPU batch size 8.
- GRPO samples 4 responses per input.
Key Experimental Results¶
Main Results¶
Bench2Drive closed-loop driving metrics:
| Method | Style | DS | SR | Efficiency | Comfort | Speed | TT |
|---|---|---|---|---|---|---|---|
| SimLingo | Aggressive | 78.56 | 65.83 | 247.60 | 18.61 | 7.66 | 25.35 |
| SimLingo | Conservative | 78.18 | 65.56 | 238.77 | 26.99 | 7.21 | 33.02 |
| DMW | Aggressive | 79.50 | 67.36 | 281.56 | 21.62 | 7.72 | 26.93 |
| DMW | Conservative | 82.72 | 71.56 | 237.06 | 34.62 | 6.18 | 47.38 |
Under the Aggressive setting, DMW achieves an 18.77% improvement in efficiency (SimLingo achieves only 3.70%), with DS declining by only 3.89%.
Long-term preference alignment (user study):
| Method | AS (ID) D1/D2 | AS (OOD) D3/D4 | Ratings (ID) | Ratings (OOD) |
|---|---|---|---|---|
| MORL-PD | 0.42/0.58 | 0.25/0.33 | 5.1/6.2 | 3.9/3.5 |
| DMW | 0.92/0.92 | 0.83/0.83 | 8.7/8.3 | 7.8/8.0 |
Ablation Study¶
Adaptive Average Pooling (AAP) ablation:
| Driver | w/ AAP | w/o AAP | Notes |
|---|---|---|---|
| D1 AS | 0.92 | 0.67 | Higher alignment score with AAP |
| D2 AS | 0.92 | 0.58 | |
| D3 AS | 0.83 | 0.25 | Larger gap in OOD setting |
| Configuration | Key Metrics | Notes |
|---|---|---|
| w/o AAP | AS avg. 0.50, Ratings avg. 5.5 | Global mean pooling reduces embedding expressiveness |
| w/ AAP | AS avg. 0.88, Ratings avg. 8.2 | Semantically important embeddings are preserved |
Key Findings¶
- DMW achieves effective style differentiation while maintaining safety: DS/SR is highest under the Conservative setting, and efficiency improves significantly under the Aggressive setting.
- Long-term preference alignment generalizes to OOD drivers: Alignment Score reaches 0.83 for unseen drivers D3/D4.
- Policy behavior differs significantly across driver conditions: Aggressive drivers (D1/D4) exhibit higher speed and acceleration, while conservative drivers (D2/D3) maintain larger following distances.
- Short-term instructions can be superimposed on long-term preferences: The two personalization dimensions are orthogonal and complementary.
Highlights & Insights¶
- Decoupling long- and short-term preferences: Driving personalization is decomposed into long-term habits (user embeddings) and short-term intent (language instructions) — a design that is both concise and effective.
- Residual action design: \(a_t = a_t^{base} + a_t^\Delta\) superimposes personalization on a safe base action, mitigating the risks of fully end-to-end approaches.
- Real driver data: The PDD dataset, collected from 30 real participants driving in CARLA, exhibits greater behavioral diversity than synthetic data.
- Interpretability of style-aware rewards: The mapping from language instructions to safety/efficiency/comfort weights is transparent and traceable.
- GRPO reinforcement fine-tuning: Achieves better specialization to individual styles compared to pure behavioral cloning.
Limitations & Future Work¶
- Validated in CARLA simulation only: Real-world performance is unknown, and the sim-to-real gap may be substantial.
- Limitations of profile questionnaires: Driving style may change dynamically with mood and road conditions; a static profile cannot fully capture such variations.
- Diversity of the 30-driver sample: The sample size is limited, and whether it covers the diversity of global driving cultures remains questionable.
- Safety risks in Aggressive mode: Lowering the TTC threshold in Aggressive mode may introduce safety hazards.
- Computational overhead: The real-time feasibility of VLA + GRPO + user embedding inference is not adequately discussed.
Related Work & Insights¶
- SimLingo: Serves as the VLA backbone, providing foundational language-vision-action capabilities.
- Talk2Drive: A pioneer in language-driven personalization, but validated only in simple scenarios.
- MAVERIC: Learns a latent space for diverse socially-aware driving behaviors.
- StyleDrive: A comparison method that injects fixed style conditions into the policy.
- Insight: The paradigm of preference alignment combined with reinforcement fine-tuning may be applicable to other embodied AI tasks requiring personalization, such as robotic manipulation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The long/short-term preference decoupling and GRPO + residual design are creative, though the overall approach is not a fundamental breakthrough.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Closed-loop evaluation plus user study, but limited to CARLA simulation.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and well-designed experiments, though notation is dense.
- Value: ⭐⭐⭐⭐ — Personalized driving addresses a genuine practical need, and the PDD dataset has potential for reuse.