MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent¶

Conference: CVPR 2026 arXiv: 2511.18810 Code: None Area: Robotics / Embodied Intelligence Keywords: VLA model merging, multi-skill robotics, sparse LoRA masking, action expert redesign, test-time task routing

TL;DR¶

This paper presents the first systematic diagnosis of two root causes underlying the non-mergeability of VLA models—LoRA selfish parameter conflicts and task coupling induced by self-attention in action experts—and proposes MergeVLA. By combining task-mask sparse LoRA activation, self-attention-free action experts, and training-free test-time task routing, MergeVLA merges multiple single-skill VLA specialists into a unified generalist agent, achieving a 90.2% success rate on LIBERO and 90% on the real-robot SO101 platform.

Background & Motivation¶

Background: Vision-Language-Action (VLA) models fine-tune large-scale VLMs on millions of robot demonstration trajectories and achieve strong performance in single-task or single-embodiment settings. However, a truly generalist real-world agent must support diverse skills, embodiments, and environments. A natural approach is to merge multiple independently fine-tuned VLA specialists into a unified policy.

Limitations of Prior Work: Model merging is well-established for LLMs and VLMs (e.g., Task Arithmetic, TIES, DARE), yet directly applying these methods to VLAs causes the merged success rate to collapse to 0%—a failure mode never observed in LLM merging.

Root Cause Diagnosis (a core contribution of this paper):

LoRA Selfish Parameter Problem: After fine-tuning four LIBERO tasks with LoRA, more than 75% of parameters are "selfish"—retained exclusively by a single task's mask—indicating that different tasks push LoRA updates into highly disjoint directions. Naively averaging or sign-based merging activates irrelevant or conflicting parameters, corrupting the shared visual-language subspace.

Action Expert Architecture Incompatibility: Even when the VLM backbone is merged perfectly, simply averaging action expert weights still yields 0% success. The root cause lies in the action experts being trained from scratch with self-attention layers; self-attention allows task-specific information to accumulate across layers, causing deep-layer parameters to become highly task-specialized and non-recomposable.

Key Insight: Since the problem stems from architectures that are inherently non-mergeable, the solution is to design VLA architectures that are inherently mergeable from the ground up.

Method¶

Overall Architecture¶

MergeVLA consists of three complementary components:

Task-Mask Sparse LoRA (resolves LoRA parameter conflicts in the VLM backbone)
Self-Attention-Free Action Expert (resolves action expert architectural incompatibility)
Training-Free Test-Time Task Routing (resolves unknown task identity at inference time)

The backbone VLM is Qwen2.5-0.5B; the action expert is redesigned from the VLA-Adapter architecture, yielding a total of approximately 0.7B parameters.

Key Design 1: Task-Mask Sparse LoRA¶

Problem: Merging \(M\) tasks' LoRA updates \(\tau_m = \Theta_m - \Theta_0\) into \(\tau_{\text{merge}}\) introduces pervasive conflicts.

Solution: A binary mask \(\mathbf{S}_m\) is constructed for each task \(m\) to selectively activate the merged parameters beneficial to that task:

\[\Theta_{\text{merge}}^{(m)} = \Theta_0 + \mathbf{S}_m \odot \tau_{\text{merge}}\]

Masks are generated via a parameter-level consistency test:

\[\mathbf{S}_m = \mathbb{I}\left[|\tau_m| > \lambda |\tau_{\text{merge}} - \tau_m|\right]\]

Intuitively, only parameters whose task-specific update magnitude is large and directionally consistent with the merged update are retained; \(\lambda\) controls sparsity. Empirically, \(\lambda = 0.6\) performs best. A beneficial side effect is that some LoRA parameters revert to pre-trained weights, thereby preserving the original visual-language representations.

Key Design 2: Self-Attention-Free Action Expert¶

Problem: The VLA-Adapter action expert consists of \(L\) Transformer blocks (self-attention + cross-attention + FFN) trained from scratch. Self-attention propagates task dependencies across layers, causing inter-task parameter distances to grow explosively in deeper layers.

Two Architectural Modifications:

Remove self-attention: Only the cross-attention pathway is retained, forcing the expert to rely on the robust shared features provided by the VLM rather than its own highly task-specialized representations learned from scratch.
Sigmoid replaces tanh gating: The original tanh gate can produce negative values that suppress VLM signals; sigmoid ensures VLM information is always preserved and positively propagated.

Layer-wise Merging Strategy: Shallow block parameters exhibit small inter-task differences and can be directly averaged. Deep layers—typically the final block, termed the expert head—are highly specialized due to regression targets and are therefore not merged; each task retains its own dedicated expert head.

Unexpected Benefit: The self-attention-free design improves out-of-distribution (OOD) performance on LIBERO-Plus by 13.4% over VLA-Adapter, demonstrating that leveraging the VLM's pre-trained robustness is more effective than learning task-specific representations from scratch.

Key Design 3: Training-Free Test-Time Task Routing¶

When task identity is unknown at inference time, the router must automatically select the corresponding task mask and expert head from initial observations.

Procedure:

For each candidate task \(m\), run the VLM with mask \(\mathbf{S}_m\) to obtain hidden states.
Apply SVD to the value projection matrix of the merged action expert's \(l\)-th block and retain the top \(k_r = 8\) right singular vectors to form the principal subspace.
Project each task's hidden state onto this subspace and compute a response score \(r_m\).
Select the task with the highest softmax score and fix the corresponding mask and expert head for the entire episode.

Design Choice: Empirical results demonstrate that the value projection (V) subspace is more stable and discriminative than the key projection (K)—V encodes actual behavioral semantics, while K defines query similarity structure and is more prone to collapsing into task-specialized subspaces.

Loss & Training¶

Each task is fine-tuned independently (LoRA + action expert trained from scratch); 50 demonstrations per task; single NVIDIA A6000 (48 GB) GPU.
The merging phase is entirely offline: merge LoRA → compute masks → average shallow action expert blocks → retain expert heads.
The router requires no training; it is based purely on SVD parameter subspace analysis.
Default hyperparameters: \(l = L\), \(k_r = 8\), \(\lambda = 0.6\), \(\alpha = 1\).

Key Experimental Results¶

Main Results: LIBERO Success Rate (%)¶

Method	Spatial	Object	Goal	Long	Avg.
OpenVLA (independent fine-tuning)	84.7	88.4	79.2	53.7	76.5
VLA-Adapter (independent fine-tuning)	99.6	99.6	98.2	96.4	98.5
MergeVLA (independent fine-tuning)	98.0	98.6	95.0	95.0	96.7
OpenVLA + TA (full merge)	0.0	0.0	0.0	0.0	0.0
OpenVLA + TA + Mask	74.2	82.6	68.8	24.0	62.4
VLA-Adapter + TA + Mask	0.0	0.0	0.0	0.0	0.0
MergeVLA + TIES + Mask	94.8	94.6	91.8	79.4	90.2
MergeVLA + TA + Mask	98.0	98.8	85.4	76.6	89.7

OOD Robustness: LIBERO-Plus Success Rate (%)¶

Method	BG	View	Instr.	Light	Layout	Robot State	Noise	Avg.
π₀ (independent fine-tuning)	81.4	13.8	58.8	85.0	68.9	6.9	79.0	56.3
VLA-Adapter (independent fine-tuning)	76.6	36.4	73.8	71.0	70.2	37.4	57.2	59.0
MergeVLA (independent fine-tuning)	92.7	62.4	75.7	92.7	73.7	46.4	74.7	72.4
MergeVLA + TIES Merge	85.7	50.7	66.0	84.2	68.1	30.3	66.0	62.5

Ablation Study: Routing Subspace Selection (LIBERO Success Rate %)¶

Subspace	Spatial	Object	Goal	Long	Avg.
K only	98.0	0.0	39.6	76.6	53.6
K & V	98.0	0.0	85.8	76.6	65.1
V only	98.0	98.8	85.4	76.6	89.7

Real-Robot SO101 Success Rate (%)¶

Method	Pick & Place	Push	Stack	Avg.
Independent fine-tuning	90.0	85.0	95.0	90.0
MergeVLA + TA	70.0	70.0	60.0	66.7
MergeVLA + TIES	90.0	90.0	90.0	90.0

Key Findings¶

Both root causes are necessary: Applying masks alone without architectural changes (VLA-Adapter + TA + Mask) still yields 0%; modifying the architecture alone without masks likewise fails.
Removing self-attention yields unexpected OOD gains: MergeVLA with independent fine-tuning already outperforms VLA-Adapter on LIBERO-Plus by 13.4%.
Merged ≈ independently fine-tuned: TIES merging fully matches independent fine-tuning performance on the real robot (90% vs. 90%).
Cross-embodiment generalization: On RoboTwin with 3 bimanual robot types × 3 tasks, TIES merging achieves 70.7%.
Routing accuracy: Value projection subspace routing substantially outperforms key projection (89.7% vs. 53.6%).
Mask sparsity: \(\lambda \in [0.6, 0.9]\) is optimal; too small admits conflicting parameters, too large discards useful information.

Highlights & Insights¶

Systematic diagnosis of "VLA non-mergeability": This work is the first to identify LoRA selfish parameters (>75%) and self-attention task coupling as two independent root causes; the diagnostic contribution alone is significant.
Architecture as mergeability: Rather than designing better merging algorithms, MergeVLA eliminates non-mergeability at the architectural level—an elegant approach generalizable to other multimodal domains.
Unexpected OOD benefit from removing self-attention: A modification originally motivated by mergeability turns out to improve OOD generalization, suggesting that having action experts "trust the VLM" is more robust than learning independently from scratch.
Training-free routing: Using SVD principal components for task discrimination is both elegant and practical, requiring no additional training data or auxiliary networks.
End-to-end validation from simulation to real robot: Four-tier evaluation across LIBERO, LIBERO-Plus, RoboTwin, and real-robot SO101 covers cross-task, cross-environment, and cross-embodiment dimensions.

Limitations & Future Work¶

No online incremental merging: Adding a new task requires recomputing masks and re-merging; plug-and-play extension is not supported.
Limited VLM scale: Only Qwen2.5-0.5B is evaluated; applicability to larger models (e.g., 7B+) remains unexplored.
Linear growth in expert heads: Each task retains a dedicated expert head, leading to parameter redundancy as the number of tasks scales.
Potential limitations for long-horizon tasks: Removing self-attention may constrain tasks requiring extended temporal reasoning; current evaluation covers only relatively short-horizon tabletop manipulation.
Routing depends on initial observation: Routing is performed solely from the \(t=0\) observation; if the initial frame is not sufficiently discriminative, routing may fail.
Future directions: Online incremental mask updates; lightweight learnable routers to replace SVD; validation on large-scale VLMs; expert head compression or sharing.

vs. Task Arithmetic / TIES: These methods are effective for LLMs/VLMs but fail completely for VLAs (0%); MergeVLA restores their applicability through architectural redesign.
vs. jointly trained multi-task VLAs (OpenVLA, π₀): Joint training requires retraining on all data simultaneously, whereas MergeVLA merges weights entirely offline without access to the original training data.
vs. VLA-Adapter: Self-attention in the action expert causes non-mergeability; MergeVLA replaces it with cross-attention and demonstrates superior performance.
vs. ReVLA: ReVLA uses merging to address visual forgetting; MergeVLA uses merging to achieve multi-skill capability—the objectives are distinct.
Takeaway: Model merging in the VLA domain is far from mature; architectural design has a decisive impact on downstream mergeability—a lesson with broad implications for the modular design of all multimodal models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic solution to VLA merging; diagnosis and design philosophy are both elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three simulation benchmarks + real robot + extensive ablations + OOD evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Diagnosis → solution → validation follows a rigorous logical arc with clear figures and tables.
Value: ⭐⭐⭐⭐⭐ Provides a viable lightweight pathway for multi-skill scaling in embodied AI.