MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent¶

Conference: CVPR 2026 arXiv: 2511.18810 Code: Project Page Area: Robotic Manipulation / VLA Models / Model Merging Keywords: Vision-Language-Action, model merging, multi-task robotics, task mask, cross-skill generalization

TL;DR¶

MergeVLA diagnoses two root causes of VLA model unmergeability—LoRA parameter conflicts and architectural incompatibility induced by self-attention in the action expert—and addresses them via sparsely activated task masks and a self-attention-free action expert architecture. This enables training-free merging of multiple single-task VLA experts, achieving 90.2% success on LIBERO and 90.0% on a real-robot SO101 platform.

Background & Motivation¶

Background: VLA (Vision-Language-Action) models achieve strong single-task performance by fine-tuning VLMs for robotic manipulation, but fail to generalize across multiple tasks. Model merging has proven effective in the LLM/VLM literature.

Limitations of Prior Work:

Directly merging VLA experts causes success rates to drop to zero—a failure mode never observed in LLM/VLM merging.
When merging four tasks, over 75% of LoRA parameters are "selfish" (retained by only one task), resulting in severe parameter conflicts.
Self-attention layers in the action expert accumulate strong task-specific dependencies during training, causing explosive growth in parameter distances across deep blocks and undermining modular composability.

Key Challenge: LoRA parameters diverge drastically across tasks, and self-attention in the action expert propagates task-specific information throughout all layers—their combined effect renders existing merging methods completely ineffective.

Goal: Design a VLA architecture that is "merging-friendly" by construction, enabling efficient merging of single-task experts into a generalist model.

Key Insight: Precisely diagnose the two root causes of failure before designing targeted solutions—task masks to resolve parameter conflicts, and self-attention removal to resolve architectural incompatibility.

Core Idea: Suppress conflicting LoRA parameters via sparsely activated task masks, and eliminate task-dependency propagation by removing self-attention from the action expert, making VLA models inherently mergeable.

Method¶

Overall Architecture¶

MergeVLA builds on the VLA-Adapter architecture (Qwen2.5-0.5B as the VLM backbone) with three key modifications: (1) task-specific binary masks applied to VLM LoRA parameters to resolve parameter conflicts; (2) removal of all self-attention layers from the action expert, retaining only cross-attention; (3) a training-free task router at inference time to identify the current task. Each task is fine-tuned independently; the merging stage is entirely training-free.

Key Designs¶

Task Mask (Resolving LoRA Parameter Conflicts)
Function: Constructs a binary mask per task to selectively activate merged parameters consistent with that task and suppress conflicting ones.
Mechanism: For each parameter position, the method checks whether the task vector aligns directionally and significantly with the merged vector: \(S_m = \mathbf{I}[|\tau_m| > \lambda \cdot |\tau_{\text{merge}} - \tau_m|]\), where \(\lambda\) controls the tolerance threshold.
Practical Effect: When merging four tasks, over 75% of parameters are selfish; the mask retains beneficial parameters, suppresses conflicts, and encourages some parameters to revert to pre-trained weights, mitigating visual forgetting.
Design Motivation: Naive merging activates parameters irrelevant or contradictory to the current task; the mask enables selective parameter activation.
Self-Attention-Free Action Expert (Resolving Architectural Incompatibility)
Function: Redesigns the action expert architecture to be inherently mergeable.
Mechanism: (a) Removes all self-attention layers, retaining only cross-attention—forcing the expert to rely on the VLM's robust representations; (b) replaces tanh gating with sigmoid gating to prevent negative activations from suppressing VLM signals.
Shallow blocks are merged via weight averaging; the final layer (expert head) remains task-specific and is not merged.
Design Motivation: Self-attention accumulates task-specific bias during training from scratch and propagates it across layers. Its removal enforces reliance on pre-trained VLM features, which improves generalization (+13.4% on OOD).
Test-Time Task Router (Training-Free Task Inference)
Function: Automatically identifies the current task at inference time when task identity is unknown, selecting the corresponding mask and expert head.
Mechanism: For each candidate task \(m\), construct a VLM variant using the corresponding task mask → extract hidden states → project onto the top-\(k_r\) right singular vector subspace of the action expert's value matrix → compute activation strength → select the highest-scoring task via softmax.
Routing is performed once at \(t=0\) and fixed thereafter.
Design Motivation: The value subspace directly encodes task-dependent information and is more stable and discriminative than query or key subspaces.

Loss & Training¶

Each task is trained independently with standard imitation learning (30k–50k steps, batch size 8, LoRA rank 32).
The merging stage is entirely training-free: LoRA weights are merged via TIES/TA/WUDI; shallow action expert blocks are merged by weight averaging; task-specific expert heads are preserved.
Hardware: Single NVIDIA A6000 Ada 48GB GPU.

Key Experimental Results¶

Main Results¶

Method	Dataset	Avg. Success Rate (%)	Comparison
MergeVLA (TIES+Mask)	LIBERO (4 suites)	90.2	Single-task upper bound 96.7% (−6.5 pp)
MergeVLA	LIBERO-Plus (OOD)	62.5	VLA-Adapter 59.0% (single-task fine-tuning)
MergeVLA	RoboTwin (cross-embodiment)	70.7	Single-task upper bound 76.0% (−5.3 pp)
MergeVLA	SO101 Real Robot (3 tasks)	90.0	On par with single-task fine-tuning

Method	Params (B)	Spatial	Object	Goal	Long	Avg
OpenVLA (TA merge)	7	0.0	0.0	0.0	0.0	0.0
OpenVLA (TA+Mask)	7	74.2	82.6	68.8	24.0	62.4
VLA-Adapter (TA+Mask)	0.68	0.0	0.0	0.0	0.0	0.0
MergeVLA (TIES+Mask)	0.70	94.8	94.6	91.8	79.4	90.2

Ablation Study¶

Configuration	LIBERO Avg	Notes
Mask only (no action expert modification)	0.0%	Mask necessary but insufficient
Self-attention removal only (no mask)	65.5%	Architecture modification effective but requires mask
Self-attention removal + Mask	90.2%	Both components are indispensable
\(\lambda = 0.6\)–\(0.9\)	>70%	Optimal tolerance range
Routing via Value	89.7%	Most stable
Routing via Key	Significant drop	Zero success on certain tasks
Self-attention removal (LIBERO-Plus OOD)	+13.4%	This modification alone substantially improves generalization

Key Findings¶

Task mask and self-attention removal are both indispensable: the former resolves VLM parameter conflicts, the latter resolves action expert incomposability.
Removing self-attention alone improves OOD performance by 13.4%, identifying it as the primary bottleneck for generalization.
Value subspace routing substantially outperforms query/key-based alternatives.
On the real robot, the merged model matches single-task fine-tuning (90.0%), demonstrating practical viability.

Highlights & Insights¶

The diagnosis-driven research paradigm is particularly elegant: root causes are experimentally identified with precision before targeted solutions are designed.
The architectural modifications are minimal yet highly effective—removing self-attention and replacing the gating function substantially improves generalization.
The test-time task router is entirely training-free, leveraging SVD of the value subspace for task discrimination.
On the real SO101 robot, the merged model matches single-task fine-tuning performance (90%), demonstrating strong practical value.

Limitations & Future Work¶

Each task still requires a dedicated expert head and task mask; storage grows linearly with the number of tasks.
The VLM backbone is limited to Qwen2.5-0.5B; effectiveness on larger models (7B+) remains unverified.
Routing is performed only once at \(t=0\), which may be insufficient for long-horizon tasks requiring mid-sequence skill switching.
Cross-embodiment experiments are conducted at a relatively small scale (3 robot types); scalability to large-scale heterogeneous merging remains to be validated.

vs. OpenVLA: Direct merging fails completely (0%), as task conflicts in the LM body cannot be resolved by simple methods. MergeVLA circumvents this via task masks.
vs. VLA-Adapter: Self-attention renders the action expert incomposable; even adding masks fails. MergeVLA eliminates the architectural obstacle by design.
vs. pi0/pi0.5: Large-scale VLAs rely on joint training for multi-task capability at high cost. MergeVLA permits independent training followed by merging, offering greater flexibility.
Insight: "Removing self-attention improves generalization" deserves broader attention—similar phenomena may exist in other modules trained from scratch, and extension to continual skill learning scenarios is worth exploring.

Rating¶

Novelty: ⭐⭐⭐⭐ — The diagnosis-plus-design paradigm is well-structured, though individual technical components are not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three simulation benchmarks, real-robot experiments, and extensive ablations and analyses.
Writing Quality: ⭐⭐⭐⭐⭐ — Narrative logic is clear, progressing systematically from diagnosis to solution.
Value: ⭐⭐⭐⭐ — Addresses a critical obstacle in VLA merging with practical implications for multi-skill robot learning.