AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots¶

Conference: CVPR 2026 arXiv: 2603.07648 Institution: Sun Yat-sen University, Peng Cheng Laboratory, Yinwang Intelligence Area: Robotic Manipulation / Vision-Language-Action Models Keywords: VLA, Atomic Skills, Mixture-of-Experts, Continual Learning, Task Planning, Skill Routing

TL;DR¶

This paper proposes AtomicVLA, a unified planning-execution framework built upon π₀ that adaptively switches between Think and Act modes to generate atomic skill abstractions, and employs a Skill-Guided MoE (SG-MoE) to route actions to specialized experts. The approach improves LIBERO-LONG success rate from 85.2% to 95.2% (+10%), achieves +18.3% on real-world Franka long-horizon tasks, and +21% on continual learning benchmarks.

Background & Motivation¶

Existing Bottleneck: Current VLA models (π₀, OpenVLA, etc.) employ a single action decoder that conflates all skill knowledge within the same parameter space. They lack explicit planning capabilities for multi-step long-horizon tasks and suffer from catastrophic forgetting when learning new skills.
Limitations of Prior Work: Two-stage methods such as SayCan and Inner Monologue rely on an external LLM for high-level planning and independent low-level controllers for execution, but the planner and executor lack mutual awareness, resulting in stale or irrelevant instructions and system-level latency.
Core Problem: A robot model must simultaneously support (1) high-level reasoning and task planning, (2) fine-grained action generation, and (3) scalable continual learning — requirements that existing methods cannot jointly satisfy.
Mechanism: Complex tasks are decomposed into reusable atomic skills (e.g., grasp, push, rotate), each handled by a dedicated expert, enabling continual expansion of the skill library through modular design.

Method¶

Overall Architecture¶

AtomicVLA constructs an end-to-end framework on top of π₀ that unifies thinking and acting modalities. Given multi-camera observations \(O_t^{1:n}\) and a language instruction \(\ell\), the model first adaptively predicts whether to enter a thinking or acting mode:

Thinking Mode: Activated at task initiation or sub-skill transitions; produces three outputs — a task chain \(C_{0 \to k}\) (decomposing the instruction into ordered sub-goals), current progress \(C_t\), and an atomic skill abstraction \(\sigma\).
Acting Mode: Activated during normal execution; conditions on the latest skill abstraction \(\sigma\) and proprioceptive state \(s_t\) to generate an action chunk \(A_t\) via SG-MoE.

Key Designs¶

1. Adaptive Think-Act Switching Mechanism

Function: Enables the model to autonomously decide "whether to reason or act," avoiding the computational overhead of planning at every step.
Core Idea: Two special output tokens, [think] and [act], are introduced. At each decision step, the model first predicts the identifier: [think] triggers high-level planning to generate the task chain and atomic skill labels, while [act] triggers action chunk output.
Design Motivation: Planning is only needed at task initiation and sub-skill transitions; intermediate execution steps can directly generate actions. This adaptive switching is more efficient than fixed-period planning and more robust than pure action prediction without any planning. When execution fails (e.g., object drop), the model automatically detects the anomaly, re-triggers thinking, generates a new skill abstraction, and resumes execution.

2. Skill-Guided Mixture-of-Experts Architecture (SG-MoE)

Function: Routes action generation for different atomic skills to their respective specialized experts, minimizing inter-skill interference.
Core Idea: SG-MoE comprises three components:
- Skill Router: Makes routing decisions based on the atomic skill abstraction \(\sigma\). A noise-schedule-style embedding is adopted: each atomic skill is mapped to a scalar noise level \(\sigma \in [0, 100]\), which is transformed into a high-dimensional vector \(Z_\sigma = E(\text{norm}(\log(\sigma)))\) via a learnable embedding function. The router computes an expert probability distribution over \(Z_\sigma\) and activates the top-1 skill expert.
- Shared Expert: Retains the pretrained π₀ weights to preserve general action generation capability; all tokens pass through this expert.
- Atomic Skill Experts: Each expert specializes in one atomic skill, with specialization emerging naturally through training.
Design Motivation: Conventional token-level MoE routing still causes each expert to learn a mixture of skills without explicit specialization. SG-MoE routes using semantically explicit atomic skill labels, ensuring that all action tokens belonging to the same skill are processed by the same expert, thereby reducing inter-skill interference.
Output Fusion: The final action is a weighted combination of the shared expert and the activated skill expert: \(F_{out} = (1-w_k) \cdot F_{share}(x_t) + w_k \cdot F_k(x_t)\)

3. Modular Continual Learning (Skill Expansion)

Function: Enables continual acquisition of new skills without forgetting existing ones.
Core Idea: When a new atomic skill is introduced, only (a) a new skill expert (randomly initialized) is added, (b) the skill router's embedding space is extended to cover the new skill label, and (c) all existing expert parameters are frozen — only the new expert and the updated router branch are trained.
Design Motivation: Full fine-tuning of conventional VLAs causes catastrophic forgetting (e.g., π₀.₅ suffers an average 15% decline on existing skills after learning new ones). The "add-only, no-modify" modular strategy avoids forgetting at the architectural level; the router is initialized by copying existing weights to ensure smooth integration.

4. High-Quality Embodied Reasoning Data Generation

Function: Provides high-quality training data (task chains, progress annotations, and skill labels) for the thinking mode.
Core Idea: A two-stage data construction pipeline is employed. First, principal-axis motion analysis is applied to robot demonstration trajectories (comparing the magnitudes of translational/rotational components and tracking gripper states) to automatically segment trajectories into atomic skill segments (e.g., sustained z-axis descent + gripper closing → "pick" action). Second, InternVideo2.5 is used to semantically annotate each segment, automatically correcting and enriching the initial atomic action labels.
Design Motivation: Traditional approaches rely on VLMs for video understanding or optical flow features for segmentation, which are prone to ambiguity and noise and require extensive manual post-processing. Decomposition based on kinematic principal-axis analysis is more precise and less dependent on human annotation.

Loss & Training¶

Think Mode: Cross-entropy loss over the predicted token sequence of task chains, progress, and skill labels.
Act Mode: Flow matching loss (inherited from π₀) for continuous action chunk prediction.
Total Loss: \(\mathcal{L}_{total} = \mathcal{L}_{think} + \mathcal{L}_{act}\)
Expert Configuration: 5 skill experts for LIBERO and real-robot experiments; 8 skill experts for CALVIN.
Two Variants: AtomicVLA is built on π₀; AtomicVLA* is built on π₀.₅.

Key Experimental Results¶

Main Results: LIBERO Benchmark (Success Rate %)¶

Method	Spatial	Object	Goal	Long	Avg.
Octo	78.9	85.7	84.6	51.1	75.1
OpenVLA	84.9	88.4	79.2	53.7	76.5
CoT-VLA	87.5	91.6	87.6	69.0	81.1
π₀	96.4	98.8	95.8	85.2	94.2
π₀.₅	98.8	98.2	98.0	92.4	96.9
AtomicVLA	96.8	98.0	96.4	95.2	96.6
AtomicVLA*	98.8	98.8	97.2	96.2	97.8

Main Results: CALVIN ABC→D (Long-Horizon Sequence Completion Rate %)¶

Method	Task 1	Task 2	Task 3	Task 4	Task 5	Avg. Len↑
π₀	94.3	87.0	77.9	68.5	59.4	3.87
π₀.₅	91.9	84.6	79.4	75.5	71.0	4.02
AtomicVLA	95.0	87.8	81.9	75.0	69.1	4.09
AtomicVLA*	94.1	88.7	85.2	81.7	77.6	4.27

Real Franka Robot: Long-Horizon Tasks (Success Rate %)¶

Method	Object-to-Plate	Object-to-Drawer	Object-to-Microwave	Avg.	ΔAvg.
π₀	45	55	10	36.7	—
π₀.₅	65	35	35	45.0	—
AtomicVLA	65	60	45	56.7	+20.0
AtomicVLA*	75	60	55	63.3	+18.3

Continual Learning: Skill Expansion Experiment (Success Rate %)¶

Method	Grasp	Stack	Close	Press	Open (New)	Avg.	ΔAvg.
π₀.₅ (Baseline)	85	65	70	90	—	77.5	—
π₀.₅ (Continual Learning)	70	45	60	75	55	61.0	−15.0
AtomicVLA* (Baseline)	95	80	70	100	—	86.3	—
AtomicVLA* (Continual Learning)	90	80	80	100	70	82.0	−1.3

Ablation Study: SG-MoE Routing Mechanism (LIBERO-LONG %)¶

Method	LIBERO-LONG
π₀ (No MoE)	85.2
+ Standard Token-Level MoE	88.6 (+3.4)
+ MoDE (Denoising-Step Routing)	89.5 (+4.3)
+ SG-MoE (Atomic Skill Routing)	95.2 (+10.0)

Key Findings¶

Most Significant Gain on LIBERO-LONG: AtomicVLA achieves a 10% improvement on the most challenging long-horizon task suite, demonstrating that explicit planning combined with skill decomposition is critical for multi-step tasks.
SG-MoE Substantially Outperforms Standard MoE: Standard token-level MoE and MoDE yield only +3.4% and +4.3% respectively, whereas SG-MoE achieves +10% — because the former two still route at the token level, causing experts to learn mixed skills, while SG-MoE ensures all tokens of the same skill are processed by the same expert.
Near-Zero Forgetting in Continual Learning: π₀.₅ experiences an average 15% decline on existing skills after learning a new one (Stack drops by 20%), whereas AtomicVLA* declines by only 1.3%, resolving forgetting at the architectural level.
Inter-Skill Interference in Mixed Training: Tasks with conflicting gripper state requirements interfere with each other when trained jointly (e.g., drawer opening does not require gripper closure, which interferes with grasping tasks); SG-MoE effectively mitigates this through skill isolation.
Error Recovery Capability: AtomicVLA can automatically detect anomalies during execution and re-plan (e.g., regenerating a skill abstraction after an object drop), although the CALVIN evaluation protocol does not credit completions achieved after recovery, potentially underestimating the true capability.

Highlights & Insights¶

Unified Think-Act Paradigm: Rather than naively appending chain-of-thought, planning is deeply coupled with atomic skill abstraction — the output of thinking directly drives the MoE routing decision, forming a closed loop between planning and execution within a single model.
Noise-Schedule-Style Skill Embedding: Borrowing from diffusion model design, discrete skill labels are mapped into a continuous embedding space for routing — an elegant design that experiments confirm to be superior to general token-level routing.
"Add-Only" Continual Learning: Rather than relying on regularization (e.g., EWC) or experience replay, old experts are frozen and new experts are appended at the architectural level — a simple yet effective strategy.
Principal-Axis Analysis for Data Generation: Kinematic principal-axis analysis replaces VLM-based video understanding for atomic action segmentation, leveraging physical priors for greater reliability without requiring manual annotation.

Limitations & Future Work¶

The quality of atomic skill labels depends on the generation capability of InternVideo2.5, which may produce inaccurate annotations for rare or highly specialized manipulations.
SG-MoE adopts top-1 routing; actions requiring simultaneous coordination of multiple skills (e.g., "push while rotating") may benefit from top-k routing.
Each new skill adds one expert, leading to linear parameter growth over time — no expert merging or pruning mechanism is provided.
Gains on CALVIN (+0.22 avg. len) are relatively modest compared to LIBERO, potentially because CALVIN task granularity is less tightly aligned with atomic skills.
Inter-skill interference from heterogeneous task mixing (e.g., conflicting gripper states) is mitigated but not fully eliminated by SG-MoE.

vs. π₀/π₀.₅: These models rely on pure flow matching action prediction without explicit planning. AtomicVLA adds thinking and skill routing on top, with +10% on LIBERO-Long validating the value of planning.
vs. SayCan/Inner Monologue: These approaches separate planning (external LLM) from execution (independent controller), creating a modality gap. AtomicVLA unifies both within a single model.
vs. MoDE: MoDE uses the denoising timestep as the routing signal, which is still inherently token-level routing. SG-MoE routes using semantically explicit skill labels, with a +5.7% margin confirming the superiority of skill-level routing.
Generality of MoE Skill Routing: The noise-schedule-style embedding routing is transferable to multi-task NLP settings, using task description embeddings to route experts.
Modular Continual Learning Paradigm: The strategy of freezing old experts and appending new ones is applicable to domain-incremental continual pretraining of large vision models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The unified Think-Act paradigm, SG-MoE noise-schedule routing, and modular continual learning form three self-consistent innovations.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers four LIBERO subsets, CALVIN, real Franka robot, ablation studies, and continual learning experiments with comprehensive data.
Writing Quality: ⭐⭐⭐⭐ Motivation–method–experiment logic is clear; the SG-MoE architecture diagram is intuitive and the algorithmic pseudocode is well-structured.
Value: ⭐⭐⭐⭐⭐ Makes important contributions to continual learning and long-horizon task planning for VLAs; the SG-MoE design offers broadly transferable insights.