Skip to content

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

Conference: CVPR 2026 arXiv: 2603.07648 Institution: Sun Yat-sen University, Peng Cheng Laboratory, Yinwang Intelligence Area: Robotic Manipulation / Vision-Language-Action Models Keywords: VLA, Atomic Skills, Mixture-of-Experts, Continual Learning, Task Planning, Skill Routing

TL;DR

This paper proposes AtomicVLA, a unified planning-execution framework built upon π₀ that adaptively switches between Think and Act modes to generate atomic skill abstractions, and employs a Skill-Guided MoE (SG-MoE) to route actions to specialized experts. The approach improves LIBERO-LONG success rate from 85.2% to 95.2% (+10%), achieves +18.3% on real-world Franka long-horizon tasks, and +21% on continual learning benchmarks.

Background & Motivation

  • Existing Bottleneck: Current VLA models (π₀, OpenVLA, etc.) employ a single action decoder that conflates all skill knowledge within the same parameter space. They lack explicit planning capabilities for multi-step long-horizon tasks and suffer from catastrophic forgetting when learning new skills.
  • Limitations of Prior Work: Two-stage methods such as SayCan and Inner Monologue rely on an external LLM for high-level planning and independent low-level controllers for execution, but the planner and executor lack mutual awareness, resulting in stale or irrelevant instructions and system-level latency.
  • Core Problem: A robot model must simultaneously support (1) high-level reasoning and task planning, (2) fine-grained action generation, and (3) scalable continual learning — requirements that existing methods cannot jointly satisfy.
  • Mechanism: Complex tasks are decomposed into reusable atomic skills (e.g., grasp, push, rotate), each handled by a dedicated expert, enabling continual expansion of the skill library through modular design.

Method

Overall Architecture

AtomicVLA constructs an end-to-end framework on top of π₀ that unifies thinking and acting modalities. Given multi-camera observations \(O_t^{1:n}\) and a language instruction \(\ell\), the model first adaptively predicts whether to enter a thinking or acting mode:

  • Thinking Mode: Activated at task initiation or sub-skill transitions; produces three outputs — a task chain \(C_{0 \to k}\) (decomposing the instruction into ordered sub-goals), current progress \(C_t\), and an atomic skill abstraction \(\sigma\).
  • Acting Mode: Activated during normal execution; conditions on the latest skill abstraction \(\sigma\) and proprioceptive state \(s_t\) to generate an action chunk \(A_t\) via SG-MoE.

Key Designs

1. Adaptive Think-Act Switching Mechanism

  • Function: Enables the model to autonomously decide "whether to reason or act," avoiding the computational overhead of planning at every step.
  • Core Idea: Two special output tokens, [think] and [act], are introduced. At each decision step, the model first predicts the identifier: [think] triggers high-level planning to generate the task chain and atomic skill labels, while [act] triggers action chunk output.
  • Design Motivation: Planning is only needed at task initiation and sub-skill transitions; intermediate execution steps can directly generate actions. This adaptive switching is more efficient than fixed-period planning and more robust than pure action prediction without any planning. When execution fails (e.g., object drop), the model automatically detects the anomaly, re-triggers thinking, generates a new skill abstraction, and resumes execution.

2. Skill-Guided Mixture-of-Experts Architecture (SG-MoE)

  • Function: Routes action generation for different atomic skills to their respective specialized experts, minimizing inter-skill interference.
  • Core Idea: SG-MoE comprises three components:
    • Skill Router: Makes routing decisions based on the atomic skill abstraction \(\sigma\). A noise-schedule-style embedding is adopted: each atomic skill is mapped to a scalar noise level \(\sigma \in [0, 100]\), which is transformed into a high-dimensional vector \(Z_\sigma = E(\text{norm}(\log(\sigma)))\) via a learnable embedding function. The router computes an expert probability distribution over \(Z_\sigma\) and activates the top-1 skill expert.
    • Shared Expert: Retains the pretrained π₀ weights to preserve general action generation capability; all tokens pass through this expert.
    • Atomic Skill Experts: Each expert specializes in one atomic skill, with specialization emerging naturally through training.
  • Design Motivation: Conventional token-level MoE routing still causes each expert to learn a mixture of skills without explicit specialization. SG-MoE routes using semantically explicit atomic skill labels, ensuring that all action tokens belonging to the same skill are processed by the same expert, thereby reducing inter-skill interference.
  • Output Fusion: The final action is a weighted combination of the shared expert and the activated skill expert: \(F_{out} = (1-w_k) \cdot F_{share}(x_t) + w_k \cdot F_k(x_t)\)

3. Modular Continual Learning (Skill Expansion)

  • Function: Enables continual acquisition of new skills without forgetting existing ones.
  • Core Idea: When a new atomic skill is introduced, only (a) a new skill expert (randomly initialized) is added, (b) the skill router's embedding space is extended to cover the new skill label, and (c) all existing expert parameters are frozen — only the new expert and the updated router branch are trained.
  • Design Motivation: Full fine-tuning of conventional VLAs causes catastrophic forgetting (e.g., π₀.₅ suffers an average 15% decline on existing skills after learning new ones). The "add-only, no-modify" modular strategy avoids forgetting at the architectural level; the router is initialized by copying existing weights to ensure smooth integration.

4. High-Quality Embodied Reasoning Data Generation

  • Function: Provides high-quality training data (task chains, progress annotations, and skill labels) for the thinking mode.
  • Core Idea: A two-stage data construction pipeline is employed. First, principal-axis motion analysis is applied to robot demonstration trajectories (comparing the magnitudes of translational/rotational components and tracking gripper states) to automatically segment trajectories into atomic skill segments (e.g., sustained z-axis descent + gripper closing → "pick" action). Second, InternVideo2.5 is used to semantically annotate each segment, automatically correcting and enriching the initial atomic action labels.
  • Design Motivation: Traditional approaches rely on VLMs for video understanding or optical flow features for segmentation, which are prone to ambiguity and noise and require extensive manual post-processing. Decomposition based on kinematic principal-axis analysis is more precise and less dependent on human annotation.

Loss & Training

  • Think Mode: Cross-entropy loss over the predicted token sequence of task chains, progress, and skill labels.
  • Act Mode: Flow matching loss (inherited from π₀) for continuous action chunk prediction.
  • Total Loss: \(\mathcal{L}_{total} = \mathcal{L}_{think} + \mathcal{L}_{act}\)
  • Expert Configuration: 5 skill experts for LIBERO and real-robot experiments; 8 skill experts for CALVIN.
  • Two Variants: AtomicVLA is built on π₀; AtomicVLA* is built on π₀.₅.

Key Experimental Results

Main Results: LIBERO Benchmark (Success Rate %)

Method Spatial Object Goal Long Avg.
Octo 78.9 85.7 84.6 51.1 75.1
OpenVLA 84.9 88.4 79.2 53.7 76.5
CoT-VLA 87.5 91.6 87.6 69.0 81.1
π₀ 96.4 98.8 95.8 85.2 94.2
π₀.₅ 98.8 98.2 98.0 92.4 96.9
AtomicVLA 96.8 98.0 96.4 95.2 96.6
AtomicVLA* 98.8 98.8 97.2 96.2 97.8

Main Results: CALVIN ABC→D (Long-Horizon Sequence Completion Rate %)

Method Task 1 Task 2 Task 3 Task 4 Task 5 Avg. Len↑
π₀ 94.3 87.0 77.9 68.5 59.4 3.87
π₀.₅ 91.9 84.6 79.4 75.5 71.0 4.02
AtomicVLA 95.0 87.8 81.9 75.0 69.1 4.09
AtomicVLA* 94.1 88.7 85.2 81.7 77.6 4.27

Real Franka Robot: Long-Horizon Tasks (Success Rate %)

Method Object-to-Plate Object-to-Drawer Object-to-Microwave Avg. ΔAvg.
π₀ 45 55 10 36.7
π₀.₅ 65 35 35 45.0
AtomicVLA 65 60 45 56.7 +20.0
AtomicVLA* 75 60 55 63.3 +18.3

Continual Learning: Skill Expansion Experiment (Success Rate %)

Method Grasp Stack Close Press Open (New) Avg. ΔAvg.
π₀.₅ (Baseline) 85 65 70 90 77.5
π₀.₅ (Continual Learning) 70 45 60 75 55 61.0 −15.0
AtomicVLA* (Baseline) 95 80 70 100 86.3
AtomicVLA* (Continual Learning) 90 80 80 100 70 82.0 −1.3

Ablation Study: SG-MoE Routing Mechanism (LIBERO-LONG %)

Method LIBERO-LONG
π₀ (No MoE) 85.2
+ Standard Token-Level MoE 88.6 (+3.4)
+ MoDE (Denoising-Step Routing) 89.5 (+4.3)
+ SG-MoE (Atomic Skill Routing) 95.2 (+10.0)

Key Findings

  • Most Significant Gain on LIBERO-LONG: AtomicVLA achieves a 10% improvement on the most challenging long-horizon task suite, demonstrating that explicit planning combined with skill decomposition is critical for multi-step tasks.
  • SG-MoE Substantially Outperforms Standard MoE: Standard token-level MoE and MoDE yield only +3.4% and +4.3% respectively, whereas SG-MoE achieves +10% — because the former two still route at the token level, causing experts to learn mixed skills, while SG-MoE ensures all tokens of the same skill are processed by the same expert.
  • Near-Zero Forgetting in Continual Learning: π₀.₅ experiences an average 15% decline on existing skills after learning a new one (Stack drops by 20%), whereas AtomicVLA* declines by only 1.3%, resolving forgetting at the architectural level.
  • Inter-Skill Interference in Mixed Training: Tasks with conflicting gripper state requirements interfere with each other when trained jointly (e.g., drawer opening does not require gripper closure, which interferes with grasping tasks); SG-MoE effectively mitigates this through skill isolation.
  • Error Recovery Capability: AtomicVLA can automatically detect anomalies during execution and re-plan (e.g., regenerating a skill abstraction after an object drop), although the CALVIN evaluation protocol does not credit completions achieved after recovery, potentially underestimating the true capability.

Highlights & Insights

  • Unified Think-Act Paradigm: Rather than naively appending chain-of-thought, planning is deeply coupled with atomic skill abstraction — the output of thinking directly drives the MoE routing decision, forming a closed loop between planning and execution within a single model.
  • Noise-Schedule-Style Skill Embedding: Borrowing from diffusion model design, discrete skill labels are mapped into a continuous embedding space for routing — an elegant design that experiments confirm to be superior to general token-level routing.
  • "Add-Only" Continual Learning: Rather than relying on regularization (e.g., EWC) or experience replay, old experts are frozen and new experts are appended at the architectural level — a simple yet effective strategy.
  • Principal-Axis Analysis for Data Generation: Kinematic principal-axis analysis replaces VLM-based video understanding for atomic action segmentation, leveraging physical priors for greater reliability without requiring manual annotation.

Limitations & Future Work

  • The quality of atomic skill labels depends on the generation capability of InternVideo2.5, which may produce inaccurate annotations for rare or highly specialized manipulations.
  • SG-MoE adopts top-1 routing; actions requiring simultaneous coordination of multiple skills (e.g., "push while rotating") may benefit from top-k routing.
  • Each new skill adds one expert, leading to linear parameter growth over time — no expert merging or pruning mechanism is provided.
  • Gains on CALVIN (+0.22 avg. len) are relatively modest compared to LIBERO, potentially because CALVIN task granularity is less tightly aligned with atomic skills.
  • Inter-skill interference from heterogeneous task mixing (e.g., conflicting gripper states) is mitigated but not fully eliminated by SG-MoE.
  • vs. π₀/π₀.₅: These models rely on pure flow matching action prediction without explicit planning. AtomicVLA adds thinking and skill routing on top, with +10% on LIBERO-Long validating the value of planning.
  • vs. SayCan/Inner Monologue: These approaches separate planning (external LLM) from execution (independent controller), creating a modality gap. AtomicVLA unifies both within a single model.
  • vs. MoDE: MoDE uses the denoising timestep as the routing signal, which is still inherently token-level routing. SG-MoE routes using semantically explicit skill labels, with a +5.7% margin confirming the superiority of skill-level routing.
  • Generality of MoE Skill Routing: The noise-schedule-style embedding routing is transferable to multi-task NLP settings, using task description embeddings to route experts.
  • Modular Continual Learning Paradigm: The strategy of freezing old experts and appending new ones is applicable to domain-incremental continual pretraining of large vision models.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The unified Think-Act paradigm, SG-MoE noise-schedule routing, and modular continual learning form three self-consistent innovations.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers four LIBERO subsets, CALVIN, real Franka robot, ablation studies, and continual learning experiments with comprehensive data.
  • Writing Quality: ⭐⭐⭐⭐ Motivation–method–experiment logic is clear; the SG-MoE architecture diagram is intuitive and the algorithmic pseudocode is well-structured.
  • Value: ⭐⭐⭐⭐⭐ Makes important contributions to continual learning and long-horizon task planning for VLAs; the SG-MoE design offers broadly transferable insights.