Skip to content

VLA-Arena: An Open-Source Framework for Evaluating Vision-Language-Action Models

Conference: ICML 2026
arXiv: 2512.22539
Code: https://github.com/VLA-Arena/VLA-Arena
Area: Robotics / Vision-Language-Action Models / Agent
Keywords: Vision-Language-Action Models, Benchmark Evaluation, Robotic Manipulation, Generalization Performance, Safety Constraints

TL;DR

VLA-Arena proposes a structured VLA benchmark that systematically quantifies difficulty through three orthogonal dimensions: task structure, language commands, and visual observations. Using 170 tasks, it reveals critical deficiencies in the generalization, visual perception, and safety of existing VLA models.

Background & Motivation

Background: VLA models are evolving rapidly—from RT-1 and RT-2 to the recent \(\pi_0\) and UniVLA, demonstrating cross-embodiment, cross-scene, and long-horizon manipulation capabilities. However, their capability boundaries and failure modes lack quantitative characterization.

Limitations of Prior Work: Existing robotics benchmarks (LIBERO, VLABench, RoboCasa) suffer from three major issues: (1) Task designs are oversimplified with single-level complexity definitions; (2) They focus either on noise robustness or task generalization, making it difficult to understand how models perform under multi-dimensional concurrent challenges; (3) Safety constraints are ignored, assuming operation in ideal environments.

Key Challenge: Does high performance on in-distribution tasks (L0) truly reflect a model's generalization capability? Does the model operate via robust multimodal understanding or fragile pattern matching?

Goal: To design a benchmark capable of systematically quantifying VLA capability boundaries, distinguishing true language-vision-action understanding from superficial pattern memorization, and assessing safety.

Key Insight: Control task difficulty simultaneously through three orthogonal dimensions—task structural complexity, semantic variation of language instructions, and systemic perturbations of visual observations.

Core Idea: Construct a "Three-Axis Difficulty Space" benchmark system. Through structured, controllable, and repeatable task design, VLA evaluation is upgraded from binary "pass/fail" judgments to fine-grained diagnostics of "where the capability frontier lies."

Method

Overall Architecture

Three-dimensional task design—(1) Task Structure Dimension (T-axis): 170 tasks divided into 11 suites organized by four core dimensions (safety/distractors/generalization/long-horizon) + 3 difficulty levels (L0/L1/L2); (2) Language Command Dimension (W-axis): W0-W4 levels achieved via WordNet semantic similarity word replacement; (3) Visual Observation Dimension (V-axis): V0-V4 levels through cumulative visual perturbations (lighting → color → viewpoint → noise).

Key Designs

  1. Constrained BDDL (CBDDL):

    • Function: Extends standard BDDL to natively support dynamic entities, perturbations, and safety predicates.
    • Mechanism: Precisely defines constraints using 10 types of safety predicates (collision, torque limits, object falls, etc.). Introduces Cumulative Cost (CC): \(CC(\tau) = \sum_{t=0}^{L-1} c^{inst}(s_t, a_t) + \alpha \cdot c^{term}(s_L)\), where \(\alpha=10\) weights end-state hazards.
    • Design Motivation: Distinguishes "task completion with safety violations" from "truly safe success," which is critical for real-world deployment.
  2. Principled Language Perturbation (WordNet-based):

    • Function: Systematically rewrites language instructions by replacing semantically similar words to diagnose model understanding versus rote memorization.
    • Mechanism: Identifies instruction semantic slots (action verbs, target objects, locations) and uses WordNet to find synonyms with a distance of 1 for replacement. Difficulty levels correspond to the number of replaced slots from W0 (original) to W4 (all 4 slots replaced).
    • Design Motivation: Distinguishes true language understanding from surface-level keyword matching. Models performing well at W0 but collapsing at W4 indicate reliance on pattern matching.
  3. Cumulative Hierarchical Visual Perturbations (V0-V4):

    • Function: Layer-by-layer application of visual challenges to diagnose visual robustness.
    • Mechanism: V0 Standard → V1 Lighting perturbation → V2 Object color randomization → V3 Viewpoint shift → V4 Gaussian noise. Each level accumulates on the previous one.
    • Design Motivation: Addresses sim-to-real issues and neural network shortcuts; moves from mild natural variations to extreme perturbations to pinpoint failure boundaries.

Key Experimental Results

Main Results: Six Mainstream VLA Models

Task Dimension Model L0-SR L1-SR L2-SR L1+L2 Decline Key Findings
Safety-StaticObstacles \(\pi_0\) 1.0 0.7 0.3 70% Even the strongest model collapses at L2
Safety-StaticObstacles OpenVLA 0.6 0.6 0.0 100% Unable to handle L2 multiple obstacles
Distractor \(\pi_0\) 0.9 0.1 0.0 100% Distractors significantly impact performance
Extrapolation-UnseenObjects \(\pi_0\) 0.8 0.5 0.0 100% Total failure on unseen objects
Long Horizon \(\pi_0\) 0.9 0.0 0.0 100% Incapable of composing skills

Ablation Study

Task Type Perturbation Dim W0 W2 W4 Trend Insight
StatePreservation Language 0.8 0.8 0.7 Gentle Relies on visual cues
StatePreservation Visual 0.8 0.7 0.1 Steep drop Collapse at V3+V4
UnseenObjects Language 0.8 0.4 0.1 Monotonic decr. Requires true understanding
UnseenObjects Visual 0.8 0.5 0.0 Steep drop Visuals equally lethal

VLM vs VLA Grounding Gap

Visual Level Qwen3-VL-8B Localization Acc. VLA Average Drop
V0 100% -
V1 100% 13.5%
V2 100% 24.0%
V3 96.7% 30.5%
V4 93.3% 50.5%

Key Findings: Reveals "catastrophic forgetting"—fine-tuning causes VLAs to abandon general visual concepts. While the VLM only drops 6.7% at V4, VLAs drop 50.5%, suggesting the fine-tuning process causes the model to relearn inversions for specific pixel distributions rather than retaining robust representations.

Key Findings

  • Memorization outperforms generalization: All models perform excellently at L0, but L1-L2 see universal drops of 50%-100%.
  • Rampant visual shortcuts: \(\pi_0\) performance on StatePreservation tasks drops from 0.8 to 0.2 by level V3.
  • Lack of semantic understanding: Language perturbations lead to a monotonic decrease in UnseenObjects performance (0.8 → 0.1).
  • Sim-to-Real validation: Performance degradation (60% → 3.3%) and safety violations (0/10 → 4/10) from L0 to L2 were largely reproduced on real Franka robots.

Highlights & Insights

  • Power of Three-Axis Orthogonal Design: Each dimension is independent yet orthogonal, allowing for the precise isolation of failure causes.
  • Innovation with CBDDL + CC Metrics: Safety is directly integrated into the benchmark definition, with cumulative costs effectively penalizing instantaneous hazards and end-state failures.
  • WordNet-driven Principled Language Perturbation: More natural and controllable compared to random rewriting or template transformations.
  • Rigorous Sim-to-Real Validation: Verifies the transferability of simulation results on physical Franka robots.
  • Rank Reversal Phenomenon: Different difficulty levels yielding different optimal models indicates that each level provides non-redundant insights.

Limitations & Future Work

  • VLA-Arena is based on MuJoCo simulation, with limited coverage of real-world scenarios.
  • The dataset scale is insufficient to completely eliminate the data influence of fine-tuning.
  • Keyword replacement is based on WordNet, with limited support for non-English languages.
  • The "skill composition" in the long-horizon dimension is relatively simple (maximum of 3 sub-skills).
  • Improvements: Multi-embodiment learning; dynamic environments; open-vocabulary tasks; online learning evaluation.
  • vs LIBERO/RoboCase: Ours introduces a "Three-Axis Difficulty Space" and a safety dimension, significantly enhancing diagnostic capabilities.
  • vs VLABench: VLABench emphasizes language diversity but task design is flat; VLA-Arena decomposes via orthogonal dimensions.
  • vs LIBERO-Pro/Plus: Ours is more systematic, featuring hierarchical diagnostics not only for the visual dimension (V0-V4) but also for the language dimension (W0-W4).
  • Insight: Benchmark design should upgrade from asking "what can it do" to "where are the boundaries"; safety should be a first-class citizen in benchmarks.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Pioneers the combination of a "Three-Axis Orthogonal Difficulty Space" and "W/V Perturbation Diagnostic Probes."
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6 mainstream VLAs + 170 tasks + real Franka validation + detailed ablations.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear logic with well-designed figures and tables.
  • Value: ⭐⭐⭐⭐⭐ Provides a structural impact on the VLA community.