Skip to content

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Conference: NeurIPS 2025 arXiv: 2508.21046 Code: https://jiutian-vl.github.io/CogVLA-page Area: Robotics / Multimodal VLM Keywords: VLA, token routing, sparsification, instruction-driven, robotic manipulation

TL;DR

CogVLA proposes a three-stage VLA architecture inspired by human multimodal cognition—comprising EFA-Routing for visual aggregation and compression to 25%, LFP-Routing for instruction-aware pruning of 50% of tokens within the LLM, and V-L-A coupled attention—achieving a 97.4% success rate on LIBERO with 2.5× training and 2.8× inference speedups over SOTA methods such as OpenVLA-OFT, and a 70.0% success rate on real-robot tasks.

Background & Motivation

Background: VLA models (e.g., OpenVLA, π₀, RT-2) unify vision, language, and action under pretrained VLMs for end-to-end robotic control. However, post-training adaptation to action spaces is computationally prohibitive—fine-tuning a 7B VLA model on a single LIBERO task, for instance, requires over 600 A100 GPU hours.

Limitations of Prior Work: Existing sparsification and acceleration methods (Mixture-of-Depths, layer skipping, early exit) suffer from two fundamental issues: (a) they focus exclusively on optimizing computation within the LLM while neglecting end-to-end cross-modal semantic coupling from perception to control—visual compression may discard task-critical features, and token skipping may disrupt contextual coherence; (b) the inherently distinct attention patterns of vision, language, and action modalities (selective attention for vision, causal reasoning for language, temporal coherence for action) are treated with a uniform attention strategy.

Key Insight: Drawing inspiration from cognitive science, human object manipulation involves a highly optimized multimodal coordination mechanism: the Visual Attention System (VAS) selectively focuses on task-relevant targets → the Supplementary Motor Area (SMA) injects action intent to filter irrelevant information → the Premotor Cortex (PMC) dynamically integrates inputs to produce coherent action trajectories. These three stages correspond to CogVLA's EFA-Routing → LFP-Routing → CAtten. Core Idea: Instruction-driven, cross-modal progressive sparsification—rather than blind compression, task instructions guide selective retention of the most relevant information at each stage.

Method

Overall Architecture

CogVLA embeds three-stage progressive sparsification into the standard VLA pipeline (visual encoder → LLM → action output). Stage 1 performs instruction-guided cross-branch aggregation within the visual encoder (25% compression); Stage 2 applies instruction-guided token pruning within the LLM (50% sparsification); Stage 3 employs a hybrid attention mask ensuring causal attention for vision–language and bidirectional attention for action tokens. Actions are generated as a complete action chunk in a single forward pass via parallel decoding.

Key Designs

  1. EFA-Routing (Encoder-FiLM based Aggregation Routing):

    • Function: Aggregates and compresses visual tokens to 25% of their original count within the visual encoder, conditioned on the task instruction.
    • Mechanism: Two-step aggregation — (a) Intra-encoder Aggregation: An Encoder-FiLM module converts the instruction embedding into scale/shift vectors \((\gamma, \beta)\) to modulate the self-attention output of each encoder branch (SigLIP and DINOv2). Learnable aggregation tokens progressively accumulate instruction-relevant information layer by layer; only the aggregation tokens are retained and the original image tokens are discarded (achieving 25% compression). (b) Cross-encoder Aggregation: An instruction-conditioned routing gate (MLP → Sigmoid) dynamically computes the fusion weight \(\alpha\) between the SigLIP and DINOv2 branches, reflecting that different instructions place different demands on semantic (SigLIP) vs. spatial (DINOv2) features.
    • Design Motivation: A dual encoder (semantic + spatial) is necessary but produces redundant tokens. FiLM modulation offers a lightweight conditioning mechanism more efficient than cross-attention. Instruction-conditioned dynamic fusion avoids the information loss inherent in a fixed 50/50 blending ratio.
  2. LFP-Routing (LLM-FiLM based Pruning Routing):

    • Function: Prunes 50% of visual tokens at each LLM layer based on instruction awareness, reducing attention computation.
    • Mechanism: At each Transformer layer \(l\), visual tokens are first modulated via LLM-FiLM with instruction-conditioned parameters \((\gamma_\text{LLM}, \beta_\text{LLM})\); a Task-Guided Pruning Router (MLP) then computes a routing weight \(R_l^j\) for each token. A retention ratio \(\beta\) is set, and the \(\beta\)-quantile of routing weights at the current layer serves as a threshold—tokens above the threshold proceed through full self-attention and FFN computation, while tokens below the threshold are skipped (their values passed through unchanged). Retained tokens are re-weighted by their routing weights.
    • Design Motivation: Although EFA-Routing reduces token count, the aggregation process may still retain semantics irrelevant to the LLM's current computation. LFP-Routing provides a deeper filtering stage, analogous to how the SMA injects action intent into the visual processing stream.
  3. V-L-A Coupled Attention (CAtten):

    • Function: Maintains cross-modal logical consistency and temporal action coherence over the compressed multimodal input.
    • Mechanism: A hybrid attention mask \(M_\text{hybrid}\) is designed: (a) the vision–language region uses causal attention \(M_\text{causal}^\text{VL}\) to preserve sequential reasoning (since visual tokens already encode instruction intent, it is reasonable for language tokens not to attend back to visual tokens); (b) action tokens within the action chunk use bidirectional attention \(M_\text{bi}^\text{act}\), allowing all action tokens to mutually attend and enabling parallel decoding—generating \(K\) action steps in a single forward pass rather than \(K \times D\) autoregressive steps; (c) action tokens attend to all vision–language tokens for full context, while vision–language tokens do not attend to action tokens (causal direction).
    • Design Motivation: Standard causal attention under sparsification leads to incoherent action generation (action token 2 cannot attend to token 1). Bidirectional attention over action tokens enables information sharing across the chunk, ensuring temporal consistency. Parallel decoding reduces inference from \(K \times D\) forward passes to one.

Loss & Training

Standard action prediction loss (MSE or token classification loss) is used for training. Training is conducted on 4× A800 GPUs; due to sparsification, training cost is only 4.7 h per 10k steps (compared to 12.5 h for OpenVLA-OFT).

Key Experimental Results

Main Results (LIBERO Benchmark)

Method Spatial SR Object SR Goal SR Long SR Avg SR Rank
OpenVLA 84.7 88.4 79.2 53.7 76.5 9
π₀ fine-tuned 96.8 98.8 95.8 85.2 94.2 5
OpenVLA-OFT 97.6 98.4 97.9 94.5 97.1 2
PD-VLA 95.5 96.7 94.9 91.7 94.7 3
CogVLA 98.6 98.8 96.6 95.4 97.4 1
Method Inference Time↓ Throughput↑ FLOPs↓ Training Cost/10k steps↓ SR
OpenVLA 0.254s 3.9Hz 8.48T 11.7h 76.5%
OpenVLA-OFT 0.132s 60.6Hz 8.45T 12.5h 97.1%
CogVLA 0.091s 87.9Hz 2.72T 4.7h 97.4%

Ablation Study

Configuration Inference Time FLOPs Notes
Full CogVLA 0.091s 2.72T Complete method
w/o Stage 1 (EFA-Routing) 0.162s 5.38T Visual tokens uncompressed → FLOPs doubled
w/o Stage 2 (LFP-Routing) 0.117s 3.52T No LLM pruning → increased computation

Real-Robot Experiments (Cobot Agilex ALOHA)

Method Object Placement Drawer Manipulation T-Shirt Folding Avg SR
OpenVLA-OFT 7/10→5/10 8/10→5/10 7/10→5/10 56.7%
PD-VLA 8/10→4/10 6/10→4/10 7/10→4/10 50.0%
CogVLA 9/10→7/10 8/10→7/10 9/10→6/10 70.0%

Key Findings

  • CogVLA achieves SOTA simultaneously in both performance and efficiency—97.4% SR ranking first, with FLOPs only 32% of OpenVLA's.
  • Real-robot results (70.0% vs. OFT's 56.7%) validate sim-to-real transfer capability, with a particularly pronounced advantage on long-horizon tasks (T-shirt folding, 3 steps).
  • 75% of visual tokens and 50% of LLM tokens can be safely removed without degrading—and in some cases improving—performance, confirming that a large proportion of tokens are task-irrelevant.
  • Stage 1 and Stage 2 contribute complementarily: Stage 1 primarily reduces FLOPs (5.38T → 2.72T), while Stage 2 primarily reduces inference latency.

Highlights & Insights

  • The cognitively inspired three-stage design (VAS → SMA → PMC) is more than a metaphor—it corresponds to a computationally grounded "select → filter → coordinate" information processing flow.
  • Instruction conditioning is the key: both FiLM modulation and routing gates are conditioned on the task instruction, realizing a "decide what to look at and what to think about based on what you are doing" principle—demonstrably more effective than unconditional compression (e.g., ViT token merging).
  • Parallel decoding combined with bidirectional action attention represents an important direction for VLA efficiency—the latency of autoregressively generating \(K\) action steps is eliminated.
  • A throughput of 87.9 Hz comfortably satisfies the demands of practical robot control (typically 10–50 Hz).

Limitations & Future Work

  • Validation is limited to LIBERO (10 tasks × 4 suites = 40 tasks) and 3 real-robot tasks, leaving task diversity relatively narrow.
  • Compression ratios (25% visual + 50% LLM pruning) are fixed; tasks of varying complexity may benefit from different ratios—adaptive compression rate scheduling is a clear direction for future improvement.
  • FiLM modulation introduces parameters via MLP generation; although lightweight, its scalability to larger VLAs remains to be verified.
  • Bidirectional action attention assumes that actions within a chunk can be generated independently in parallel, which may not hold for fine-grained manipulation tasks with strong temporal dependencies on preceding actions.
  • The efficiency trend in VLAs: from OpenVLA (3.9 Hz) to OFT (60.6 Hz) to CogVLA (87.9 Hz)—efficiency improvement is the critical path toward practical VLA deployment.
  • Revival of FiLM modulation for cross-modal conditioning: FiLM (Feature-wise Linear Modulation), originally proposed for visual question answering, regains relevance in the VLA domain—lightweight, end-to-end trainable, and non-intrusive to the backbone architecture.
  • Task-awareness as the basis for token sparsification: CogVLA demonstrates that "prune tokens according to the task" substantially outperforms "prune tokens according to attention scores"—task semantics should serve as the primary criterion for sparsification.

Rating

  • Novelty: ⭐⭐⭐⭐ The cognition-inspired three-stage instruction-driven sparsification is creative, though individual components (FiLM / token pruning / parallel decoding) are not entirely novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ LIBERO + real-robot + efficiency comparisons + ablations are solid, though task variety remains limited.
  • Writing Quality: ⭐⭐⭐⭐ Architecture descriptions are clear; the cognitive science analogy is illuminating.
  • Value: ⭐⭐⭐⭐ Directly applicable to efficient VLA deployment; 87.9 Hz throughput makes real-time control feasible.