Skip to content

🤖 Robotics & Embodied AI

📷 CVPR2026 · 49 paper notes

Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

This work leverages the pretrained 3D geometric foundation model π3 as a perception backbone, fuses 3D geometric, 2D semantic, and proprioceptive features, and jointly predicts future action chunks and future 3D Pointmaps via a diffusion model. Using only RGB inputs, the proposed method comprehensively surpasses point-cloud-based approaches on the RoboTwin bimanual benchmark.

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

This paper proposes Adaptive Action Chunking (AAC), a strategy that leverages action entropy as a signal to dynamically determine the optimal chunk size at inference time, requiring no additional training or architectural modification. AAC consistently improves task success rates of GR00T N1.5 and π0.5 on benchmarks including RoboCasa and LIBERO.

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

AtomicVLA proposes a unified planning-execution framework that adaptively switches between Think and Act modes to generate task chains and atomic skill abstractions. Using a Skill-Guided MoE (SG-MoE), it builds a scalable atomic skill expert library, surpassing π₀ by 10% on LIBERO-LONG, exceeding baselines by 21% in real-world continual learning, with forgetting as low as 1.3%.

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

This paper proposes AtomicVLA, a unified planning-execution framework built upon π₀ that adaptively switches between Think and Act modes to generate atomic skill abstractions, and employs a Skill-Guided MoE (SG-MoE) to route actions to specialized experts. The approach improves LIBERO-LONG success rate from 85.2% to 95.2% (+10%), achieves +18.3% on real-world Franka long-horizon tasks, and +21% on continual learning benchmarks.

BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration

This paper proposes BiPreManip, a framework for bimanual preparatory manipulation based on visual affordance representations. The system first anticipates the primary hand's target interaction region, then guides the assistive hand to perform preparatory actions (e.g., flipping a bottle so its cap faces the primary hand), achieving substantial improvements over baselines in both simulated and real-world environments.

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

This paper proposes a Feasible Action Neighborhood (FAN) regularizer that shapes the output distribution of VLA models into a Gaussian form matching physical action tolerances. The approach consistently improves success rate, generalization, and sample efficiency under both SFT and RFT finetuning paradigms (RFT requires only 1/3 of training steps to reach 90% success rate).

Chain of World: World Model Thinking in Latent Motion (CoWVLA)

CoWVLA unifies the strengths of world-model VLAs and latent-action VLAs: a Latent Motion Extractor decomposes video into structural and motion latent variables, enabling the VLA to perform world-model prediction in the latent motion space rather than reconstructing redundant pixels. Combined with Co-Fine-tuning that alternately generates keyframe and action tokens, CoWVLA achieves 95.2% on LIBERO-Long (surpassing π₀ at 85.2%) and an average score of 0.560 on SimplerEnv-WidowX (surpassing π₀ at 0.425).

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

CoMo is proposed to jointly address the shortcut learning problem in continuous latent motion learning via two mechanisms — Early Temporal Difference (Td) and Temporal Contrastive Learning (Tcl) — enabling the extraction of fine-grained continuous pseudo-action labels from internet videos and joint training of video data and robot actions under a unified continuous distribution, substantially improving policy performance.

Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning

This paper proposes NeSyCR, a neurosymbolic counterfactual reasoning framework that abstracts video demonstrations into a symbolic world model, detects cross-domain incompatibilities via counterfactual state simulation, and automatically corrects program steps. NeSyCR achieves a 31.14% improvement in success rate over the strongest baseline, Statler, on cross-domain demo-to-code tasks.

CycleManip: Enabling Cyclic Task Manipulation via Effective Historical Perception and Understanding

CycleManip is the first work to systematically address cyclic robotic manipulation tasks (e.g., shaking a bottle N times). It enhances historical perception via a cost-aware history sampling strategy and improves historical understanding through multi-task learning auxiliary objectives, enabling controllable cycle-count manipulation in an end-to-end imitation learning framework.

DAWN: Pixel Motion Diffusion is What We Need for Robot Control

This paper proposes DAWN, a two-stage fully diffusion-based vision-language-action framework. A Motion Director (latent diffusion model) generates dense pixel motion fields as interpretable intermediate representations, while an Action Expert (diffusion Transformer policy) translates pixel motion into executable robot actions. DAWN achieves state-of-the-art performance on the CALVIN benchmark (average length 4.00) and demonstrates strong generalization on real-world single-arm and dual-arm manipulation tasks.

DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

This paper proposes DecoVLN, a framework that decouples observation, reasoning, and correction in VLN tasks. By introducing an adaptive memory refinement (AMR) mechanism and a state-action-pair-based correction fine-tuning strategy, DecoVLN achieves state-of-the-art performance on R2R-CE and RxR-CE using only egocentric RGB input.

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

This paper proposes the DeepSketcher suite — comprising a 31k high-quality interleaved image-text CoT dataset built via code rendering and a self-contained Embedding Editor model — enabling VLMs to generate "visual thoughts" directly in the visual embedding space for multimodal reasoning without relying on any external tools.

Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

This paper proposes the ViFailback framework, which leverages explicit visual symbols (arrows, crosshairs, etc.) to efficiently annotate real-world robotic manipulation failure data. It constructs a large-scale dataset of 58,128 VQA pairs and fine-tunes ViFailback-8B, which, when combined with a VLA model in real-robot experiments, achieves failure recovery with an average success rate improvement of 22.2%.

Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

This paper proposes ViFailback, a framework that leverages visual symbols (arrows, crosshairs, labels, etc.) to efficiently annotate real-world robotic manipulation failures. The framework constructs a dataset of 58,128 VQA pairs and trains ViFailback-8B, a VLM capable of failure diagnosis and both visual and textual corrective guidance. When integrated with a VLA, it achieves a 22.2% improvement in task success rate.

Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

To address the limitation of MoE-LoRA methods where all experts share identical structures (uniform rank) and thus cannot adapt to tasks of varying complexity, this paper proposes EPT: a parameter pyramid constructed via a shared meta-knowledge subspace and deconvolution experts with varying kernel sizes, coupled with an Adaptive LoRA Pruner and contrastive learning-based Task Embedding. EPT achieves an average score of 87.0% on GLUE with only 0.41M parameters per task, outperforming all MoE-LoRA variants.

Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

This paper proposes Expert Pyramid Tuning (EPT), which transplants the multi-scale feature pyramid (FPN) concept from computer vision into the MoE-LoRA paradigm. By combining a shared low-dimensional meta-knowledge subspace, deconvolution expert projections with kernels of varying scales, and contrastive task embeddings, EPT achieves an average score of 87.0% on GLUE with only 0.41M parameters per task—reducing parameter count by approximately 50% compared to existing MoE-LoRA variants.

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

This paper proposes Fast-ThinkAct, which compresses verbose textual CoT reasoning (~250 tokens) into 6 verbalizable continuous latent tokens, combined with reward-guided preference distillation and visual trajectory alignment, achieving an 89.3% reduction in inference latency (9.3× faster than ThinkAct-7B) while maintaining or surpassing the performance of state-of-the-art reasoning VLAs.

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

This paper proposes FineCog-Nav, a zero-shot UAV vision-language navigation framework inspired by human cognition. It decomposes navigation into seven fine-grained cognitive modules—language processing, perception, attention, memory, imagination, reasoning, and decision-making—each driven by moderate-scale foundation models, enabling long-range navigation in complex 3D environments without any training.

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

By analyzing the over-reliance on layer-wise features and spectral-domain components in visual jailbreaking attacks, FORCE corrects non-generalizable feature dependencies and guides the attack toward flatter loss landscapes, thereby substantially improving cross-model transferability.

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

This work identifies the root cause of poor transferability in visual jailbreak attacks as their residence in high-sharpness loss regions — arising from shallow-layer over-reliance on model-specific representations and excessive influence of high-frequency information. FORCE is proposed to address this via layer-aware regularization that broadens the shallow-layer feasible region, and spectral rescaling that suppresses high-frequency non-semantic components, guiding attacks into flatter loss landscapes and substantially improving cross-model transferability.

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

This paper proposes ForceVLA2, the first end-to-end model that unifies force awareness and hybrid force-position control within a VLA framework. Force-based Prompts injected into a VLM expert construct cross-phase force-aware task concepts, while a Cross-Scale MoE adaptively fuses task semantics with real-time interaction forces to achieve closed-loop force-position regulation. The model achieves an average success rate of 66% across 5 contact-rich tasks, surpassing π₀ and π₀.5 by 48.0% and 35.0%, respectively.

GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer

This paper proposes GeCo-SRT, a geometry-aware continual adaptation method that extracts cross-domain and cross-task invariant knowledge from local geometric features, enabling knowledge accumulation across successive sim-to-real transfers to efficiently adapt to new tasks.

GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer

GeCo-SRT proposes the first continual cross-task Sim-to-Real transfer paradigm, exploiting the domain-invariance and task-invariance of local geometric features. Through a Geo-MoE module for reusable geometric knowledge extraction and Geo-PER for expert-level forgetting prevention, the method achieves an average success rate of 63.3% across four real-robot tasks (a 52% improvement over baselines) while requiring only 1/6 of the data to match baseline performance.

IGen: Scalable Data Generation for Robot Learning from Open-World Images

IGen starts from a single open-world image and automatically generates large-scale vision-action training data through a pipeline of 3D scene reconstruction → VLM task planning → SE(3) action generation → point cloud synthesis → frame rendering. Policies trained exclusively on the generated data can successfully perform real-world manipulation tasks.

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

This paper establishes via the NTK framework that linearized attention fails to converge to the infinite-width kernel limit (requiring width \(m = \Omega(\kappa^6)\)), and proposes the "influence malleability" metric to quantify its dual implications: attention exhibits 6–9× higher data-dependent flexibility than ReLU networks, which simultaneously reduces approximation error and increases adversarial vulnerability.

Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics

This paper demonstrates that linearized attention does not converge to the infinite-width limit under the NTK framework, and proposes the metric of influence malleability to show that the expressive power of attention and its adversarial vulnerability share a common origin—data-dependent kernel structure that deviates from the kernel regime.

Language-Grounded Decoupled Action Representation for Robotic Manipulation (LaDA)

LaDA is a framework that uses natural language as a semantic bridge to decouple continuous 7-DoF actions into three interpretable primitives — translation, rotation, and gripper — and employs soft-label contrastive learning to align cross-task action representations in a shared embedding space. With only 0.6B parameters, LaDA achieves a 93.6% success rate on LIBERO, outperforming all baselines with 1.3B–8.5B parameters.

Language-Grounded Decoupled Action Representation for Robotic Manipulation

This paper proposes LaDA, a framework that decouples continuous 7-DoF robotic actions into interpretable, language-described motion primitives (translation, rotation, gripper state), and unifies the visual-language-action representation space via semantically guided soft-label contrastive learning to achieve cross-task generalization.

Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

This paper proposes the TVVE framework, which employs a reinforcement learning-driven Multi-View Exploration Policy (MVEP) to select optimal virtual camera viewpoints and re-render observations online. A task-aware MoE visual encoder (TaskMoE) is designed to mitigate cross-task feature interference. The framework achieves an average success rate of 86.6% across 18 tasks on RLBench.

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

ManipArena proposes a standardized real-world robot manipulation evaluation framework comprising 20 reasoning-oriented tasks and 10,812 expert trajectories. Through a green-screen controlled environment, systematic diversity design, and hierarchical OOD evaluation, it provides a fair and reproducible benchmark for VLA models and world models.

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

This paper presents the first systematic diagnosis of two root causes underlying the non-mergeability of VLA models—LoRA selfish parameter conflicts and task coupling induced by self-attention in action experts—and proposes MergeVLA. By combining task-mask sparse LoRA activation, self-attention-free action experts, and training-free test-time task routing, MergeVLA merges multiple single-skill VLA specialists into a unified generalist agent, achieving a 90.2% success rate on LIBERO and 90% on the real-robot SO101 platform.

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

MergeVLA diagnoses two root causes of VLA model unmergeability—LoRA parameter conflicts and architectural incompatibility induced by self-attention in the action expert—and addresses them via sparsely activated task masks and a self-attention-free action expert architecture. This enables training-free merging of multiple single-task VLA experts, achieving 90.2% success on LIBERO and 90.0% on a real-robot SO101 platform.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

This paper proposes PALM, a unified VLA framework that employs structured fine-grained affordance prediction across four categories (global, local, spatial, and dynamic) as implicit reasoning anchors, and incorporates continuous sub-task progress estimation to enable seamless task transitions. PALM achieves an average completion length of 4.48 on CALVIN ABCD (surpassing the previous SOTA by 12.5%), a success rate of 91.8% on LIBERO-LONG, and more than twice the baseline performance in real-world long-horizon generalization evaluations.

PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments

PanoAffordanceNet introduces a novel task of holistic affordance grounding in 360° indoor environments. It employs a Distortion-Aware Spectrum Modulator (DASM) to correct ERP geometric distortions, an Omnidirectional Sphere Densification Head (OSDH) to recover continuous affordance regions from sparse activations, and multi-level training objectives. The method achieves substantial gains over existing approaches on 360-AGD, the first panoramic affordance dataset constructed by the authors.

PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments

This paper presents PanoAffordanceNet, the first holistic affordance grounding framework for 360° panoramic indoor environments. It systematically addresses ERP geometric distortion, sparse functional regions, and semantic drift via a Distortion-Aware Spectral Modulator (DASM) and an Omnidirectional Spherical Densification Head (OSDH), and introduces 360-AGD, the first panoramic affordance dataset.

Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition

This paper proposes CroBo, a self-supervised framework that learns visual state representations via a global-local reconstruction objective: a global reference image is compressed into a single bottleneck token, which is then used to reconstruct a heavily masked (90%) local crop, compelling the bottleneck token to encode pixel-level "what-is-where" scene composition. CroBo achieves state-of-the-art performance on the Franka Kitchen and DMC robot policy learning benchmarks.

Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

This paper reformulates Multimodal Misinformation Detection (MMD) as a structured probabilistic reasoning problem over concept graphs. The proposed PCGR framework employs MLLMs to automatically discover and validate human-interpretable concept nodes, constructs a hierarchical probabilistic concept graph, and achieves interpretable misinformation detection, outperforming 13 baselines across three benchmarks.

ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation

ProFocus is a training-free progressive framework that achieves state-of-the-art zero-shot VLN performance on R2R and REVERIE benchmarks through two mechanisms: proactive perception (converting panoramic observations into semantic maps and having an LLM generate targeted visual queries) and focused reasoning (BD-MCTS filtering top-k high-value waypoints from large navigation histories).

PULSE: Privileged Knowledge Transfer from Rich to Deployable Sensors for Embodied Multi-Sensory Learning

This paper proposes PULSE, a framework that performs knowledge distillation from a frozen privileged-sensor (e.g., EDA) teacher to a student model relying solely on cheap, deployable sensors (e.g., ECG, BVP, accelerometer). PULSE introduces shared-private embedding decomposition and a reconstruction-based collapse-prevention mechanism, achieving 0.994 AUROC for stress detection without EDA at inference time—surpassing even models that use all sensors.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

This paper proposes QuantVLA, the first training-free post-training quantization (PTQ) framework for Vision-Language-Action (VLA) models. Through a selective quantization layout and two lightweight calibration mechanisms—Attention Temperature Matching (ATM) and Output Head Balancing (OHB)—QuantVLA achieves approximately 70% memory reduction under W4A8 precision while surpassing the task success rate of the full-precision baseline.

RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

This paper proposes Robot-Conditioned Normalizing Flow (RC-NF), which models the joint distribution of robot states and object motion trajectories via a conditional normalizing flow, enabling real-time anomaly detection at <100ms latency. RC-NF serves as a plug-and-play monitoring module for VLA models (e.g., π₀), supporting task-level replanning and state-level trajectory rollback (homing).

RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

This paper proposes RC-NF, a real-time anomaly detection model based on conditional normalizing flows that decouples the processing of robot state and object trajectory features. Trained in an unsupervised manner using only normal demonstrations, RC-NF detects OOD anomalies during VLA model execution within 100ms, outperforming state-of-the-art methods (including VLM baselines such as GPT-5 and Gemini 2.5 Pro) by approximately 8% AUC and 10% AP on LIBERO-Anomaly-10.

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

SaPaVe proposes an end-to-end active manipulation framework that decouples camera actions from manipulation actions via a bottom-up training strategy: it first learns active perception priors from 200K semantic camera-control pairs, then jointly optimizes for active manipulation, surpassing π₀ and GR00T N1 by up to 31.25% in real-world success rate.

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

SaPaVe is an end-to-end framework that decouples camera motion from manipulation actions via a two-stage bottom-up learning strategy, enabling semantics-driven active perception and viewpoint-invariant manipulation execution. It surpasses GR00T N1 and π₀ by 31.25% and 40%, respectively, on real-world tasks.

STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation

STRNet proposes a unified spatio-temporal representation framework for visual navigation. It employs a graph reasoning module to model intra-frame spatial topology, and combines hybrid temporal shifting with multi-resolution differential convolution to capture temporal dynamics, achieving substantial improvements in goal-conditioned navigation success rates (70% gain over NoMaD).

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

This paper introduces the Test-time Ego-Exo Adaptation for Action Anticipation (TE2A3) task and proposes the DCPGN network, which leverages multi-label prototype growing and dual-clue (visual + textual) consistency to online-adapt a source-view trained model to the target view at test time for action anticipation, substantially outperforming existing TTA methods.

Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning

For the task of general scene-adaptive vision-language navigation (GSA-VLN) in open environments, inspired by Kahneman's dual-process cognitive theory, this paper proposes the slow4fast-VLN framework. A fast reasoning module performs real-time navigation via an end-to-end policy network while accumulating historical memory; a slow reasoning module leverages LLM-based reflection to generate structured, generalizable experience entries. These experiences are fed back into the fast reasoning network via attention-based fusion, enabling continuous adaptation to unseen environments and diverse instruction styles. The proposed framework achieves comprehensive improvements over the previous SOTA (GR-DUET) on the GSA-R2R dataset.

Towards Training-Free Scene Text Editing

This paper proposes TextFlow, a training-free scene text editing framework that employs Flow Manifold Steering (FMS) during the early denoising stage to preserve style consistency and Attention Boost (AttnBoost) during the late stage to enhance text rendering accuracy, achieving editing quality comparable to or better than training-based methods without any task-specific training.