Skip to content

🤖 Robotics & Embodied AI

📷 CVPR2026 · 130 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (162) · 💬 ACL2026 (11) · 🧪 ICML2026 (53) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (26)

🔥 Top topics: Robotics ×50 · Multimodal/VLM ×30 · Navigation ×18 · Reasoning ×17 · Agents ×10

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

Addressing the issue of unstable 6-DoF grasping caused by occluded geometric information in "corner views" of single-view point clouds, this paper proposes a post-fusion framework utilizing an auxiliary view captured easily by a robotic arm. By employing self-supervised contrastive learning, cross-view point features are mapped to be "spatially consistent + directionally discriminable." A "Cross-view Aligned Cylindrical Integration" module fuses geometry from two views within a grasp-related cylindrical neighborhood. On GraspNet-1Billion, the Seen split AP reaches 74.08 (RealSense, +3.55 Gain), with a 96% clearing success rate on a real robotic arm.

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

The "intermediate reasoning" of VLA is replaced from language subtasks or target images with coarse-grained reference action sequences in the action space (Action Chain-of-Thought). An explicit action reasoner generates reference trajectories, while an implicit action reasoner extracts action priors from the VLM's KV cache. These two pathways jointly condition the action head, achieving SOTA on LIBERO/LIBERO-Plus/VLABench simulation benchmarks and real-world hardware.

Action-Sketcher: From Reasoning to Action via Visual Sketches for Robotic Manipulation

This paper proposes Action-Sketcher, which enables VLA models to operate in a "See-Think-Sketch-Act" loop. It first draws spatial intent as a Visual Sketch (composed of points, boxes, and arrows) as a human-readable and editable intermediate representation before generating actions. It significantly outperforms strong baselines like π0.5 and OpenVLA-OFT on long-horizon, cluttered, and referentially ambiguous real-world manipulation tasks. Furthermore, sketches allow for direct human intervention to further improve success rates.

ActiveGrasp: Information-Guided Active Grasping with Calibrated Energy-based Model

Addressing the challenge of grasping targets in cluttered scenes with limited viewpoints, ActiveGrasp employs a calibrated energy-based model to directly model grasp distributions on the SE(3) manifold. It defines the information gain of the "Next-Best-View" (NBV) as the reduction in grasp success entropy, guiding the robot to regions of highest uncertainty. This approach achieves superior success rates with fewer view budgets in both simulation (79% SR) and real-world experiments.

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

ActiveVLA integrates "active perception" into 3D Vision-Language-Action (VLA) models: it first utilizes multi-view orthogonal projections and heatmaps to locate 3D key regions, then actively selects optimal virtual camera views around these regions and performs virtual Zoom-in to enhance resolution. This approach significantly improves success rates in scenarios involving occlusions and fine manipulations (achieving a 91.8% average on RLBench).

AdaDexTrack: Dynamic Modulation for Adaptive and Generalizable Dexterous Manipulation Tracking

AdaDexTrack redefines the "Language Command → Dexterous Hand-Object Interaction" pipeline as modulated tracking. A distilled general tracker acts as the "skill carrier," while an RL-trained modulator is integrated into the feedback loop. This modulator performs real-time correction through three interfaces—reference trajectory, object latent variables, and position targets—enabling the stable execution of noisy text-generated references for long-horizon, drift-resistant manipulation and achieving zero-shot sim-to-real transfer.

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

The paper proposes Adaptive Action Chunking (AAC), which utilizes action entropy as a cue to dynamically determine the optimal chunk size during inference without additional training or architectural modifications. It consistently improves success rates for GR00T N1.5 and π0.5 on benchmarks such as RoboCasa and LIBERO.

Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation

Addressing the "Memory Trap" issue where VLA models rigidly follow training trajectories towards old object locations under scene perturbations, this paper proposes a training-free 3D Spatial Affordance Field (SAF) as a plug-and-play plugin. The system uses proprioception to detect traps, rolls back to safe historical poses, and employs SAF to sample waypoints and rerank VLA candidate trajectories based on cumulative affordance, achieving an average improvement of 23.5% in real-world OOD scenarios.

AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence

AffordGen transforms "affordance semantic correspondence" from an online planning signal into an offline data generation prior. By establishing keypoint correspondences across large-scale 3D meshes using DINOv2, it batch-transfers grasping and skill segments from a single human demonstration to hundreds of new objects. This process synthesizes a trajectory dataset covering full 6D poses and multiple categories, which is then used to train a closed-loop visuomotor policy, achieving zero-shot generalization to genuinely unseen objects.

AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

AGENTSAFE is the first benchmark to systematically evaluate the safety of "embodied VLM agents executing hazardous instructions." It utilizes an adversarial simulation sandbox (SAFE-THOR) that interfaces with arbitrary agents, a collection of 9,900 hazardous instructions categorized by the "Three Laws of Robotics" (SAFE-VERSE), and a fine-grained diagnostic protocol (SAFE-DIAGNOSE) spanning the "perception-planning-execution" stages. The study evaluates 9 VLMs and 2 agent workflows, revealing a systemic failure where current agents "recognize danger but fail to incorporate this cognition into planning and execution," and proposes a thought-level defense module called SAFE-AUDIT.

AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning

AGiLe jointly trains a "backward planner + forward evaluator" to generate latent sub-goal sequences that are both goal-aligned and dynamically reachable (temporal robustness). These abstract sub-goals are used as queries to filter visual features via cross-attention, implicitly grounding them to pixel-level affordances to drive actions (spatial robustness). It achieves a 97.1% average success rate on LIBERO-LONG, an 8.5% improvement over the previous strongest baseline, LBP.

Align While Search: Belief-Guided Exploratory Inference for World-Grounded Embodied Agents

To address the issue where LLM embodied agents "mechanically replay training trajectories" during object search in partially observable environments, AWS models search as a single-state Bayes-adaptive control. It maintains a hierarchical belief (global linguistic hypotheses + low-level action distribution) at test time, utilizes a frozen LLM to simulate observations for "update \(\to\) projection" belief refreshes, and selects actions based on predicted information gain. Without any gradient updates, it simultaneously improves search success rates and reduces token overhead compared to inference-time scaling and training-time world model baselines.

Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning

Arcadia redefines embodied learning from "single-stage optimization" to a "full-lifecycle problem," utilizing a tightly coupled real→sim→real loop to string together autonomous exploration, generative scene reconstruction, shared navigation/manipulation backbones, and deployment feedback into a self-improving system. It achieves average improvements of 7.07% and 11.08% on navigation and manipulation benchmarks respectively, with real-world success rates significantly exceeding NaVILA and OpenVLA.

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

AT-VLA introduces a learnable tactile gating mechanism into pre-trained VLAs (GO-1), injecting tactile signals into the action expert only during the moment of "object contact" to prevent the new modality from disrupting pre-trained visual grounding capabilities. By decoupling a slow visual stream and a fast tactile stream, it achieves a 0.04s closed-loop reaction, improving average success rates from 0.22 (vanilla) to 0.50 in real contact-rich tasks such as unzipping, stamping, wiping vases, and unscrewing caps.

AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots

This paper proposes AtomicVLA, a unified planning-execution framework built upon \(\pi_0\). By utilizing adaptive Think-Act switching to generate atomic skill abstractions and routing actions to specialized experts via Skill-Guided MoE (SG-MoE), it improves the LIBERO-LONG success rate from 85.2% to 95.2% (+10%), achieves a +18.3% Gain in real-world Franka long-horizon tasks, and increases continual learning performance by 21%.

AURA: Multi-modal Shared Autonomy for Urban Navigation

AURA decomposes urban sidewalk navigation into hierarchical shared autonomy where "humans provide high-level instructions and AI handles low-level control." By using a Spatial-Aware Instruction Encoder (SIE) to align text, sketches, and arrows with scene semantics and geometry, and an anchor-based diffusion policy to generate trajectories, it reduces human takeover frequency by 44% and operational costs by over 70% in both simulation and real-world environments.

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

This work re-examines the visual processing of VLA models from a POMDP perspective and proposes the AVA-VLA framework. By utilizing a recurrent state and an active visual attention module, it dynamically modulates the importance of current-frame visual tokens based on historical context, achieving SOTA performance on benchmarks such as LIBERO and CALVIN.

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

AwareVLN equips end-to-end VLN models with "self-aware reasoning" capabilities—sparsely triggering structured reasoning only at critical navigation nodes (subtask completion, path deviation, or stopping errors). It utilizes an automated data engine that requires no manual annotation to generate introspective supervision, enabling a pure monocular RGB agent to significantly outperform previous SOTA models on R2R-CE and RxR-CE.

Beyond Mimicry: Learning Whole-Body Human-Humanoid Interaction from Human-Human Demonstrations

To enable humanoid robots to learn whole-body physical interactions such as hugging, handshaking, and high-fives, this paper first utilizes Contact-Aware Interaction Retargeting (PAIR) to translate massive "Human-Human Interaction" (HHI) data into physically consistent "Human-Humanoid Interaction" (HHoI) data. It then employs a hierarchical diffusion strategy (D-STAR) that decouples "when to move" and "where to move" to learn synchronized interactions. The method achieves an average success rate of 75.4% across 6 interaction tasks and is deployed on the Unitree G1 robot.

Beyond Success: Refining Elegant Robot Manipulation from Mixed-Quality Data via Just-in-Time Intervention

Addressing the problem where Vision-Language-Action (VLA) policies learn "successful but non-elegant" behaviors from mixed-quality human demonstrations, this work avoids retraining the base policy. Instead, it trains an Elegance Critic offline (using Cal-QL to estimate the "elegance value" of actions) and triggers multi-candidate re-selection only during critical decision moments by monitoring Q-value fluctuations. This improves the Elegant Success Rate from approximately 50% to 67% in LIBERO-Elegant and real-world experiments (+23.7 pts on hardware).

BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration

The BiPreManip framework is proposed to achieve bimanual preparatory manipulation based on visual affordance representations. It first imagines the target interaction region for the lead hand and then guides the helper hand to perform preparatory actions (e.g., flipping a bottle so the cap faces the lead hand), significantly outperforming baselines in both simulation and real-world environments.

Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior

This paper proposes the Feasible Action Neighborhood (FAN) regularizer, which shapes the output distribution of VLA models into a Gaussian form that matches physical action tolerances. It significantly improves success rates, generalization, and sample efficiency in both SFT and RFT paradigms (RFT achieves a 90% success rate with only 1/3 of the training steps).

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

This paper proposes HSGM—a hierarchical map that rasterizes 3D geometric information into multi-channel 2D top-down views readable by VLMs. This allows the VLM to focus on high-level semantic decisions ("picking the next waypoint on the map") while an A* algorithm handles low-level collision-free movement. In a completely training-free zero-shot setting, it achieves 47.9% / 41.8% SR on R2R-CE / RxR-CE, surpassing all zero-shot methods and even some supervised models.

Chain of World: World Model Thinking in Latent Motion (CoWVLA)

CoWVLA is proposed to unify the advantages of world-model VLAs and latent-action VLAs. By utilizing a Latent Motion Extractor to decompose video into structural and motion latents, the VLA performs world-model prediction within the latent motion space instead of reconstructing redundant pixels. Combined with Co-Fine-tuning to alternately generate keyframes and action tokens, it achieves 95.2% on LIBERO-LONG, surpassing \(\pi_0\) (85.2%), and an average score of 0.560 on SimplerEnv-WidowX, exceeding \(\pi_0\) (0.425).

CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

CLaD enables robots to plan within a compact latent space. It models the co-evolution of modalities through an asymmetric cross-attention mechanism where "proprioceptive changes query semantic changes." It predicts latent foresight grounded by both EMA targets and reconstruction losses, which then modulates a diffusion policy for action generation. On LIBERO-LONG, it achieves a 94.7% success rate with only 0.66B parameters, outperforming the 7B OpenVLA.

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Ours proposes CoMo, which synergistically addresses the shortcut learning problem in continuous latent motion learning through two mechanisms: Early Temporal Differencing (Td) and Temporal Contrastive Learning (Tcl). It extracts fine-grained continuous pseudo-action labels from internet videos, allowing video data and robot actions to be co-trained under a unified continuous distribution, significantly improving policy performance.

Contact-Aware Neural Dynamics

Addressing the sim-to-real gap in contact-rich manipulation with dexterous hands, this paper utilizes off-the-shelf simulators as priors and develops a neural forward dynamics model that implicitly aligns simulation and reality through a "predict contact events first, then predict contact-conditional diffusion poses" approach. By anchoring physical reality with binary contact signals from robot tactile sensors, the model achieves state-of-the-art MSE and ADD-S in long-horizon predictions for single/multi-object tasks and enables screening/fine-tuning of policies trained purely in simulation for higher real-world success rates.

Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning

This paper proposes NeSyCR, a neurosymbolic counterfactual reasoning framework that abstracts video demonstrations into symbolic world models. By performing counterfactual state deduction to detect cross-domain incompatibilities and automatically correcting program steps, it achieves a 31.14% improvement in success rate over the strongest baseline, Statler, on cross-domain demo-to-code tasks.

Cross-Hand Latent Representation for Vision-Language-Action Models

XL-VLA trains a shared, embodiment-independent latent action space for four structurally diverse dexterous hands. By plugging this space into a VLA framework like \(\pi_0\) to replace original joint state tokens, a single hand-agnostic policy can simultaneously control multiple dexterous hands, improving the average cross-embodiment manipulation success rate from 0.55 to 0.90 on real hardware.

Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

Addressing the language-perception alignment challenge caused by "partial observability" in VLN, this paper proposes imagining future key semantics through language (rather than images). It introduces ATD, a dual-branch left-right brain structure: the left-brain LLM estimates the current navigation state, while the right-brain LLM textualizes the imagination of the scene ahead. Irrelevant details are filtered via State Grounded Cross-Attention (SGCA), and the information is injected into a graph navigation strategy via a decoder-free latent vector. With only 1.5B parameters, it achieves a 12%/11% improvement in SR/SPL on R2R val unseen compared to the baseline.

CUBic: Coordinated Unified Bimanual Perception and Control Framework

CUBic reformulates "bimanual coordination" as a unified perception representation problem—using a pair of shared-mapping VQ codebooks to bind the perception tokens of the left and right arms in the same latent space, followed by a DiT diffusion policy to output actions. Both "arm independence" and "bimanual coordination" emerge naturally from the architecture, achieving an average success rate on RoboTwin 12% higher than SOTA visuomotor policies.

CycleManip: Enabling Cycle-based Manipulation via Effective History Perception and Understanding

Addressing cyclic manipulation tasks such as "shaking a bottle three times" or "hammering a nail eight times," which require precise cycle counting and timely termination, CycleManip introduces "Cost-Aware Sampling" to efficiently expand history perception and "Multi-task Progress Prediction" to force the model to understand cycle stages within end-to-end imitation learning. It increases success rates on cyclic tasks from single-digit/low percentages to 53–97% in both simulation and real-world experiments.

D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation

D3D-VLP reformulates "planning, 3D grounding, and navigation" into a unified internal autoregressive 3D Chain-of-Thought (3D CoT) within a 3D-VLM, complemented by a CoT memory feedback loop for dynamic re-planning. By utilizing a "Fragmented Supervision" strategy, the model jointly trains on 10 million samples with incomplete annotations (e.g., navigation-only labels), achieving new SOTA results on multiple embodied navigation and grounding benchmarks, including R2R-CE, REVERIE-CE, HM3D-OVON, and SG3D.

Dejavu: Towards Experience Feedback Learning for Embodied Intelligence

An Experience Feedback Network (EFN) is attached to a frozen VLA policy. It retrieves semantically similar historical trajectories from an "experience bank" that grows continuously during deployment. Using reinforcement learning, it predicts a residual correction added to the original action. This allows robots to improve through "accumulating and invoking memory" without updating any backbone weights. Success rates for LIBERO long tasks increased from 53.7% to 76.5%, and average real-world success rates improved from 25.8% to 70.2%.

DemoFunGrasp: Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning

This work decomposes "functional grasping" into two conditions: affordance (where to grasp) and grasping style (how to grasp). By employing "Single-step Demonstration-Editing RL"—which collects only one demonstration and requires the policy to output residual corrections—it bypasses the multi-step, multi-task exploration challenges of high-DOF dexterous hands. A universal functional grasping policy is trained on 3,200 objects and achieves zero-shot transfer to real-world robots (64.4% success rate under VLM guidance).

DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

DextER reformulates "language-driven multi-finger dexterous grasping" into an autoregressive sequence—the model first generates contact tokens (specifying which finger link contacts which 3D position on the object surface) and then generates grasp action tokens. By using "contact reasoning" as an intermediate step for an embodied Chain-of-Thought, the success rate on DexGYS is pushed to 67.14% (+3.83 p.p.), and the intent alignment metric P-FID improved by 96.4% relative to the Prev. SOTA.

Dexterous World Models

Given a static 3D scene and a sequence of first-person dexterous hand movements, DWM utilizes a scene-action conditioned video diffusion model to generate only the residual visual changes (grasping, opening doors, moving objects) caused by hand manipulation. While maintaining camera motion and unaffected regions unchanged, it enables static digital twins to "move" for the first time and serves as a visual world model for evaluating candidate actions.

Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

The ViFailback framework is proposed to efficiently annotate real-world robot manipulation failures using visual symbols (arrows, crosshairs, labels, etc.), constructing a dataset of 58,128 VQA pairs. ViFailback-8B VLM is trained to achieve failure diagnosis and visual+textual correction guidance, which, when integrated with VLA, achieves a 22.2% improvement in task success rate.

DiffuView: Multi-View Diffusion Pretraining for 3D-Aware Robotic Manipulation

DiffuView treats "multi-view diffusion generation" as a 3D-consistent visual pretraining task—teaching the network to "generate a target view given source view observations and camera poses" to implicitly recover scene geometry. The pretrained diffusion UNet is then utilized as a visual backbone for a diffusion action policy, enabling stable robot arm manipulation even under camera view shifts. It achieves a success rate nearly 20% higher than existing methods in view-offset scenarios.

Do You Have Freestyle? Expressive Humanoid Locomotion via Audio Control

RoboPerform establishes an end-to-end, retargeting-free generative framework for "audio-to-humanoid motion." It aligns audio latents with the motion latent space via contrastive learning, trains a teacher policy using Residual Mixture of Experts (\(\Delta\)MoE), and distills a diffusion student policy that decouples "content" (text-specified task) and "style" (audio rhythm/prosody). This enables the Unitree G1 to directly dance to music or perform co-speech gestures with latency significantly lower than conventional cascaded pipelines that rely on intermediate human motion reconstruction.

DynBridge: Bridging Imagination and Control through Interaction Dynamics for Robot Manipulation

DynBridge proposes "interaction dynamics" as a latent representation to end-to-end couple "imagining the future (trajectory generation)" and "control decision-making (action prediction)". This allows the robot to learn not just "where" the environment changes but also "how" actions cause these changes. It outperforms methods like ATM and GraphMimic on simulation and real-world benchmarks (LIBERO / Meta-World) without requiring pre-training on additional robot data.

EgoRoC: Towards Egocentric Robotic Control via Task-Agnostic Visual Alignment

EgoRoC decouples "how the robot sees" from "how the robot acts" by introducing a plug-and-play egocentric alignment head. Before manipulation, it aligns the wrist camera view to the target, outputting only a 6-DoF pose interface to the downstream VLA. A diffusion-based online hand-eye calibration module then transforms or corrects the alignment action into the end-effector coordinate system. Trained once using only static image pairs, it enhances the success rates of various VLAs across tasks and hardware in a zero-shot manner (especially for long-horizon and out-of-distribution tasks).

End-to-End Language-Action Model for Humanoid Whole Body Control

SENTINEL is the first fully end-to-end "language → humanoid low-level action" model. It generates a large-scale language-action dataset by using a pre-trained whole-body controller to track human motions with text annotations in simulation. It then employs a flow matching action expert to map language instructions and proprioception directly to 29-dimensional joint targets, with a residual reinforcement learning head to correct open-loop drift. It achieves significantly better semantic alignment and execution success rates (99.45% in simulation) on both simulation and the Unitree G1 hardware compared to two-stage "text-to-motion + controller" baselines.

EnergyAction: Unimanual to Bimanual Composition with Energy-Based Models

Two pre-trained unimanual policies are treated as energy functions and "composed" into a bimanual policy via energy summation. Spatiotemporal coordination is ensured through energy constraints, and an energy-aware adaptive denoising scheme determines the number of steps. This achieves coordinated bimanual manipulation with minimal dual-arm demonstration data (77.3% success rate on RLBench2 with only 20 demonstrations, outperforming the runner-up by 32.5%).

Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

Evo-1 utilizes a native multimodal VLM with only 0.77B parameters as the backbone, paired with a pure cross-attention flow-matching diffusion action expert and a "freeze-then-fine-tune" two-stage training strategy. Without any robot data pre-training, it achieves SOTA on Meta-World, RoboTwin, and LIBERO by preserving the VLM's semantic space, reaching a 78% success rate in real-world tests with 16.4 Hz inference and only 2.3 GB VRAM.

Extending Embodied Question Answering from Perception to Decision

This work constructs EQA-Decision, a 4-million-scale embodied question answering dataset (covering nine sub-tasks across four modules: static scenes, spatial understanding, task dynamics, and instant decision-making). Based on Qwen3-VL-8B, the authors train a strong baseline model, RoboDecision, through a three-stage "SFT → CoT-SFT → GRPO + Mixed Reward" pipeline. This advances embodied QA from "what is seen" to "what should be done now," improving the overall score from 48.84 to 68.06 across six task categories in the self-built benchmark.

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation

FantasyVLN enables a VLN model to learn textual, visual, and multimodal Chain-of-Thought (CoT) during training. It compresses "imagined future observations" into a Visual AutoRegressive (VAR) latent space to avoid token explosion. Through cross-modal alignment constraints, these reasoning capabilities are distilled into a "direct decision-making" path that bypasses explicit CoT generation during inference. This achieves instruction-to-action mapping with zero explicit reasoning overhead while retaining reasoning power. On the long-horizon LH-VLN benchmark, the Success Rate (SR) improved from 0.65 to 2.44, with inference latency roughly an order of magnitude faster than explicit CoT methods.

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Fast-ThinkAct is proposed to compress lengthy textual CoT reasoning (~250 tokens) into 6 verbalizable continuous latent tokens. By combining reward-guided preference distillation with visual trajectory alignment, it achieves an 89.3% reduction in inference latency (9.3× faster than ThinkAct-7B) while maintaining or exceeding the performance of SOTA reasoning VLAs.

FLARE: A Failure-Aware Framework for Autonomous Correction and Recovery in Visual-Language Robotic Manipulation

FLARE categorizes robotic VLA failures into "robot pose error (ID)" and "environmental destruction (OOD)." It uses perturbation-bridge data augmentation to provide models with endogenous "retry" capabilities and MLLM-driven offline mining of failure videos to automatically learn object-level "reset" skills. An online MLLM monitor then orchestrates closed-loop switching between these skills, improving the average success rate of \(\pi_{0.5}\) on 9 contact-rich RoboMimic tasks from 72.2% to 84.0%.

FloVerse: Floor Plan-Guided Multi-Modal Navigation

FloVerse utilizes the floor plan as a unified spatial prior and proposes a navigation task and dataset that merges three target modalities—PointNav, ObjectNav, and ImageNav—into a single model. Using a two-stage diffusion strategy called ThreeDiff (a planner with modality masking + a depth-SDF-based refiner), the method achieves higher success rates and path efficiency across all three modalities compared to mapless approaches or single-modality expert models.

FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denoising

FM-Steer introduces a test-time computation framework for flow-matching (FM) VLA generalist policies. It employs an intermediate flow verifier to estimate Q-values for "semi-denoised" candidate actions and selects the optimal one via Best-of-N. Subsequently, a lightweight Lite-Flow denoiser asynchronously completes the remaining denoising steps. This approach enhances π0 performance by +4.4%, +25.9%, and +12.9% on LIBERO, Simpler, and real-world robots respectively, while increasing the control frequency from 4 Hz to 90 Hz without retraining the foundation model.

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Analysis reveals that the root cause of poor transferability in visual jailbreak attacks is that the attack resides in a high-sharpness loss region—stemming from an over-reliance of shallow features on model-specific representations and the excessive influence of high-frequency information. This work proposes the FORCE method, which expands the feasible region of shallow layers through layer-aware regularization and suppresses high-frequency non-semantic components via spectral rescaling, guiding the attack into a flatter loss landscape to significantly improve cross-model transferability.

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation

ForceVLA2 is proposed, the first end-to-end model within a VLA framework to unify force awareness and hybrid force-position control. By constructing cross-stage force-aware task concepts in VLMs via Force-based Prompts and adaptively fusing task semantics with real-time interactive forces through a Cross-Scale MoE, it achieves closed-loop force-position regulation. Across five contact-rich tasks, it achieves an average success rate of 66%, outperforming π₀ and π₀.5 by 48.0% and 35.0%, respectively.

ForeAct: Steering Your VLA with Efficient Visual Foresight Planning

Instead of driving a VLA with a single high-level language instruction, ForeAct utilizes an efficient "foresight image generator + VLM sub-task planner" to progressively provide the VLA with "imagined future observations + sub-task text." This allows the VLA to focus exclusively on visuo-motor mapping. On 11 real-world multi-step tasks, it improves the average success rate of \(\pi_0\) from 46.5% to 87.4% (+40.9%).

From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation

ManualVLA employs a unified Mixture-of-Transformers framework that enables a VLA model to first "imagine" intermediate manuals (comprising sub-goal images, pixel coordinates, and text instructions) from a "goal state." It then translates these manuals into precise actions through explicit and implicit Manual Chain-of-Thought paths. On long-horizon tasks such as LEGO assembly and object rearrangement, the average success rate is 32% higher than previous hierarchical SOTA methods.

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

This work projects RGB-D observations into a compact, agent-centric Bird's-Eye View (BEV) representation that fuses explicit depth geometry with implicit priors from a 3D foundation model. By replacing redundant dense RGB patch tokens in MLLM navigators with this representation, the method achieves SOTA performance in continuous-environment VLN with significantly fewer tokens, without requiring DAgger augmentation or VQA co-training.

Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains

Gallant voxelizes vehicle-grade LiDAR point clouds into robot-centric occupancy grids, utilizing a lightweight 2D CNN that treats the z-axis as channels for end-to-end mapping to whole-body control strategies. By incorporating high-fidelity LiDAR simulation that accounts for the robot's own limbs, a single policy achieves zero-shot sim-to-real transfer. It marks the first instance of achieving >90% success rates in tasks like stair climbing and high-platform mounting while covering ground, lateral, and overhead obstacles simultaneously.

GeCo-SRT: Geometry-aware Continual Adaptation for Cross-Task Sim-to-Real Transfer

GeCo-SRT transforms "sim-to-real" transfer from a one-off parameter tuning process into a cross-task continuous accumulation process. It quantifies the sim-to-real gap using human-in-the-loop correction trajectories and utilizes the "Geometry-aware Mixture-of-Experts (Geo-MoE)" to treat local geometric features of point clouds (planarity, linearity, saliency) as reusable knowledge carriers involving both cross-task and cross-domain invariance. Furthermore, "Geometry-expert guided Priority Experience Replay (Geo-PER)" is employed to prevent idle experts from being forgotten. Ultimately, the average success rate on four real robotic arm tasks is 52% higher than the baseline, and performance parity is achieved using only 1/6 of the data.

General Process Reward Modeling for Robotic Reinforcement Learning

Ours proposes Robo-Dopamine: first training a "step-wise, cross-task" general process reward model (GRM) using 3,400 hours of multi-view video, then feeding dense signals to RL through a theoretically guaranteed "policy-invariant reward shaping" mechanism. This enables real-robot policies to improve from near 0% to a 95% success rate with a single demonstration and approximately 150 online rollouts (~1 hour).

GeniNav: Generative Model Driven Image-Goal Navigation via Imagination-Guided Consistency Flow Matching

GeniNav employs a VLM to "imagine" intermediate subgoals in a latent space to guide a Multi-Segment Consistency Flow Matching (MS-CFM) policy for generating smooth trajectories. A Hybrid Ranking Module (HRM), which integrates geometric safety, semantic alignment, and field-of-view gain, is then used to select the optimal path, improving the success rate from ~54% to 68.7% in mapless image-goal navigation.

GeoDexGrasp: Geometry-aware Generation for Data-efficient and Physics-plausible Dexterous Grasping

GeoDexGrasp utilizes a SIM(3) equivariant network with self-supervised disentangled pre-training to extract four categories of interpretable and transferable geometric representations (shape, size, pose, and interaction direction) from point clouds. It decomposes dexterous grasping into a two-stage decoupled pipeline: "root rotation generation on the SO(3) manifold + finger joint diffusion generation in Euclidean space." It achieves comparable success rates with less than one-fifth of the parameters of SOTA models and reduces penetration depth by approximately 40%.

GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

GeoPredict augments a continuous-action VLA policy (based on \(\pi_0\)) with two "future prediction" auxiliary tasks—predicting multi-step 3D trajectories of robot keypoints and predicting future 3D Gaussian geometry of the workspace. These two modules serve as supervision signals only during training and are not executed during inference. This allows the policy to learn internal representations oriented toward 3D space and long-horizon dynamics without increasing deployment overhead, significantly outperforming the \(\pi_0\) baseline on RoboCasa, LIBERO, and real-world platforms.

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

OptimusVLA equips the action generator of a hierarchical VLA with two memory modules: the Global Prior Memory (GPM) replaces the Gaussian noise starting point with retrieved similar trajectories to shorten the flow matching path, and the Local Consistency Memory (LCM) models historical actions with a lightweight structure to inject temporal consistency constraints. This achieves higher success rates (98.6% on LIBERO) while delivering 2.9× inference acceleration on real robots.

GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions

GraspALL encodes continuously varying illumination into a set of learnable "luminance curves," using estimated light levels to dynamically regulate the fusion weights of RGB and depth (non-RGB) features. This generates illumination-consistent garment grasping representations under arbitrary low-light conditions, improving the grasp success rate by 32–44% over baselines on a self-constructed multi-illumination garment dataset.

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

GraspGen-X conditions a diffusion-based 6-DOF grasp model on "gripper representation"—a 12-dimensional Swept Volume heuristic describing the space the fingers sweep through during closing. By training on 25 procedurally generated grippers and 395 million simulated grasps, it achieves zero-shot 6-DOF grasping for unseen real grippers + unseen objects for the first time, with a real-robot success rate of 79%, significantly outperforming baselines such as grasp pose retargeting.

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

GraspLDP is proposed to inject grasp pose priors and graspness map visual cues from a pre-trained grasp detector into a latent diffusion policy framework. Through guidance in an action latent space encoded by a VAE and a self-supervised reconstruction objective, it significantly improves grasping precision and generalization.

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

The HiF-VLA framework is proposed, utilizing Motion Vectors as compact temporal primitives to unify Hindsight, Insight, and Foresight reasoning capabilities. By achieving bidirectional temporal expansion for VLA models, it significantly outperforms baselines in long-horizon manipulation tasks with minimal computational overhead.

HQC-NBV: A Hybrid Quantum-Classical View Planning Approach

The "Next-Best-View" (NBV) problem in robotic exploration is reformulated as finding the ground state of a quantum Hamiltonian. Using a 10-qubit variational circuit with VQE/SPSA to simultaneously evaluate multiple movement directions, the approach leverages quantum superposition and entanglement to escape local optima common in classical heuristics or sampling methods. In 2D exploration scenarios, it improves exploration efficiency by 7.9–49.2% compared to classical methods.

HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

HTNav establishes a foundation for urban UAV vision-language navigation using a hybrid training paradigm of "IL pre-training + PPO fine-tuning," layered with a tiered decision-making mechanism ("Macro planification of waypoints + Micro action selection") and a residual map encoding module. It doubles the success rate on unseen test scenarios in CityNav from 9.70% to 25.49%.

IGen: Scalable Data Generation for Robot Learning from Open-World Images

IGen starts from a single open-world image and automatically generates large-scale vision-action training data through a pipeline of 3D scene reconstruction → VLM task planning → SE(3) action generation → point cloud synthesis → frame rendering. Policies trained solely on this generated data can successfully perform real-world manipulation tasks.

INSIGHT Bench: Towards Grounded IN-SItu Guidance for Robotic Manipulation

Addressing the gap where current VLA models follow external language instructions but fail to understand in-situ symbols like "PUSH/PULL/Arrows/Squeeze" printed on objects, this paper proposes INSIGHT Bench—a robotic manipulation benchmark that programmatically binds in-situ visual guidance with physical constraints. It features a five-category guidance taxonomy, a scalable automated data generation pipeline, and a dataset of 14,076 trajectories, revealing that π0, GR00T N1.5, and SmolVLA generally fail to stably ground such in-situ guidance.

InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

InternData-A1 utilizes a fully decoupled and autonomous simulation synthesis pipeline to generate 630,000 trajectories (7,433 hours) of high-fidelity robotic manipulation data. It demonstrates for the first time that a VLA model pre-trained solely on "purely synthetic data" can match the performance of the official \(\pi 0\) pre-trained on the closed-source real-world \(\pi\text{-dataset}\) across 49 simulation and 9 real-world tasks.

Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control

This paper proposes CLAIMS—a closed-loop framework where "motion data synthesis" and "humanoid controller training" co-evolve. It utilizes motion diffusion models to generate professional high-dynamic motions from difficulty-graded semantic template prompts. Following dual filtering via physics and VLM, a physics-based motion tracker is trained. Feedback from physics metrics and VLM then drives an LLM to automatically escalate difficulty, reducing the average failure rate of the PHC tracker by 45% on a 2201-segment test set using only approximately 1/10 of the AMASS data volume.

Language-Grounded Decoupled Action Representation for Robotic Manipulation (LaDA)

The LaDA framework is proposed to decouple continuous 7-DoF actions into interpretable primitives (translation, rotation, and gripper) using natural language as a semantic bridge. By employing soft-label contrastive learning to align cross-task action representations in a shared embedding space, the model achieves a 93.6% success rate on LIBERO with only 0.6B parameters, surpassing all baselines ranging from 1.3B to 8.5B parameters.

Learning a Unified Latent Action Space from Videos with Action-centric Cycle Consistency

CycleMimic is proposed to learn a latent action tokenizer from unlabeled videos using "Action-centric Cycle Consistency (AC3)." By establishing a closed loop of "sampling latent actions → generating future frames → predicting the action back from the original and generated frames," the method enforces a semantically consistent and unified cross-embodiment latent action space. It improves performance over OpenVLA by 20.1% on LIBERO and increases the average completed tasks on CALVIN from 3.27 to 3.93.

Learning Surgical Robotic Manipulation with 3D Spatial Priors

A feed-forward 3D geometric reconstruction model (MASt3R) is fine-tuned on a self-constructed synthetic surgical dataset to extract 3D implicit representations end-to-end from stereo endoscopic images. These representations are aligned to the robot action space using a lightweight connector, enabling real surgical robots to achieve SOTA success rates in delicate tasks like knot tying and ex-vivo gallbladder dissection without relying on wrist-mounted cameras.

Learning to Act Robustly with View-Invariant Latent Actions

VILA proposes that view-invariance should not be imposed on the "visual representation of the entire scene," but rather solely on "action-related dynamic changes." It learns a compact latent action encoding adjacent frame changes via IDM/FDM, then uses ground-truth action sequences for action-guided weighted contrastive learning and structural alignment to align latent actions of the same movement across different views. Finally, this latent policy serves as a view-invariant encoder to condition downstream policies, achieving significantly higher robustness to unseen views and new tasks in both simulation and real-world experiments.

Learning to Control Physically-simulated 3D Characters via Generating and Mimicking 2D Motions

Mimic2DM reformulates "learning physically controllable characters from video" as a pure 2D reprojection tracking problem. Using only 2D keypoints extracted from in-the-wild videos and leveraging physical simulation as a prior to filter infeasible poses, it trains a view-invariant tracking policy. This policy is extended to 3D tracking via multi-view feature aggregation in a zero-shot manner and integrated with an autoregressive 2D motion generator to form a hierarchical controller. It synthesizes physically plausible motions like dancing, soccer dribbling, and quadruped locomotion without ever touching explicit 3D motion data.

Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

The TVVE framework is proposed, which selects optimal virtual camera viewpoints through a reinforcement learning-driven Multi-View Exploration Policy (MVEP) and performs online observation re-rendering. Concurrently, a task-aware MoE visual encoder (TaskMoE) is designed to resolve feature interference in multi-task settings, achieving an average success rate of 86.6% across 18 RLBench tasks.

LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models

Addressing the illusion of "VLA models reporting 95%+ success rates on LIBERO but frequently failing in real deployment," this work constructs LIBERO-Plus, an automated, fine-grained robustness benchmark with seven dimensions of controllable perturbations. Systematic evaluation of 10 mainstream VLA models reveals that success rates plummet from 95% to below 30% under moderate perturbations, uncovering deep vulnerabilities such as "ignoring language, relying on fixed visuals, and depending on positional memory."

Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

A lifelong imitation learning framework is proposed that stores and replays compact representations in the feature space of frozen encoders via Multimodal Latent Replay (MLR). It introduces the Incremental Feature Adjustment (IFA) mechanism with angular distance constraints to maintain inter-task separability, achieving a 10-17 point AUC improvement and up to 65% reduction in forgetting on the LIBERO benchmark.

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Mantis decouples "future frame prediction" from the VLA backbone—using a set of latent action queries and an independent Diffusion Transformer (DiT) head to generate future frames. This allows the backbone to output only compact inter-frame dynamics as action supervision signals, preserving the benefits of visual foresight while maintaining backbone capacity for language understanding and reasoning. It achieves a 96.7% success rate on LIBERO and outperforms π0.5 in instruction following and generalization on real robots.

MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

Addressing the issue where fine-tuning VLA models initialized from VLMs destroys pre-trained representations and compromises generalization, MAPS replaces the "global proximity constraint" in robust fine-tuning with a module-wise schedule that linearly decays from the vision encoder to the language layers. By keeping vision layers strictly tethered to pre-trained geometric priors while allowing action-oriented language layers to adapt freely—without adding parameters or data—OOD generalization is improved by up to 30% across SimplerEnv, CALVIN, LIBERO, and real-world Franka robots.

MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis

MaskDexGrasp decomposes dexterous hand grasping into six components (palm + five fingers) based on hand anatomy, quantizes them into discrete tokens using VQ-VAE, and iteratively samples these tokens via a bidirectional masked Transformer conditioned on object point clouds and task text. This approach generates high-quality, semantically aligned, and per-finger editable grasps, achieving SOTA on the self-built TDG dataset (65k grasps / 260k texts / 11 task categories).

Memory-Augmented Scene Understanding and Exploration for Open-World Aerial Object-Goal Navigation

Focusing on the Aerial ObjectNav task in large-scale outdoor scenes—where only target descriptions are provided without step-by-step instructions—this paper proposes OctMem-Agent. It utilizes an adaptive octree memory to incrementally aggregate historical RGB-D observations into a scalable hierarchical 3D representation, then employs instruction-modulated memory queries to extract compact "localization" and "exploration" tokens for VLA decision-making. On the UAV-ON benchmark, the success rate is improved by 7.5% over the previous SOTA.

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

This work provides the first systematic diagnosis of two root causes preventing VLA model merging (selfish parameter conflicts in LoRA and task coupling caused by self-attention in action experts). It proposes MergeVLA—a framework that merges multiple single-skill VLA experts into a generalist agent using task-masked sparse LoRA activation, de-self-attention action experts, and training-free test-time routing. It achieves a 90.2% success rate on LIBERO and 90% on the real-world SO101 robot.

MM-ACT: Learn from Multimodal Parallel Generation to Act

MM-ACT represents text, images, and actions within a unified set of discrete tokens, utilizing a masked token predictor with bidirectional attention for unified parallel decoding (multi-step re-masking for text/images, and one-step generation for actions). Through Context-Shared multimodal learning, task planning and future image prediction enhance action generation. It achieves 96.3% on LIBERO, 52.38% on eight tasks in RoboTwin2.0 (with a +9.25% gain from cross-modal training), and 72.0% on Franka real-world robots.

MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models

MoEActok decomposes a single action tokenizer into "skill-clustered multi-expert VQ-VAEs," where each expert is responsible only for one category of action skill (e.g., translation / grasping). Combined with a coarse-to-fine training paradigm that "first predicts the skill category, then generates action tokens," it significantly outperforms existing discretization methods such as Binning, FAST, VQ-BET, and VQ-VLA in RoboTwin, Simpler-Env simulations, and real-world zero-shot transfer.

Motus: A Unified Latent Action World Model

Motus employs a Mixture-of-Transformers (MoT) architecture to integrate three pre-trained experts—Understanding, Video Generation, and Action—via shared self-attention (Tri-model Joint Attention) and UniDiffuser-style asynchronous scheduling. It unifies five embodied paradigms—VLA, World Model, IDM, video generation, and joint video-action prediction—within a single model. By extracting pixel-level "latent actions" from optical flow, the action expert can be pre-trained on massive unlabeled videos. Motus outperforms \(\pi 0.5\) by 45% and X-VLA by 15% in simulation, with real-world improvements ranging from 11% to 48%.

Obstruction Reasoning for Robotic Grasping

Addressing the long-neglected problem in cluttered scenes where "the target object is obstructed and obstructions must be removed first," this paper proposes UNOGrasp. It is a Vision-Language Model (VLM) that constructs target-centric directed obstruction graphs and is trained via SFT+RFT (GRPO + IoU reward). Accompanied by the self-built UNOBench benchmark (100k+ obstruction paths), it outperforms Qwen2.5-VL and Google’s proprietary Gemini Robotics-ER 1.5 in both obstruction reasoning and grasping success rates across synthetic and real-world scenarios.

OctoNav: Towards Generalist Embodied Navigation

OctoNav unifies five fragmented navigation tasks—ObjNav, PointNav, ImgNav, Ins-ImgNav, and VLN—into a single "free-form, multi-modal, multi-capability" instruction format. The work releases OctoNav-Bench, containing 45k+ instruction-trajectory pairs, and the TBA-CoT dataset with reasoning chains. It introduces OctoNav-R1 (based on LLaMA-VID), a VLA model that "thinks before acting" trained via a three-stage Hybrid Training Paradigm (SFT, GRPO, and online RL), improving the overall success rate from the previous best of 9.2% to 19.4% in a unified setting.

Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory

Addressing the issue that conventional Aerial Vision-and-Dialog Navigation (AVDN) requires supervised fine-tuning and lacks cross-domain generalization, this paper proposes PSC-AVDN. This training-free framework decomposes MLLM navigation into a "Parsing-Search-Confirmation" three-stage Chain-of-Thought (CoT) and incorporates a Structured Spatial Memory (SSM) to compensate for missing spatial/historical information. It achieves training-free SOTA performance on ANDH / ANDH-Full, performing comparably to or better than several fine-tuned methods.

Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts

This paper proposes "analytic concepts"—a procedural representation of object structure and manipulation knowledge defined via mathematical symbols, directly computable and simulatable by machines. It grounds semantic-level commonsense reasoned by MLLMs into the physical world to guide robots in manipulating articulated objects, achieving approximately a 27% improvement over A3VLM on unseen categories in simulation.

Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering

Pred-EQA transforms Embodied Question Answering (EQA) from a "look-and-move" reactive exploration into a "predict-then-explore" loop of prediction and correction. A high-level planner predicts where evidence might be hidden to generate exploration branches with long-term intent; a low-level executor actively reduces uncertainty within these branches and prunes them upon prediction failure. Coupled with a dual-memory system that separates "stable structural priors" from "question-relevant visual evidence," the method achieves SOTA performance in both answer accuracy and exploration efficiency on A-EQA and Express-Bench.

Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection

This paper reformulates multimodal misinformation detection (MMD) as a structured probabilistic reasoning problem based on a concept graph. It proposes the PCGR framework, which utilizes MLLMs to automatically discover and verify human-understandable concept nodes, constructing a hierarchical probabilistic concept graph. This achieves interpretable misinformation detection and comprehensively outperforms 13 baselines across three benchmarks.

ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation

The study proposes ProFocus, a training-free progressive framework that achieves SOTA performance for zero-shot methods on R2R and REVERIE benchmarks. It utilizes proactive perception (converting panoramas to semantic maps + LLM-generated targeted visual queries) and focused reasoning (BD-MCTS to filter top-k high-value candidates from extensive historical waypoints).

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

QuantVLA is proposed as the first training-free post-training quantization (PTQ) framework for Vision-Language-Action (VLA) models. By employing a selective quantization layout and two lightweight calibration mechanisms—Attention Temperature Matching (ATM) and Output Head Balancing (OHB)—it achieves approximately 70% memory savings at W4A8 precision while exceeding the task success rate of the full-precision baseline.

RealAppliance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manuals

The authors manually modeled 100 high-fidelity appliance digital assets strictly aligned with real-world manuals (dimensions, textures, physical mechanisms, electronic mechanisms, and program logic are all reproduced according to real manuals). Based on these, they established RealAppliance-Bench to evaluate mainstream MLLMs and embodied planning models through four tasks: "manual retrieval, component grounding, open-loop planning, and closed-loop correction." It was discovered that even GPT-5's success rate for complete open-loop planning is only in the single digits.

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Ours proposes R²VLM, which processes local video segments step-by-step through a recurrent reasoning framework and maintains a dynamically updated CoT to record task decomposition and completion status. Combined with multi-dimensional RL rewards, it achieves SOTA in long-horizon embodied task progress estimation and supports downstream applications such as policy learning, reward modeling, and active assistance.

Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation

This work presents the first systematic empirical study on the properties of wrist-mounted fisheye cameras in imitation learning for robotic manipulation. It addresses three core questions regarding spatial localization, scene generalization, and hardware generalization, revealing the advantages and limitations of wide field-of-view (FoV). Furthermore, it proposes Random Scale Augmentation (RSA) to mitigate the scale overfitting issue during cross-camera transfer.

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Addressing the task of "VLM translating human instructions into executable intermediate representations," this work draws inspiration from Context-Free Grammar (CFG) to decompose intermediate representations into Vocabulary + Grammar. It designs the SEAM representation, which is both comprehensible for VLMs and generalizable to unseen tasks, paired with a RAG-based few-shot open-vocabulary part segmentation module. Real-robot success rates are approximately 15% higher than the previous SOTA.

Rethinking Visual Rearrangement from A Diffusion Perspective

This work reinterprets the embodied rearrangement task of "restoring a cluttered room" as a diffusion bridge process—where shuffling is forward diffusion and restoration is reverse denoising. By representing object states as Gaussian Mixture Models (GMM) and using a Denoising Transformer to iteratively infer movements, the method improves the success rate on RoomR from 14.2% to 17.8%.

RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

RoboAgent is proposed as a capability-driven embodied task planning framework that employs a single VLM to simultaneously function as a scheduler and five basic capabilities (Exploration Guidance, Object Grounding, Scene Description, Action Decoding, and Experience Summarization). Through a three-stage training process (SFT + DAgger + Expert-guided RL), it achieves SOTA performance on EB-ALFRED and ALFWorld.

RoboTAG: End-to-end Robot Pose Estimation via Topological Alignment Graph

To address the pain points of monocular RGB robot pose estimation—specifically its high reliance on annotations and the loss of spatial priors when compressing 3D problems into 2D—RoboTAG organizes camera-robot system state variables into a "Topological Alignment Graph" featuring 2D and 3D branches. By identifying "closed loops" within the graph to impose 2D-3D consistency supervision, the two backbone networks co-evolve. This allows training on unlabeled in-the-wild images, achieving SOTA on 5 out of 9 DREAM benchmarks with an average AUC of 76.9%.

RoboWheel: A Data Engine from Real-World Human Demonstrations for Cross-Embodiment Robotic Learning

RoboWheel automatically converts monocular RGB(D) videos of "human-hand-object interaction" into robot supervision data suitable for training VLA / imitation learning policies. Through high-precision reconstruction, physics-plausible optimization, cross-embodiment retargeting, and simulation-domain augmentation, it generates the HORA dataset with 150,000 trajectories, providing the first quantitative proof that HOI videos can serve as effective supervision for robotic learning.

Scalable Trajectory Generation for Whole-Body Mobile Manipulation

AutoMoMa unifies the mobile base, robotic arm, and the manipulated object into a single "Augmented Kinematic Representation (AKR)," then offloads trajectory optimization and collision detection to the GPU for batch parallelism. This enables the automatic synthesis of 500,000 physically feasible whole-body coordinated trajectories at a rate of 5,000 per GPU-hour (approximately 80x faster than CPU baselines), proving that the fundamental bottleneck previously hindering whole-body mobile manipulation policy learning was data scale rather than algorithms.

Semantic Audio-Visual Navigation in Continuous Environments

This paper proposes the SAVN-CE task, extending semantic audio-visual navigation to continuous 3D environments. It introduces MAGNet (Memory-Augmented Goal-description Network), which achieves robust goal reasoning after the target sound ceases by fusing historical context with self-motion cues, resulting in an absolute success rate improvement of up to 12.1%.

SemanticVLA: Towards Semantic Reasoning over Action Memorization via Synergistic Explicit Trace and Latent Action Planning

SemanticVLA adopts a dual-path design of "explicit trace reasoning + implicit action tokens" to effectively leverage the native spatial grounding capabilities of VLMs for robotic manipulation. It achieves a 97.0% success rate on LIBERO and 65.1% on SimplerEnv WidowX, demonstrating significantly higher stability in instruction rewriting, long-horizon, and reasoning-intensive tasks compared to baselines.

SIR: Structured Image Representations for Explainable Robot Learning

SIR transforms robotic observations into a fully connected scene graph and employs an end-to-end learnable sparsification module to retain only task-relevant nodes. This "thinned subgraph" serves as the state representation for the policy—improving success rates on RoboCasa from 14.81% to 19.5% while providing intrinsic explainability. This allows for the identification of spurious correlations and positional biases within datasets.

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

To bridge the perception-action gap where VLA models "use 2D vision to drive 3D physical actions," this paper proposes a "Spatial-Aware Pretraining" phase before learning robot policies. By extracting 3D visual and 3D action annotations from large-scale human manipulation videos as supervision, the dual-encoder model VIPA-VLA aligns 2D semantic vision with 3D space. Consequently, without using a single frame of robot data for pretraining, it achieves a 92.4% average success rate on LIBERO and significantly outperforms strong baselines on real robots.

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

SPEAR-1 argues that the poor generalization of robot foundation models stems from the base VLM only understanding 2D. Therefore, the VLM is first trained into a 3D-aware SPEAR-VLM capable of predicting 3D coordinates using "easy-to-collect non-robot 2D images + automatically generated 3D annotations." An action expert is then trained on top of it for VLA. Ultimately, its zero-shot performance in unseen Franka (DROID) environments matches \(\pi\)0.5 and exceeds \(\pi\)0-FAST, while using \(20\times\) less robot demonstration data.

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

SRPO uses self-generated successful trajectories within the same training batch as references and measures "how close a failed trajectory is to success" via world model latent representations. This converts the 0/1 sparse rewards of GRPO into dense process rewards without extra demonstrations or manual reward engineering, improving OpenVLA* on LIBERO from 48.9% to 99.2% (within 200 steps).

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

StaMo utilizes a lightweight encoder and a pre-trained DiT decoder to unsupervisedly compress a static image into a compact state representation of only two 1024-dimensional tokens. It proves that the "difference between two state tokens" naturally serves as an executable robot action (latent action). Without any video or temporal modeling, it improves VLA performance on LIBERO by 11.6% and increases the success rate on real robots by 31%.

Structural Action Transformer for 3D Dexterous Manipulation

SAT flips dexterous action chunks from "temporally ordered action vectors \((T,D_a)\)" to "joint-ordered trajectory sequences \((D_a,T)\)." This allows the Transformer to naturally handle heterogeneous embodiments by treating the number of joints as a variable sequence length. Coupled with an Embodied Joint Codebook describing kinematic roles and Flow Matching to generate actions from 3D point clouds, the model outperforms 2D/3D baselines on 11 simulation and 6 real-world bimanual tasks with only 19.36M parameters.

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

SwiftVLA enables a 0.45B lightweight VLA to "borrow" 4D spatiotemporal features during training to learn geometric and dynamic reasoning. These insights are distilled into the 2D branch via masked reconstruction, allowing the 4D module to be discarded at inference. On edge devices, it runs \(18\times\) faster and saves \(12\times\) VRAM compared to \(\pi0\), while achieving success rates comparable to models with \(7\times\) more parameters.

Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

This work identifies the Test-time Ego-Exo Adaptation for Action Anticipation (TE2A3) task for the first time. It proposes the DCPGN network, which utilizes multi-label prototype growing and dual-clue (visual + textual) consistency to adapt a model trained on a source view to a target view online during inference, significantly outperforming existing TTA methods.

Test-Time Perturbation Tuning with Delayed Feedback for Vision-Language-Action Models

To address the "trajectory overfitting" problem where VLAs fail under minute object pose variations, this paper proposes PDF—a verifier-free test-time adaptation framework that keeps the backbone frozen. It employs uncertainty-driven adaptive data augmentation and multi-view voting to suppress spurious correlations, combined with a lightweight perturbation head trained via delayed feedback after episode completion to correct model overconfidence. PDF achieves a +7.4% success rate on LIBERO and +0.10 Human Normalized Score (HNS) on Atari.

Towards Human-Like Robot Handwriting via Contour-Aware Generation

To enable writing robots to produce characters with human-like stroke thickness variations, this paper proposes a new task "Contour-aware Handwriting Trajectory Reconstruction (CHTR)" and builds the CHTR-110K dataset with 110,000 samples. It introduces the G-HTR method based on multi-scale character graphs to reconstruct character images into "trajectory sequences with stroke width," significantly surpassing SOTAs such as TrajFormer across multiple metrics and successfully deploying to real calligraphy robots.

Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots

Inspired by the Turing Test, the authors propose the "Motion Turing Test" (MTT), which evaluates whether a human can distinguish between pose sequences of humans and humanoid robots based solely on motion (stripping away appearance). They release the HHMotion dataset containing 1,000 segments across 15 action categories from 11 robot types and humans (annotated with 0–5 human-likeness scores). A regression baseline, PTR-Net, is provided. Results indicate a significant gap between current robot motion and humans, and even SOTA multimodal large models fail to score these motions accurately.

Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning

Addressing the General Vision-Language Navigation (GSA-VLN) task in open environments, this paper proposes the slow4fast-VLN framework, inspired by the human dual-process cognitive system. The fast reasoning module performs real-time navigation and accumulates historical memory based on an end-to-end policy network. The slow reasoning module utilizes LLMs to reflect on and generate structured generalization experiences. These experiences are fed back to enhance the fast reasoning network through attention fusion, achieving continuous adaptation in unseen environments and under diverse instructions. The framework consistently outperforms the previous SOTA (GR-DUET) on the GSA-R2R dataset.

Towards Training-Free Scene Text Editing

TextFlow is proposed as a training-free scene text editing framework. By utilizing Flow Manifold Steering (FMS) in the early denoising stages to maintain style consistency and Attention Boost (AttnBoost) in the later stages to enhance text rendering accuracy, it achieves editing quality comparable to or even better than training-based methods without requiring task-specific training.

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

TraceGen shifts "world modeling" from pixel space to a compact scene-level 3D trace space. Accompanied by the TraceForge data engine, it unifies 123,000 human and robot videos into consistent 3D traces to pre-train a cross-embodiment motion prior. Consequently, it achieves an 80% success rate on new robots/tasks with only 5 target demonstrations while inferring 50–600 times faster than video-generative world models.

Training One Model to Master Cross-Level Agentic Actions via Reinforcement Learning

CrossHA unifies heterogeneous action spaces including "language, grounding, motion, atomic, and latent" into a single VLM agent. It employs a three-stage GRPO pipeline—"Mixed SFT → Single-step RL → Multi-turn RL"—to train the agent to autonomously select the most suitable action space at each step of a trajectory. Trained on only 30 Minecraft tasks, it generalizes to over 800 tasks and achieves SOTA performance (54.6% ASR across all tasks).

TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation

TrajRAG compresses historical navigation trajectories into a "topological-polar" structure stored in a lifelong cumulative RAG knowledge base. During navigation, each candidate frontier generates a hypothetical trajectory to retrieve similar historical experiences in a coarse-to-fine manner. These experiences are then fed into an LLM planner to select the next waypoint, achieving new SOTA results across three zero-shot ObjectNav benchmarks: MP3D, HM3D-v1, and HM3D-v2.

TRM-VLA: Temporal-Aware Chain-of-Thought Reasoning and Memorization for Vision-Language-Action Models

TRM-VLA enables VLA models to perform hierarchical Chain-of-Thought (CoT) reasoning only at "keyframes" and utilizes a granularity-adaptive memory buffer to retrieve historical reasoning across frames. This achieves a new state-of-the-art success rate (72.9% on SIMPLER) while reducing the CoT token count per step by approximately 4× across SIMPLER, LIBERO-90, and four real-world robot tasks.

UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs

UAST utilizes a mapless RGB-D framework to unify "active search for arbitrary targets" and "persistent tracking" into a single perception-control pipeline. A dual-branch perception combined with a regulated point search strategy adaptively switches between "Visible Tracking," "Short-term Occlusion Compensation," and "Lost Exploration" states. A lightweight control network directly outputs dynamically feasible trajectories, improving high-speed long-range tracking success rates by over 50% compared to SOTA and increasing search speed by approximately 3x in both simulation and real-world experiments.

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation (VITA)

VITA proposes unifying perception and control using a "vision-action shared discrete latent space." The same sequence of tokens autoregressively generated by the VLM backbone is simultaneously decoded into "future video frames" and "robot actions." By treating visual prediction as an inductive bias for action generation (Implicit Visual CoT), the model bridges the modality gap between visual observations and low-dimensional actions while avoiding the training instability and high latency of "predict-then-act" paradigms. It achieves gains of 14.5%/9.6%/12.1% on CALVIN/LIBERO/SimplerEnv respectively, and an 80.5% average success rate across 6 real-world tasks.

Video2Robo: 3DGS-based Synthetic Data from One Video Enables Scalable Robot Learning

Video2Robo uses a single monocular human demonstration video recorded by a smartphone. Leveraging 3DGS, it reconstructs task-relevant objects, tracks their 6D trajectories, and parses manipulation skills. A virtual Franka robot arm then "takes over" these trajectories with multi-dimensional scene augmentation to mass-synthesize photorealistic and kinematically plausible robot training data. The resulting policy enables zero-calibration transfer to real-world robot arms.

Visual-RRT: Finding Paths toward Visual-Goals via Differentiable Rendering

This work integrates "visual gradient utilization" based on differentiable robot rendering into the "sampling exploration" framework of RRT. This allows robotic arms to plan collision-free motion paths given only a single goal image without goal joint angles, improving success rates on Franka / UR5e / Fetch from the ~20% range to ~75%.

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

This paper decomposes pre-trained VLA models into "Spatial Modeling (Vision Encoder)" and "Physical Modeling (VLM + Action Expert)". It demonstrates that the failure of VLAs under new viewpoints or visual perturbations is caused by representation drift in spatial modeling rather than the loss of physical modeling capabilities. By using two extremely lightweight one-shot adaptations—Feature Token Modulation (FTM) with 4K parameters for affine modulation and Feature Linear Adaptation (FLA) with 4.7M parameters for ViT low-rank updates—the success rate on LIBERO's new viewpoints is increased from 48.5% to 90.8%, matching or exceeding full LoRA fine-tuning with only 1% of the parameters.