ICML2026 Robotics & Embodied AI AI paper notes paper summaries Multimodal/VLM Robotics Diffusion Models Agents Reasoning Adversarial Robustness

🤖 Robotics & Embodied AI¶

🧪 ICML2026 · 53 paper notes

📌 Same area in other venues: 📷 CVPR2026 (146) · 🔬 ICLR2026 (162) · 💬 ACL2026 (11) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (75) · 📹 ICCV2025 (26)

🔥 Top topics: Multimodal/VLM ×21 · Robotics ×9 · Diffusion Models ×7 · Agents ×4 · Reasoning ×3

Contrastive Representation Regularization for Vision-Language-Action Models: The authors observe that representations in VLA models inherited from VLMs are dominated by visual appearance and are insensitive to robot proprioceptive states. They propose Robot State-aware Contrastive Loss (RS-CL), which uses the Euclidean distance between proprioceptive states as "soft contrastive labels" to reshape representations. Combined with "view cutoff" feature-level augmentation, this method achieves a SOTA success rate of 69.7% on RoboCasa-Kitchen using GR00T N1.5 and improves success rates from 45.0% to 58.3% on real-world Franka pick-and-place tasks.
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation: For zero-shot robotic manipulation transferring from trained tasks to entirely new tasks, the authors decompose demonstrations into "atomic skill-action" pairs as intermediate representations. They utilize a dual-library (dynamic library retrieving by visual/planning similarity + static library completing missing skill tokens via IDF weighting) to provide LLMs with skill-comprehensive in-context demonstrations, upgrading "trajectory imitation" to "compositional skill reasoning."
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies: This paper shifts VLA action decoding from Autoregressive (AR) or external continuous diffusion heads to "masked diffusion on discrete action tokens within a unified Transformer." Combined with adaptive parallel decoding ranked by confidence and secondary re-masking for error correction, it achieves a 96.4% average success rate on LIBERO and a 64.1% total mean score on SimplerEnv-Fractal. Notably, performance degrades by only 0.8% / 20.4% under OOD language/visual perturbations, significantly outperforming continuous diffusion and parallel decoding baselines while preserving the multimodal priors of the pre-trained VLM.
BEAR: Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis: BEAR decomposes embodied tasks into 14 atomic skills and constructs 4,469 interleaved image-video-text VQA pairs. By performing horizontal and vertical skill-level diagnosis on 20 MLLMs, it discovers that perception (rather than reasoning) is the primary bottleneck. Consequently, BEAR-Agent is developed using external visual/spatial tools—such as GroundingDINO, 3D scene graphs, and trajectory visualization—improving GPT-5 performance by 17.5% relative to the baseline and increasing real-robot grasping success by 20.17%.
Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation: SceneDiver mitigates visual hallucinations in both high-level planning and reactive control by filtering task-related objects before feeding them back into the model. It employs a two-stage focus plan—coarse-grained sub-scene decomposition via scene graphs followed by agentic VLM verification—and distills this explicit reasoning into VLA using a Slot Attention adapter.
DLO-Lab: Benchmarking Deformable Linear Object Manipulations with Differentiable Physics: DLO-Lab develops a differentiable simulator based on Taichi on the Genesis platform, utilizing Discrete Elastic Rods (DER) as its core. It supports bidirectional coupling, bending plasticity, and closed-loop topology. The platform includes 10 benchmark tasks for rope/cable/elastic bands and a specialized agent using VLM for "grasp proposal + task decomposition." It evaluates various policy learning algorithms (PPO/SAC/SHAC/SAPO/CMA-ES/GD) and validates sim-to-real transitions via system identification.
Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning: This paper proposes CAPS: reinterpreting "instruction drift" as a systematic sampling error. It uses SNR (\(= \log|\mathcal{A}|-\mathcal{H}\)) as a metacognitive switch to trigger Metropolis-Hastings iterative refinement based on a power distribution \(\pi \propto p^\alpha\) only during high-entropy "Pivotal Windows." It outperforms OpenVLA and TACO training-free on RoboTwin, Simpler-WindowX, and Libero-long.
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model: DUST utilizes a "dual-stream" multi-modal diffusion Transformer (MMDiT) to process action flows and future visual embedding flows in parallel. By employing shared attention for cross-modal fusion, combined with independent noise scheduling and asynchronous action-vision sampling, it enables the VLA to simultaneously learn "what actions to perform" and "what consequences those actions produce." It consistently outperforms GR00T-N1.5+FLARE on RoboCasa, GR-1, and real-world Franka robots.
Dual Advantage Fields: This paper observes that in the bilinear goal-conditioned value model \(V_\theta(s,g)=\psi_\theta(s)^\top\phi_\theta(g)\), the goal embedding \(\phi_\theta(g)\) is exactly the gradient direction of the value field with respect to the state embedding. By utilizing an "action-feature displacement predictor" \(u_\xi(s,a)\approx\gamma\psi(s')-\psi(s)\) and taking its inner product with the goal embedding, a learning-free Q-network local advantage score is obtained. This approach significantly improves the RLiable aggregated metrics across OGBench long-range navigation, manipulation, and puzzle tasks.
Dual Quaternion SE(3) Synchronization with Recovery Guarantees: This paper parameterizes the SE(3) synchronization problem using Unit Dual Quaternions (UDQ) instead of \(4\times4\) matrices. It calculates spectral initialization via power iteration on Hermitian dual quaternion matrices, followed by iterative refinement using the Dual Quaternion Generalized Power Method (DQGPM) with element-wise projection to \(\mathrm{UDQ}^n\). It provides the first finite-step linear convergence and explicit error bounds for SE(3) synchronization, reducing both rotation/translation errors and computational time below those of matrix-based methods in multi-scan point cloud registration.
Efficient Skill Grounding via Code Refactoring with Small Language Models: RECENT enables robots to adapt to different morphologies (different arms or grippers) or dynamic environments without using Large Language Models (LLMs) to rewrite skill code from scratch. Instead, it decouples "semantic intent" from "execution binding" in executable code. A 7B Small Language Model (sLM) performs local refactoring via Fill-in-the-Middle (FIM) only on the execution binding lines—addressing morphological differences through ontological reasoning before deployment and environmental differences via in-situ patches during runtime. This achieves success rates close to GPT-5.2-Codex using only edge-side small models.
EMBGuard: Constructing Hazard-Aware Guardrails for Safe Planning in Embodied Agents: EmbGuard decouples "physical safety judgment for embodied agents" from the policy into an independent, lightweight guardrail model. It takes (observation image, candidate action) as input and outputs (risk binary, risk category, hazard explanation). With only 2B/4B parameters, it matches the performance of GPT-5.1/Gemini-2.5-Pro while significantly suppressing the "over-conservative false positive" issues prevalent in baseline models.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models: This paper reformulates "vision-action attribution" as an intervention estimation problem. It proposes two metrics, ISS (Intervention Saliency Score) and NMR (Nuisance Mass Ratio), using Bernoulli masking + Gaussian blur perturbation + Action MSE as a proxy for KL divergence to quantify which visual regions VLA policies rely on. It demonstrates that NMR has a strong negative correlation of \(r = -0.77\) with OOD task success rates, serving as a cost-effective diagnostic tool for predicting VLA generalization capabilities.
Embodied Task Planning via Graph-Informed Action Generation with Large Language Models: GiG equips the LLM planner with "graph-in-graph" dual-layer memory (Scene Graph + State Transition Graph) + GNN encoding + 1-step lookahead, improving Pass@1 by 6–37 percentage points over ReCAP on Robotouille (Sync/Async) and ALFWorld.
Fourier Features Let Agents Learn High Precision Policies with Imitation Learning: By applying a NeRF-style Fourier feature mapping to point cloud Cartesian coordinates before feeding them into a point cloud encoder, this work eliminates the "spectral bias" where point cloud policies focus on low frequencies and fail to capture high frequencies. This approach significantly improves the success rates of diffusion imitation learning policies in high-precision manipulation tasks across RoboCasa, ManiSkill3, and real-world setups (increasing the real-world normalized score from 14.8% to 40.2%), while remaining robust across various encoders and hyperparameters.
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model: BehaviorVLA utilizes a causal triple-stream Mamba encoder (VBE) to compress long-horizon demonstrations into a time-invariant "behavioral prototype \(z_{\text{proto}}\)" and a time-variant "phase state \(z_{\text{phase}}\)". A phase-conditioned behavior decoder (PBD) then expands the behavioral skeleton into phase-aligned Gaussian priors via a Predictor-Corrector mechanism to guide the flow matching strategy. It sets new SOTA benchmarks on LIBERO, RoboTwin 2.0, and CALVIN, matching OpenVLA-OFT performance using only 50% of real-world data.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation: MoLA utilizes a set of "Modality-aware Inverse Dynamics Models (IDM)" pre-trained on large-scale robotic data to translate future frames predicted by a video generation model into three-way discrete latent actions (semantic, depth, and optical flow). A policy head then performs control based on these action-centric representations, making the "imagine-execute" interface both stable and precise across CALVIN, LIBERO, LIBERO-Plus, and real UR5e platforms.
Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents: Addressing the dual issues of Code-as-Policies (CaP)—specifically, their slow response (repeated prefilling of long prompts) and fragility (API mismatches, lack of safety checks) due to generating code from scratch—FCGraft maintains a library of "function-level verified code skeletons + corresponding KV caches." It utilizes cache-stitching to assemble the KV pairs of cached functions into new policies and cache-patching to locally regenerate only the erroneous segments. On open-domain tasks like ALFRED and RLBench, FCGraft achieves an 18.31% higher success rate and a 2.3× reduction in synthesis latency compared to RAGCache.
SAFAG: Generalizable Actionable Part Pose Estimation without Symmetry Annotation: SAFAG decomposes GAPart 6D pose estimation into a two-stage framework of "candidate quaternion generation + tangent space refinement." By utilizing adaptive probability distributions to implicitly learn symmetry axes/planes across \(x, y, z\) axes, it reduces cross-category rotation error for actionable parts from 5.51° to 3.23° in the complete absence of symmetry annotations.
HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks: HDFlow employs diffusion models to generate sparse strategic sub-goals and rectified flows to generate dense trajectories, integrated with energy guidance and manifold projection. This forms a dual-layer "slow-fast" planner that improves success rates by 20–30% in long-horizon sparse-reward tasks such as furniture assembly.
Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies: LP-DS treats a frozen diffusion/flow-matching policy as a black-box decoder \(\Phi(s,w)\) and learns a state-conditional residual only on its initial noise \(w=\epsilon+\Delta_\theta(s)\). By using a Lagrangian trust region \(\mathbb{E}_s[\|\Delta_\theta(s)\|_2^2]\le\delta\) to constrain the perturbation magnitude, it achieves sample-efficient online RL fine-tuning while preserving multimodal priors. It is more stable than DSRL and DPPO on RoboMimic / Gym / Adroit / LIBERO, with reward gains up to +25%.
LangForce: Bayesian Decomposition of Vision-Language-Action Models via Latent Action Queries: LangForce formulates the VLA policy as a Bayesian decomposition \(\pi(a\mid v,\ell)=p(\ell\mid a,v)\,p(a\mid v)/p(\ell\mid v)\). By introducing learnable Latent Action Queries, it executes both "vision-only" and "vision+language" branches using a single set of VLM weights. It explicitly penalizes "visual shortcuts" by maximizing the log-likelihood ratio between actions and instructions, achieving an 11.3 percentage point absolute improvement over the QwenGR00T baseline on SimplerEnv.
Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models: LaRA-VLA internalizes both textual and visual Chain-of-Thought (CoT) into continuous latents. Through a three-stage curriculum training process (explicit CoT → latent replacement → action expert adaptation), reasoning is performed within the latent space. This reduces inference latency by up to 90% compared to explicit CoT, restoring control frequencies to real-time ranges.
LIMMT: Less is More for Motion Tracking: This paper investigates physics-based humanoid motion tracking from a "data-centric" perspective and proposes a three-stage filtering framework, GQS (Physical Feasibility Filtering → Semantic Motion Embedding → Complexity-Weighted Subset Sampling). It demonstrates that training with less than 3% of the AMASS dataset achieves tracking performance superior to using the full dataset, and this filtering approach can be migrated to various trackers like Any2Track and TWIST2 in a plug-and-play manner.
ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics: This work addresses the gap where vision-language manipulation research primarily covers rigid arms and ignores soft continuum arms. ManiSoft benchmark is constructed using a hybrid simulator coupling "Cosserat rod soft dynamics + MuJoCo rigid body contact + elastic force constraints." It defines 4 categories of tasks reflecting soft arm control difficulties and automatically generates 6,300 scenes and expert trajectories via a "high-level rule planner + low-level RL torque actuator." Results systematically reveal that DP/RDT/OpenVLA-OFT are moderately successful in clean scenes (~30%) but suffer a cliff-like drop in randomized scenes (up to 29.4 points). The root causes of failure lie in the inability to estimate proprioceptive states from vision and the failure to utilize soft body deformability for obstacle avoidance.
Mixture of Horizons in Action Chunking: Addressing the "long-horizon planning vs. short-horizon precision" trade-off caused by "action chunk length (horizon) selection" in VLA models, this paper proposes Mixture of Horizons (MoH). By decomposing a single action chunk into various sub-chunks of different lengths, predicting them in parallel using a shared action transformer, and fusing them with a 2k-parameter linear gate—complemented by a load-balancing loss and dynamic inference via "cross-horizon consensus"—the authors enable \(\pi_{0.5}\) to reach a 99% average success rate on LIBERO for the first time while increasing throughput to 2.5× the baseline.
Moving Out: Physically-grounded Human-AI Collaboration: To address the lack of "physics-grounded constraints" in existing discrete/symbolic benchmarks, this paper introduces Moving Out, a collaborative environment based on a 2D rigid-body physics engine with continuous state-action spaces (e.g., two agents carrying heavy objects around corners). It proposes BASS (Behavior Augmentation, Simulation, and Selection), which enables the AI to collaborate stably when facing unseen human behaviors and object properties, nearly doubling the task completion rate in real human-AI trials.
Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models: NIAF redefines the "action chunk" of VLA models from a sequence of discrete waypoints to a continuous time function \(\mathcal{A}(\tau)=\Phi(\tau;\theta)\). By utilizing an MLLM as a "hierarchical spectral modulator" to output parameters \(\theta\) for a SIREN, the model achieves \(C^\infty\) smooth trajectories, arbitrary frequency querying, and analytically derivable velocity/jerk signals. It achieves SOTA results on CALVIN/LIBERO and eliminates jitter in real-robot impedance control.
Neural Low-Discrepancy Sequences: NeuroLDS utilizes a small MLP that maps integer indices via sinusoidal position encoding to points. By first regressing against Sobol' sequences and then fine-tuning using a closed-form \(L_2\) discrepancy loss over all prefixes, it generates the first extensible neural low-discrepancy sequence. It consistently outperforms Sobol'/Halton across 4D discrepancy metrics, Borehole integration, RRT motion planning, and Black–Scholes PDE solving.
Online Self-Training for Co-Adaptation in Hierarchical Diffusion Policies: ORCHID utilizes "self-training" to enable hierarchical diffusion robot policies to improve online. By repeatedly sampling trajectories and filtering for those where both the planner and controller succeed using sparse environment signals, it distills these successes back into both the high-level planner and low-level controller. This mechanism induces bidirectional co-adaptation between the high-level (HL) and low-level (LL) layers, allowing a lightweight model to outperform VLAs twice its size on the CALVIN benchmark.
Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges: This paper characterizes anonymous Multi-Agent Path Finding (MAPF) as a class of Markovian Multi-Marginal Optimal Transport (MMOT), compressing the \(K^{T+1}\) dimensional transport tensor into a polynomial-scale LP (P1) and guaranteeing integer optimality through Total Unimodularity (TU). It further generalizes this into a Schrödinger bridge for Sinkhorn-style entropic relaxation (P2) to produce a "shadow transport," followed by pruning the graph based on the shadow and solving a sparse LP (P3) to recover integer solutions, achieving 3.6×–7.1× speedup under \(K^{1.15}\) complexity with a cost gap <10%.
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation: This paper proposes SAGE: automatically synthesizing large-scale navigation tasks and IF-THEN experience rules within a physics-constrained semantic sandbox. These experiences are distilled into a VLM policy using GRPO with mixed-prompt sampling and asymmetric adaptive clipping, improving the LLM-Match success rate on A-EQA from 43.5% to 53.2% (2B) / 60.2% (4B), with successful zero-shot transfer to real indoor robots.
Position: Good Embodied Reward Models Need Bad Behavior Data: This position paper utilizes human ratings from RoboArena to empirically demonstrate that three types of SOTA embodied reward models (ReWind, GVL, and Dopamine) systematically "overestimate" actual failed robot behaviors. The root cause is identified as the training data consisting almost exclusively of expert success demonstrations. By inserting real "bad" behavior videos and dense negative reward labels into GVL's in-context prompts, the authors prove that even a minimal amount of negative samples significantly corrects preference ranking, thus calling for the community to actively collect and release "bad" robot data.
PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making: This paper proposes PSG-Nav, which replaces traditional deterministic scene graph navigation with a "three-piece suite": a 3D probabilistic scene graph that retains full category distributions, a multiverse decision-making process that samples multiple consistent worlds from a joint distribution, and an evidential calibration library based on success/failure memories. It achieves new SOTA results on three major ObjectNav benchmarks—HM3D, MP3D, and HSSD—with SR of 66.1%, 44.8%, and 67.9%, respectively.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning: R2R2 incorporates VICReg-style redundancy reduction constraints into Self-Predictive Learning (SPL) to stabilize high UTD training. A key modification is the removal of zero-centering—theoretically proving that zero-centering eliminates constant eigenmodes (i.e., global dynamics information) in SPL spectral decomposition. Experiments on TD7 with UTD=20 improved scores from 1.02 to 1.24 (+22%), and the newly proposed SimbaV2-SPL architecture achieved a new SOTA in continuous control.
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies: RoboMME is the first to systematically map "temporal/spatial/object/procedural" memory from human cognition to 16 long-horizon robotic manipulation tasks (770k high-quality timesteps). By performing a systematic ablation of 14 "memory representation × integration method" combinations on a π0.5 base, it concludes that "Perceptual Memory + AdaLN Modulator" currently offers the best comprehensive trade-off.
Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation: BridgePolicy reformulates the diffusion policy paradigm—moving from "observation as a condition with sampling starting from random noise" to using a diffusion bridge that embeds observations directly into the endpoint of the forward SDE. This allows action sampling to originate from an "information-rich observation prior." By utilizing a semantic aligner to compress heterogeneous observations into an isomorphic representation with actions, the method consistently outperforms existing generative policies across 52 simulation tasks and 5 real-robot tasks.
SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models: SCALE enables autoregressive VLAs to utilize a "self-uncertainty" score, calculated solely from output logits during inference, to simultaneously modulate action sampling temperature and visual attention temperature. By exploring broadly when uncertain and focusing greedily when certain, SCALE significantly improves the success rates of multiple SOTA VLAs with zero additional training, no external verifiers, and a single forward pass.
Scaling by Diversified Experience for Vision-Language-Action Models: SyVLA utilizes a dual-system architecture comprising a "VLM + Flow Matching Action Expert + Feature Query Token" for "think-before-act" robotic control. It incorporates two core designs: an intention decoupling algorithm based on gradient norm masking (stripping high-level reasoning from control intent) and similar-sample guided RL (fixing expert sample advantage at 1.0 to stabilize real-world online RL). Using less than 5% of the \(\pi_0\) pre-training data, it achieves higher real-world success rates and stronger OOD generalization while preserving the original VLM's vision-language capabilities.
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation: Addressing the performance collapse of Vision-Language-Action (VLA) models under simple perturbations, this paper proposes a video transfer pipeline consisting of "semantic/geometric condition extraction → caption rewriting → conditional video diffusion rendering" to augment simulation data with visual and environmental diversity. Combined with a three-stage velocity caching mechanism that reduces generation time by 61% and a difficulty + diversity dual-driven coreset sampling strategy that selects only 10% of key trajectories, it achieves a 5–15% performance gain for RDT-1B / \(\pi_0\) on Robotwin 2.0, LIBERO-Plus, and real-world robots.
Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action: SOMA equips VLA with a persistent spatial-semantic memory built via active scanning with a movable head camera. This memory supports incremental online updates and instruction-based retrieval, enabling robots to stably manipulate objects outside the current field of view (OOV). In five real-world OOV grasping tasks, SOMA reduces the time to first gaze, head search path, and number of grasp attempts by 40-60%.
SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning: The authors observe that VLA inference is compute-bound, making pruning the optimal acceleration path. Given the high overlap of visual information between consecutive action steps, they propose SpecPrune-VLA. This training-free method uses a three-way fusion (previous global attention + current early-layer local attention + frame-difference dynamic tokens) for static pruning, combined with intra-layer dynamic pruning and a velocity-aware coarse/fine granularity controller. It achieves 1.57× speedup on LIBERO and 1.70× on real robots with negligible success rate loss.
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data: Addressing the collapse of VLA models under visual perturbations, the authors identify the MLP projector between the vision encoder and the LLM as the primary source of vulnerability. By replacing it with a "Channel-wise Information Bottleneck Adapter (IB-Adapter)" with fewer than 10M parameters, the 0.5B StableVLA achieves an average performance gain of approximately 35% under severe LIBERO perturbations without any additional training data or augmentation strategies. It also exhibits higher stability than the 14× larger OpenPi in real-world pick-and-place tasks.
STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction: STEP attaches a lightweight Transformer predictor ("previous action history + current observation → next action") to the diffusion policy. Its output serves as a denoising starting point (warm-start), compressing 100 denoising steps to 2. It also includes a "velocity-aware" execution deadlock defense mechanism that injects noise when action changes are too small. It outperforms BRIDGER/DDIM by an average of 21.6% / 27.5% in success rate across 9 simulation and 2 real-robot tasks.
TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation: TapSampling proposes a policy-agnostic, plug-and-play inference-time sampling framework. It first learns a low-dimensional posterior from a small number of policy-generated actions using an Action-VAE to efficiently sample many candidate actions. Subsequently, a semantically interpretable verifier, which predicts "changes in task progress," is used to score and fuse candidates via weighted averaging. Without fine-tuning original policies, it consistently improves the success rates of various general robotic policies such as Diffusion Policy, OpenVLA, VPP, \(\pi_0\), and \(\pi_{0.5}\) on CALVIN/LIBERO and real-world robots.
Test-Time Training for Visual Foresight Vision-Language-Action Models: Addressing the simultaneous dual-stage misalignment of Visual Foresight VLA (VF-VLA) — which predicts future images before generating actions — in Out-of-Distribution (OOD) scenarios, this paper proposes T3VF. It treats the predicted future images and the actual observations after several steps as natural self-supervised pairs. During test-time, the model updates only the minimal visual query modules while filtering noisy steps using "action variance + adaptive quantile buffering." T3VF improves the average success rate on LIBERO-Plus by approximately 5% (relative) with ~1.3× inference overhead, without modifying any network architecture.
The Lie We Tell: Correcting the Euclidean Fallacy in Vision-Language-Action Policies via Score Matching on Tangent Space: Lie Diffuser Actor (LDA) corrects the "Euclidean Fallacy" of flattening \(SE(3)\) poses into \(\mathbb{R}^{12}\) by returning to manifold-native diffusion: injecting noise into the Lie algebra \(\mathfrak{se}(3)\) via left-invariant SDEs, pulling back via the exponential map, and predicting scores in the tangent space. Theoretically, this achieves manifold closure, coordinate equivariance, and geodesic optimality, pushing the average task length on CALVIN ABC→D from 3.27 to 3.51.
Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models: To address the issues of low speed and error accumulation in explicit Chain-of-Thought (CoT) for VLA, the authors propose AVA-VLA—modeling reasoning as a sequence of invisible latent variables, using Reinforcement Learning (RL) to denoise the latent trajectory, and employing an early exit mechanism to adaptively determine reasoning steps based on state confidence. It achieves a 98.3% average success rate on LIBERO while being approximately 6 times faster than explicit CoT reasoning.
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance: TimeRewarder formalizes "task progress" as the normalized temporal distance between video frame pairs. It trains a self-supervised ViT distance regressor using only action-free expert videos and provides the predicted distance between adjacent frames as a dense reward to DrQ-v2. On 10 Meta-World tasks, it approaches a 9/10 success rate within 200K interactions, even outperforming manually designed environmental dense rewards.
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning: This paper proposes FAN, which compresses "expensive generative policies + distributional critics" into "single-step flow anchoring + single noise sample critic." By using Flow Anchoring to perform behavior regularization within a single flow evaluation and a noise-conditioned critic to replace multi-quantile samples with a single Gaussian noise sample, FAN achieves SOTA performance on D4RL/OGBench while being 5-14× faster to train than similar distributional methods.
DiBO: Offline Black-box Optimization with Diffusion Language Models (DNA + Robot Morphology): DiBO adapts the diffusion language model LLaDA-8B for offline black-box optimization. It uses delimiter tokens to unify three heterogeneous signals (prompt/design/label), followed by a three-stage post-training pipeline: Domain Adaptation, Masked-response SFT, and Label-improvement RL. The model achieves SOTA results on several Design-Bench tasks with only 500 labeled samples (e.g., a +8% normalized score gain on DNA tasks) and completes training for a discrete task in 1.5 hours on a single H100.
Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation: To address the continuous environment distribution drift in online vision-language navigation, this paper proposes the IDEA framework. It encapsulates soft prompts learned during each test-time adaptation, along with domain coordinates and uncertainty, into reusable "assets." By utilizing Wasserstein convex hull projection to map the target domain onto a combination of historical assets, a training-free cross-domain bridge is achieved, resulting in an average improvement of +2.5% SR and +1.9% SPL on REVERIE / R2R.
WestWorld: Scalable Trajectory World Models with Knowledge Encoding: WestWorld integrates trajectory dynamics of diverse heterogeneous robots into a single scalable world model using System-aware MoE (Sys-MoE) and knowledge-encoded structural embeddings. After pre-training on 89 simulated and real environments, its zero-shot/few-shot trajectory prediction MAE/MSE significantly outperforms MLP Ensemble, TDM, and TrajWorld. It also enhances downstream MPPI control and successfully deploys to a real Unitree Go1.