Skip to content

TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation

Conference: ICML 2026
arXiv: 2605.25547
Code: Project Page (noted at the end of the paper, repository address not provided in text)
Area: Robotics
Keywords: Inference-time sampling, Action-VAE, Task progress verifier, Generalist robot policy, Plug-and-play

TL;DR

TapSampling proposes a policy-agnostic, plug-and-play inference-time sampling framework: it first learns a low-dimensional posterior from a small number of actions generated by the policy using an Action-VAE to efficiently sample many candidate actions. Then, a semantically interpretable verifier that "predicts task progress changes" scores the candidates for weighted fusion. Without fine-tuning the original policy, it consistently improves the success rates of various generalist robot policies like Diffusion Policy, OpenVLA, VPP, \(\pi_0\), and \(\pi_{0.5}\) on CALVIN/LIBERO and real-world robots.

Background & Motivation

Background: Current mainstream generalist robot policies are mostly based on VLM/video diffusion models, outputting actions through diffusion action heads or next-token-prediction (e.g., OpenVLA, \(\pi_0\), VPP, Diffusion Policy). While they achieve performance through scaling data and models, they all follow a single-shot inference paradigm—producing only one action per decision.

Limitations of Prior Work: These non-deterministic paradigms (diffusion and autoregressive) naturally carry variance; the policy may succeed or fail unpredictably under the same environment and instruction. Single-shot inference lacks a mechanism to correct such stochastic failures. While the LLM and diffusion image communities have verified that multiple sampling + verifier selection at inference time can steadily improve performance, there are few principled solutions in robotics.

Key Challenge: Implementing inference-time sampling on robots is hindered by two factors— (1) Sampling: Repeatedly sampling directly from the policy is too slow (latency increases by orders of magnitude), while sampling from a Gaussian distribution (like RoboMonkey) ignores the correlations across dimensions within an action chunk, causing candidates to deviate from the true action distribution; (2) Verification: Low-level actions (joint angles/end-effector poses/gripper states) lack off-the-shelf VLM evaluators. Existing verifiers often use manual rewards, preference learning, or offline RL training, which lack interpretability and are coupled with specific policy architectures, preventing plug-and-play use.

Goal: Construct a policy-agnostic inference-time sampling framework that simultaneously solves the sub-problems of "efficient and high-fidelity sampling" and "semantically interpretable, high-throughput verification."

Key Insight: The authors observe that when humans judge robot actions, they are essentially estimating "how much this segment of action will advance task progress." This is a semantically meaningful continuous value. Since expert trajectories implicitly contain a monotonically increasing progress curve, positive and negative samples can be constructed at zero cost to train this progress predictor. For sampling, rather than re-running the policy, a VAE can be learned to compress the action distribution, encoding policy actions into a low-dimensional posterior for batch decoding.

Core Idea: Use the posterior learned by Action-VAE for efficient high-fidelity sampling, and task progress change prediction as a semantically interpretable verifier, combining them into the plug-and-play TapSampling.

Method

Overall Architecture

The inputs are the current observation \(s\) and instruction \(l\). The pipeline consists of three steps: 1. Low-N Policy Sampling: Invoke the original policy \(\pi(a\mid s,l)\) to obtain \(N\) (default \(N=4\)) initial action chunks \(A_\pi=\{a_i\}_{i=1}^N\); 2. Posterior Mixing + Decoding: Pass each \(a_i\) through the Action-VAE encoder \(\mathcal{E}\) to get a Gaussian \(q_\mathcal{E}(z\mid a_i)\), mix them as \(q_\text{mix}(z\mid A_\pi)=\frac{1}{N}\sum_i q_\mathcal{E}(z\mid a_i)\), sample \(M\) latents from it, and decode into \(M\) candidate actions \(A^*=\{\mathcal{D}(z_i)\}_{i=1}^M\) (decoding uses a single tiny transformer, latency \(<0.01\)s); 3. Progress Verification + Weighted Fusion: The verifier \(\mathcal{V}(s,l,a)\) predicts a "task progress change" \(\Delta p\in[-1,1]\) for each candidate. Candidates with \(\Delta p\) below a threshold are discarded, and the remaining ones are weighted-averaged by their scores to produce the final action.

Key Designs

  1. Action-VAE Learned Compressed Posterior for Efficient Sampling:

    • Function: Compresses action chunks into low-dimensional latent posteriors, allowing arbitrary multiples of candidate actions to be sampled while ensuring they remain close to the true policy distribution.
    • Mechanism: Both encoder and decoder are transformers. The encoder takes action chunk \(a\) and outputs a diagonal Gaussian \(q_\mathcal{E}(z\mid a)=\mathcal{N}(z;\mu_\mathcal{E}(a),\mathrm{diag}(\sigma_\mathcal{E}^2(a)))\); the decoder reconstructs actions from latents. The objective is \(\mathcal{L}_{avae}=\mathcal{L}_{rec}+\lambda_{KL}\mathcal{L}_{KL}\). During inference, \(N\) policy actions are encoded and equally mixed to form \(q_\text{mix}\), from which \(M\) latents are sampled for decoding.
    • Design Motivation: Repeated policy sampling increases latency linearly (observed \(\sim 8\times\) for \(k=16\)), while independent Gaussian sampling breaks correlations between time steps and dimensions within action chunks. Action-VAE's low-dimensional posterior is both "fast" (only \(N=4\) policy calls + one light decoding) and "high-fidelity" (Table 5 shows MMD is reduced by nearly half compared to Gaussian sampling, e.g., 0.064 vs 0.098 at \(\gamma=2\)).
  2. Progress Prediction Verifier Based on Expert Temporal Consistency:

    • Function: Assigns a semantic score to candidate actions based on "how much they advance task progress." Scores are interpretable (negative = hinders task, small positive = steady but slow, large positive = rapid advancement).
    • Mechanism: Assuming task progress grows linearly along expert trajectories, step \(i\) is assigned \(p_i=i/t\). Positive samples come from the trajectories: \((l,s_i,a_{i:i+k-1})\mapsto k/t\). Negative samples are constructed by reversing action playback: \((l,s_i,a_{i:i-k+1}^r)\mapsto -k/t\) (reversing joint angles/poses has clear physical meaning). The verifier uses a VLA-Adapter architecture (Qwen2.5-0.5B backbone + learnable queries + action head modified to take query actions as input to regress \(\Delta p\)). The loss is \(\mathcal{L}_{tap}=\lVert \mathcal{V}(s,l,a)-\Delta p\rVert_1\).
    • Design Motivation: Existing verifiers often produce uninterpretable scores or are tied to specific architectures. This design offers three benefits: (a) zero human annotation for training; (b) continuous progress values provide more information than 0/1 rewards; (c) it is decoupled from the original policy and generalizable across different architectures.
  3. Batch Parallel Verification with Shared Backbone:

    • Function: Reduces the verification latency of \(M\) candidates to nearly the same as one, preventing the verifier from becoming an inference bottleneck.
    • Mechanism: Since all candidates share the same \((s,l)\) context, the verifier runs the VLM backbone (Qwen2.5-0.5B) only once to get hidden states. These states are then duplicated into a batch and fed into the lightweight action head along with \(M\) candidate actions for parallel \(\Delta p\) regression.
    • Design Motivation: In contrast, RoboMonkey uses LLaVA-7B as a reward head, requiring the 7B backbone to run for every candidate. Ours is approximately \(12\times\) faster at \(k=16\), making inference-time sampling practical for real-time control.

Loss & Training

Two-stage independent training: (1) Action-VAE is trained on action data with \(\mathcal{L}_{rec}+\lambda_{KL}\mathcal{L}_{KL}\). (2) The verifier uses L1 regression for \(\Delta p\). The original policy is completely frozen and requires no fine-tuning. The threshold \(\tau\) and candidate count \(M\) are key hyperparameters at inference.

Key Experimental Results

Main Results

CALVIN ABC→D (zero-shot long-horizon manipulation, metrics: success rate for continuous 1~5 tasks and Average Length Avg.Len):

Policy Avg.Len (Orig.) Avg.Len (+TapSampling) Task 5 Success Gain
Diffusion Policy 2.41 2.58 (+0.17) +4.2
OpenVLA 3.30 3.51 (+0.21) +6.4
VPP (Strong Baseline) 4.39 4.46 (+0.07) +2.8

LIBERO-Long (Hardest subset): \(\pi_{0.5}\) average success rate 96.8% → 98.0%; Real Franka 7-DoF (Knock Down/Pick & Place/Stack, Seen/Unseen objects): \(\pi_0\) 78.3% → 83.3% (+10 on Unseen Stack).

Ablation Study

Comparison of three sampling strategies on CALVIN with VPP (\(k=16\) candidates):

Sampling Strategy Avg.Len Task 5 Single-step Latency (s)
VPP (Original) 4.39 78.3 0.136
Gaussian Sampling 4.43 (+0.04) 79.9 (+1.6) 0.465
Learned Posterior (Ours) 4.46 (+0.07) 81.1 (+2.8) 0.488
Policy Sampling (Upper Bound) 4.50 (+0.11) 82.4 (+4.1) 2.638

Distribution Fidelity (MMD, lower is better, multi-bandwidth RBF):

Bandwidth \(\gamma\) 2 4 6 8 10
Gaussian 0.098 0.155 0.187 0.204 0.212
Ours 0.064 0.082 0.095 0.104 0.112

Key Findings

  • Verifier is more critical than the sampler: Even weak Gaussian Sampling improves performance with the proposed verifier. Replacing Policy Sampling with Learned Posterior results in minimal performance loss but a \(5\times\) reduction in latency, highlighting the trade-off.
  • Verifier scores are semantically meaningful: Selecting actions with the highest vs. lowest verifier scores leads to significantly shorter vs. longer/failed trajectories, proving \(\Delta p\) estimates real progress.
  • Weaker base policies gain more: Diffusion Policy/OpenVLA see gains of \(\sim\)+0.2 Avg.Len, while the already strong VPP gains +0.07, aligning with the intuition that verifiers primarily help avoid stochastic failures.

Highlights & Insights

  • Inference-time scaling successfully migrated to robotics: While "multi-sampling + verifier" is common for LLMs/Diffusion, robotics was limited by the lack of interpretable verifiers and efficient samplers. This paper bridges both gaps.
  • "Reversed actions = negative samples" is a clever zero-cost data construction: Leveraging the physical reversibility of robot actions allows automatic generation of signal-rich negative samples from expert trajectories, avoiding preference annotation or RL rollouts.
  • Shared backbone + batch action heads is an engineering highlight for verifier design, allowing for higher complexity (0.5B VLM) while maintaining near-constant latency, applicable to other embodied tasks requiring score-and-select.
  • Weighted average (not argmax) for action output: Treating the verifier as a soft vote rather than a hard selection is more stable than executing only the top-scoring action.

Limitations & Future Work

  • The linear progress assumption \(p_i=i/t\) may fail in long-horizon, phased tasks where progress might jump between stages, requiring finer non-linear modeling.
  • Weighted fusion in multimodal action distributions might average into invalid actions (e.g., between two feasible trajectories); though thresholding mitigates this, it is not systematically discussed.
  • The 0.5B VLM verifier is not extremely lightweight; with 0.5s latency on Franka, it is still heavy for high-frequency control (>100Hz).
  • Experimental validation was limited to tabletop manipulation; effectiveness in mobile manipulation / bimanual / contact-rich tasks remains unverified.
  • Physical reversibility assumptions for negative samples may not strictly hold in tasks with collisions, friction, or non-conservative constraints.
  • vs. RoboMonkey (Kwok et al., 2025): RoboMonkey uses Gaussian sampling + LLaVA-7B reward head. Ours wins on both ends—lower MMD for the sampler and \(12\times\) faster batch verification with semantic scores.
  • vs. TACO (Yang et al., 2025a): TACO uses state-count optimization for verifiers, which is more tightly coupled to the policy. TapSampling promotes plug-and-play generality with similar gains on LIBERO-Long.
  • vs. Progress estimation works (Zhai/Zhang 2025; Ghasemipour 2025): Their progress functions usually look only at state \(s\) for RL rewards. Ours explicitly conditions on candidate action \(a\) to predict \(\Delta p\) for inference scoring.
  • vs. LLM/Diffusion verifier-guided sampling (PRM, Best-of-N): Conceptually similar, but this work demonstrates that in physical execution, "verifier interpretability" and "latency control" are the decisive factors.

Rating

  • Novelty: ⭐⭐⭐⭐ Porting inference-time scaling to robotics with specific innovations like time-reversed negative samples and batch scoring.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three sim policies + LIBERO + multi-task Franka, covering latency, fidelity (MMD), and upper/lower bound comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation and methodology with well-aligned formulas and tables.
  • Value: ⭐⭐⭐⭐ Plug-and-play capability for mainstream VLAs with high engineering reuse value; significant for the inference-time scaling paradigm in embodied AI.