TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation¶
Conference: ICML 2026
arXiv: 2605.25547
Code: Project Page (noted at the end of the paper, repository address not provided in text)
Area: Robotics
Keywords: Inference-time sampling, Action-VAE, Task progress verifier, Generalist robot policy, Plug-and-play
TL;DR¶
TapSampling proposes a policy-agnostic, plug-and-play inference-time sampling framework: it first learns a low-dimensional posterior from a small number of actions generated by the policy using an Action-VAE to efficiently sample many candidate actions. Then, a semantically interpretable verifier that "predicts task progress changes" scores the candidates for weighted fusion. Without fine-tuning the original policy, it consistently improves the success rates of various generalist robot policies like Diffusion Policy, OpenVLA, VPP, \(\pi_0\), and \(\pi_{0.5}\) on CALVIN/LIBERO and real-world robots.
Background & Motivation¶
Background: Current mainstream generalist robot policies are mostly based on VLM/video diffusion models, outputting actions through diffusion action heads or next-token-prediction (e.g., OpenVLA, \(\pi_0\), VPP, Diffusion Policy). While they achieve performance through scaling data and models, they all follow a single-shot inference paradigm—producing only one action per decision.
Limitations of Prior Work: These non-deterministic paradigms (diffusion and autoregressive) naturally carry variance; the policy may succeed or fail unpredictably under the same environment and instruction. Single-shot inference lacks a mechanism to correct such stochastic failures. While the LLM and diffusion image communities have verified that multiple sampling + verifier selection at inference time can steadily improve performance, there are few principled solutions in robotics.
Key Challenge: Implementing inference-time sampling on robots is hindered by two factors— (1) Sampling: Repeatedly sampling directly from the policy is too slow (latency increases by orders of magnitude), while sampling from a Gaussian distribution (like RoboMonkey) ignores the correlations across dimensions within an action chunk, causing candidates to deviate from the true action distribution; (2) Verification: Low-level actions (joint angles/end-effector poses/gripper states) lack off-the-shelf VLM evaluators. Existing verifiers often use manual rewards, preference learning, or offline RL training, which lack interpretability and are coupled with specific policy architectures, preventing plug-and-play use.
Goal: Construct a policy-agnostic inference-time sampling framework that simultaneously solves the sub-problems of "efficient and high-fidelity sampling" and "semantically interpretable, high-throughput verification."
Key Insight: The authors observe that when humans judge robot actions, they are essentially estimating "how much this segment of action will advance task progress." This is a semantically meaningful continuous value. Since expert trajectories implicitly contain a monotonically increasing progress curve, positive and negative samples can be constructed at zero cost to train this progress predictor. For sampling, rather than re-running the policy, a VAE can be learned to compress the action distribution, encoding policy actions into a low-dimensional posterior for batch decoding.
Core Idea: Use the posterior learned by Action-VAE for efficient high-fidelity sampling, and task progress change prediction as a semantically interpretable verifier, combining them into the plug-and-play TapSampling.
Method¶
Overall Architecture¶
The inputs are the current observation \(s\) and instruction \(l\). The pipeline consists of three steps: 1. Low-N Policy Sampling: Invoke the original policy \(\pi(a\mid s,l)\) to obtain \(N\) (default \(N=4\)) initial action chunks \(A_\pi=\{a_i\}_{i=1}^N\); 2. Posterior Mixing + Decoding: Pass each \(a_i\) through the Action-VAE encoder \(\mathcal{E}\) to get a Gaussian \(q_\mathcal{E}(z\mid a_i)\), mix them as \(q_\text{mix}(z\mid A_\pi)=\frac{1}{N}\sum_i q_\mathcal{E}(z\mid a_i)\), sample \(M\) latents from it, and decode into \(M\) candidate actions \(A^*=\{\mathcal{D}(z_i)\}_{i=1}^M\) (decoding uses a single tiny transformer, latency \(<0.01\)s); 3. Progress Verification + Weighted Fusion: The verifier \(\mathcal{V}(s,l,a)\) predicts a "task progress change" \(\Delta p\in[-1,1]\) for each candidate. Candidates with \(\Delta p\) below a threshold are discarded, and the remaining ones are weighted-averaged by their scores to produce the final action.
Key Designs¶
-
Action-VAE Learned Compressed Posterior for Efficient Sampling:
- Function: Compresses action chunks into low-dimensional latent posteriors, allowing arbitrary multiples of candidate actions to be sampled while ensuring they remain close to the true policy distribution.
- Mechanism: Both encoder and decoder are transformers. The encoder takes action chunk \(a\) and outputs a diagonal Gaussian \(q_\mathcal{E}(z\mid a)=\mathcal{N}(z;\mu_\mathcal{E}(a),\mathrm{diag}(\sigma_\mathcal{E}^2(a)))\); the decoder reconstructs actions from latents. The objective is \(\mathcal{L}_{avae}=\mathcal{L}_{rec}+\lambda_{KL}\mathcal{L}_{KL}\). During inference, \(N\) policy actions are encoded and equally mixed to form \(q_\text{mix}\), from which \(M\) latents are sampled for decoding.
- Design Motivation: Repeated policy sampling increases latency linearly (observed \(\sim 8\times\) for \(k=16\)), while independent Gaussian sampling breaks correlations between time steps and dimensions within action chunks. Action-VAE's low-dimensional posterior is both "fast" (only \(N=4\) policy calls + one light decoding) and "high-fidelity" (Table 5 shows MMD is reduced by nearly half compared to Gaussian sampling, e.g., 0.064 vs 0.098 at \(\gamma=2\)).
-
Progress Prediction Verifier Based on Expert Temporal Consistency:
- Function: Assigns a semantic score to candidate actions based on "how much they advance task progress." Scores are interpretable (negative = hinders task, small positive = steady but slow, large positive = rapid advancement).
- Mechanism: Assuming task progress grows linearly along expert trajectories, step \(i\) is assigned \(p_i=i/t\). Positive samples come from the trajectories: \((l,s_i,a_{i:i+k-1})\mapsto k/t\). Negative samples are constructed by reversing action playback: \((l,s_i,a_{i:i-k+1}^r)\mapsto -k/t\) (reversing joint angles/poses has clear physical meaning). The verifier uses a VLA-Adapter architecture (Qwen2.5-0.5B backbone + learnable queries + action head modified to take query actions as input to regress \(\Delta p\)). The loss is \(\mathcal{L}_{tap}=\lVert \mathcal{V}(s,l,a)-\Delta p\rVert_1\).
- Design Motivation: Existing verifiers often produce uninterpretable scores or are tied to specific architectures. This design offers three benefits: (a) zero human annotation for training; (b) continuous progress values provide more information than 0/1 rewards; (c) it is decoupled from the original policy and generalizable across different architectures.
-
Batch Parallel Verification with Shared Backbone:
- Function: Reduces the verification latency of \(M\) candidates to nearly the same as one, preventing the verifier from becoming an inference bottleneck.
- Mechanism: Since all candidates share the same \((s,l)\) context, the verifier runs the VLM backbone (Qwen2.5-0.5B) only once to get hidden states. These states are then duplicated into a batch and fed into the lightweight action head along with \(M\) candidate actions for parallel \(\Delta p\) regression.
- Design Motivation: In contrast, RoboMonkey uses LLaVA-7B as a reward head, requiring the 7B backbone to run for every candidate. Ours is approximately \(12\times\) faster at \(k=16\), making inference-time sampling practical for real-time control.
Loss & Training¶
Two-stage independent training: (1) Action-VAE is trained on action data with \(\mathcal{L}_{rec}+\lambda_{KL}\mathcal{L}_{KL}\). (2) The verifier uses L1 regression for \(\Delta p\). The original policy is completely frozen and requires no fine-tuning. The threshold \(\tau\) and candidate count \(M\) are key hyperparameters at inference.
Key Experimental Results¶
Main Results¶
CALVIN ABC→D (zero-shot long-horizon manipulation, metrics: success rate for continuous 1~5 tasks and Average Length Avg.Len):
| Policy | Avg.Len (Orig.) | Avg.Len (+TapSampling) | Task 5 Success Gain |
|---|---|---|---|
| Diffusion Policy | 2.41 | 2.58 (+0.17) | +4.2 |
| OpenVLA | 3.30 | 3.51 (+0.21) | +6.4 |
| VPP (Strong Baseline) | 4.39 | 4.46 (+0.07) | +2.8 |
LIBERO-Long (Hardest subset): \(\pi_{0.5}\) average success rate 96.8% → 98.0%; Real Franka 7-DoF (Knock Down/Pick & Place/Stack, Seen/Unseen objects): \(\pi_0\) 78.3% → 83.3% (+10 on Unseen Stack).
Ablation Study¶
Comparison of three sampling strategies on CALVIN with VPP (\(k=16\) candidates):
| Sampling Strategy | Avg.Len | Task 5 | Single-step Latency (s) |
|---|---|---|---|
| VPP (Original) | 4.39 | 78.3 | 0.136 |
| Gaussian Sampling | 4.43 (+0.04) | 79.9 (+1.6) | 0.465 |
| Learned Posterior (Ours) | 4.46 (+0.07) | 81.1 (+2.8) | 0.488 |
| Policy Sampling (Upper Bound) | 4.50 (+0.11) | 82.4 (+4.1) | 2.638 |
Distribution Fidelity (MMD, lower is better, multi-bandwidth RBF):
| Bandwidth \(\gamma\) | 2 | 4 | 6 | 8 | 10 |
|---|---|---|---|---|---|
| Gaussian | 0.098 | 0.155 | 0.187 | 0.204 | 0.212 |
| Ours | 0.064 | 0.082 | 0.095 | 0.104 | 0.112 |
Key Findings¶
- Verifier is more critical than the sampler: Even weak Gaussian Sampling improves performance with the proposed verifier. Replacing Policy Sampling with Learned Posterior results in minimal performance loss but a \(5\times\) reduction in latency, highlighting the trade-off.
- Verifier scores are semantically meaningful: Selecting actions with the highest vs. lowest verifier scores leads to significantly shorter vs. longer/failed trajectories, proving \(\Delta p\) estimates real progress.
- Weaker base policies gain more: Diffusion Policy/OpenVLA see gains of \(\sim\)+0.2 Avg.Len, while the already strong VPP gains +0.07, aligning with the intuition that verifiers primarily help avoid stochastic failures.
Highlights & Insights¶
- Inference-time scaling successfully migrated to robotics: While "multi-sampling + verifier" is common for LLMs/Diffusion, robotics was limited by the lack of interpretable verifiers and efficient samplers. This paper bridges both gaps.
- "Reversed actions = negative samples" is a clever zero-cost data construction: Leveraging the physical reversibility of robot actions allows automatic generation of signal-rich negative samples from expert trajectories, avoiding preference annotation or RL rollouts.
- Shared backbone + batch action heads is an engineering highlight for verifier design, allowing for higher complexity (0.5B VLM) while maintaining near-constant latency, applicable to other embodied tasks requiring score-and-select.
- Weighted average (not argmax) for action output: Treating the verifier as a soft vote rather than a hard selection is more stable than executing only the top-scoring action.
Limitations & Future Work¶
- The linear progress assumption \(p_i=i/t\) may fail in long-horizon, phased tasks where progress might jump between stages, requiring finer non-linear modeling.
- Weighted fusion in multimodal action distributions might average into invalid actions (e.g., between two feasible trajectories); though thresholding mitigates this, it is not systematically discussed.
- The 0.5B VLM verifier is not extremely lightweight; with 0.5s latency on Franka, it is still heavy for high-frequency control (>100Hz).
- Experimental validation was limited to tabletop manipulation; effectiveness in mobile manipulation / bimanual / contact-rich tasks remains unverified.
- Physical reversibility assumptions for negative samples may not strictly hold in tasks with collisions, friction, or non-conservative constraints.
Related Work & Insights¶
- vs. RoboMonkey (Kwok et al., 2025): RoboMonkey uses Gaussian sampling + LLaVA-7B reward head. Ours wins on both ends—lower MMD for the sampler and \(12\times\) faster batch verification with semantic scores.
- vs. TACO (Yang et al., 2025a): TACO uses state-count optimization for verifiers, which is more tightly coupled to the policy. TapSampling promotes plug-and-play generality with similar gains on LIBERO-Long.
- vs. Progress estimation works (Zhai/Zhang 2025; Ghasemipour 2025): Their progress functions usually look only at state \(s\) for RL rewards. Ours explicitly conditions on candidate action \(a\) to predict \(\Delta p\) for inference scoring.
- vs. LLM/Diffusion verifier-guided sampling (PRM, Best-of-N): Conceptually similar, but this work demonstrates that in physical execution, "verifier interpretability" and "latency control" are the decisive factors.
Rating¶
- Novelty: ⭐⭐⭐⭐ Porting inference-time scaling to robotics with specific innovations like time-reversed negative samples and batch scoring.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three sim policies + LIBERO + multi-task Franka, covering latency, fidelity (MMD), and upper/lower bound comparisons.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and methodology with well-aligned formulas and tables.
- Value: ⭐⭐⭐⭐ Plug-and-play capability for mainstream VLAs with high engineering reuse value; significant for the inference-time scaling paradigm in embodied AI.