SITCOM: Scaling Inference-Time COMpute for VLAs¶

Conference: NeurIPS 2025 arXiv: 2510.04041 Code: N/A Area: Multimodal VLM Keywords: Inference-time compute scaling, VLA, world model, model predictive control, robotic manipulation

TL;DR¶

SITCOM proposes an inference-time compute scaling framework inspired by Model Predictive Control (MPC). It performs multi-step rollout simulation of a pretrained VLA using a learned dynamics model and selects optimal trajectories via a reward model, transforming a single-step VLA into a robust long-horizon planner. On the SIMPLER benchmark, it improves task success rate from 48% to 72%.

Background & Motivation¶

Robot learning has long been constrained by three core challenges: the high cost of acquiring annotated data, limited generalization, and difficulty in long-horizon planning. Vision-Language-Action (VLA) models have made notable progress by grounding natural language instructions into control commands, yet they face critical limitations in real-world deployment:

Lack of lookahead: VLAs are inherently single-step predictors and cannot evaluate the long-term consequences of actions.

Accumulated errors: In open-loop execution, small errors compound across steps, causing failures in multi-step tasks.

Poor adaptability to dynamic environments: Plans cannot be adjusted in response to environmental changes during execution.

Existing solutions either train explicit reasoning via chain-of-thought (CoT) data—which requires expensive annotation—or employ world models that are often computationally expensive and task-specific.

Key Insight: SITCOM transfers the concept of inference-time compute scaling from language models to robot control. Rather than modifying the training paradigm, it "thinks more" at inference time through parallel rollouts and reward-based ranking before acting—analogous to MPC: lookahead simulation, evaluation, and selection at each decision step.

Method¶

Overall Architecture¶

SITCOM's inference pipeline: 1. Candidate generation: At each decision step, \(n\) candidate actions are sampled from the VLA policy with a high sampling temperature. 2. Rollout simulation: For each candidate action, a dynamics model predicts the next-state image; the VLA then samples a subsequent action from that predicted image, iterating for \(l\) steps to generate complete trajectories. 3. Reward evaluation: A reward is computed for the final state of each trajectory, incorporating gripper-object distance, object-target distance, and a grasp success indicator. 4. Trajectory selection and execution: The action sequence from the highest-reward trajectory is executed in the real environment. 5. Repetition: The loop continues at the environment's replanning frequency until task completion.

Two rollout modes are provided: - SITCOM (EnvSim): Oracle rollouts using environment instances (serves as an upper bound). - SITCOM (Dynamics): Rollouts using the learned dynamics model (the practically deployable variant).

Key Designs¶

Transformer dynamics model: An encoder-decoder architecture is adopted. The encoder processes image patches concatenated with action information; the decoder predicts patches of the next frame. Joint training uses an \(L_1\) pixel loss and LPIPS perceptual loss to balance low-level accuracy with high-level visual coherence. A two-stage training strategy is employed: (1) pretraining on approximately 25,000 BridgeV2 trajectories to learn general dynamics; (2) fine-tuning on SIMPLER environment trajectories to adapt to target environment visuals and physics, bridging the Real2Sim gap.
DAgger-style adaptation strategy: A model trained only on single-step prediction performs well for one step but suffers severe object reconstruction degradation during multi-step rollouts due to compounding errors. Inspired by DAgger, the model is trained to predict from its own previous predictions rather than always from ground-truth observations, aligning the training distribution with the autoregressive inference distribution and substantially reducing prediction drift in long-horizon rollouts.
VLA fine-tuning: As no public expert data exists for the SIMPLER environment, approximately 100 expert trajectories are curated: the pretrained model generates trajectories, heuristic rules filter successful executions, and human review ensures high quality. Standard cross-entropy loss is used to fine-tune the discretized action tokens.

Loss & Training¶

The dynamics model employs a composite \(L_1\) + LPIPS loss: \(L_1\) ensures pixel-level accuracy, while LPIPS ensures perceptual visual fidelity. The reward comprises three signals: gripper-object clearance (guiding approach), object-target distance (guiding placement), and a grasp success indicator.

Default configuration: rollout length of 10 steps, 5 candidate trajectories. Planning time scales linearly with the number of candidates (approximately 35 seconds for 5 candidates, approximately 160 seconds for 25 candidates).

Key Experimental Results¶

Main Results — Task Success Rate on SIMPLER¶

Task	OpenVLA	OpenVLA-SFT	SITCOM (EnvSim)	SITCOM (World Model)
Put carrot on plate	0.0	0.50	0.71	0.66
Put spoon on tablecloth	0.0	0.63	0.83	0.83
Stack green on yellow block	0.042	0.17	0.58	0.62
Put eggplant in basket	0.0	0.63	0.92	0.79
Average	0.01	0.48	0.76	0.72

Dynamics Model Quality¶

Model	FID↓	OFL↓
Base model (BridgeV2 only)	17.0	1.665
Fine-tuned model	11.2	0.992

Ablation Study — Effect of Number of Candidates¶

# Candidates	1	5	10	15	20	25
Planning time (s)	21	35	75	100	130	160

Key Findings¶

Inference-time compute scaling is effective in robot control: success rate improves from 48% (OpenVLA-SFT) to 72–76% (SITCOM).
The learned dynamics model (72%) closes to within 4% of the oracle simulator (76%), validating the approach's feasibility.
VLA fine-tuning alone improves performance from 1% to 48%, resolving approximately 40% of the Real2Sim gap.
Increasing the number of candidates yields consistent gains up to 25 candidates, with earlier saturation on some tasks.
Complex tasks (e.g., placing eggplant in basket) benefit more from longer rollouts.
DAgger-style adaptation substantially improves object reconstruction quality in long-horizon rollouts, though prediction drift is not fully eliminated.
Primary failure modes: VLA Real2Sim gap and insufficient dexterous manipulation capability (e.g., slight mistiming of gripper closure causing objects to slip).

Highlights & Insights¶

The core idea is concise and powerful: transferring "inference-time compute scaling" from LLMs to robot control, trading more computation for better decisions.
The MPC-style rollout-and-ranking framework is general and compatible with any VLA policy.
The DAgger-style training strategy elegantly addresses the distributional shift inherent in autoregressive prediction.
The qualitative analysis of failure modes is thorough and candid, clearly identifying dexterous manipulation and the Real2Sim gap as primary bottlenecks.
The two-stage dynamics model training (large-scale pretraining + in-domain fine-tuning) is a practical and reproducible recipe.

Limitations & Future Work¶

The reward signal relies on oracle simulator states (assuming perfect environment knowledge); real-world deployment requires replacing this with a learned reward model.
A deterministic dynamics model struggles with stochastic environments; future work could explore probabilistic action-conditioned video diffusion models.
Inference time is a significant bottleneck (35 seconds for 5 candidates), limiting real-time control frequency.
Evaluation is conducted only in simulation (SIMPLER); real-robot experiments are absent.
The dynamics model is trained solely on successful trajectories, yielding unreliable predictions for failure states.
Rollout length must be set manually, with no adaptive mechanism despite task-dependent optimal values.

Unlike explicit decomposition via CoT reasoning (ECoT, zero-shot annotation), SITCOM conducts long-horizon reasoning through implicit simulation-based evaluation.
Compared to world model approaches such as Dreamer and TD-MPC2, SITCOM predicts in pixel space using a lightweight Transformer, balancing visual fidelity with computational efficiency.
Compared to generative video models such as GAIA-1 and UniSim, SITCOM avoids expensive diffusion architectures.
Promising future directions include combining action chunking to reduce the number of inference calls, and developing general vision-based reward models.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐