This paper proposes Sparse Imagination, which achieves substantial inference speedup in ViT patch token-based world model planning by randomly dropping tokens and training with randomly grouped attention (50% drop rate reduces planning time by ~50%), while maintaining or even surpassing full-token planning performance on certain tasks. A key finding is that simple random dropout outperforms sophisticated token selection methods, as static importance ranking suffers from a "blind spot problem" in dynamic planning scenarios.
Key Challenge: Background: 1. World model-based planning (MPC) enables decision-making by imagining future trajectories, but computational cost scales quadratically with token count — each planning step requires \(K \times M \times H\) world model forward passes. 2. ViT patch tokens as visual state representations (e.g., DINO-WM) retain richer spatial information than a single CLS token, offering clear advantages in fine-grained manipulation tasks. 3. However, the computational overhead of full patch tokens in MPC makes real-time deployment extremely difficult, especially in compute-constrained settings such as robotics. 4. ViT representations are known to be redundant — multiple studies (Raghu et al., Pan et al., Kim et al.) demonstrate that not all patches are equally important for downstream tasks. 5. Compute resources are particularly limited in robotic settings (embedded GPUs), requiring substantial reduction in inference cost without sacrificing accuracy. 6. Existing token pruning methods (attention ranking / learned selection / merging / training-time dropout) are effective on static tasks such as classification but have not been validated in iterative, dynamic planning scenarios.
World Model Architecture: A frozen pretrained DINO encoder extracts visual patch tokens \(z_t \in \mathbb{R}^{H_p \times W_p \times D}\); a causal Transformer decoder predicts future token sequences.
Training Loss: MSE prediction loss \(\mathcal{L}_{wm} = \frac{1}{N}\sum_{i=1}^N \|\hat{z}_{t+1,i} - z_{t+1,i}\|^2\); goal distance is also measured with MSE.
Sparse Imagination: At world model inference time, a fraction \(p\) of patch tokens is randomly dropped, and forward prediction is performed using only \((1-p)N\) tokens.
Randomly Grouped Attention Training: During training, tokens of each frame are randomly partitioned into two groups; attention masks restrict interactions to within each group, enabling the model to handle arbitrary token subsets. Group assignments are kept consistent along the temporal dimension.
MPC Integration: The dropout mask is resampled independently at each planning step; both prediction and CEM optimization are performed on the sparse token set.
VLA-Guided Planning: For long-horizon tasks, \(K\) candidate action sequences are sampled from a pretrained VLA (SmolVLA) to replace CEM's random sampling, substantially improving long-horizon planning efficiency.
Key Finding: Simple random sampling outperforms complex attention-based or learned ranking methods, because static importance metrics exhibit "blind spots" in dynamic planning — patches that appear unimportant at the current state may become critical when evaluating candidate action sequences. The unbiased coverage of random sampling avoids systematic omissions.
Specifically, patches that appear unimportant at the current state may become critical when evaluating candidate action sequences.
Random sampling avoids systematic omissions through unbiased coverage — resampling the mask at each iteration ensures every region has a nonzero probability of being included.
This finding contradicts the common conclusion in the token pruning literature that "learned selection outperforms random," highlighting the unique characteristics of planning scenarios.
The optimal drop ratio must be tuned manually per task; an adaptive mechanism is lacking. A potential improvement is to dynamically adjust the ratio based on task complexity or the current state.
The number of groups is fixed at 2; the effect of more groups (e.g., 3–4) is unexplored.
The method relies on the redundancy assumption of DINO features, which may not hold in information-dense scenarios (e.g., text-heavy interfaces).
Real-world validation is limited to two relatively simple tasks (PickPlace + Drawer); more complex manipulation tasks are not tested.
Integration with token merging methods (e.g., ToMe) is unexplored — combining sparse selection with merging could further improve efficiency.
vs. Dreamer series (Hafner et al.): Dreamer imagines in a low-dimensional vector latent space, whereas this paper imagines in a high-dimensional patch token space — retaining richer spatial information at greater computational cost. Sparse Imagination bridges this gap.
vs. DINO-WM (Zhou et al. 2024): This work builds directly on DINO-WM and addresses its computational bottleneck via sparse imagination.
vs. ToMe (Bolya et al.): ToMe reduces computation through token merging; this work uses token dropping — a simpler design that requires no additional merging logic.
vs. SmolVLA (Shukor et al.): SmolVLA provides a pretrained policy for guided planning; sparse imagination accelerates the world model evaluation under VLA-guided planning.
Insights: The sparse imagination paradigm can be generalized to other settings requiring many forward passes — e.g., value network evaluation in MCTS search, or world simulation in multi-step reasoning.