Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning¶

Conference: CVPR2026 arXiv: 2601.09708 Code: Project Page Area: Robotics Keywords: VLA, reasoning, latent CoT, knowledge distillation, preference learning, robot manipulation

TL;DR¶

This paper proposes Fast-ThinkAct, which compresses verbose textual CoT reasoning (~250 tokens) into 6 verbalizable continuous latent tokens, combined with reward-guided preference distillation and visual trajectory alignment, achieving an 89.3% reduction in inference latency (9.3× faster than ThinkAct-7B) while maintaining or surpassing the performance of state-of-the-art reasoning VLAs.

Background & Motivation¶

Vision-Language-Action (VLA) tasks require agents to reason over complex visual scenes and execute adaptive actions. Recent VLA models are primarily trained via large-scale robot demonstrations with supervised learning, achieving strong performance on basic skills (e.g., pick-and-place), but exhibiting insufficient generalization in the following areas:

Long-horizon planning: Complex tasks requiring multi-step reasoning (e.g., turning on a stove before placing a pan)
Failure recovery: Detecting failures at runtime and generating corrective plans
Few-shot adaptation: Rapidly adapting to novel scenes and tasks

Reasoning VLAs (e.g., ThinkAct, CoT-VLA, MolmoAct) improve generalization by incorporating explicit chain-of-thought reasoning. However, generating lengthy reasoning chains introduces severe inference latency bottlenecks:

ThinkAct-7B requires approximately 7.5 seconds per step (~0.1 Hz)
Robot manipulation requires real-time decision frequencies of 1–15 Hz
ECoT-Lite attempts to accelerate via reasoning dropout, but directly truncating textual reasoning discards critical information, leading to performance degradation

Core motivation: How can verbose textual CoT be compressed into a compact representation while preserving reasoning capability and correctly capturing spatiotemporal dynamics?

Core Problem¶

Textual CoT reasoning generates long sequences (~250 tokens), with inference latency reaching several seconds, rendering real-time manipulation infeasible
Latent reasoning methods from the LLM domain (e.g., Coconut, CODI) cannot be directly transferred to VLA tasks, which require spatiotemporal understanding and must bridge semantic reasoning with embodied control
After compressing reasoning into the continuous latent space, direct supervision signals guiding what the latent should encode are absent

Method¶

Overall Architecture¶

Fast-ThinkAct comprises three core stages:

Reward-Guided Preference Distillation: The teacher's GRPO reward signal guides the student to learn high-quality latent reasoning
Visual Trajectory Alignment: Aligns trajectory-level representations between teacher and student to transfer visual planning capability
Reasoning-Enhanced Policy Learning: Freezes the student VLM and augments the action model with latent reasoning features for action generation

Stage 1: Verbalizable Latent CoT by Reward Preferences¶

Teacher training: The textual teacher VLM \(\mathcal{F}_{\theta^T}\) is trained via GRPO from a CoT-SFT checkpoint using action-aligned visual rewards to generate explicit textual reasoning chains. The GRPO advantage function \(A(\tau)\) naturally serves as an indicator of reasoning quality.

Constructing preference pairs: The highest- and lowest-advantage reasoning chains are selected from each rollout group as positive and negative samples:

\[\tau^+ = \arg\max_{\tau \in G} A(\tau), \quad \tau^- = \arg\min_{\tau \in G} A(\tau)\]

Student learning: The student VLM \(\mathcal{F}_\theta\) does not generate text tokens; instead, it autoregressively generates \(M=6\) continuous latent vectors \(\mathbf{z} = \{z_m\}_{m=1}^M\), where \(z_m \in \mathbb{R}^d\).

Verbalizer: A verbalizer LLM \(\mathcal{V}_\psi\) (Qwen3-0.6B with inserted cross-attention layers) is introduced to decode latents into natural language. The training objective encourages the verbalizer to assign higher likelihood to high-quality reasoning \(\tau^+\):

\[\mathcal{L}_{\text{verb}} = -\mathbb{E}\left[\log \sigma\left(\beta \left(\log \frac{p_\psi(\tau^+|\mathbf{z})}{p_{\text{ref}}(\tau^+)} - \log \frac{p_\psi(\tau^-|\mathbf{z})}{p_{\text{ref}}(\tau^-)}\right)\right)\right]\]

This is a DPO-style objective where \(\beta=0.1\) controls preference strength. Through this formulation, the student is guided to encode latents that the verbalizer can decode into high-quality reasoning.

Stage 2: Action-Aligned Visual Plan Distillation¶

The hidden states of the teacher and student at the <answer> token are aligned to transfer trajectory-level visual planning capability:

\[\mathcal{L}_{\text{distill}} = \|h_t^T - h_t\|_2^2\]

Additionally, \(K=5\) learnable spatial tokens \(\{s_i\}_{i=1}^K\) are appended after the latent reasoning sequence. The output hidden state of each spatial token is projected in parallel through an MLP to a waypoint \(p_i \in \mathbb{R}^6\) (format \([x_{\text{single}}, y_{\text{single}}, x_{\text{left}}, y_{\text{left}}, x_{\text{right}}, y_{\text{right}}]\)), replacing the teacher's autoregressive generation of 60–70 waypoint text tokens.

The overall training objective is: \(\mathcal{L}_{\text{student}} = \mathcal{L}_{\text{verb}} + \mathcal{L}_{\text{distill}} + \mathcal{L}_{\text{ans}}\)

Stage 3: Reasoning-Enhanced Policy Learning¶

The student VLM \(\mathcal{F}_\theta\) is frozen, and visual latent planning features \(c_t\) are extracted from the early-layer KV cache of the spatial tokens, then injected into a diffusion Transformer action model \(\pi_\phi\) (DiT-Policy or RDT) via cross-attention:

\[\mathcal{L}_{\text{IL}}(\phi) = \ell(\pi_\phi(o_t, l, c_t), \hat{a}_t)\]

Ablations confirm that early-layer KV outperforms late-layer KV in capturing visual planning information (LIBERO: 89.7 vs. 88.3 vs. 87.1).

Loss & Training¶

VLM backbone: Qwen2.5-VL 3B
Training pipeline: SFT → CoT-SFT → Teacher GRPO + Student distillation (4,500 iterations)
Verbalizer warmup: 3,000 iterations (LM loss), then switched to \(\mathcal{L}_{\text{verb}}\) for 1,500 iterations
Policy learning: 20K iterations with frozen VLM and state encoder
At inference, only \(\mathcal{F}_\theta + \pi_\phi\) are required; the verbalizer is used only during training and for interpretability

Key Experimental Results¶

LIBERO & SimplerEnv (Robot Manipulation)¶

Method	LIBERO (avg)	SimplerEnv-Google	Inference Latency (ms)
OpenVLA-7B	76.5	40.2	N/A
ThinkAct-7B	84.4	68.3	7513
MolmoAct-7B	86.8	64.9	6723
ThinkAct-3B	83.1	64.7	5674
Fast-ThinkAct-3B	89.7	68.7	805 (↓7.0×)

Fast-ThinkAct-3B surpasses ThinkAct-3B by 6.6% on LIBERO, by 4.0% on SimplerEnv, and achieves a 7× latency reduction.

RoboTwin2.0 (Bimanual Manipulation)¶

Method	Easy Avg	Hard Avg
RDT	56.4	22.8
ThinkAct	62.4	24.7
Fast-ThinkAct	65.7	26.4

Advantages are more pronounced on long-horizon tasks (270+ steps).

Embodied Reasoning¶

Method	EgoPlan-Bench2	RoboVQA (B-Avg)	OpenEQA	Overall
ThinkAct-3B	44.0	55.3	48.9	49.4
Fast-ThinkAct-3B	46.4	60.8	51.2	52.8

Surpasses commercial models including GPT-4V (36.4) and Gemini-2.5-Flash (38.9).

Ablation Study¶

Removing \(\mathcal{L}_{\text{verb}}\): Overall drops from 52.8 to 48.5 (−4.3), indicating the necessity of preference-guided supervision
Removing \(\mathcal{L}_{\text{distill}}\): Further drops to 47.7, confirming the importance of visual planning transfer
Comparison with efficient textual reasoning: teacher direct reasoning 49.8, 6 text tokens 46.3, RL length-penalty 47.8, Fast-ThinkAct with 6 latent tokens 53.3
Latent token count ablation: \(M=1\) is insufficient; \(M=30/100\) introduces noise; \(M=6\) is optimal

Highlights & Insights¶

Elegant verbalizable latent design: Latents can be decoded into text via the verbalizer, achieving both compression and interpretability while fundamentally addressing the lack of direct supervision in the latent space
Reward-guided preference distillation: Reuses the teacher GRPO reward signal to construct DPO preference pairs without additional annotation, yielding highly efficient training signals
Dramatic latency reduction: Parallel prediction with 6 latent tokens and 5 spatial tokens achieves an 89.3% latency reduction, transforming an unusable system (0.1 Hz) into a real-time capable one
Strong failure recovery: Outperforms the second-best method by 10.9–16.4 points on RoboFAC, demonstrating that latent reasoning preserves the ability to understand errors and plan corrections

Limitations & Future Work¶

The verbalizer inherits hallucination tendencies from the pretrained LLM — verbalized reasoning may produce plausible-sounding but inaccurate descriptions (without affecting action inference)
Evaluation is conducted exclusively in simulated environments; real-robot deployment results are not demonstrated
Only a 3B VLM backbone is used for the student; ablations on a 7B version are insufficient (evaluated only on reasoning benchmarks, not comprehensively validated on manipulation tasks)
The number of spatial tokens is fixed at \(K=5\); adaptive token counts are not explored
The training pipeline is complex (SFT → CoT-SFT → Teacher GRPO → Student distillation → Policy learning), leaving substantial room for end-to-end simplification

Dimension	ThinkAct	MolmoAct	CoT-VLA	ECoT-Lite	Fast-ThinkAct
Reasoning form	Textual CoT	2D visual trace	Visual goal + text	Reasoning dropout	Latent CoT
Reasoning length	~250 tokens	~250 tokens	—	Variable	6 latent tokens
Inference latency	7.5s (7B)	6.7s (7B)	—	Reduced but unstable	0.8s (3B)
RL training	GRPO	None	None	None	Teacher GRPO → DPO distill
Interpretability	High (text)	High (visual)	Medium	Low	Medium (optional verbalization)

Core distinction: Fast-ThinkAct migrates reasoning from token space to continuous latent space and replaces direct distillation with preference learning, achieving an effective balance between efficiency and quality.

Beyond the direct comparisons above, the verbalizable latent paradigm generalizes to real-time reasoning scenarios such as autonomous driving — any task requiring CoT capability under latency constraints. The reward-guided distillation pipeline (Teacher GRPO → Student DPO) circumvents the annotation challenge in latent space and is transferable to other latent reasoning works. The finding that early-layer KV cache outperforms late-layer KV implies that visual planning information is encoded in the shallower layers of VLMs, intersecting with the VLM probing literature. This work complements LLM latent reasoning approaches such as Coconut and CODI, representing the first extension of latent reasoning to the VLA domain.

Rating¶

Novelty: 8/10 — The combination of verbalizable latents and reward preference distillation is a novel design that addresses the key challenge of supervision signals in latent reasoning
Experimental Thoroughness: 9/10 — Six benchmarks (3 reasoning + 3 manipulation), comprehensive ablations, and detailed latency analysis
Writing Quality: 8/10 — Clear structure, complete mathematical formulation, and intuitive illustrations
Value: 9/10 — Reducing inference latency from seconds to sub-second while improving performance resolves a critical bottleneck for deploying reasoning VLAs in practice