AStar: Boosting Multimodal Reasoning with Automated Structured Thinking¶

Conference: AAAI 2026 arXiv: 2502.02339 Code: Not released Area: Multimodal VLM Keywords: Multimodal reasoning, thought cards, MCTS, training-free, structured thinking

TL;DR¶

This paper proposes AStar, a training-free multimodal reasoning paradigm that constructs a library of high-level "thought card" reasoning templates from 500 seed samples. At inference time, the most suitable templates are adaptively retrieved to guide structured reasoning in MLLMs. A 7B model achieves 53.9% accuracy on MathVerse (surpassing GPT-4o's 50.2%), requiring only 50 minutes of preprocessing and no model training.

Background & Motivation¶

MLLMs perform poorly on complex visual reasoning tasks. Existing enhancement methods fall into two categories: (1) search-based methods (e.g., MCTS) with prohibitive computational costs; and (2) post-training methods (e.g., SFT/GRPO) that require large-scale data (>100K) and significant compute, and suffer from training instability. RL-based methods can only shift the output distribution without introducing external knowledge, which fundamentally limits the upper bound of reasoning capability. A more efficient, training-free approach to enhance multimodal reasoning is therefore needed.

Core Problem¶

How can the complex visual reasoning capabilities of MLLMs be substantially improved without large-scale training? The core challenges are: direct MCTS search is too slow; SFT/RL requires too much data and compute; and existing methods fail to effectively generalize high-level reasoning strategies to unseen problems.

Method¶

Overall Architecture¶

A two-stage pipeline: (1) Thought Card Construction — MCTS is applied to 500 seed samples to identify optimal reasoning paths, which are selected via the VOC criterion and then distilled into abstract "thought card" templates; (2) Adaptive Reasoning & Verification — at inference time, the 5 most relevant thought cards are retrieved based on problem complexity (PC) and text-image semantics (TIS), instantiated into candidate solutions, and verified via self-consistency combined with an outcome reward model.

Key Designs¶

Visual Reasoning Actions: Six atomic reasoning actions are defined: Visual Parsing (VP), System Analysis (SA), One-Step Thought (OST), Chain-of-Thought (CoT), Divide and Conquer (DC), and Self-Reflection (SR). These serve as the atomic operations of thought cards, and different combinations yield different reasoning strategies.
Thought Card Construction (MCTS + VOC Distillation): MCTS is first applied to each seed problem to generate reasoning trees with multiple valid paths. The optimal path is then selected using the VOC criterion \(Score(q, p) = k \cdot R(p|q) - (1-k) \cdot C(p)\), which balances reward and cost. Problems are grouped by PC (assessed by a 2B small model) and CLIP semantic embeddings (TIS), with each group sharing a common thought card template (e.g., \(a_1 \to a_2 \to a_4\)).
Adaptive Retrieval Mechanism: At inference time, the PC and TIS of a test question are computed, and all thought cards are ranked along both dimensions: \(R_{TIS}\) (semantic similarity rank) and \(R_{PC}\) (complexity similarity rank). The top-5 cards by combined rank are selected. These templates are instantiated to generate 5 candidate solutions, from which the best is selected via self-consistency and an ORM.

Loss & Training¶

Fully training-free. Thought card construction requires only 500 seed samples and 50 minutes of preprocessing on a single GPU. No additional computational overhead is introduced at inference time beyond template retrieval.

Key Experimental Results¶

Method	Type	Data	MathVerse	MathVista	MathVision
GPT-4o	Closed-source	-	50.2	60.1	30.4
URSA-8B	SFT	1100K	45.7	59.8	26.2
R1-VL-7B	GRPO	260K	40.0	63.5	27.1
MM-Eureka-7B	GRPO	15K	50.3	59.4	26.9
Mulberry-7B	Search	260K	44.9	61.3	26.4
AStar(Qwen2.5-7B)	Free	0.5K	53.9	64.2	32.7
AStar(Qwen2-VL-7B)	Free	0.5K	47.5	61.7	27.9

Data efficiency: AStar surpasses URSA by 8.2% using only 1/2200 of its data. Plug-and-play: AStar+RL (LMM-R1) achieves 48.3% on MathVerse, a +6.5% gain over the RL baseline alone. Cross-domain transfer: math thought cards improve GPT-4o on MMMU to 73.2% (vs. 70.3%) and GAOKAO to 52.2% (vs. 47.8%).

Ablation Study¶

Removing thought cards leads to a −9.5% drop on MathVerse, demonstrating the core value of structured reasoning patterns.
Replacing adaptive retrieval with random card selection causes −2.2–6.3% degradation, highlighting the importance of problem–pattern matching.
Seed data scaling: 50→100→500→1000 samples yield average accuracies of 33.5→39.4→43.3→44.1; 500 samples offers the optimal cost-performance trade-off.
Replacing the full verification pipeline with self-consistency alone results in only a −1.5% drop, indicating that thought cards themselves enable high-quality solution generation.

Highlights & Insights¶

The "Thought Cards" concept is remarkably elegant — reasoning paths found via MCTS are distilled into reusable high-level templates, realizing a "search once, reuse many times" paradigm.
Extraordinary data efficiency — a thought card library can be constructed from 500 samples in 50 minutes, outperforming methods trained on millions of samples.
Strong cross-domain transferability — thought cards built from math-domain problems improve performance on science reasoning, visual perception, chart understanding, and other entirely distinct tasks.
Plug-and-play compatibility — AStar can be stacked on top of SFT/GRPO-trained models for further gains, suggesting it captures complementary reasoning patterns.
Weak-to-strong generalization — thought cards constructed from Qwen2-VL-7B can even improve GPT-4o's reasoning performance.

Limitations & Future Work¶

Thought card construction still relies on MCTS to search reasoning paths over seed samples; the quality of the initial search affects card quality.
The six reasoning actions are predefined and may not cover all reasoning scenarios.
The retrieval mechanism relies on two simple metrics (PC and TIS), which may lack sufficient granularity.
The verification stage depends on a text-domain ORM; visual-domain verification models are absent.
Validation on broader tasks such as video understanding and document understanding has not been performed.

vs. Mulberry/AR-MCTS (search-based methods): AStar requires only 500 seed samples, whereas Mulberry requires 260K samples plus GPT-4o distillation — a 500× efficiency gap.
vs. URSA/R1-VL (post-training methods): AStar is fully training-free, requires no GPU training, and achieves higher accuracy.
vs. Buffer-of-Thoughts/ReasonFlux: These works also explore thought templates but are restricted to the text domain; AStar is the first to extend this paradigm to multimodal settings.
Inspiration for future directions: Can thought cards be applied to other VLM tasks, such as visual grounding or image captioning, by constructing task-specific thought card libraries?
Thought cards are essentially "meta-learning of reasoning strategies" — learning from a small number of samples which reasoning strategy to apply to which type of problem.
A potential connection to Distillation Dynamics: in ViT distillation, "distillation strategy cards" could perhaps be constructed to select optimal distillation layers based on the information bottleneck location between teacher and student.
The strong cross-domain transferability suggests that reasoning patterns (e.g., "decompose → reason → verify") are domain-invariant, echoing the "compression → expansion" pattern described by Information Bottleneck theory.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of thought cards, MCTS distillation, and adaptive retrieval constitutes a genuinely novel paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated across 8 benchmarks and 4 dimensions (performance, efficiency, flexibility, transferability) with detailed ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, strong motivation, and theoretical grounding (VOC/metareasoning) for every design choice.
Value: ⭐⭐⭐⭐⭐ Training-free, minimal data requirements, and superior performance to GPT-4o yield both strong practical utility and significant academic contribution.