Skip to content

AStar: Boosting Multimodal Reasoning with Automated Structured Thinking

Conference: AAAI 2026 arXiv: 2502.02339 Code: Not released Area: Multimodal VLM Keywords: Multimodal reasoning, thought cards, MCTS, training-free, structured thinking

TL;DR

This paper proposes AStar, a training-free multimodal reasoning paradigm that constructs a library of high-level "thought card" reasoning templates from 500 seed samples. At inference time, the most suitable templates are adaptively retrieved to guide structured reasoning in MLLMs. A 7B model achieves 53.9% accuracy on MathVerse (surpassing GPT-4o's 50.2%), requiring only 50 minutes of preprocessing and no model training.

Background & Motivation

MLLMs perform poorly on complex visual reasoning tasks. Existing enhancement methods fall into two categories: (1) search-based methods (e.g., MCTS) with prohibitive computational costs; and (2) post-training methods (e.g., SFT/GRPO) that require large-scale data (>100K) and significant compute, and suffer from training instability. RL-based methods can only shift the output distribution without introducing external knowledge, which fundamentally limits the upper bound of reasoning capability. A more efficient, training-free approach to enhance multimodal reasoning is therefore needed.

Core Problem

How can the complex visual reasoning capabilities of MLLMs be substantially improved without large-scale training? The core challenges are: direct MCTS search is too slow; SFT/RL requires too much data and compute; and existing methods fail to effectively generalize high-level reasoning strategies to unseen problems.

Method

Overall Architecture

A two-stage pipeline: (1) Thought Card Construction — MCTS is applied to 500 seed samples to identify optimal reasoning paths, which are selected via the VOC criterion and then distilled into abstract "thought card" templates; (2) Adaptive Reasoning & Verification — at inference time, the 5 most relevant thought cards are retrieved based on problem complexity (PC) and text-image semantics (TIS), instantiated into candidate solutions, and verified via self-consistency combined with an outcome reward model.

Key Designs

  1. Visual Reasoning Actions: Six atomic reasoning actions are defined: Visual Parsing (VP), System Analysis (SA), One-Step Thought (OST), Chain-of-Thought (CoT), Divide and Conquer (DC), and Self-Reflection (SR). These serve as the atomic operations of thought cards, and different combinations yield different reasoning strategies.

  2. Thought Card Construction (MCTS + VOC Distillation): MCTS is first applied to each seed problem to generate reasoning trees with multiple valid paths. The optimal path is then selected using the VOC criterion \(Score(q, p) = k \cdot R(p|q) - (1-k) \cdot C(p)\), which balances reward and cost. Problems are grouped by PC (assessed by a 2B small model) and CLIP semantic embeddings (TIS), with each group sharing a common thought card template (e.g., \(a_1 \to a_2 \to a_4\)).

  3. Adaptive Retrieval Mechanism: At inference time, the PC and TIS of a test question are computed, and all thought cards are ranked along both dimensions: \(R_{TIS}\) (semantic similarity rank) and \(R_{PC}\) (complexity similarity rank). The top-5 cards by combined rank are selected. These templates are instantiated to generate 5 candidate solutions, from which the best is selected via self-consistency and an ORM.

Loss & Training

Fully training-free. Thought card construction requires only 500 seed samples and 50 minutes of preprocessing on a single GPU. No additional computational overhead is introduced at inference time beyond template retrieval.

Key Experimental Results

Method Type Data MathVerse MathVista MathVision
GPT-4o Closed-source - 50.2 60.1 30.4
URSA-8B SFT 1100K 45.7 59.8 26.2
R1-VL-7B GRPO 260K 40.0 63.5 27.1
MM-Eureka-7B GRPO 15K 50.3 59.4 26.9
Mulberry-7B Search 260K 44.9 61.3 26.4
AStar(Qwen2.5-7B) Free 0.5K 53.9 64.2 32.7
AStar(Qwen2-VL-7B) Free 0.5K 47.5 61.7 27.9

Data efficiency: AStar surpasses URSA by 8.2% using only 1/2200 of its data. Plug-and-play: AStar+RL (LMM-R1) achieves 48.3% on MathVerse, a +6.5% gain over the RL baseline alone. Cross-domain transfer: math thought cards improve GPT-4o on MMMU to 73.2% (vs. 70.3%) and GAOKAO to 52.2% (vs. 47.8%).

Ablation Study

  • Removing thought cards leads to a −9.5% drop on MathVerse, demonstrating the core value of structured reasoning patterns.
  • Replacing adaptive retrieval with random card selection causes −2.2–6.3% degradation, highlighting the importance of problem–pattern matching.
  • Seed data scaling: 50→100→500→1000 samples yield average accuracies of 33.5→39.4→43.3→44.1; 500 samples offers the optimal cost-performance trade-off.
  • Replacing the full verification pipeline with self-consistency alone results in only a −1.5% drop, indicating that thought cards themselves enable high-quality solution generation.

Highlights & Insights

  • The "Thought Cards" concept is remarkably elegant — reasoning paths found via MCTS are distilled into reusable high-level templates, realizing a "search once, reuse many times" paradigm.
  • Extraordinary data efficiency — a thought card library can be constructed from 500 samples in 50 minutes, outperforming methods trained on millions of samples.
  • Strong cross-domain transferability — thought cards built from math-domain problems improve performance on science reasoning, visual perception, chart understanding, and other entirely distinct tasks.
  • Plug-and-play compatibility — AStar can be stacked on top of SFT/GRPO-trained models for further gains, suggesting it captures complementary reasoning patterns.
  • Weak-to-strong generalization — thought cards constructed from Qwen2-VL-7B can even improve GPT-4o's reasoning performance.

Limitations & Future Work

  • Thought card construction still relies on MCTS to search reasoning paths over seed samples; the quality of the initial search affects card quality.
  • The six reasoning actions are predefined and may not cover all reasoning scenarios.
  • The retrieval mechanism relies on two simple metrics (PC and TIS), which may lack sufficient granularity.
  • The verification stage depends on a text-domain ORM; visual-domain verification models are absent.
  • Validation on broader tasks such as video understanding and document understanding has not been performed.
  • vs. Mulberry/AR-MCTS (search-based methods): AStar requires only 500 seed samples, whereas Mulberry requires 260K samples plus GPT-4o distillation — a 500× efficiency gap.
  • vs. URSA/R1-VL (post-training methods): AStar is fully training-free, requires no GPU training, and achieves higher accuracy.
  • vs. Buffer-of-Thoughts/ReasonFlux: These works also explore thought templates but are restricted to the text domain; AStar is the first to extend this paradigm to multimodal settings.
  • Inspiration for future directions: Can thought cards be applied to other VLM tasks, such as visual grounding or image captioning, by constructing task-specific thought card libraries?
  • Thought cards are essentially "meta-learning of reasoning strategies" — learning from a small number of samples which reasoning strategy to apply to which type of problem.
  • A potential connection to Distillation Dynamics: in ViT distillation, "distillation strategy cards" could perhaps be constructed to select optimal distillation layers based on the information bottleneck location between teacher and student.
  • The strong cross-domain transferability suggests that reasoning patterns (e.g., "decompose → reason → verify") are domain-invariant, echoing the "compression → expansion" pattern described by Information Bottleneck theory.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The combination of thought cards, MCTS distillation, and adaptive retrieval constitutes a genuinely novel paradigm.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated across 8 benchmarks and 4 dimensions (performance, efficiency, flexibility, transferability) with detailed ablations.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, strong motivation, and theoretical grounding (VOC/metareasoning) for every design choice.
  • Value: ⭐⭐⭐⭐⭐ Training-free, minimal data requirements, and superior performance to GPT-4o yield both strong practical utility and significant academic contribution.