AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning¶

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=nUGPEmQ2ut
Code: https://github.com/ssmisya/AdaReasoner
Area: Multimodal VLM / Visual Reasoning / Tool Augmentation
Keywords: Multimodal Reasoning, Tool Orchestration, Multi-turn GRPO, Visual Tools, Adaptive Tool Use

TL;DR¶

AdaReasoner teaches Multimodal Large Language Models (MLLMs) to dynamically orchestrate a set of visual tools during multi-turn visual reasoning. Through a two-stage training process of "Tool Cold Start + Multi-turn Tool GRPO," it enables a 7B small model to autonomously select, discard, and adjust tool usage frequency. It achieves an average performance gain of +38.7%, reaching a near-perfect score of 97.6% on VSP, surpassing GPT-5 and Claude Sonnet 4.

Background & Motivation¶

Background: Equipping MLLMs with external tools is a popular direction for enhancing visual reasoning. Early SFT/prompt methods (CogCoM, TACO, LLaVA-Plus) used predefined tools with scripted calls. Recent RL methods (DeepEyes, Pixel-Reasoner) utilize crop/zoom-in searches to enhance perception.

Limitations of Prior Work: These works are mostly restricted to single, atomic tools and single-step trajectories. They neither address multi-turn planning nor select effective tool combinations for complex tasks. More critically, pure R1-style rule-based rewards only optimize the "reasoning process" without directly improving the model's underlying perception capability. Perceptual errors accumulate into hallucinations, a phenomenon termed "guided guessing": the model roughly locates relevant regions but misses key details, causing linguistic capabilities to lose visual anchors and regress into semantic-prior-based guessing.

Key Challenge: Visual reasoning fundamentally lacks iterative exploration and refinement of visual evidence. The decision-making process—"which tool to use, when to use it, and how to combine them"—is itself a multimodal reasoning capability that prior methods have failed to treat as a learnable objective.

Goal: Transform tools from "static appendages" into "supports for active manipulation and refinement of visual representations," enabling models to learn multi-turn planning and dynamic combination over a diverse set of candidate tools. This requires solving three sub-problems: (1) Generating high-quality multi-turn tool trajectory data; (2) Designing RL to optimize multi-turn tool invocation; (3) Ensuring the toolset accommodates both lightweight offline tools and compute-intensive expert models.

Key Insight: The authors leverage Extended Mind Theory (Clark & Chalmers 1998), where external tools are viewed as organic components of cognition. By following an iterative "Observe → Manipulate → Verify → Reflect" workflow, the model can delegate difficult sub-tasks to high-precision tools while focusing on judgment and synthesis.

Core Idea: A two-stage approach—"Tool Cold Start (learning how and when to use tools) + Multi-turn Tool GRPO (optimizing multi-turn tool trajectories)"—to train a reasoning agent capable of adaptive tool orchestration, allowing it to autonomously devise optimal strategies from a broad toolset.

Method¶

Overall Architecture¶

AdaReasoner formalizes "tool-augmented multimodal reasoning" as a sequential decision process. A policy $\pi_\theta$ (the MLLM) generates a reasoning trajectory $\tau = \{(s_0, a_0, o_0), \dots, (s_T, a_T, o_T)\}$, where $s_t$ is the state, $a_t \in T$ is a tool-calling action, and $o_t$ is the tool observation. The toolset $T = \{t_1, \dots, t_n\}$ covers three core functions: Perception (POINT, OCR), Manipulation (DRAW2DPATH, INSERTIMAGE), and Computation (ASTAR).

The pipeline comprises three components: a persistent Tool Server hosting various tools; a Tool Cold Start phase using synthetic high-fidelity multi-turn trajectories for SFT; and a Multi-turn Tool GRPO phase using reinforcement learning tailored for tool trajectories.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input: Image + Question"] --> B["Multifunctional Visual Toolset<br/>Perception/Manipulation/Computation<br/>Lightweight Offline + Heavy Expert Services"]
    B --> C["High-Fidelity Trajectory Construction<br/>Blueprint → Tool Filling → CoT<br/>Inject Reflection & Tool Failures"]
    C -->|SFT Cold Start| D["Multi-turn Tool GRPO<br/>Multi-turn Reward Accumulation + Adaptive Tool Reward"]
    D -->|RL Optimization| E["Adaptive Tool Orchestration<br/>Select / Discard / Frequency Adjustment"]
    E --> F["Multi-turn Reasoning Trajectory → Answer"]

Key Designs¶

1. Multifunctional Visual Toolset: Integrating Lightweight and Expert Service Tools

Addressing the "single atomic tool" limitation, AdaReasoner designs a toolset matrix across three functional categories × two compute types. Functions include Perception (POINT, OCR), Manipulation (DRAW2DPATH, INSERTIMAGE, CROP, DETECTBLACKAREA), and Computation (ASTAR). The tools are unified via a central Tool Server. This allows compute-intensive expert models to be accessed via the same interface as lightweight tools, making it feasible for the model to "use expert models as tools." Table 3 shows that while Qwen2.5-VL accuracy for origin positioning is low (2.47%-50.0%), it reaches 100.0% when using the expert POINT tool.

2. High-Fidelity Multi-turn Trajectory Construction: Blueprints and Reflection

To enable planning, the authors generate multi-turn trajectories via a three-stage pipeline: (1) Abstract Trajectory Design (creating blueprints for tasks like VSP or Jigsaw); (2) Tool Call Filling (executing actual tools to populate inputs/outputs); (3) CoT Generation (using a strong LLM to fill reasoning chains). Crucially, they inject Reflection & Backtracking and Explicit Tool Failures to teach the model how to handle errors and when to rely on its internal capabilities.

3. Multi-turn Tool GRPO: Optimizing Long Trajectories

Standard GRPO is adapted for multi-turn tool trajectories using a multiplicative-additive structure: $$R_{\text{total}} = R_{\text{format}} \cdot (\lambda_{\text{tool}} \cdot R_{\text{tool}} + \lambda_{\text{acc}} \cdot R_{\text{acc}})$$ Where $R_{\text{format}}$ is a product over all turns (zeroing out if any turn fails format), $R_{\text{tool}}$ is the average hierarchical score of all tool calls, and $R_{\text{acc}}$ is the final answer accuracy.

4. Adaptive Tool Reward: Asymmetric Incentives

To prevent over-reliance on tools while encouraging necessary help-seeking, an asymmetric reward is used. Correct trajectories receive a full reward (8 points) regardless of tool use. Incorrect trajectories are penalized severely if they guessed without tools (0 points), but receive partial credit (up to 4 points) if they correctly utilized tools. This teaches the model: "Answer directly if certain; use a structured tool-assisted process if uncertain."

Key Experimental Results¶

Main Results¶

Evaluation across six benchmarks (VSPO, VSP, Jigsaw, etc.). TC = Tool Cold Start, TG = Tool GRPO.

Model	VSP Overall	Jigsaw	BLINK-J	Description
Qwen2.5-VL-7B (Base)	31.64	45.70	52.67	Starting Point
+ Direct SFT	46.64	86.40	88.00	Strong Baseline
+ Direct GRPO	30.18	64.90	80.00	Strong Baseline
+ Ours TC + TG	97.64	96.60	96.00	Full AdaReasoner
GPT-5	55.64	80.10	73.33	Closed-source
Claude Sonnet 4	56.27	58.60	65.33	Closed-source

Ablation Study¶

Configuration	VSP Overall	Key Finding
TG Only (7B)	35.09	RL alone has limited effect
TC Only (7B)	64.91	Cold start is the foundation
TC + TG (Full)	97.64	Both stages are essential
$\lambda_{\text{tool}}:\lambda_{\text{acc}}=0:1$	71.45	No tool reward
$\lambda_{\text{tool}}:\lambda_{\text{acc}}=2:1$	93.27	Higher tool reward weight is better

Key Findings¶

Adaptive tool use is an emergent behavior: During RL, the model learns to adopt beneficial tools (e.g., ASTAR in navigation) and discard irrelevant ones.
Zero-shot tool introduction: The model can successfully generalize to unseen tools (e.g., ASTAR) during inference, though RL is needed to suppress interference.
Three roles of tools: Perception tools help the model "see," manipulation tools help "verify," and computation/trajectories help "plan."

Highlights & Insights¶

Tool selection as learnable multimodal reasoning: Treats "which tool to use" as a learnable skill that emerges through RL.
Asymmetric adaptive reward: Effectively balances tool reliance and direct answering by encoding preferences into the reward structure.
Tool Server abstraction: Decouples heavy expert models from the execution environment, providing a scalable interface for agentic workflows.
Small Model + Good Tools = SOTA: Demonstrates that performance bottlenecks are shifting from model scale to tool quality.

Limitations & Future Work¶

Reflection data trade-off: Training on reflection data can lead to "rigidity," making the model less flexible in adopting new tools at inference time.
Toolset/Task coupling: Human-designed blueprints and specific tools make generalization to open-domain visual reasoning challenging.
Expert tool dependency: The framework relies on the high precision of expert tools; gains diminish if the tools themselves are inaccurate.

vs DeepEyes / Pixel-Reasoner: AdaReasoner moves beyond single-step/atomic tools to multi-turn planning and dynamic combinations.
vs CogCoM / TACO: Evolves from scripted SFT to adaptive RL-based orchestration.
vs Rule-based GRPO: Addresses the "perceptual bottleneck" by using external tools as visual anchors, preventing hallucinations.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Systematizes dynamic tool orchestration as a learnable capability.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive benchmarks, scale comparisons, and behavior analysis.
Writing Quality: ⭐⭐⭐⭐ Clear storyline and well-defined reward structures.
Value: ⭐⭐⭐⭐⭐ Provides a benchmark strategy for tool-augmented agent training.