AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent¶

Conference: ICLR 2026 arXiv: 2512.20745 Code: None Area: LLM Reasoning Keywords: Mathematical Reasoning, Tool Augmentation, Reinforcement Learning, Code Interpreter, Agent Framework

TL;DR¶

AgentMath proposes a tool-augmented agent framework that seamlessly integrates LLM reasoning with the computational precision of a code interpreter through automated data synthesis, multi-turn interactive reinforcement learning, and an efficient asynchronous training system. At the 30B-A3B scale, it achieves state-of-the-art performance on AIME24/25 and HMMT25 (90.6%/86.4%/73.8%), surpassing o3-mini and Claude-Opus-4.0-Thinking.

Background & Motivation¶

Large reasoning models (LRMs) such as o3 and DeepSeek-R1 have made remarkable progress in long chain-of-thought reasoning, yet they still suffer from low computational efficiency and insufficient accuracy when handling problems that require precise mathematical operations — the inherent limitations of pure-text reasoning lead to frequent arithmetic errors and redundant corrections. Existing tool-augmented approaches face three major challenges: (1) high-quality tool-use data is extremely scarce, and manual annotation is costly and non-scalable; (2) the potential of agentic reinforcement learning for optimizing tool-use policies remains largely unexplored; (3) competition-level mathematics problems involve ultra-long reasoning chains (96k tokens, 96 tool calls), which conventional synchronous batch RL training cannot handle. The core idea of this paper is to build an end-to-end agent framework that addresses data scarcity through automated data synthesis, learns optimal tool-use policies through Agentic RL, and resolves training efficiency bottlenecks through an asynchronous training architecture.

Method¶

Overall Architecture¶

AgentMath models tool-augmented mathematical reasoning as a Markov Decision Process (MDP), in which the LLM policy generates alternating reasoning segments and executable code blocks that interact with a sandboxed environment. The system adopts a structured token protocol: <think> marks natural-language reasoning, <code> marks executable code, and <interpreter> encapsulates execution feedback. The overall pipeline consists of two stages: (1) SFT on synthesized tool-augmented trajectories to establish initial tool-use capability; and (2) large-scale RL to drive exploration toward optimal tool-use policies.

Key Designs¶

Tool-Driven Data Synthesis: A three-stage automated synthesis pipeline. Stage 1: Long chain-of-thought data in pure-text form is aggregated from public sources such as AM-Thinking and Open-Thoughts; N-gram filtering removes overlap with evaluation sets, yielding 346k high-quality samples. DeepSeek-V3 is then used as a teacher model to replace computationally intensive steps with executable code blocks, while retaining simple calculations in text form to prevent excessive tool dependency. Stage 2: Multi-dimensional quality refinement — format consistency correction, code executability verification (sandbox execution), environment feedback alignment (Qwen3-32B judges consistency and replaces simulated outputs with actual execution results), and tool-use rationality assessment (unnecessary code is excluded via AST depth and line-count constraints). Stage 3: Self-correction capability injection — failure trajectories are sampled and the teacher model generates correction trajectories following a "diagnose error → fix code → re-execute → continue reasoning" pattern. The final output is a tool-augmented training set of 316k samples, averaging 8.3 tool calls and 16.9k tokens per sample.
Agentic RL: Built on the GRPO optimization algorithm with three system-level innovations. (a) Agent trajectories with interleaved code execution: During rollout, hybrid trajectories are constructed via a "generate–pause–execute–resume" loop, with a maximum of \(T\) tool calls. (b) Selective loss masking: Advantage signals are applied only to tokens in <think> and <code> segments; tokens in <interpreter> segments (environment feedback) are masked during optimization, ensuring gradient updates derive solely from the model's own decisions. (c) Adaptive batch construction: Problems for which all rollouts are either all correct or all incorrect (providing limited learning signal) are filtered out, and back-filling maintains a consistent batch size.
Composite Reward Design: The reward function integrates answer correctness and tool-use efficiency: \(R_{total} = R_{acc} + \mathbb{I}(R_{acc}=1) \cdot R_{tool}\). Here \(R_{acc}\) is a binary signal based on mathematical equivalence, and \(R_{tool} = \min(R_{max}, \alpha + \beta \cdot N_{code})\) incentivizes efficient tool utilization when the answer is correct.
Scalable Agent RL Infrastructure: Three technical innovations address the training bottlenecks imposed by ultra-long sequences and high-frequency tool interactions. (a) Distributed code execution sandbox cluster: CPU-intensive code execution is offloaded from the training loop, reducing tool-call latency from 175s to 1.2s. (b) Request-level asynchronous rollout scheduling: Each trajectory is treated as an independent long-running request; the inference engine and the agent communicate asynchronously, so the engine immediately processes other ready requests while a request is paused waiting for a tool call, eliminating head-of-line blocking. (c) Agent partial rollout: Long trajectories are decomposed into budget-constrained segments (\(\tau = \tau^{(1)} \oplus \tau^{(2)} \oplus \ldots\)), each bounded by a maximum generation length \(L_{seg}\) and a maximum tool-call count \(T_{seg}\), preventing any single trajectory from monopolizing resources and achieving a 2.2–2.5× speedup. (d) Prefix-aware weighted load balancing: Dynamic weights \(w_j = \lfloor L_j / L_{base} \rfloor + w_{base}\) are assigned based on prefix length, combined with LRU sticky sessions to maximize KV-cache reuse. The overall system achieves a 4–5× training speedup.

Loss & Training¶

SFT stage: Autoregressive loss with selective feedback masking, \(\mathcal{L}_{SFT-masked} = -\sum_t \sum_k (1 - \mathbb{I}(z_{t,k})) \log \pi_\theta(z_{t,k} | \cdot)\), masking <interpreter> segment tokens.
RL stage: A multi-stage adaptive policy that automatically scales when the truncation rate exceeds 10%: context length expands from 48k → 72k → 96k, the tool-call limit from 48 → 72 → 96, and the number of partial rollouts from 2 → 3 → 4.
Llama-Factory is used for SFT (6 epochs, learning rate 6e-5); verl 0.5.0 is used for RL (learning rate 1e-6, batch size 64, 8 rollouts per problem).

Key Experimental Results¶

Main Results¶

Dataset	Metric	AgentMath-8B	AgentMath-30B-A3B	AgentMath-235B-A22B-SFT	Prev. SOTA (same scale)	Gain
AIME24	avg@32	89.8%	90.6%	93.4%	86.0% (DS-0528-Qwen3-8B)	+3.8%
AIME25	avg@32	84.7%	86.4%	90.8%	76.3% (DS-0528-Qwen3-8B)	+8.4%
HMMT25	avg@32	71.3%	73.8%	81.7%	61.5% (DS-0528-Qwen3-8B)	+9.8%

AgentMath-30B-A3B (with only 3B active parameters) surpasses OpenAI-o3-mini (87.3%/86.3%) and Claude-Opus-4.0-Thinking (83.0%/72.0%) on AIME24/25, approaching DeepSeek-R1-671B (91.4%/87.5%).

Ablation Study¶

Configuration	AIME24	AIME25	Notes
Unrefined synthesized data	35.3%	25.7%	Format inconsistency and non-executable code degrade performance
+ Format consistency correction	47.4%	40.1%	+12.1%/+14.4%
+ Code executability verification	52.8%	44.8%	+5.4%/+4.7%
+ Environment feedback alignment	56.3%	48.3%	+3.5%/+3.5%
+ Self-correction capability injection	58.6%	50.8%	+2.3%/+2.5%
+ SFT selective masking	60.5%	53.3%	Final SFT performance
Text-Based-SFT vs. AgentMath-SFT	57.1% vs. 60.5%	49.2% vs. 53.3%	Advantage of tool-augmented data
Text-Based-RL vs. AgentMath-RL	68.7% vs. 76.2%	57.5% vs. 67.5%	4× efficiency gain at RL stage

Training Efficiency¶

Method	Time per Step	Speedup
Static synchronous batch rollout	3600–4000s	—
+ Request-level async scheduling	2100–2500s	1.5–1.8×
+ Agent partial rollout	1100–1300s	3.0–3.3×
+ Prefix-aware load balancing	750–900s	4.0–5.0×

Key Findings¶

The tool-augmented model reaches 76.2% (AIME24) in only ~400 RL steps, whereas the pure-text model requires ~1,600 steps to reach 68.7% — a 4× efficiency improvement.
Emergent code self-correction capability arises during multi-stage RL training.
Reasoning sequence length is reduced by approximately 4k tokens (~14%), as tool-generated code replaces lengthy manual calculations.
Scaling data from 2k to 300k samples improves AIME24 performance from 27.2% to 78.4%, demonstrating favorable scaling laws.

Highlights & Insights¶

Systematic resolution of three bottlenecks: Data scarcity (automated synthesis pipeline), policy optimization (Agentic RL), and training efficiency (asynchronous infrastructure) form a complete technical loop.
Emergent code self-correction: During RL training, the model autonomously learns to diagnose and repair code errors — an emergent behavior that was not explicitly trained.
Remarkable efficiency of MoE models: The 30B-A3B model, with only 3B active parameters, approaches the performance of a 671B-parameter model, demonstrating that tool-augmented strategies can substantially compensate for parameter count deficits.
Elegant design of partial rollout: Decomposing ultra-long trajectories into manageable segments resolves long-tail latency without degrading accuracy (accuracy ~70% remains consistent across different \(N\) settings).

Limitations & Future Work¶

Due to computational constraints, the 235B-scale model undergoes only SFT without RL training, leaving potential gains unexplored.
The study focuses exclusively on mathematical competition benchmarks and does not validate generalization to broader domains such as scientific reasoning or engineering computation.
The tool-use reward component of the composite reward function is relatively simple and may not precisely guide optimal tool-invocation timing.
The code interpreter is currently limited to Python/SymPy; integration of other computational tools (e.g., Mathematica, SageMath) remains unexplored.

Comparison with ToRL/ReTool: These methods also explore RL combined with tool use, but fall short of AgentMath in data quality and training efficiency, with limited performance gains.
Comparison with CoRT: CoRT relies on high-quality manual annotation and is not scalable; AgentMath's automated synthesis pipeline directly addresses this limitation.
Engineering insights: The design principles of the asynchronous training system (request-level scheduling + partial rollout + prefix-aware load balancing) are highly generalizable and transferable to other agent RL scenarios.
On agent system design: This work demonstrates that simple outcome-based rewards such as GRPO are sufficiently effective in agent settings, without requiring complex process rewards.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐