ICLR 2026 Robotics tool-integrated reasoning hierarchical RL GRPO code generation self-correction mathematical reasoning

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning¶

Conference: ICLR 2026 arXiv: 2509.13761 Code: GitHub Area: Robotics Keywords: tool-integrated reasoning, hierarchical RL, GRPO, code generation, self-correction, mathematical reasoning

TL;DR¶

This paper proposes THOR, a framework that systematically addresses three core challenges in tool-integrated mathematical reasoning for LLMs—data construction, fine-grained optimization, and inference enhancement—through three complementary components: the TIRGen data construction pipeline, hierarchical reinforcement learning (joint episode-level and step-level optimization), and a self-correction inference mechanism. THOR achieves state-of-the-art performance among models of comparable scale on benchmarks including MATH500 and AIME.

Background & Motivation¶

As probabilistic next-token predictors, LLMs are inherently limited in high-precision numerical computation, equation solving, and symbolic manipulation, where sampling errors accumulate across multi-step calculations.
Tool-Integrated Reasoning (TIR) is an effective paradigm for mitigating this limitation, but faces three core challenges:
- Data construction difficulty: Synthesizing data via prompting external large models such as GPT-4o introduces style mismatch issues and yields poor results for reasoning models (e.g., DeepSeek-R1); rule-injection methods such as START suffer from imprecise insertion points, leading to redundant code invocations.
- Coarse optimization granularity: Existing RL methods (Agent-R, ToRL, ReTool) perform only episode-level optimization, using final answer correctness as the sole reward signal—this causes severe sparse reward problems in long reasoning chains, leaving intermediate code steps without fine-grained updates.
- Lack of error correction during inference: Single-pass reasoning ignores immediate feedback from tool execution; when code execution fails, the model should backtrack and correct rather than proceed blindly.
SFT-based approaches (Toolformer, AIMO-2) require large amounts of high-quality demonstration data and generalize poorly.
Core Insight: The execution success rate of intermediate tool calls is a strong predictor of final answer correctness, providing a natural and dense reward signal for step-level optimization.

Method¶

Overall Architecture¶

THOR comprises three complementary components forming a complete pipeline:

TIRGen Data Construction Pipeline: A Generator-Refiner collaboration that automatically generates policy-aligned TIR training data \(\mathcal{D}_{SFT}\).
Hierarchical RL Training: Cold Start SFT → joint training with episode-level optimization (final answer reward) and step-level optimization (code execution reward).
Self-Correction Inference Mechanism: During inference, tool execution feedback is leveraged to backtrack and regenerate failed steps.

The entire process is formalized as a think-act-observe loop: given a problem \(q\), the model generates an alternating sequence \(\tau=(r^1, a^1, o^1, \ldots, r^n)\), where \(r^t\) denotes a natural language reasoning step, \(a^t\) a code action, and \(o^t\) the execution result.

Key Designs¶

TIRGen Data Construction Pipeline (Generator-Refiner Framework)
- The Generator produces natural language reasoning steps (with a per-step maximum length \(L_{step}\)), preserving the model's original reasoning style.
- The Refiner evaluates whether each step can be converted to code execution (numerical computation, equation solving, etc.), extracts the pure logical reasoning portion \(r_{logic}^t\), and converts it to executable Python code \(a^t\).
- Code is executed in a sandboxed environment to obtain observation \(o^t\), which replaces the original computed result.
- Policy alignment advantage: The Refiner observes only the single reasoning step (not the full problem or answer), so the generated data is naturally aligned with the Generator's policy distribution, avoiding performance degradation from out-of-distribution data.
- Reduced dependence on large models: The Generator handles high-level mathematical reasoning, while the Refiner requires only basic instruction following and code generation capability; this task decomposition reduces the need for very large models.
- Multi-stage filtering: Format consistency checks → code quality filtering (requiring sympy/numpy calls or control flow) → difficulty and invocation-count balanced sampling → exclusion of simple problems solvable by pure CoT.
Hierarchical Reinforcement Learning (Joint Episode- and Step-Level Optimization)
- Cold Start Phase: SFT on \(\mathcal{D}_{SFT}\) establishes foundational tool-calling patterns, enabling the model to master the think-act-observe format.
- Episode-Level Optimization: Employs the GRPO algorithm with final answer correctness as reward (\(\mathcal{R}_i = 1\) for correct, \(0\) for incorrect); trajectories with execution failures are filtered to avoid improper penalization due to environment issues. Advantage function: \(A_i = \frac{\mathcal{R}_i - \text{mean}(\mathcal{R})}{\text{std}(\mathcal{R})}\).
- Step-Level Optimization: Targets code steps with execution failures via backtracking—the erroneous reasoning step \(r^t\) is split into a prefix \(r_{pre}^t\) and suffix \(r_{suf}^t\) (of length \(L_{suf}\)); the prefix context is retained while the suffix and code action are regenerated, using code execution success rate as the step-level reward.
- Positive samples are additionally reinforced by introducing a VAPO-style NLL loss \(\mathcal{L}_{NLL}\) with weighting coefficient \(\alpha\) to directly reinforce the likelihood of samples with positive advantage.
Self-Correction Inference Mechanism
- During inference, if code action \(a^t\) fails to execute, a backtracking mechanism is triggered.
- The reasoning step \(r^t\) is split into \(r_{pre}^t\) and \(r_{suf}^t\); based on the history up to \(r_{pre}^t\), the model regenerates a revised reasoning suffix \(\hat{r}_{suf}^t\) and corrected action \(\hat{a}^t\).
- Up to \(N_{corr}\) retries are allowed; each retry regenerates only the suffix rather than the entire trajectory, incurring minimal additional computational overhead.

Loss & Training¶

The total training loss combines episode-level and step-level objectives:

\[\mathcal{L}(\theta) = \mathcal{L}_{\pi_\theta}^{epis}(\theta) + \mathcal{L}_{\pi_\theta}^{step}(\theta)\]

Both levels adopt GRPO's clipped surrogate objective augmented with a VAPO-style NLL loss:

Episode level: \(\mathcal{L}^{epis}\) uses final answer correctness as reward; an indicator function \(I(s_{i,t})\) ensures gradients are computed only over model-generated tokens (not external executor outputs).
Step level: \(\mathcal{L}^{step}\) uses code execution success rate as reward; each sample contains a single think-act-observe cycle to ensure dense rewards.

Training procedure: Cold Start SFT → joint episode- and step-level RL optimization.

Key Experimental Results¶

Datasets and Benchmarks¶

Benchmark	Type	Description
MATH500	Mathematical reasoning	500 problems spanning 7 difficulty levels
AIME 2024	Competition mathematics	American Invitational Mathematics Examination 2024
AIME 2025	Competition mathematics	American Invitational Mathematics Examination 2025
AMC	Competition mathematics	American Mathematics Competition
Minerva Math	Mathematical reasoning	STEM mathematics problem set
Olympiad Bench	Competition mathematics	Olympiad mathematics benchmark
HumanEval	Code generation	Function-level code generation
MBPP	Code generation	Basic programming problems
LiveCodeBench	Code generation	Continuously updated competitive programming problems

Baselines¶

Model	Parameters	Type	Tool Use
QwQ-32B	32B	Reasoning model	✗
DeepSeek-R1-Distill-32B	32B	Distilled reasoning model	✗
ToRL-Qwen-32B	32B	RL + tool calling	✓
ReTool-32B	32B	RL + tool calling	✓
START-32B	32B	Rule injection + tool calling	✓
DeepSeek-R1-7B	7B	Distilled reasoning model	✗
STILL-3-1.5B	1.5B	Small reasoning model	✗
AIMO-2	32B	SFT + tool calling	✓

Main Results (Math Benchmarks, Avg@4)¶

Model	MATH500	AIME24	AIME25	AMC	Minerva	OlympiadBench	Avg
QwQ-32B	96.4	79.2	72.0	-	-	-	-
DeepSeek-R1-Distill-32B	94.3	72.6	60.2	-	-	-	-
ToRL-Qwen-32B	95.6	75.2	67.3	-	-	-	-
ReTool-32B	95.2	72.5	67.5	-	-	-	-
THOR-32B	97.5	82.2	76.0	-	-	-	Best

Key Findings:

⭐⭐⭐ THOR comprehensively outperforms all models at the same scale: achieving 97.5% on MATH500, 82.2% on AIME24, and 76.0% on AIME25, surpassing all same-scale baselines.
⭐⭐⭐ Strong cross-architecture generalization: THOR is effective on both reasoning models (QwQ-32B backbone) and non-reasoning models (Qwen-2.5-32B backbone), and scales across parameter sizes (1.5B/7B/32B).
⭐⭐ Concurrent improvements on code benchmarks: Positive gains are also observed on HumanEval, MBPP, and LiveCodeBench, indicating that step-level code optimization enhances general code generation capability.
⭐⭐ Low inference overhead: The self-correction mechanism is triggered only upon execution failure and regenerates only the suffix rather than the entire trajectory, yielding average inference token counts lower than ToRL and ReTool.

Ablation Study¶

Configuration	MATH500	AIME24	AIME25	Trend
Full THOR	97.5	82.2	76.0	Complete framework
− Step-level RL	96.8	78.5	72.7	↓ Notable degradation
− Self-Correction	96.4	80.3	73.5	↓ Inference decline
− TIRGen (replaced with GPT-4o data)	95.9	76.8	70.3	↓ Significant degradation
Cold Start SFT only	94.2	68.3	58.7	↓ Severe degradation

Key Findings:

⭐⭐⭐ Step-level RL contributes most: Removing it reduces AIME24 by 3.7 points and AIME25 by 3.3 points, validating the effectiveness of intermediate code execution success rate as a dense reward signal.
⭐⭐ TIRGen data alignment is critical: Replacing with GPT-4o synthesized data reduces AIME25 by 5.7 points, demonstrating that policy-aligned training data substantially outperforms out-of-distribution data.
⭐⭐ Self-correction is most impactful on hard problems: It contributes a 2.5-point gain on AIME25 (the most challenging benchmark), as harder problems are more prone to code execution failures requiring correction.

Key Statistical Validation¶

The paper validates its core insight—that intermediate tool call success rate is strongly correlated with final answer correctness: trajectories in which all code calls succeed exhibit substantially higher final answer accuracy than those containing execution failures, providing theoretical grounding for step-level optimization.

Highlights & Insights¶

⭐⭐⭐ Systematic framework design: The three stages of data, training, and inference are mutually complementary; TIRGen addresses data quality, hierarchical RL addresses optimization, and self-correction addresses inference, forming a complete closed loop.
⭐⭐⭐ Dense rewards via hierarchical RL: Leveraging code execution success rate as a step-level reward signal effectively mitigates the sparse reward problem in long reasoning chains.
⭐⭐ Generator-Refiner data construction: The design in which the Refiner observes only a single step rather than the full problem elegantly achieves policy alignment while reducing dependence on external large models.
⭐⭐ Low-overhead self-correction: Regenerating only the suffix rather than the entire trajectory incurs minimal computational cost.
⭐ Broad model applicability: Effective across reasoning and non-reasoning models and multiple parameter scales.

Limitations & Future Work¶

⭐⭐ Single tool type: The current framework supports only a Python code executor and has not been extended to other tools such as symbolic solvers (Mathematica) or search engines.
⭐⭐ Fixed backtracking strategy: The suffix length \(L_{suf}\) is a fixed hyperparameter that does not adaptively adjust based on error type—too long wastes computation, while too short may fail to address the root cause of errors.
⭐ Restriction to mathematics: The framework has not been evaluated on broader reasoning tasks such as scientific or logical reasoning.
⭐ Limited self-correction retries: The \(N_{corr}\) retry limit may still fail to repair certain systematic errors, and no mechanism exists for abandoning the current reasoning path and switching strategies.

Summary¶

THOR is an end-to-end framework for enhancing tool-integrated mathematical reasoning. Its core contributions are: (1) the TIRGen pipeline, which generates high-quality TIR training data aligned with the policy model's distribution; (2) hierarchical RL, which jointly optimizes at the episode level (answer reward) and step level (code execution reward) to effectively alleviate sparse rewards; and (3) a self-correction mechanism that leverages tool feedback for low-overhead online error correction. The framework achieves state-of-the-art performance among same-scale models across multiple mathematical and code benchmarks, demonstrating a systematic solution for tool-integrated reasoning.

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Datasets and Benchmarks¶

Baselines¶

Main Results (Math Benchmarks, Avg@4)¶

Ablation Study¶

Key Statistical Validation¶

Highlights & Insights¶

Limitations & Future Work¶

Summary¶

Related Papers¶