Skip to content

MARGE: Improving Math Reasoning for LLMs with Guided Exploration

Conference: ICML 2025
arXiv: 2505.12500
Code: https://github.com/georgao35/MARGE
Area: LLM Reasoning / Mathematical Reasoning
Keywords: Mathematical Reasoning, Guided Exploration, Intermediate States, Credit Assignment, Self-Generated Data

TL;DR

MARGE proposes a "hit-guided exploration" approach to enhance the mathematical reasoning capabilities of LLMs. By systematically exploring the intermediate reasoning states in self-generated solutions, it achieves thorough exploration and better credit assignment without requiring external annotations or additional value models, simultaneously improving single-attempt accuracy and exploration diversity.

Background & Motivation

Background: LLMs have demonstrated strong potential in mathematical reasoning, but the shortage of high-quality training data limits further progress. Current mainstream approaches scale training by utilizing self-generated data, where LLMs generate their own solution processes and undergo reinforcement learning (RL) training using correct/incorrect signals.

Limitations of Prior Work: - Spurious correlation data: Data generated by existing methods contains a large number of "spuriously correct" reasoning paths—where the final answer happens to be correct, but the intermediate reasoning steps are faulty. - Insufficient exploration: Standard RL methods (e.g., GRPO, ReST) primarily obtain feedback from the final results of complete solutions, neglecting the exploration of intermediate reasoning steps. - Difficulty in credit assignment: When a multi-step reasoning process ultimately fails, it is difficult to determine which specific step went wrong. - Accuracy-diversity trade-off: Existing alignment methods typically reduce exploration diversity (decreasing pass@k) while improving accuracy.

Key Challenge: To improve mathematical reasoning, more high-quality training data is required. However, existing data generation methods only utilize the correctness signal of the final answer, wasting the rich information embedded in the intermediate reasoning steps.

Key Insight: Instead of looking solely at whether the final answer is correct, one should delve into each intermediate reasoning state to determine if a correct answer can be reached starting from that state—this is "hit-guided exploration."

Core Idea: Extract intermediate reasoning states from the model's self-generated solutions, re-sample subsequent reasoning paths from these states to estimate their "hit rate," and then utilize these hit rates to guide exploration and training. This enables the model to learn not only "which answers are correct" but also "which reasoning paths are more reliable."

Method

Overall Architecture

The overall pipeline of MARGE is an iterative self-improvement loop: 1. Generation Phase: Generate multiple complete solutions for training problems using the current policy. 2. Intermediate State Extraction: Extract intermediate reasoning states from the generated solutions. 3. Hit Estimation: Re-sample subsequent reasoning paths for each intermediate state to estimate its hit rate. 4. Training Data Construction: Utilize hit rates to construct high-quality training pairs—advantageous states (high hit rate) vs. disadvantageous states (low hit rate). 5. Policy Update: Train the model using the constructed data to enhance its reasoning capabilities. 6. Repeat the above loop.

Key Designs

  1. Intermediate State Exploration:

    • Given a self-generated solution trajectory \(\tau = (s_0, a_0, s_1, a_1, ..., s_T)\)
    • "Fork" at each intermediate state \(s_t\)—keep the reasoning prefix up to \(s_t\), and re-sample subsequent steps.
    • Sample \(K\) subsequent paths for each state, and calculate how many of them reach the correct final answer.
    • Hit rate \(h(s_t) = \frac{\text{到达正确答案的路径数}}{K}\)
    • Design Motivation: The hit rate reflects the quality of an intermediate state—a high hit rate indicates that it is easy to reach the correct answer from this state (a good intermediate reasoning result), while a low hit rate indicates that the intermediate reasoning has already deviated.
  2. Hit-Guided Credit Assignment:

    • Traditional methods (such as DPO) only label the entire trajectory as positive/negative based on the final outcome.
    • MARGE performs more fine-grained credit assignment for each intermediate step using hit rates.
    • Specifically, for a trajectory that is ultimately correct, steps with low hit rates might still be "spuriously positive steps" (which happened to lead to the correct answer by chance).
    • For a trajectory that is ultimately incorrect, steps with high hit rates might still contain good reasoning segments.
    • In this way, training data is constructed by comparing state-action pairs with different hit rates.
    • Design Motivation: This addresses the issue of coarse "one-size-fits-all" credit assignment across the entire trajectory found in traditional methods.
  3. Preserving Exploration Diversity:

    • Standard RL training tends to collapse the model into a few fixed patterns with high rewards, reducing the pass@k metric.
    • MARGE naturally maintains higher exploration diversity by exploring different branches of intermediate states.
    • The training data includes both samples of "successful exploration from good states" and "rectification from bad states."
    • This allows the model to learn richer reasoning strategies rather than simply memorizing a single solution template.
    • Design Motivation: In mathematical reasoning, an increase in pass@k implies that the model can solve the same problem in multiple ways, which is highly critical for practical applications (such as majority voting).
  4. No External Annotation and Value Models Required:

    • No human annotation is needed for the correctness of intermediate steps.
    • No need to train additional reward or value models.
    • The hit rate is estimated entirely through the model's own sampling.
    • Design Motivation: This reduces the implementation complexity and computational overhead of the approach.

Loss & Training

MARGE uses a preference optimization loss similar to DPO, but the training pairs are constructed based on hit rates: (state, action) pairs with high hit rates are treated as Chosen, and those with low hit rates are treated as Rejected. Specifically, the data format is {"query": math problem, "guidance hit": guidance hit, "gt": ground truth}. Training is based on open-source datasets (Math-Step-DPO-10K and Big-Math-RL-Verified), and the code utilizes the backwardlearning framework.

Key Experimental Results

Main Results

Model Method MATH GSM8K Other Benchmarks
Qwen2.5-Math-7B-Instruct Base Baseline Baseline Baseline
Qwen2.5-Math-7B-Instruct +MARGE Significant Improvement Significant Improvement Mostly Improved
LLaMA-3.1-8B-Instruct Base Baseline Baseline Baseline
LLaMA-3.1-8B-Instruct +MARGE Improved Improved Consistently Improved
MetaMath-Mistral Base Baseline Baseline Baseline
MetaMath-Mistral +MARGE Improved Improved Effective Across Architectures

Ablation Study

Configuration Single-attempt Accuracy pass@k Description
Standard DPO (Entire Trajectory) Improved Decreased Accuracy-diversity trade-off
MARGE (Without Hit Guidance) Slight Improvement Maintained Pure exploration is insufficient
MARGE (With Hit Guidance) Significant Improvement Improved Simultaneously improves both
Different Sample Size K Improves with larger K - But with diminishing marginal returns

Key Findings

  • MARGE is one of the few methods that can simultaneously improve single-attempt accuracy and pass@k, breaking the common accuracy-diversity trade-off.
  • Hit-guided credit assignment is the core contribution—performing intermediate state exploration without hit guidance yields negligible effects.
  • The method is effective across multiple backbone models (Qwen2.5-Math, LLaMA-3.1, MetaMath-Mistral), demonstrating good generalizability.
  • As the volume of self-generated data scales up, the advantage of MARGE becomes more pronounced, demonstrating its capability to unlock the scaling potential of self-generated data.
  • Eliminating the need for an additional reward model or value model reduces implementation costs.

Highlights & Insights

  • "Hit rate" is an elegant signal: It circumvents the need to train complex value networks. Reasoning ability at each step can be estimated simply by re-sampling from intermediate states. The idea is simple yet effective.
  • Breaking the accuracy-diversity trade-off: This is a persistent challenge in alignment methods. MARGE naturally resolves this issue through intermediate state exploration.
  • Transferable to other reasoning tasks: The framework of intermediate state exploration combined with hit-guided feedback is not limited to math. It can be applied to any task requiring multi-step reasoning, such as code generation and logical reasoning.
  • A new perspective on scaling: When training data is scarce, fork exploration from intermediate states can exponentially expand useful training signals.

Limitations & Future Work

  • Estimating the hit rate requires sampling K times for each intermediate state, which increases computational costs with K and trajectory length.
  • The choice of "split points" for intermediate states may affect performance—determining which positions are worth exploring remains an open question.
  • The paper primarily validates the method on mathematical reasoning; its effectiveness on other reasoning tasks (such as code or commonsense reasoning) has not been fully verified.
  • Estimation of the hit rate can be noisy—when K is small, inaccurate estimation may introduce erroneous signals.
  • Future work could consider combining the hit rate with other signals (such as ORM scores) to further improve the quality of the signal.
  • vs Standard RL (GRPO/PPO): Standard RL only learns from the final reward of the global trajectory. MARGE dives into intermediate steps, providing richer signals and more precise credit assignment.
  • vs Process Reward Model (PRM): PRMs require human annotations on intermediate steps or separate training of reward models. MARGE does not require an extra model, estimating hit rates solely through sampling.
  • vs STaR/ReST: These self-improvement methods only retain ultimately correct trajectories. MARGE can extract valuable intermediate segments even from incorrect trajectories.

Rating

  • Novelty: ⭐⭐⭐⭐ The concept of intermediate state hit-guided exploration is novel, though the overall framework shares similarities with existing self-improvement methods.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple models and benchmarks are used, with solid ablation analysis, though more complete details of experimental figures could be provided.
  • Writing Quality: ⭐⭐⭐⭐ The motivation is clearly articulated, and the methodology is presented in a well-structured manner.
  • Value: ⭐⭐⭐⭐ Simultaneously improving accuracy and diversity is a valuable contribution, providing insight into methods for improving mathematical reasoning.