Skip to content

EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning for LLMs

Conference: ACL 2026
arXiv: 2508.07809
Code: https://github.com/gtxygyzb/EvoCoT
Area: Reinforcement Learning / LLM Reasoning / Curriculum Learning
Keywords: RLVR, Chain-of-Thought, Curriculum Learning, Self-Evolution, Sparse Rewards

TL;DR

This paper proposes EvoCoT, a two-stage self-evolving curriculum learning framework: first, it constrains the LLM to self-generate verifiable CoT trajectories using the final answer; then, it progressively expands the exploration space by deleting reasoning steps from the tail. This enables stable RLVR training on hard problems with sparse rewards without relying on teacher models or human-written CoTs, significantly improving R1-Qwen-1.5B's accuracy on hard MATH problems from 55.7% to 87.8%.

Background & Motivation

Background: RLVR (Reinforcement Learning with Verifiable Reward) has become a mainstream paradigm for LLM post-training—models like DeepSeek-R1 / Kimi-k1.5 rely on it to advance reasoning capabilities. The basic workflow involves rollout sampling, rule-based verification of the final answer, and updating the policy via GRPO/DAPO. However, this method requires rollouts to hit the correct answer to receive a reward.

Limitations of Prior Work: On hard problems, the rollout hit rate is extremely low (after GRPO training, Qwen2.5-7B still fails to solve 8.8% of GSM8K and 22.0% of MATH), resulting in persistent sparse rewards and near-zero learning signals. Existing solutions fall into two categories: (i) Teacher-model dependence (LUFFY / Guide-GRPO / ReLIFT / TAPO / SRFT), which requires distillation from GPT-4o level models and is unsuitable for flagship model training; (ii) Curriculum filtering (RORL / AdaRFT / SEC), which discards hard problems that could otherwise provide learning signals.

Key Challenge: Hard problems are critical for extending the upper bound of reasoning capabilities, yet sparse rewards prevent RLVR from learning from them. Enabling LLMs to learn from hard problems under the constraints of "no teacher model" and "no data filtering" is a fundamental RLVR bottleneck noted by Yue / Zhao (2025).

Goal: (i) Design a distillation-free and unfiltered training framework (EvoCoT is the only method in Table 1 satisfying both); (ii) Provide dense rewards for hard problems where initial 0/8 rollouts are incorrect; (iii) Ensure orthogonality with existing GRPO / DAPO as a post-training enhancement layer.

Key Insight: The authors identifies an "exploration space vs. current capability" mismatch as the key bottleneck (Figure 1). By temporarily constraining the exploration space to reachable ranges, hard problems can generate dense rewards; the exploration space is then gradually expanded to target task difficulty. This follows curriculum learning principles, but unlike prior work (which requires external difficulty labels or static full CoT data like R3/AdaBack), EvoCoT generates its own difficulty gradient.

Core Idea: Transform each hard problem into "self-generated CoT + progressively shortened prefix"—longer CoTs reduce difficulty (only requiring step completion), while shorter ones increase it. Each sample naturally forms an "easy-to-hard" gradient without external labels or teacher CoTs.

Method

Overall Architecture

EvoCoT is an iterative nested two-stage framework. Stage 1 (Answer-Guided Reasoning Path Self-Generation): For each \((Q,A)\) hard problem, the LLM generates \(\hat{C}\) given the final answer \(A\); consistency checks \((Q,\hat{C}) \to \hat{A}\) filter for \(\hat{A}=A\), and \(\hat{C}\) is split into steps \(\{\hat{c_1},\dots,\hat{c_n}\}\). Stage 2 (Step-Wise Curriculum Learning): Steps are removed from the tail—sampling from \((Q,c_1,\dots,c_n)\), then \((Q,c_1,\dots,c_{n-1})\), ..., down to \((Q)\)—creating a difficulty gradient from "full prefix guidance" to "zero guidance." Prefixes are fixed during rollouts, and the remaining steps are freely generated and updated via RLVR. These stages alternate for \(t\) iterations (Equation 5); as LLM capability grows, Stage 1 generates higher-quality CoTs in subsequent rounds, forming a self-evolution loop.

Key Designs

  1. Answer-Guided Backward CoT Self-Generation + Answer-Consistency Filtering:

    • Function: Generates credible step-by-step reasoning for hard problems from \((Q,A)\) supervision.
    • Mechanism: By viewing \(Q\) and \(A\) simultaneously, the LLM generates \(\hat{C}\), leveraging the intuition that backward reasoning from a known answer is easier. Forward consistency verification confirms \((Q, \hat{C})\) leads to the correct answer. The process is formalized as \((Q,A) \xrightarrow{\text{LLM}} \hat{C}\) and \((Q,\hat{C}) \xrightarrow{\text{LLM}} \hat{A}\), keeping cases where \(\hat{A}=A\).
    • Design Motivation: (i) Distillation-free (self-generation and self-verification); (ii) Consistency filtering ensures CoTs lead to the correct answer, avoiding "shortcuts"; (iii) Similar to STaR (Zelikman 2022) but used here as scaffolding for RL curriculum.
  2. Step-Wise Reverse-Prefix Curriculum:

    • Function: Converts a single CoT into a difficulty gradient from "full guidance" to "zero guidance."
    • Mechanism: CoT steps are removed in reverse order: \((Q,c_1,\dots,c_n) \to (Q,c_1,\dots,c_{n-1}) \to \dots \to (Q)\). Each step uses the current prefix as a fixed rollout anchor (teacher forcing). Longer prefixes ensure high success rates and dense rewards; shorter prefixes increase exploration difficulty after the model has stabilized early steps.
    • Design Motivation: Based on two observations: (a) Long-prefix guidance is easier (natural gradient); (b) Progressive shortening prevents reward hacking (copying answers), ensuring the model eventually solves \((Q)\) directly. Unlike R3/AdaBack, EvoCoT does not require external CoT data.
  3. Self-Evolving Iteration:

    • Function: Creates a positive feedback loop between CoT quality and LLM capability.
    • Mechanism: In round \(t\), the current LLM generates \(\hat{C}^{(t)}\), and EvoCoT trains \(\text{LLM}^{(t+1)}\) on \((Q, \hat{C}^{(t)}, A)\); the stronger model then generates better \(\hat{C}^{(t+1)}\).
    • Design Motivation: A single round is limited by initial CoT quality; multiple iterations make the framework robust. EvoCoT remains orthogonal to GRPO / DAPO.

Loss & Training

Stage 2 utilizes RLVR (defaulting to GRPO, compatible with DAPO/PRIME). Rollouts continue from the fixed prefix with rewards based on final answer verification. Stage 1 identifies hard problems (0/8 correct rollouts) from GSM8K + MATH and samples 8 CoTs per problem (temperature=1.0). Experiments ran on 8×A100 (40GB) with fixed training steps. EvoCoT defaults to 2 iterations (plateauing after 3).

Key Experimental Results

Main Results (Cross-model comparison on 6 math benchmarks, pass@1)

Model + Method GSM8K MATH AIME24 AMC23 Minerva Olympiad Avg
Llama3.1-8B + GRPO 78.5 23.1 0.0 5.0 4.4 6.2 19.5
Llama3.1-8B + EvoCoT 80.5 23.8 0.0 7.5 4.8 5.8 20.4
DeepSeek-Math-7B + GRPO 79.8 38.7 0.0 15.0 16.2 12.4 27.0
DeepSeek-Math-7B + EvoCoT 76.3 39.1 0.0 20.0 19.1 13.0 27.9
Qwen2.5-7B + GRPO (SimpleRL) 92.4 79.7 10.0 52.5 34.6 38.1 51.2
Qwen2.5-7B + EvoCoT 91.4 76.5 20.0 60.0 37.1 35.9 53.5
R1-Qwen-1.5B + DeepScaleR (GRPO) 88.2 89.4 36.7 77.5 38.2 51.6 63.6
R1-Qwen-1.5B + EvoCoT 88.0 89.7 40.0 87.5 42.8 52.0 66.7

R1-Qwen-1.5B + EvoCoT significantly outperforms DeepScaleR on AMC23 (+10pp), AIME24 (+3.3pp), and Minerva (+4.6pp).

Ablation Study (Hard Problems & Self-Evolution)

Configuration GSM8K (Hard) MATH (Hard) Avg
Qwen2.5-7B + GRPO 91.2 78.0 84.6
Qwen2.5-7B + EvoCoT 95.4 82.7 89.1
R1-Qwen-1.5B + GRPO 80.7 55.7 68.2
R1-Qwen-1.5B + EvoCoT 91.9 87.8 89.9
Llama3.1-8B + GRPO 84.3 21.9 53.1
Llama3.1-8B + EvoCoT 83.6 21.9 52.8

Self-evolving iteration trends (R1-Qwen-1.5B avg): iter0 = 63.6 → iter1 = 64.3 → iter2 = 66.7 → iter3 = 65.8 (stability/oscillation starts).

Key Findings

  • Strong models benefit more: Qwen2.5-7B improved on MATH hard items by 4.7pp; R1-Qwen-1.5B surged by +32.1pp. Llama3.1-8B saw no change, as weak self-generated CoTs prevent the self-evolution loop from starting.
  • Consistent rollout success: Figure 2 shows R1-Qwen-1.5B maintains 220+ correct/256 rollouts, whereas GRPO lingers at 0-5. This validates the effectiveness of constraining exploration.
  • EvoCoT > SFT: Qwen2.5-7B SFT reached 36.9, while EvoCoT reached 53.5, suggesting progressive RL internalizes reasoning better than static SFT memorization (Chu 2025).
  • Data Efficiency: Achievement with only GSM8K + MATH training is comparable to PRIME (53.5 vs 55.3) which uses 380K data points.

Highlights & Insights

  • Backward reasoning as a strong prior: While typical RL derives answers from CoTs, EvoCoT’s "backward+forward" constraint significantly improves CoT synthesis quality in weak supervision settings.
  • Intra-sample curriculum: Unlike RORL (external difficulty estimates) or R3 (external CoTs), EvoCoT generates a difficulty spectrum within each sample.
  • Orthogonality: Does not replace RL algorithms but modifies sample structure as a drop-in enhancement for RLVR pipelines.
  • Capability Baseline: The failure of Llama3.1-8B suggests self-evolution requires a base model to cross a capability threshold, hinting at a potential scaling law for post-training curriculum methods.

Limitations & Future Work

  • Ineffective for weak base models where Stage 1 CoT quality is a hard ceiling.
  • Saturation after 2 iterations prevents indefinite scaling.
  • Verification limited to math; code generation and tool-use are yet to be tested.
  • High training cost due to multiple rollouts and iterations per problem.
  • vs LUFFY / Guide-GRPO: EvoCoT is self-sufficient with no teacher model dependency.
  • vs RORL / AdaRFT: EvoCoT retains all hard problems rather than filtering them.
  • vs R3 / AdaBack: EvoCoT self-generates CoTs, making it more generalized.
  • vs STaR (Zelikman 2022): EvoCoT applies rationalization to RL curricula with progressive difficulty rather than just SFT.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐