ICML2025 Reinforcement Learning Credit Assignment PPO Monte Carlo Value Estimation RLVR Mathematical Reasoning

VinePPO: Refining Credit Assignment in RL Training of LLMs¶

Conference: ICML2025
arXiv: 2410.01679
Code: McGill-NLP/VinePPO
Area: Reinforcement Learning
Keywords: Credit Assignment, PPO, Monte Carlo, Value Estimation, RLVR, Mathematical Reasoning

TL;DR¶

VinePPO exploits the property that language environments can be reset from any intermediate state. It replaces the value network in PPO with Monte Carlo (MC) rollouts for unbiased value estimation. This approach outperforms the peak performance of PPO/GRPO/RLOO on mathematical reasoning tasks with less wall-clock time (up to 3x speedup) and exhibits a stronger generalization gradient.

Background & Motivation¶

LLMs require multi-step reasoning to obtain final rewards in tasks like mathematical reasoning, leading to the core challenge of Credit Assignment (CA): not all reasoning steps are equally important, and it is necessary to identify which steps actually contribute to the final outcome.

PPO's Approach: Trains a separate value network (critic) to estimate the expected return of each intermediate state, which is then used to compute the advantage.
Limitations of Prior Work: Methods like GRPO, RLOO, and DPO abandon fine-grained CA and treat all tokens with equal weight, yet still achieve decent results. This seemingly contradicts the classical RL paradigm where CA is considered crucial.
Ours' Discovery: The value network in PPO actually performs poorly—its predictions are severely biased in reasoning tasks, and it performs only slightly better than random guessing in a 1-out-of-5 rank test.

This leads to the Core Problem: If CA is improved rather than discarded, can the performance of LLM reinforcement learning be further enhanced?

Method¶

Core Idea¶

Language generation environments possess a unique property: the state is simply the concatenation of the token sequence, allowing any intermediate state \(s_t\) to be fed directly back into the model to regenerate subsequent content. This implies that MC rollouts can be performed from any intermediate point to obtain an unbiased estimate of the value of that state.

VinePPO Algorithm¶

Step 1: Sampling Training Trajectories
For each prompt \(\mathbf{x}\), sample a training trajectory \(\tau\) using the current policy \(\pi_\theta\).

Step 2: MC Value Estimation
For each intermediate state \(s_t\) in the trajectory, resample \(K\) auxiliary trajectories \(\eta_1, \dots, \eta_K \sim \pi_\theta(\cdot | s_t)\) from that state, and compute the MC value estimate:

\[\hat{V}_{\text{MC}}(s_t) = \frac{1}{K} \sum_{k=1}^{K} R(\eta_k)\]

Step 3: Calculating Advantage
Compute the advantage using the MC value estimates instead of the value network:

\[\hat{A}_{\text{MC}}(s_t, a_t) = r(s_t, a_t) + \gamma \hat{V}_{\text{MC}}(s_{t+1}) - \hat{V}_{\text{MC}}(s_t)\]

Step 4: PPO Policy Update
Perform standard PPO clipped policy gradient updates using the computed \(\hat{A}_{\text{MC}}\). Note that the auxiliary trajectories \(\eta_k\) are used solely for value estimation and do not directly participate in policy updates.

Efficiency Optimization¶

Step-level Grouping: Share the same advantage for all tokens within the same reasoning step, performing MC estimation at the reasoning-step level rather than the token level to balance precision and efficiency.
Efficient Inference Engine: Leveraging inference engines like vLLM, a 7B model can achieve 5K tokens/s on a single A100 GPU.
No Additional GPU Memory: Eliminates the parameters and optimizer of the value network (saving approximately 112GB of VRAM for a 7B model).

Comparison with Other Methods¶

Method	CA Granularity	Value Estimation Method	Additional Overhead
RLOO/GRPO	None (Initial State Only)	Trajectory mean as baseline	None
PPO	Token-level	Learned value network	Extra model + VRAM
VinePPO	Step-level / Token-level	MC rollout unbiased estimation	Extra sampling time

Key Experimental Results¶

Experimental Setup¶

Models: DeepSeekMath 7B, RhoMath 1.1B (full-parameter fine-tuning)
Datasets: MATH (competition-level mathematics), GSM8K (grade-school mathematics)
Rewards: Binary correctness reward (correct/incorrect answer)
Fair Comparison: All methods consume the same number of episodes (64 trajectories per problem).

Main Results¶

Method	MATH (7B)	GSM8K (7B)	CA Presence
RestEM	Low	Low	❌
DPO+	Medium	Medium	❌
GRPO	Medium	Medium	❌
RLOO	Medium	Medium	❌
PPO	Medium-High	Medium-High	✅ (Value Net)
VinePPO	Highest	Highest	✅ (MC)

Computational Efficiency¶

RhoMath 1.1B: VinePPO reaches the peak accuracy of PPO in 1/3 of the wall-clock time with a 9x reduction in gradient steps.
DeepSeekMath 7B: VinePPO reaches the peak of PPO in 1/1.51 of the wall-clock time with a 2.8x reduction in gradient steps.
Although VinePPO is slower per iteration (5x slower for 1.1B, 2x slower for 7B), each iteration is significantly more efficient.

Value Network Analysis¶

Prediction Accuracy: PPO value network accuracy is \(\le 65\%\), while VinePPO's MC estimation reaches 70-90%.
1-out-of-5 Ranking: The PPO value network remains close to the random baseline for most of the training process, whereas VinePPO consistently maintains high accuracy.
Reasoning Chain Location: PPO's error increases in the later stages of reasoning (generalization failure), whereas VinePPO's error decreases in the later stages (longer context makes generation more deterministic).

Ablation Study of K (RhoMath 1.1B, MATH)¶

K	Effect
1	Already outperforms PPO
3	Further improvement
9	Best (default setting)

A larger \(K\) yields lower variance and superior computational efficiency (converges in fewer iterations).

Highlights & Insights¶

Precise Problem Diagnosis: Systematically analyzes why the PPO value network fails—insufficient generalization capability in the later stages of the reasoning chain, severe prediction bias, performing even worse than random ranking.
Simple Yet Effective Method: Modifies only one component—the advantage estimation in PPO—leaving the rest completely unchanged, perfectly isolating the impact of CA.
Optimal Generalization Gradient: VinePPO achieves the highest test accuracy for the same training accuracy, indicating that precise CA enables the model to extract more generalization signals rather than memorization.
Leveraging Environmental Characteristics: Ingeniously exploits the deterministic transition property of the language environment (where state equals token concatenation) to make the Vine/MC methods, which are typically impractical in traditional RL, feasible.
Memory Friendly: Eliminates the value network, saving 112GB of VRAM for a 7B model.
Origin of the Name: Derived from the "Vine" variant of TRPO (Schulman et al., 2015). The original authors posited that this variant is only applicable to environments that allow intermediate resets—which language generation perfectly satisfies.

Limitations & Future Work¶

Sampling Overhead: MC rollouts increase sampling time, especially for smaller models (5x slower for 1.1B). Large-scale application requires more efficient sampling strategies.
Selection of \(K\): Larger \(K\) is better but slower; there lacks an adaptive mechanism for choosing \(K\).
Limited to Mathematical Reasoning Validation: Only evaluated on MATH and GSM8K without testing on other reasoning scenarios such as code generation or web navigation.
Step-Level Grouping Assumption: Sharing advantages of tokens within the same reasoning step may lose fine-grained signals within that step.
Binary Reward Limitation: Only uses 0/1 correctness rewards without exploring continuous or process rewards.
Lack of Integration with PRMs/ORMs: Does not explore the complementary relationship between MC value estimation and Process Reward Models (PRMs) or Outcome Reward Models (ORMs).

GRPO/RLOO (Shao et al., 2024; Ahmadian et al., 2024): Simplified methods that abandon CA, using the average trajectory return as a baseline.
TRPO Vine (Schulman et al., 2015): The theoretical source of VinePPO, first introducing state value estimation via MC.
AlphaGo/AlphaZero (Silver et al., 2016, 2017): Integrates MC rollouts with a value network in Go, but focuses on inference-time search rather than credit assignment during training.
SFT memorizes, RL generalizes (Chu et al., 2025): The theoretical foundation supporting the generalization gradient findings in this work.

Rating¶

Novelty: ⭐⭐⭐⭐ — The core idea is simple and elegant, successfully transferring the Vine TRPO concept to LLM RL training.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive ablation studies, deep analysis of the value network's failures, and strict control of variables.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear logic and excellent visualizations, featuring a very smooth narrative from problem definition to diagnosis, mechanism design, and verification.
Value: ⭐⭐⭐⭐⭐ — Provides a simple yet effective solution to credit assignment in RLVR (Reinforcement Learning for Verbal Reasoning), offering valuable insights for subsequent works like DeepSeek-R1.