Note 7: Value-Guided Search - Efficient Chain-of-Thought Reasoning¶
Conference: NeurIPS 2025 arXiv: 2504.18428 Code: GitHub Area: Other Keywords: Value model, block-level search, beam search, VGS, inference efficiency
TL;DR¶
This paper proposes Value-Guided Search (VGS), which employs a token-level value model to guide block-level beam search without requiring predefined "steps." VGS achieves a +14.5% relative accuracy improvement over majority voting on competition mathematics while reducing inference computation by 30%, outperforming existing PRM-based approaches.
Background & Motivation¶
PRM scalability bottleneck: Existing PRMs require fine-grained step-level annotations (high cost, ambiguous definitions) and extensive Monte Carlo sampling, making them difficult to scale to long-horizon reasoning.
Computational efficiency dilemma: Reasoning models generate long outputs at high cost, necessitating efficient test-time computation strategies to reduce inference FLOPs.
Advantage of value models: Value prediction requires only final outcome labels rather than step-level annotations. The question is whether value models can serve as a powerful alternative to PRMs.
Key design question: How can token-level value signals be effectively applied within long CoT for block-level search decisions?
Method¶
Overall Architecture¶
End-to-end three-stage pipeline: (1) Value model training → (2) Data collection → (3) Block-level search application
Key Designs¶
1. Regression-classification value model training: Value prediction is decomposed into a classification task (incomplete, incorrect, correct): $\(\mathcal{L}(θ,\mathcal{B}) = \frac{1}{|\mathcal{B}|}\sum_{(x,y,z,κ)∈\mathcal{B}}\frac{1}{|z|}\sum_{h=1}^{|z|}\ell_{ce}(f_θ(x,y,z^{:h}),κ)\)$
Crucially, the loss is computed at every token position \(h\) within the roll-out (super-sampling), fully exploiting the equivalence of autoregressive sampling. The value function is defined as: $\(V_θ(x) := f_θ(x)[1] \text{ and } \mathbb{E}_{z∼π_{ref}}[R(x,z)|x]\)$ directly predicting the expected reward of completed trajectories.
2. Efficient data collection pipeline: - Pre-filtering: OpenR1-Math 94k → 50k (removing unsolvable and ambiguous problems) - Sampling: 14 roll-ins (truncated thinking prefixes) sampled from each of 4 DeepSeek model scales (1.5B–32B) - Completion: 4 roll-outs sampled from \(π_{ref}\) per roll-in, yielding 56 samples per problem in total - Post-filtering: Problems where all roll-outs are unsolvable (~10%) are removed
Cost comparison:
| Method | Annotation Granularity | Cost |
|---|---|---|
| PRM800K | Human per step | Extremely high |
| Math-Shepherd | MC per step | Very high |
| Qwen2.5-PRM | LLM-as-Judge per step | High |
| VGS (Ours) | Final label only | Low |
3. Block-level beam search: Generation is divided into blocks (block_size = 4096 tokens); within each block, candidates are independently sampled and scored by the value model.
Algorithm 1 – Beam Search:
Initialize B = N/w beams
while incomplete beams exist:
Sample w blocks from each beam
Select the highest-value block to continue
end while
Aggregation strategies: - Best-of-n (BoN): Select the response with the highest value score - Weighted Majority Voting (WMV): Group-based majority voting weighted by value scores
Key finding: WMV outperforms BoN, as it benefits from greater stability across diverse generations.
Key Experimental Results¶
AIME-25 & HMMT-25 Competition Mathematics¶
| Model Configuration | AIME-25 Accuracy | HMMT-25 Accuracy | Average (%) | Relative Gain over MV (%) |
|---|---|---|---|---|
| DeepSeek-1.5B \((N=256\) samples\()\) | ||||
| Majority Voting (MV) | 38.9±1.9 | 24.3±2.9 | 31.6±1.7 | Baseline |
| VGS (DeepSeek-VM-1.5B) | 46.7±0.7 | 32.8±0.8 | 39.8±0.5 | +26.0% |
| VGS (Qwen-PRM-7B) | 38.9±1.4 | 24.2±0.2 | 31.6±0.7 | -0.1% |
| DeepSeek-7B \((N=128\) samples\()\) | ||||
| MV Baseline | 56.5±1.6 | 33.8±2.5 | 45.2±1.5 | - |
| VGS (DeepSeek-VM-1.5B) | 59.4±0.8 | 41.1±1.6 | 50.3±0.9 | +11.3% |
Inference Efficiency – FLOPs Required to Reach Target Accuracy¶
| Target Accuracy | DeepSeek-1.5B MV | DeepSeek-1.5B VGS | FLOP Reduction (%) |
|---|---|---|---|
| 35% | 256 generations | 128 generations | 50% |
| 37% | 512 generations | 192 generations | 62.5% |
| 39% | 1024 generations | 256 generations | 75% |
Key Findings¶
- Outperforms PRMs: DeepSeek-VM-1.5B surpasses all 7B PRMs (Qwen/Math-Shepherd) at the same scale, with a 16.2% F1 advantage.
- Critical aggregation: WMV consistently outperforms BoN by 2–4%, highlighting the importance of diverse generation.
- Efficiency gains: Inference FLOPs required to achieve equivalent accuracy are reduced by 50–75%, demonstrating practical deployment feasibility.
- Scaling properties: A single fixed beam width and block size (without per-problem tuning) already yields strong results, simplifying deployment.
Highlights & Insights¶
- Methodological simplicity: Avoids the difficulties of fine-grained step definitions required by PRMs; learns end-to-end via token-level classification.
- Data efficiency: Requires no expert annotation or step-wise labeling; only final answers are needed to train a generalizable value model.
- Practical engineering: Data, models, and code are fully open-sourced, lowering barriers to reproduction and application.
- Performance–efficiency balance: Simultaneously improves accuracy (+14%) and reduces computation (FLOPs ↓ 50–75%).
Limitations & Future Work¶
- Evaluation is limited to competition mathematics (AIME/HMMT); applicability to broader scientific reasoning and code generation remains unexplored.
- Beam width and block size are hyperparameters; while fixed values already perform well, problem-adaptive strategies (e.g., per-instance difficulty estimation) have not been investigated.
- Comparison with multi-path sampling search (MPS) is insufficient, and the potential of alternative search and aggregation strategies (tree search, graph search) remains unexplored.
Related Work & Insights¶
- Reasoning models and test-time compute scaling (o1/DeepSeek-R1/o3)
- Process reward models and search-guided reasoning
- Token-level vs. step-level value prediction
Rating¶
⭐⭐⭐⭐⭐