GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time¶

Conference: ICLR 2026 arXiv: 2510.03777 Code: GitHub Area: LLM Evaluation Keywords: inference-time scaling, repeated sampling, diversity, concept exploration, pass@k

TL;DR¶

This paper proposes GuidedSampling, an inference-time algorithm that explicitly decouples the implicit exploration and generation process of repeated sampling (RS) into two stages: iteratively generating diverse problem-solving concepts/theorems, followed by generating candidate solutions conditioned on each concept. The method achieves an average improvement of ~21.6% on pass@50 and ~9.7% on pass@5 after fine-tuning.

Background & Motivation¶

Inference-time compute scaling is an important direction for improving LLM performance, often more efficient than scaling model size.
Repeated Sampling (RS) is the simplest inference-time algorithm, yet suffers from severe diversity deficiency: LLMs are trained to produce a single correct response for a given input.
Quantitative analysis shows that Llama-3.2-3B generates 100 candidate solutions on HumanEval using on average only 2.75 distinct concepts; 37% of problems are attempted with only a single concept.
For example, on a maximization problem from MATH, 892 out of 1,000 RS solutions apply the "AM-GM inequality," most of which lead to incorrect answers.
Tree-of-Thought (ToT) can improve diversity but incurs prohibitively high computational cost, requiring explicit evaluation of candidate thoughts at every step of the tree.
The core motivation is to explicitly separate the implicitly coupled "exploration" and "generation" phases in RS, achieving high diversity at low cost.

Method¶

Overall Architecture¶

GuidedSampling operates in two stages: 1. Exploration Phase: Iteratively generates \(K\) diverse concepts/theorems. 2. Generation Phase: Generates \(M\) candidate solutions per concept (total budget \(IC = K \times M\)).

Key Designs¶

Design 1: Iterative Concept Exploration - Function: Given problem \(x\), iteratively samples a sequence of concepts \(c_1, c_2, \ldots, c_K\). - Mechanism: The \(k\)-th concept is conditioned on all preceding concepts: \(c_k \sim p_\theta(\cdot | x, c_{1:(k-1)})\), encouraging the model to explore directions beyond already-generated concepts. - Design Motivation: Concepts serve as problem-level "high-level guidance" (e.g., theorem names). Since they are explored once and reused, this is far more efficient than ToT's step-by-step evaluation.

Design 2: Concept-Guided Generation - Function: For each concept \(c_k\), generates \(M\) candidate solutions conditioned on that concept: \(s_k^{(m)} \sim p_\theta(s | x, c_k)\). - Mechanism: Explicit binding between concepts and solutions ensures that candidate solutions cover diverse problem-solving paths. - Design Motivation: Overcomes the limitation of RS where all solutions share the same implicit concept. GuidedSampling produces on average 17.63% more unique concepts than RS.

Design 3: GuidedSampling Post-Training - Function: Uses GuidedSampling-generated trajectories as synthetic training data. - Mechanism: Two training data formats — FA (final answer only: \((x, s)\)) and CAA (concept + answer: \((x, \text{concat}(\mathcal{C}, s))\)). - Design Motivation: CAA training internalizes diverse reasoning strategies; after fine-tuning, pass@5 improves by an average of 9.7% and generalizes to OOD benchmarks such as GPQA and HumanEval.

Loss & Training¶

Post-training uses standard fine-tuning losses: - FA mode: \(\mathcal{L}_{FA} = -\mathbb{E}_{(x,s) \sim \mathcal{D}_{FA}} [\log P_\theta(s|x)]\) - CAA mode: \(\mathcal{L}_{CAA} = -\mathbb{E}_{(x,\mathcal{C},s) \sim \mathcal{D}_{CAA}} [\log P_\theta(y|x)]\), where \(y = \text{concat}(\mathcal{C}, s)\)

Theoretical guarantee (Theorem 1): GuidedSampling outperforms RS when \(k_{min} \cdot P(\mathcal{C}_r | x) > 1\), i.e., when the model has sufficient probability of generating relevant concepts and the concepts provide a significant amplification factor.

Key Experimental Results¶

Main Results¶

pass@50 improvement (averaged across Llama-3.2-3B, Qwen2.5-3B, Gemma-3-27B):

Benchmark	RS Baseline	GuidedSampling	Gain
MATH	—	—	+21.8%
GPQA-Diamond	—	—	+11.87%
HumanEval	—	—	+11.28%
OlympiadBench	—	—	+3.08%
Average	—	—	+16.01%

Ablation Study¶

Post-training pass@5 comparison (Llama-3.2-3B-Instruct):

Training Strategy	MATH	GPQA	HumanEval	Olympiad	Average
RS	44.78	40.08	55.78	10.83	37.87
STaR	46.23	38.41	57.35	10.62	38.15
ToT	56.63	44.44	49.51	18.36	42.24
FA (Ours)	47.98	50.61	55.95	20.21	43.69
CAA (Ours)	60.06	40.23	59.03	21.66	45.25

Diversity analysis: RS produces an average of 4.04 unique concepts vs. 4.75 for GuidedSampling (+17.63%).

Key Findings¶

GuidedSampling outperforms RS on nearly all model–benchmark combinations. The only exception is Qwen2.5-3B on HumanEval, where performance degrades due to weak concept generation capability in the code domain (averaging only 1.13 concepts).
An optimal sweet spot exists for the exploration–generation allocation: increasing \(K\) initially improves performance but then degrades it as the per-concept generation budget \(M\) becomes insufficient.
Early concepts (\(k=1\)–\(5\)) exhibit higher average quality (19.8%→16.2%), while later concepts (\(k \geq 6\)) contribute critically to a small subset of difficult problems that require deeper exploration.
Domain limitation: On commonsense reasoning (CommonSenseQA), GuidedSampling underperforms RS by 3.28%, indicating inapplicability to domains where concepts are ill-defined.
The CAA training mode substantially outperforms FA, demonstrating the effectiveness of training models on complete trajectories of "concept exploration followed by solution generation."
Concept generation is a one-time sequential call, incurring far less overhead than the total 100 RS samples.

Highlights & Insights¶

Elegant design philosophy: Substantial gains are achieved simply by decoupling "implicit exploration + generation" into "explicit exploration → guided generation."
Sound theoretical analysis: Theorem 1 precisely characterizes the sufficient conditions under which GuidedSampling outperforms RS; two analytical pathways (concept coverage + irrelevant concept recovery) provide a clear framework.
Dual value of post-training: GuidedSampling serves not only as an inference strategy but also as a high-quality synthetic data generator — CAA fine-tuning significantly improves pass@k.
The AM-GM inequality example is highly compelling: 892 out of 1,000 RS solutions apply the same theorem, most leading to incorrect results.
Strong composability: The method can be combined with RL (e.g., pass@k optimization), majority voting, and other techniques.

Limitations & Future Work¶

Significant domain limitations: The method performs poorly on tasks with ill-defined concepts (e.g., commonsense reasoning), restricting its applicability to domains with well-defined concepts or theorems.
Strong model dependency: Qwen2.5-3B generates only 1.13 concepts on HumanEval — models with weak concept generation capability cannot benefit from this approach.
The concept generation phase is sequential and iterative, preventing parallelization and becoming a bottleneck at large \(K\).
Main experiments are conducted primarily on 3B-scale models; performance on 7B+ models remains to be verified.
Concept quality assessment relies entirely on Qwen2.5-32B for extraction — inaccuracies in the extractor may introduce bias in diversity measurements.

Repeated Sampling (Cobbe et al., 2021): The simplest inference-time scaling approach, but suffers from insufficient diversity.
Tree-of-Thought (Yao et al., 2023): Structured exploration with high computational cost; GuidedSampling achieves a better balance between diversity and efficiency.
Self-Taught Reasoner (STaR) (Zelikman et al., 2022): Fine-tunes on reasoning trajectories but does not explicitly manage diversity.
Inspiration: The exploration–generation decoupling paradigm generalizes to code generation (plan algorithm before implementation), scientific discovery, and related areas.

Rating¶

Novelty: ⭐⭐⭐⭐ The exploration–generation decoupling idea is concise and effective, though the core insight ("plan your approach before solving") is relatively intuitive.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark, multi-model evaluation with theoretical analysis and post-training experiments are comprehensive, though primarily limited to 3B-scale models.
Writing Quality: ⭐⭐⭐⭐ Well-structured with an excellent motivating example (AM-GM), though certain details (e.g., the precise definition of concepts) could be made more explicit.
Value: ⭐⭐⭐⭐ Practically valuable for inference-time compute scaling, though generality is limited by the requirement for well-defined concepts.