Predicting LLM Reasoning Performance with Small Proxy Model¶

Conference: ICLR 2026 arXiv: 2509.21013 Code: Available (dataset and code planned for open release) Area: LLM Evaluation Keywords: scaling law, proxy model, reasoning, pre-training data selection, negative log-likelihood

TL;DR¶

This paper proposes rBridge, which uses reasoning traces from frontier models as gold labels and applies token-level task-aligned weighted NLL, enabling small models (≤1B) to effectively predict the reasoning performance of 13B–32B models, achieving over 100× computational savings on dataset ranking tasks.

Background & Motivation¶

Pre-training large language models is extremely costly, so the field commonly employs small proxy models to evaluate pre-training design choices. Research on scaling laws and dataset ranking invariance has demonstrated the viability of this strategy for general tasks.

Key Challenge: Reasoning ability exhibits emergent behavior, appearing reliably only in sufficiently large models (typically >7B). Small models (≤1B) produce highly noisy results on reasoning benchmarks, or even exhibit reversed trends (e.g., a 1B model's accuracy on MATH500 decreases as training progresses). This means: - Directly using small-model accuracy to predict large-model performance → fails - Practitioners are forced to use 7B–15B "proxy models," which still incur enormous costs (a single 7B/500B-token training run exceeds $50K USD)

The authors' analysis identifies two fundamental problems:

Evaluation Objective Misalignment: Small models are trained with a next-token prediction (NTP) objective, yet they are evaluated using Acc./Pass@K — discrete metrics entirely different from the training objective.

Task Alignment Deficiency: Even when NLL is used, the choice of gold label is critical: if the gold label is out-of-distribution (OOD), the NLL signal remains noisy. Moreover, standard NLL treats all tokens with equal weight, without distinguishing task-critical tokens from formatting tokens.

Method¶

Overall Architecture¶

The core idea of rBridge is to design a proxy evaluation metric $\text{metric}^p$ that maximizes $\text{corr}(\text{metric}^p(\pi^p), \text{metric}^t(\pi^t))$. That is, to find a metric computed on small models that is highly correlated with the target metric (Acc./Pass@K) of large models.

Key Designs¶

Reasoning Traces as Gold Labels:
- The reasoning trace $R^\phi$ generated by a frontier model $\pi^\phi$ (excluding the final answer formatting) is used as the gold label.
- Why this works: (a) $R^\phi$ is closer to the pre-training data distribution (continuous long-form text), reducing NLL by 74.7%; (b) $R^\phi$ represents the reasoning process leading to the correct answer, naturally aligning with the target task.
- Comparison with ScalingBench: ScB uses $R^\phi + A^\phi$ (trace + answer), but the answer contains formatting artifacts such as \\n, "Final Answer:", and "I hope it is correct.", which are severely OOD.
Token-Level Task-Aligned Weighting: $$\text{rBridge NLL}(\text{token}_i) = -\log(p^p(\text{token}_i)) \cdot \frac{1}{|\text{token}_i|} \sum_{\text{letter} \in \text{token}_i} p^\phi(\text{letter})$$
- Each token's NLL is multiplied by the frontier model's confidence (probability) for that token.
- Intuition: Tokens to which the frontier model assigns high confidence (e.g., "sum modulo 9") represent critical reasoning steps and should receive higher weight; low-confidence tokens (e.g., newlines, numbering) constitute formatting noise and are downweighted.
- Tokenizer-agnostic: Letter-level probability averaging handles tokenization differences across different tokenizers.
- MinMax normalization is applied to the weight factors to amplify the effect.
One-Time Upfront Cost: Generating reasoning traces with the frontier model costs less than $10 per benchmark; computing the weights requires only a few seconds of CPU time.

Loss & Training¶

This paper does not train new models. Instead, intermediate checkpoints from pre-training are used as proxy models, and rBridge NLL is computed to predict target model performance.

Key Experimental Results¶

Main Results (1B→13B Proxy-Target Relationship)¶

Method	Avg. Train R² ↑	Avg. Test MAE ↓
Acc./Pass@1	0.304	3.709
iSFT	0.290	5.123
TED	0.375	3.377
MPCA	0.194	302.642
NLL	0.485	5.173
$R^\phi$ (trace NLL only)	0.867	1.455
rBridge	0.874	1.384

rBridge achieves the best results in 10 out of 12 settings (6 benchmarks × R²/MAE).
1B→32B: R²=0.826, MAE=1.481, also best overall.
1B→13B+SFT: R²=0.846, MAE=1.304, best overall.

Ablation Study (Dataset Ranking, <100M→1.2B)¶

Proxy Scale	rBridge DAcc	Best Baseline DAcc	Computational Savings
3.7M (87.3M tokens)	~70%	~50% (random level)	733.4×
6M (81.6M tokens)	~75%	~55%	100.2×
97.9M	~80%	~75%	-

At the smallest proxy scale (3.7M), rBridge outperforms baselines by 27% DAcc.
To achieve equivalent ranking accuracy, rBridge saves 100.2× to 733.4× FLOPs.

Key Findings¶

rBridge outperforms models 7–13× larger: A 1B model using rBridge surpasses the predictive capability of 7B/13B models evaluated directly via accuracy.
Zero-shot cross-dataset transfer: The rBridge→Acc function fitted on OLMo-Mix-1124 transfers zero-shot to another dataset, achieving perfect ranking on 5 out of 5 benchmarks with MAE of only 0.043–1.417 (one outlier at 9.716).
Ablations confirm each component's contribution: Each step from standard NLL → using $R^\phi$ → adding weighting → MinMax normalization yields consistent improvements.

Highlights & Insights¶

Deep insight: Small models do not "fail" on reasoning benchmarks due to a lack of information, but because the evaluation method (Acc.) is mismatched with the capabilities of small models.
NLL returns to its roots: The pre-training objective is NTP, and evaluation should likewise return to NLL — but the key lies in what is used as the gold label.
Frontier model probabilities as automatic annotators: Token probabilities naturally encode "which steps are critical reasoning steps," eliminating the need for manual annotation.
Highly practical: The two-stage dataset optimization framework (<100M initial screening → 1B fine-grained ranking) has direct industrial application value.

Limitations & Future Work¶

Frontier models are not 100% accurate; a small number of incorrect reasoning traces may introduce noise (though filtering out incorrect traces shows only marginal improvement in experiments).
Frontier models occasionally fail to generate outputs in the required format (currently retried only once before being discarded).
Zero-shot transfer has only been validated on one dataset pair at the 1B→7B scale, limited by computational budget.
Effectiveness on non-reasoning tasks has not been verified (although mature solutions already exist for such tasks).
The two-stage optimization framework is proposed as a discussion point only and has not been fully validated experimentally.

Complements the scaling law literature (Kaplan et al., Hoffmann et al.): rBridge specifically addresses the proxy difficulties introduced by reasoning emergence.
Directly compared with the DataDecide benchmark: rBridge achieves optimal ranking on its tasks at very low cost.
Implications for pre-training data mixture optimization (Xie et al., Liu et al.): NLL/perplexity is a suboptimal metric; rBridge is better suited for reasoning-oriented data selection.
Broad applicability: Any scenario requiring "using small models to predict large-model behavior" can consider the rBridge paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ The core insight (NLL + frontier trace + token weighting) is not entirely new, but the combination is elegant and yields significant gains.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three-stage experimental design is rigorous: 6 benchmarks × multiple model scales × ablations × cross-dataset transfer.
Writing Quality: ⭐⭐⭐⭐ Structure is clear and problem analysis is progressive, though notation and tables are dense.
Value: ⭐⭐⭐⭐⭐ Directly reduces pre-training exploration costs by 100×+, with extremely high industrial deployment value.