Predicting LLM Reasoning Performance with Small Proxy Model¶
Conference: ICLR 2026 arXiv: 2509.21013 Code: Available (dataset and code planned for open release) Area: LLM Evaluation Keywords: scaling law, proxy model, reasoning, pre-training data selection, negative log-likelihood
TL;DR¶
This paper proposes rBridge, which uses reasoning traces from frontier models as gold labels and applies token-level task-aligned weighted NLL, enabling small models (≤1B) to effectively predict the reasoning performance of 13B–32B models, achieving over 100× computational savings on dataset ranking tasks.
Background & Motivation¶
Pre-training large language models is extremely costly, so the field commonly employs small proxy models to evaluate pre-training design choices. Research on scaling laws and dataset ranking invariance has demonstrated the viability of this strategy for general tasks.
Key Challenge: Reasoning ability exhibits emergent behavior, appearing reliably only in sufficiently large models (typically >7B). Small models (≤1B) produce highly noisy results on reasoning benchmarks, or even exhibit reversed trends (e.g., a 1B model's accuracy on MATH500 decreases as training progresses). This means: - Directly using small-model accuracy to predict large-model performance → fails - Practitioners are forced to use 7B–15B "proxy models," which still incur enormous costs (a single 7B/500B-token training run exceeds $50K USD)
The authors' analysis identifies two fundamental problems:
Evaluation Objective Misalignment: Small models are trained with a next-token prediction (NTP) objective, yet they are evaluated using Acc./Pass@K — discrete metrics entirely different from the training objective.
Task Alignment Deficiency: Even when NLL is used, the choice of gold label is critical: if the gold label is out-of-distribution (OOD), the NLL signal remains noisy. Moreover, standard NLL treats all tokens with equal weight, without distinguishing task-critical tokens from formatting tokens.
Method¶
Overall Architecture¶
The core idea of rBridge is to design a proxy evaluation metric \(\text{metric}^p\) that maximizes \(\text{corr}(\text{metric}^p(\pi^p), \text{metric}^t(\pi^t))\). That is, to find a metric computed on small models that is highly correlated with the target metric (Acc./Pass@K) of large models.
Key Designs¶
-
Reasoning Traces as Gold Labels:
- The reasoning trace \(R^\phi\) generated by a frontier model \(\pi^\phi\) (excluding the final answer formatting) is used as the gold label.
- Why this works: (a) \(R^\phi\) is closer to the pre-training data distribution (continuous long-form text), reducing NLL by 74.7%; (b) \(R^\phi\) represents the reasoning process leading to the correct answer, naturally aligning with the target task.
- Comparison with ScalingBench: ScB uses \(R^\phi + A^\phi\) (trace + answer), but the answer contains formatting artifacts such as
\\n,"Final Answer:", and"I hope it is correct.", which are severely OOD.
-
Token-Level Task-Aligned Weighting: $\(\text{rBridge NLL}(\text{token}_i) = -\log(p^p(\text{token}_i)) \cdot \frac{1}{|\text{token}_i|} \sum_{\text{letter} \in \text{token}_i} p^\phi(\text{letter})\)$
- Each token's NLL is multiplied by the frontier model's confidence (probability) for that token.
- Intuition: Tokens to which the frontier model assigns high confidence (e.g., "sum modulo 9") represent critical reasoning steps and should receive higher weight; low-confidence tokens (e.g., newlines, numbering) constitute formatting noise and are downweighted.
- Tokenizer-agnostic: Letter-level probability averaging handles tokenization differences across different tokenizers.
- MinMax normalization is applied to the weight factors to amplify the effect.
-
One-Time Upfront Cost: Generating reasoning traces with the frontier model costs less than $10 per benchmark; computing the weights requires only a few seconds of CPU time.
Loss & Training¶
This paper does not train new models. Instead, intermediate checkpoints from pre-training are used as proxy models, and rBridge NLL is computed to predict target model performance.
Key Experimental Results¶
Main Results (1B→13B Proxy-Target Relationship)¶
| Method | Avg. Train R² ↑ | Avg. Test MAE ↓ |
|---|---|---|
| Acc./Pass@1 | 0.304 | 3.709 |
| iSFT | 0.290 | 5.123 |
| TED | 0.375 | 3.377 |
| MPCA | 0.194 | 302.642 |
| NLL | 0.485 | 5.173 |
| \(R^\phi\) (trace NLL only) | 0.867 | 1.455 |
| rBridge | 0.874 | 1.384 |
- rBridge achieves the best results in 10 out of 12 settings (6 benchmarks × R²/MAE).
- 1B→32B: R²=0.826, MAE=1.481, also best overall.
- 1B→13B+SFT: R²=0.846, MAE=1.304, best overall.
Ablation Study (Dataset Ranking, <100M→1.2B)¶
| Proxy Scale | rBridge DAcc | Best Baseline DAcc | Computational Savings |
|---|---|---|---|
| 3.7M (87.3M tokens) | ~70% | ~50% (random level) | 733.4× |
| 6M (81.6M tokens) | ~75% | ~55% | 100.2× |
| 97.9M | ~80% | ~75% | - |
- At the smallest proxy scale (3.7M), rBridge outperforms baselines by 27% DAcc.
- To achieve equivalent ranking accuracy, rBridge saves 100.2× to 733.4× FLOPs.
Key Findings¶
- rBridge outperforms models 7–13× larger: A 1B model using rBridge surpasses the predictive capability of 7B/13B models evaluated directly via accuracy.
- Zero-shot cross-dataset transfer: The rBridge→Acc function fitted on OLMo-Mix-1124 transfers zero-shot to another dataset, achieving perfect ranking on 5 out of 5 benchmarks with MAE of only 0.043–1.417 (one outlier at 9.716).
- Ablations confirm each component's contribution: Each step from standard NLL → using \(R^\phi\) → adding weighting → MinMax normalization yields consistent improvements.
Highlights & Insights¶
- Deep insight: Small models do not "fail" on reasoning benchmarks due to a lack of information, but because the evaluation method (Acc.) is mismatched with the capabilities of small models.
- NLL returns to its roots: The pre-training objective is NTP, and evaluation should likewise return to NLL — but the key lies in what is used as the gold label.
- Frontier model probabilities as automatic annotators: Token probabilities naturally encode "which steps are critical reasoning steps," eliminating the need for manual annotation.
- Highly practical: The two-stage dataset optimization framework (<100M initial screening → 1B fine-grained ranking) has direct industrial application value.
Limitations & Future Work¶
- Frontier models are not 100% accurate; a small number of incorrect reasoning traces may introduce noise (though filtering out incorrect traces shows only marginal improvement in experiments).
- Frontier models occasionally fail to generate outputs in the required format (currently retried only once before being discarded).
- Zero-shot transfer has only been validated on one dataset pair at the 1B→7B scale, limited by computational budget.
- Effectiveness on non-reasoning tasks has not been verified (although mature solutions already exist for such tasks).
- The two-stage optimization framework is proposed as a discussion point only and has not been fully validated experimentally.
Related Work & Insights¶
- Complements the scaling law literature (Kaplan et al., Hoffmann et al.): rBridge specifically addresses the proxy difficulties introduced by reasoning emergence.
- Directly compared with the DataDecide benchmark: rBridge achieves optimal ranking on its tasks at very low cost.
- Implications for pre-training data mixture optimization (Xie et al., Liu et al.): NLL/perplexity is a suboptimal metric; rBridge is better suited for reasoning-oriented data selection.
- Broad applicability: Any scenario requiring "using small models to predict large-model behavior" can consider the rBridge paradigm.
Rating¶
- Novelty: ⭐⭐⭐⭐ The core insight (NLL + frontier trace + token weighting) is not entirely new, but the combination is elegant and yields significant gains.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three-stage experimental design is rigorous: 6 benchmarks × multiple model scales × ablations × cross-dataset transfer.
- Writing Quality: ⭐⭐⭐⭐ Structure is clear and problem analysis is progressive, though notation and tables are dense.
- Value: ⭐⭐⭐⭐⭐ Directly reduces pre-training exploration costs by 100×+, with extremely high industrial deployment value.