Skip to content

Predicting LLM Reasoning Performance with Small Proxy Model

Conference: ICLR 2026 arXiv: 2509.21013 Code: Available (dataset and code planned for open release) Area: LLM Evaluation Keywords: scaling law, proxy model, reasoning, pre-training data selection, negative log-likelihood

TL;DR

This paper proposes rBridge, which uses reasoning traces from frontier models as gold labels and applies token-level task-aligned weighted NLL, enabling small models (≤1B) to effectively predict the reasoning performance of 13B–32B models, achieving over 100× computational savings on dataset ranking tasks.

Background & Motivation

Pre-training large language models is extremely costly, so the field commonly employs small proxy models to evaluate pre-training design choices. Research on scaling laws and dataset ranking invariance has demonstrated the viability of this strategy for general tasks.

Key Challenge: Reasoning ability exhibits emergent behavior, appearing reliably only in sufficiently large models (typically >7B). Small models (≤1B) produce highly noisy results on reasoning benchmarks, or even exhibit reversed trends (e.g., a 1B model's accuracy on MATH500 decreases as training progresses). This means: - Directly using small-model accuracy to predict large-model performance → fails - Practitioners are forced to use 7B–15B "proxy models," which still incur enormous costs (a single 7B/500B-token training run exceeds $50K USD)

The authors' analysis identifies two fundamental problems:

Evaluation Objective Misalignment: Small models are trained with a next-token prediction (NTP) objective, yet they are evaluated using Acc./Pass@K — discrete metrics entirely different from the training objective.

Task Alignment Deficiency: Even when NLL is used, the choice of gold label is critical: if the gold label is out-of-distribution (OOD), the NLL signal remains noisy. Moreover, standard NLL treats all tokens with equal weight, without distinguishing task-critical tokens from formatting tokens.

Method

Overall Architecture

The core idea of rBridge is to design a proxy evaluation metric \(\text{metric}^p\) that maximizes \(\text{corr}(\text{metric}^p(\pi^p), \text{metric}^t(\pi^t))\). That is, to find a metric computed on small models that is highly correlated with the target metric (Acc./Pass@K) of large models.

Key Designs

  1. Reasoning Traces as Gold Labels:

    • The reasoning trace \(R^\phi\) generated by a frontier model \(\pi^\phi\) (excluding the final answer formatting) is used as the gold label.
    • Why this works: (a) \(R^\phi\) is closer to the pre-training data distribution (continuous long-form text), reducing NLL by 74.7%; (b) \(R^\phi\) represents the reasoning process leading to the correct answer, naturally aligning with the target task.
    • Comparison with ScalingBench: ScB uses \(R^\phi + A^\phi\) (trace + answer), but the answer contains formatting artifacts such as \\n, "Final Answer:", and "I hope it is correct.", which are severely OOD.
  2. Token-Level Task-Aligned Weighting: $\(\text{rBridge NLL}(\text{token}_i) = -\log(p^p(\text{token}_i)) \cdot \frac{1}{|\text{token}_i|} \sum_{\text{letter} \in \text{token}_i} p^\phi(\text{letter})\)$

    • Each token's NLL is multiplied by the frontier model's confidence (probability) for that token.
    • Intuition: Tokens to which the frontier model assigns high confidence (e.g., "sum modulo 9") represent critical reasoning steps and should receive higher weight; low-confidence tokens (e.g., newlines, numbering) constitute formatting noise and are downweighted.
    • Tokenizer-agnostic: Letter-level probability averaging handles tokenization differences across different tokenizers.
    • MinMax normalization is applied to the weight factors to amplify the effect.
  3. One-Time Upfront Cost: Generating reasoning traces with the frontier model costs less than $10 per benchmark; computing the weights requires only a few seconds of CPU time.

Loss & Training

This paper does not train new models. Instead, intermediate checkpoints from pre-training are used as proxy models, and rBridge NLL is computed to predict target model performance.

Key Experimental Results

Main Results (1B→13B Proxy-Target Relationship)

Method Avg. Train R² ↑ Avg. Test MAE ↓
Acc./Pass@1 0.304 3.709
iSFT 0.290 5.123
TED 0.375 3.377
MPCA 0.194 302.642
NLL 0.485 5.173
\(R^\phi\) (trace NLL only) 0.867 1.455
rBridge 0.874 1.384
  • rBridge achieves the best results in 10 out of 12 settings (6 benchmarks × R²/MAE).
  • 1B→32B: R²=0.826, MAE=1.481, also best overall.
  • 1B→13B+SFT: R²=0.846, MAE=1.304, best overall.

Ablation Study (Dataset Ranking, <100M→1.2B)

Proxy Scale rBridge DAcc Best Baseline DAcc Computational Savings
3.7M (87.3M tokens) ~70% ~50% (random level) 733.4×
6M (81.6M tokens) ~75% ~55% 100.2×
97.9M ~80% ~75% -
  • At the smallest proxy scale (3.7M), rBridge outperforms baselines by 27% DAcc.
  • To achieve equivalent ranking accuracy, rBridge saves 100.2× to 733.4× FLOPs.

Key Findings

  1. rBridge outperforms models 7–13× larger: A 1B model using rBridge surpasses the predictive capability of 7B/13B models evaluated directly via accuracy.
  2. Zero-shot cross-dataset transfer: The rBridge→Acc function fitted on OLMo-Mix-1124 transfers zero-shot to another dataset, achieving perfect ranking on 5 out of 5 benchmarks with MAE of only 0.043–1.417 (one outlier at 9.716).
  3. Ablations confirm each component's contribution: Each step from standard NLL → using \(R^\phi\) → adding weighting → MinMax normalization yields consistent improvements.

Highlights & Insights

  • Deep insight: Small models do not "fail" on reasoning benchmarks due to a lack of information, but because the evaluation method (Acc.) is mismatched with the capabilities of small models.
  • NLL returns to its roots: The pre-training objective is NTP, and evaluation should likewise return to NLL — but the key lies in what is used as the gold label.
  • Frontier model probabilities as automatic annotators: Token probabilities naturally encode "which steps are critical reasoning steps," eliminating the need for manual annotation.
  • Highly practical: The two-stage dataset optimization framework (<100M initial screening → 1B fine-grained ranking) has direct industrial application value.

Limitations & Future Work

  • Frontier models are not 100% accurate; a small number of incorrect reasoning traces may introduce noise (though filtering out incorrect traces shows only marginal improvement in experiments).
  • Frontier models occasionally fail to generate outputs in the required format (currently retried only once before being discarded).
  • Zero-shot transfer has only been validated on one dataset pair at the 1B→7B scale, limited by computational budget.
  • Effectiveness on non-reasoning tasks has not been verified (although mature solutions already exist for such tasks).
  • The two-stage optimization framework is proposed as a discussion point only and has not been fully validated experimentally.
  • Complements the scaling law literature (Kaplan et al., Hoffmann et al.): rBridge specifically addresses the proxy difficulties introduced by reasoning emergence.
  • Directly compared with the DataDecide benchmark: rBridge achieves optimal ranking on its tasks at very low cost.
  • Implications for pre-training data mixture optimization (Xie et al., Liu et al.): NLL/perplexity is a suboptimal metric; rBridge is better suited for reasoning-oriented data selection.
  • Broad applicability: Any scenario requiring "using small models to predict large-model behavior" can consider the rBridge paradigm.

Rating

  • Novelty: ⭐⭐⭐⭐ The core insight (NLL + frontier trace + token weighting) is not entirely new, but the combination is elegant and yields significant gains.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three-stage experimental design is rigorous: 6 benchmarks × multiple model scales × ablations × cross-dataset transfer.
  • Writing Quality: ⭐⭐⭐⭐ Structure is clear and problem analysis is progressive, though notation and tables are dense.
  • Value: ⭐⭐⭐⭐⭐ Directly reduces pre-training exploration costs by 100×+, with extremely high industrial deployment value.