Skip to content

Understanding the Role of Training Data in Test-Time Scaling

Conference: ICLR2026
arXiv: 2510.03605
Code: None
Area: LLM Inference
Keywords: Test-time scaling, chain-of-thought, in-context learning, task hardness, overthinking, training data selection

TL;DR

The paper provides a theoretical analysis of how training data attributes influence test-time scaling performance. It proves that CoT reasoning is equivalent to pseudo-Newton iterations, proposes a task hardness metric based on the minimum eigenvalue of feature covariance, reveals the mechanism behind the "overthinking" phenomenon where more computation degrades performance, and identifies the optimal task selection strategy for multi-task training—emphasizing that training sets should be diverse, relevant, and hard.

Background & Motivation

Background: Test-time scaling (e.g., OpenAI o1, DeepSeek R1) enhances reasoning capabilities by allocating more computational resources during inference to generate longer CoT. This has demonstrated significant success in tasks like mathematical competitions and programming.

Core Problem: Despite significant empirical success, it remains unclear under what conditions training data supports test-time scaling—specifically: - Does increasing test-time computation always improve downstream reasoning performance? - Can increased test-time computation reduce computational requirements during training? - What constitutes a "hard" training sample, and why are they beneficial for test-time scaling?

Limitations of Prior Work: Most previous studies on the diversity and difficulty of training data are empirical and lack a rigorous theoretical framework to explain the underlying mechanisms of test-time scaling.

Key Insight: The authors answer these three questions through both theoretical and experimental lenses within the in-context learning (ICL) framework of linear regression.

Method

Overall Architecture

The paper places test-time scaling into an analytically tractable toy world: using in-context learning (ICL) for linear regression as the task, a single-layer linear self-attention (LSA) as the model, and treating multi-step test-time chain-of-thought (CoT) as repeated iterations on this model. The reasoning follows a causal chain: first, framing "inferring weights from a prompt" as an optimizable regression problem to prove that the global optimum of LSA depends solely on a pivot matrix \(\Gamma\) determined by training data covariance; next, proving that running CoT at test time is equivalent to pseudo-Newton iteration using \(\Gamma^{-1}\), thereby translating "thinking quality" into the convergence and rate of this iteration; then, defining task hardness using the covariance spectrum to derive a test-time scaling law that quantifies the trade-off between "more thinking" and "more training data"; finally, demonstrating that this iteration is effective only when the training covariance \(\Gamma\) aligns with the test covariance \(\Sigma\). This explains overthinking while deriving optimal training data selection criteria. All conclusions are derived from the spectral properties of the same covariance matrix.

This paper is a purely theoretical analysis (linear models + spectral derivation), focusing on a sequence of interlocking theorems rather than a data pipeline. Therefore, no architecture diagram is provided; the four key designs below follow the aforementioned causal chain.

Key Designs

1. Analytic Toy World: ICL Weight Prediction and LSA Global Optimum

To strictly analyze test-time scaling, the authors establish a setting with a closed-form solution. Each prompt is \(P_\tau = (x_{\tau,1}, y_{\tau,1}, \ldots, x_{\tau,n}, y_{\tau,n})\), where labels are generated by hidden weights \(y_{\tau,i} = \langle w_\tau, x_{\tau,i}\rangle\). Here \(x_{\tau,i}\sim\mathcal{N}(0,\Lambda)\) and \(w_\tau\sim\mathcal{N}(0,I_d)\). The model must infer \(w_\tau\) from the prompt. The input is arranged as an embedding matrix, with the last column reserved for the weight estimate \(\hat w_0\):

\[E_\tau = \begin{bmatrix} X_\tau & 0 \\ y_\tau & 0 \\ 0_{d \times n} & \hat{w}_0 \\ 0_{1 \times n} & 1 \end{bmatrix}\]

The training objective is to minimize the mean squared error \(L(\theta) = \tfrac{1}{2}\mathbb{E}\big[\| f_{\text{LSA}}(E_\tau;\theta)_{[:,-1]} - (0_d, 0, w_\tau, 1)\|^2\big]\). Theorem 3.1 proves that under appropriate initialization, gradient descent with a constant step size converges to the global optimum \(V_* = -\Gamma^{-1}/c\), where the pivot matrix is defined as \(\Gamma := (1 + \tfrac{1}{n})\Lambda + \tfrac{1}{n}\text{tr}(\Lambda) I_d\). All subsequent theorems are transmitted through \(\Gamma\), which encodes training data attributes (covariance \(\Lambda\), prompt length \(n\)).

2. CoT Equivalent to Pseudo-Newton Method

The first question addressed is what test-time CoT steps actually compute. Proposition 3.2 shows that running \(k\) CoT steps is essentially a recursive weight estimation: \(w_{i+1} = w_i - \tfrac{1}{m}\Gamma^{-1} X_{\text{test}} X_{\text{test}}^\top (w_i - w_{\text{test}})\). This is exactly a pseudo-Newton method applied to the test loss \(\ell(w) = \tfrac{1}{2m}\|y_{\text{test}} - X_{\text{test}}^\top w\|^2\), using the learned \(\Gamma^{-1}\) to approximate the inverse of the true Hessian \(\Lambda^{-1}\). Expanding \(k\) steps yields the closed form \(w_{k+1} = \big(I - (I - \tfrac{1}{m}\Gamma^{-1} X_{\text{test}} X_{\text{test}}^\top)^k\big) w_{\text{test}}\). This equivalence transforms CoT from a black-box operation into an optimization iteration whose convergence speed is determined by the alignment between \(\Gamma\) and test data.

3. Task Hardness and Test-Time Scaling Law

With the "CoT = Iteration" framework, task hardness and compute trade-offs can be quantified. Theorem 3.3 provides an upper bound for error in direct ICL (no CoT): \(\mathbb{E}\|\hat{w} - w_{\text{test}}\|^2 \leq \tfrac{d}{n^2}(1 + \tfrac{\text{tr}(\Lambda)}{\lambda_{\min}(\Lambda)})^2 + \tfrac{d}{m}(1 + \tfrac{\text{tr}(\Lambda)}{\lambda_{\min}(\Lambda)})\). The recurring ratio is defined as task hardness \(\text{Hard}(\Lambda) := \tfrac{u}{\lambda_{\min}(\Lambda)}\). Intuitively, each eigenvector of \(\Lambda\) represents a "skill," and the eigenvalue is the skill strength. Easy tasks rely on balanced skills (eigenvalues are close), while hard tasks have long-tailed distributions (extremely small eigenvalues). Combined with convergence rates, Corollary 3.5 gives the error after \(k\) CoT steps:

\[\mathbb{E}\|w_{k+1} - w_{\text{test}}\|^2 \leq d\,\Big(1 + \tfrac{n}{1 + \text{Hard}(\Lambda)}\Big)^{-2k}(1 + o(1))\]

This scaling law yields three conclusions: for a fixed target error \(\varepsilon\), increasing \(k\) can compensate for shorter training prompt lengths \(n\) (complementarity of training and inference compute); harder tasks (higher \(\text{Hard}(\Lambda)\)) converge slower and require more CoT steps; and the entire process has a complexity of \(O(kd^2)\).

4. Distribution Alignment: Overthinking and Optimal Task Selection

The scaling law assumes the training covariance \(\Gamma\) aligns with the test covariance \(\Sigma\). If this fails, it explains "overthinking" and dictates data selection. The net effect of thinking is controlled by \(\text{tr}\big((I - \Gamma^{-1/2}\Sigma\Gamma^{-1/2})^{2k}\big)\). If certain skill directions in the target task are under-covered in the training data (weak \(\Gamma\) in those directions), the eigenvalues in the brackets exceed 1, causing the term to grow exponentially with \(k\). Overthinking is not the model "getting tired" but the iteration diverging in uncovered subspaces. Consequently, multi-task training should prioritize tasks that fill these gaps. Proposition 4.3 proves that the optimal sampling weights \(\{\pi_\ell\}\) allocate at least half the probability to "hard" tasks (those with small \(\sigma_{\min}(\Lambda_\ell)\)). Solving for the optimal ratio is an efficient quadratic program:

\[\min_{\{\pi_\ell\}} \left\| I - \Sigma^{-1} \sum_{\ell} \Lambda_\ell \pi_\ell \right\|_F^2 \quad \text{s.t.} \sum_\ell \pi_\ell = 1,\ \pi_\ell \geq 0\]

The three criteria for a "good training set" are: diverse (covering all skill directions), relevant (aligned with the target spectrum after weighting), and hard (including samples that contribute to small eigenvalue directions).

Key Experimental Results

LSA Model Validation

Setting Conclusion
Training prompt length \(n=10,20,30\) Increasing \(k\) compensates for shorter training context; \(n=10\) at \(k=20\) reaches the error level of \(n=30\) direct prediction.
Tilted training covariance (\(\lambda_i \propto 1/i\)) When training/test distributions mismatch, test error first drops then rises as \(k\) increases—overthinking occurs.
Large \(n\) is worse during overthinking Contrary to the non-overthinking case, longer training contexts "learn more bias" under tilted distributions.

GPT-2 (9.5M parameters) Validation

Experiment Result
Training \(n=20,30,40\), varying \(k\) Consistent with LSA: longer CoT allows for shorter training contexts to achieve equivalent performance.
Tilted covariance + Identity test GPT-2 exhibits overthinking: error increases for \(k>10\).

Task Selection Experiment

Task Type \((\alpha, B)\) Average Selection Probability
Easy-Short (0.2, 20) Lowest
Hard-Short (0.8, 20) Medium
Easy-Long (0.2, 100) Medium-Low
Hard-Long (0.8, 100) Highest

Real Inference Benchmark (Qwen 2.5-7B)

Model CoT length [0, 1k) CoT length [1k, 2k]
Qwen-Base 30.39% 27.2%
Qwen-GCD (Aligned) 75% (+44.6) 38.4% (+11.2)
Qwen-Poly (Misaligned) 29% (-1.4) 20.83% (-6.4)

Thinking helps when training data is aligned but hurts when misaligned—perfectly validating the theoretical predictions.

Highlights & Insights

  • CoT = Pseudo-Newton Method: Establishes a precise mathematical correspondence between test-time CoT and optimization algorithms, providing a new perspective on reasoning.
  • Theoretical Explanation of Overthinking: Explains why more reasoning sometimes impairs performance—skill directions not covered during training are amplified during iteration.
  • Spectral Definition of Task Hardness: \(\text{Hard}(\Lambda) = \text{tr}(\Lambda)/\lambda_{\min}(\Lambda)\) is a concise and insightful metric.
  • Substitutability of Training and Test Compute: Rigorously proves that test-time compute can compensate for insufficient training context length, guiding resource allocation in practice.

Limitations & Future Work

  • Theoretical Limitation to Linear Models: The analysis is primarily limited to linear regression and LSA; extensions to non-linear tasks and deep Transformers are required.
  • Synthetic Data for GPT-2: Experiments with GPT-2 are still conducted on synthetic data.
  • Spectrum-Dependent Hardness: "Skills" and "feature distributions" are difficult to measure directly in real NLP tasks, leaving a gap between theory and practice.
  • Exclusion of RL Scenarios: The analysis is based on SFT/ICL; the applicability to RL-based reasoning paradigms like o1/R1 remains unknown.
  • vs Snell et al. (2024) / Muennighoff et al. (2025): Previous work empirically demonstrated test-time scaling; this work provides the theoretical foundation.
  • vs Su et al. (2025) (overthinking): Prior work observed overthinking on simple problems empirically; this paper provides a mathematical explanation.
  • vs Huang et al. (2025a): Previous analysis was limited to isotropic features (\(\Lambda = I\)) and training-time CoT; this work extends to general covariance and pure test-time CoT.
  • Insight: System designs should dynamically adjust test-time compute based on training data coverage—increasing compute for well-covered tasks and limiting it for under-covered ones.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First systematic theoretical analysis of the relationship between training data and test-time scaling; the overthinking theory and task selection are novel contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Theory is well-validated by LSA/GPT-2 synthetic experiments and Qwen benchmarks, though real-world task coverage could be broader.
  • Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are rigorous and clear, with intuitive explanations and a logical progression from single-task to multi-task scenarios.
  • Value: ⭐⭐⭐⭐⭐ Highly significant for understanding and improving test-time scaling; task selection strategies are directly applicable to RL reasoning training.