Scaling Up Active Testing to Large Language Models¶
Conference: NeurIPS 2025 arXiv: 2508.09093 Code: GitHub Area: LLM/NLP Keywords: active testing, LLM evaluation, risk estimation, surrogate model, label efficiency
TL;DR¶
By introducing three key simplifications—constructing a fixed surrogate model via in-context learning, using a small surrogate model to evaluate a large target model, and eliminating the need for target model predictions during data acquisition—this work scales active testing to LLMs, reducing risk estimation error by 25%–80% relative to random sampling.
Background & Motivation¶
Background: Frontier models are increasingly complex; annotation costs are high and evaluation data may leak into training sets, necessitating continuous and dynamic collection of new evaluation data.
Limitations of Prior Work: Existing active testing methods require iterative gradient-based retraining of surrogate models and inference over the entire pool with both surrogate and target models, making computational costs prohibitive for LLM-scale applications.
Core Problem: How to substantially reduce computational cost while preserving the effectiveness of active testing, enabling it to scale to 70B-parameter LLMs.
Method¶
Overall Architecture¶
Active testing improves risk estimation \(R = \mathbb{E}[\ell(f(x), y)]\) by intelligently selecting which test inputs to annotate. This paper addresses three computational bottlenecks.
Key Design 1: Fixed Surrogate Model (Resolving the Training Bottleneck)¶
Conventional methods retrain the surrogate model after each newly acquired label. Instead, this work: - Constructs the surrogate model once using a small set of initial labeled data via in-context learning - Keeps the surrogate model fixed thereafter - Completely eliminates the in-loop gradient training overhead
Key Design 2: Small Surrogate Model (Resolving the Inference Cost Bottleneck)¶
The surrogate model can be substantially smaller than the target model: - A 7B model serves as the surrogate to evaluate a 70B target model - Models as small as Gemma3-4B or Phi-2 can evaluate Llama-2-70B - Older models (Llama 2) can effectively serve as surrogates for newer models (Gemma 3)
Key Design 3: No Target Model Predictions Required (Resolving the \(N\)-Inference Bottleneck)¶
The surrogate model is used to approximate target model predictions: - The acquisition function is simplified from the cross-entropy \(H[\pi_m(y|x) \| p_f(y|x)]\) to the surrogate model's predictive entropy \(H[\pi_m(y|x)]\) - The number of target model inference calls is reduced from \(N\) to \(M\) (\(M \ll N\))
Risk Estimation¶
Unbiased risk estimation is performed using LURE (Levelled Unbiased Risk Estimator):
where \(v_m\) are importance weights that correct for non-uniform sampling.
Bootstrap Error Estimation¶
A bootstrap method is proposed to estimate risk estimation error within a single active testing run, providing confidence intervals for practical deployment.
Key Experimental Results¶
Main Results: Reduction in Risk Estimation Error¶
| Dataset | Target Model | Surrogate Model | Relative Error Reduction |
|---|---|---|---|
| SST-2 | 70B-few | 7B-few | ~50% |
| FPB | 70B-few | 7B-few | ~40% |
| HS | 70B-few | 7B-few | ~60% |
| Subj | 70B-few | 7B-few | ~30% |
| Average | — | — | 25%–80% |
Cross-Model Surrogate Evaluation¶
| Surrogate Model | Target Model | Effectiveness |
|---|---|---|
| Llama-2-7B | Llama-2-70B | Effective |
| Gemma3-4B | Llama-2-70B | Effective |
| Phi-2 | Llama-2-70B | Effective |
| Llama-2-7B | Gemma3-4B | Effective |
Sampling vs. Interpolation Methods¶
| Method | Robustness | Notes |
|---|---|---|
| LURE (sampling) | High | Insensitive to surrogate model quality |
| ASE (interpolation) | Low | Sensitive to surrogate quality; degrades with a fixed surrogate |
This confirms the theoretical prediction that sampling-based methods outperform interpolation-based methods when the surrogate is fixed.
Bootstrap Error Estimation¶
The 95% confidence interval achieves a true error coverage rate of 88% (reaching ~94% when \(K \geq 100\)).
Key Findings¶
- Forgoing surrogate model updates incurs negligible performance loss while saving substantial computational cost.
- The surrogate model can be 10× smaller than the target model, or even more.
- The acquisition function that eliminates target model predictions (replacing cross-entropy with predictive entropy) performs surprisingly well.
- Label noise can severely impair active testing—a stronger surrogate model may paradoxically perform worse in such cases.
Highlights & Insights¶
- Elegant resolution of three bottlenecks: Each simplification is theoretically motivated and empirically validated.
- Counter-intuitive finding: Using a smaller surrogate to evaluate a larger target model can yield equal or better performance.
- Bootstrap diagnostic tool: Provides a practical criterion for determining whether active testing is functioning effectively in deployment.
- Dataset curation as a by-product: Active testing can be applied to dataset curation, selecting subsets for model evaluation.
Limitations & Future Work¶
- Experiments are limited to text classification; generative tasks present greater complexity.
- Active testing may fail under label noise, as demonstrated by the SST-2 case.
- Improvements on the more challenging MMLU dataset are modest.
- Theoretical convergence guarantees for the bootstrap estimator are lacking.
Related Work & Insights¶
- Kossen et al. (2021, 2022): Proposed sampling- and interpolation-based active testing methods; this work extends them to the LLM setting.
- TinyBenchmarks/DELE: Dataset compression methods focused on cross-model generalizability; this work performs model-specific acquisition instead.
- Insight: Active testing can be integrated with continuous benchmark updates to mitigate data contamination issues.
Rating¶
- Novelty: ⭐⭐⭐⭐ Extends existing methods to LLMs; each simplification is well-motivated but not fundamentally novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple models, datasets, and settings, including analysis of failure cases.
- Writing Quality: ⭐⭐⭐⭐⭐ The three-bottleneck analytical framework is clearly articulated; experimental design is systematic.
- Value: ⭐⭐⭐⭐ Offers practical utility for improving LLM evaluation efficiency, particularly in annotation-expensive scenarios.