Skip to content

EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

Conference: NeurIPS 2025
arXiv: 2506.02672
Code: github.com/ByteDance-Seed/EvaLearn
Area: LLM Evaluation
Keywords: LLM evaluation, sequential learning, learning capability, benchmark, dynamic evaluation

TL;DR

This paper proposes EvaLearn, a benchmark that evaluates the learning capability and learning efficiency of LLMs through a sequential problem-solving paradigm, revealing that models with stronger static performance do not necessarily possess greater learning potential.

Background & Motivation

Existing LLM evaluation benchmarks almost universally adopt a parallel evaluation paradigm: models answer independent, identically distributed samples one by one, and static metrics such as accuracy are aggregated. This approach only measures a model's "static capability," while neglecting an equally important dimension—the ability of a model to learn and adapt from experience within a specific task (i.e., learning capability), as well as the speed at which such learning occurs (i.e., learning efficiency).

Learning capability and learning efficiency are central indicators of human intelligence, yet they have rarely been systematically explored in LLM evaluation. The parallel evaluation paradigm is structurally incapable of capturing such dynamic learning behavior. A fundamentally new evaluation framework is therefore needed to address this gap.

Method

Overall Architecture

EvaLearn adopts a sequential evaluation paradigm: 1. A total of 648 challenging problems are constructed and organized into 182 sequences. 2. Each sequence contains 7 problems of the same task type. 3. Models must solve the problems within each sequence in order. 4. Models are permitted to leverage experience gained from earlier problems to improve performance on subsequent ones.

Six task categories are covered: - Summarization (Sum): Assesses whether models can improve summary accuracy and coverage through experience. - Classification (Cla): Assesses whether models can enhance classification ability across a series of classification problems. - Extraction (Ex): Assesses whether models can progressively improve the completeness of key information extraction. - Logical Reasoning (LR): Assesses whether models can learn from mistakes to improve logical reasoning. - Mathematical Reasoning (MR): Assesses whether models can rapidly acquire problem-solving strategies through feedback. - Sequential Reasoning (SR): Assesses whether models can leverage historical experience to strengthen sequential reasoning.

Key Designs

Automated Evaluation Framework: Since most challenging problems cannot be verified by rules, an instance-level rubric + LLM-as-a-judge approach is adopted. Each problem is accompanied by a human-written scoring rubric, with GPT-4o serving as the judge. Validation experiments demonstrate that evaluation accuracy exceeds 95% across all tasks.

Two Sequential Learning Paradigms: - Demonstration Learning: Provides the model with all preceding problems in the sequence along with their ground-truth answers (analogous to ICL). - Feedback Learning: In addition to preceding problems, provides the model with its own prior responses and detailed feedback generated by the judge model based on the rubric.

Evaluation Metrics

Let \(N=182\) denote the number of sequences, \(M=7\) the number of problems per sequence, and \(y_{n,m} \in \{0,1\}\) indicate whether the \(m\)-th problem in the \(n\)-th sequence is solved correctly. Five metrics are defined:

  1. Overall Accuracy (Acc): \(\text{Acc} = \frac{1}{NM}\sum_{n=1}^{N}\sum_{m=1}^{M} y_{n,m}\)

  2. Slope of Fitted Accuracy Curve (k): A least-squares linear fit is applied to the position-accuracy curve \(\text{Acc}_m\); the slope \(k\) reflects learning speed.

  3. Average Position of First Correct Solution (\(P_{\text{first}}\)): \(P_{\text{first}} = \frac{1}{N}\sum_{n=1}^{N} p_n\), where lower values are preferred.

  4. Average Offset to First Learned Correct Solution (\(P_{\text{offset}}\)): After excluding problems solvable in a zero-shot setting, this metric measures how quickly a model begins to learn.

  5. Average Number of Consecutive Correct Solutions (\(N_{\text{consec}}\)): \(N_{\text{consec}} = \frac{1}{N}\sum_{n=1}^{N} \max_{1 \le a \le b \le M}\{b-a+1 : y_{n,a}=\cdots=y_{n,b}=1\}\)

  6. Post-Warmup Accuracy (\(\text{Acc}_{\text{pw}}\text{-}K\)): Accuracy computed after excluding the first \(K\) problems.

Key Experimental Results

Main Results

Nine frontier models are evaluated (including both thinking and non-thinking variants):

Model Zero-shot Feedback Learning Change
OpenAI-o3-mini 54.3% 64.8% +10.5%
Claude-3.7-Sonnet 28.4% 35.6% +7.2%
Claude-3.7-Sonnet-Thinking 31.2% 37.4% +6.2%
Gemini-2.5-Pro 68.5% 67.2% -1.3%
DeepSeek-R1 55.7% 46.4% -9.3%

Task-Level Findings: - Mathematical Reasoning: GPT-4o improves by +18.0%; Claude-3.7-Sonnet improves by +15.6%. - Classification: Claude-3.7-Sonnet-Thinking improves by +13.5%. - Summarization: 7 out of 9 models exhibit performance degradation, indicating that this task relies more heavily on pre-trained knowledge.

Ablation Study

Comparison of Four Paradigms (Zero-shot vs. Few-shot vs. Demonstration Learning vs. Feedback Learning): - Demonstration Learning generally outperforms few-shot parallel solving. - Feedback Learning outperforms Demonstration Learning on most models. - Feedback learning more effectively enables models to obtain correct solutions earlier in the sequence.

Key Findings

  1. Thinking models benefit more from sequential learning: o3-mini achieves an average longest consecutive correct streak of 3.42, compared to only 2.58 for GPT-4o.
  2. Learning capability is uncorrelated with static capability: DeepSeek-R1 surpasses Claude-3.7-Sonnet in static performance, yet exhibits a 9% decline under sequential learning.
  3. Significant variation in learning efficiency: Claude-3.7-Sonnet achieves the highest slope at \(k=2.08\); non-thinking models start from a lower baseline, yielding steeper slopes.
  4. Task specificity: Each model demonstrates strong learning capability on certain tasks, but no model consistently improves across all tasks.

Highlights & Insights

  • Evaluation paradigm innovation: EvaLearn is the first work to systematically quantify the dynamic learning capability of LLMs, transcending the limitations of parallel evaluation.
  • Comprehensive metric design: Five metrics characterize learning behavior from multiple perspectives—speed, stability, and the position at which learning first occurs.
  • Decoupling of metrics and methods: The evaluation metrics are independent of any specific learning method, supporting future extensibility.
  • Revealing an important phenomenon: The paper demonstrates that strong static capability does not imply strong learning capability, offering a new perspective for model development.
  • In feedback learning, rubric-based feedback from the judge model proves more effective than directly providing ground-truth answers.

Limitations & Future Work

  • The dataset scale is limited (648 problems / 182 sequences), with only 7 problems per sequence, which may be insufficient to fully characterize learning curves.
  • Task coverage can be further expanded (e.g., to code generation and multimodal settings).
  • Only 9 closed-source or large-parameter models are evaluated; open-source small- and medium-scale models are not included.
  • The judge model (GPT-4o) may introduce its own biases.
  • The sequence length is fixed at 7; the effect of varying sequence lengths on learning behavior remains unexplored.
  • EvaLearn shares similarities with ARC (Chollet, 2019) in measuring a model's "intelligence potential," but focuses specifically on the ability to learn from experience.
  • Unlike ICL evaluations, EvaLearn targets cumulative learning over long sequences rather than few-shot generalization.
  • The framework can be combined with meta-learning or curriculum learning to design stronger sequential learning strategies.
  • EvaLearn provides a new criterion for model selection: models that perform comparably on static benchmarks may differ substantially in learning capability.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — A fundamentally new evaluation dimension and paradigm.
  • Value: ⭐⭐⭐⭐ — Offers practical guidance for model selection and development.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — 9 frontier models, 6 task categories, and multiple learning paradigms.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure with rigorous metric definitions.
  • Overall: 8.5/10