AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research¶

Conference: ACL 2025
arXiv: 2507.13300
Code: https://github.com/yale-nlp/AbGen
Area: LLM Evaluation
Keywords: ablation study design, scientific research, LLM evaluation, experiment design, meta-evaluation

TL;DR¶

Proposes AbGen—the first benchmark to evaluate the ability of LLMs to design ablation studies (1,500 expert-annotated data points from 807 NLP papers). It reveals that the strongest LLM (DeepSeek-R1) falls behind human experts by 14.4%, and LLM-as-Judge scores are highly inconsistent with human evaluations.

Background & Motivation¶

Background: Scientific experiment design (especially ablation studies) is a critical step in validating the effectiveness of methods, requiring deep domain knowledge. LLMs have demonstrated capabilities in scientific research tasks such as paper reviewing, writing, and code generation.

Limitations of Prior Work: Scientists often identify flaws in ablation study designs only after peer review; no standardized benchmark exists to evaluate whether LLMs can assist in designing ablation studies.

Key Challenge: Ablation study designs generated by LLMs appear reasonable but suffer from systematic flaws in faithfulness (consistency with actual research methods) and soundness (reproducibility), which existing automated evaluation methods fail to capture.

Goal: (1) Build an evaluation benchmark for ablation study design; (2) Evaluate the capabilities and limits of current LLMs; (3) Test the reliability of LLM-as-Judge on this task.

Key Insight: Deconstruct ablation study design into two output parts: "Research Goal + Experimental Procedure," and perform human expert evaluations across three dimensions: importance, faithfulness, and soundness.

Core Idea: Construct AbGen, the first benchmark for ablation study design, and AbGen-Eval, a meta-evaluation benchmark, revealing the capability bottlenecks of LLMs in scientific experiment design and the unreliability of automated evaluation.

Method¶

Overall Architecture¶

(1) Collect 1,500 ablation study samples from 807 NLP papers, where experts annotate the research context \(C\) (background, methodology, experimental results) and reference ablation design \(A\); (2) Given context \(C\) and module name \(M\), the LLM generates the ablation design \(\hat{A} = \arg\max_A P_{LLM}(A|C,M)\); (3) Perform comparison using three-dimension human evaluation and LLM-as-Judge.

Key Designs¶

Benchmark Construction Pipeline:
- Function: Filter experimental papers from arXiv (March to August 2024 in the NLP domain), requiring at least 2 ablation studies per paper.
- Mechanism: Experts rewrite the research context (not simply copying the abstract), including research background (mean 319 words), methodology (904 words), experimental setup and results (624 words), with the reference ablation design averaging 146 words.
- Quality Control: 273/1,500 entries were revised after verification, achieving 95%+ satisfaction (score \(\ge 4/5\)).
Three-Dimensional Evaluation System:
- Function: Evaluate from three dimensions: Importance (whether the ablated module is crucial), Faithfulness (whether the design is consistent with the context), and Soundness (whether the plan is reproducible).
- Mechanism: Two-stage evaluation—first blind-evaluate the LLM output using only the research context, then adjust scores against the reference answers.
- Review Experts: 4 ACL Rolling Review area chairs, achieving a Cohen's Kappa of 0.71–0.78.
Meta-Evaluation Benchmark AbGen-Eval:
- Function: Verify the reliability of LLM-as-Judge when evaluating ablation study designs.
- Key Findings: The instance-level Pearson correlation is at most 0.48 (Gemini-2.5-Flash on the faithfulness dimension), with most below 0.4, indicating that LLM-as-Judge is severely unreliable.

Loss & Training¶

Pure evaluation benchmark, no training.

Key Experimental Results¶

Main Results¶

Model	Importance	Faithfulness	Soundness	Average
Human Expert	4.65	4.93	4.83	4.80
Reference Paper	4.70	4.90	4.70	4.77
DeepSeek-R1	4.23	4.00	4.11	4.11
o4-mini	4.23	3.78	4.00	4.00
GPT-4.1	4.12	3.87	4.02	4.00
Qwen3-235B	4.26	3.43	4.00	3.90
Gemini-2.5-Flash	3.89	3.94	3.76	3.86
GPT-4o	3.88	3.67	3.91	3.82

LLM-Human Interaction User Study¶

Model	Phase	Importance	Faithfulness	Soundness
GPT-4o	Initial Failure	3.9	2.1	2.0
GPT-4o	Post-Feedback	4.8 (+0.9)	4.2 (+2.1)	4.6 (+2.6)
Llama-3.1-70B	Initial Failure	3.7	1.8	1.7
Llama-3.1-70B	Post-Feedback	4.5 (+0.8)	3.9 (+2.1)	4.1 (+2.4)

Key Findings¶

The strongest LLM (DeepSeek-R1, 4.11) lags behind human experts (4.80) by 0.69 points (14.4%), with the largest gap in faithfulness (4.00 vs 4.93).
LLM-as-Judge scores are severely inflated (rating DeepSeek-R1 as 4.80 = human level), which is inconsistent with human evaluation.
Five common failure categories: inconsistency with context, non-reproducible experiments, incomplete ablation, unimportant module selection, and logical self-contradiction.
LLMs show significant improvement after receiving human feedback: faithfulness increases by +100-150% and soundness increases by +150%+, highlighting the potential of LLM-human collaboration.
Cross-domain generalization is acceptable: performances in biomedical and computer networking domains are close to those in NLP.

Highlights & Insights¶

Reveals systematic weaknesses of LLMs in scientific experiment design: LLMs can generate plausible ablation plans but fall severely short in detail consistency (faithfulness)—which is the most critical quality of scientific experiment design.
The unreliability of LLM-as-Judge is particularly prominent in this task: the automated evaluation scores are almost unable to distinguish good designs from bad ones, indicating that the evaluation of scientific reasoning tasks still requires human involvement.
Huge room for improvement through human feedback (soundness from 2.0 to 4.6) suggests: the optimal use of LLM is "assist + iterate" rather than full automation.

Limitations & Future Work¶

The benchmark only covers NLP papers; although user studies show acceptable cross-domain generalization, formal benchmarks are still lacking.
Evaluation relies on the Likert scale from human experts, making subjectivity inevitable.
The "creativity" dimension of ablation studies is not evaluated—some creative ablation designs might differ from the reference but remain highly valuable.
Only ablation study design is evaluated, without covering complete experimental planning (such as baseline selection and dataset selection).

vs ReviewAdvisor / MARG / SEA, etc.: Previous AI for Science evaluations focused on paper reviewing and writing. AbGen is the first to focus on experiment design, a core scientific research capability.
vs Code Generation Benchmarks (HumanEval, etc.): Code can be automatically verified, whereas optimal experiment design must be judged by domain experts, presenting higher evaluation difficulty.
Insight: The bottleneck of LLM-assisted research might not lie in the volume of knowledge, but in the capability to be "faithful to specific contexts"—which is fundamentally the same as the faithfulness issue in RAG.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First ablation study design benchmark + meta-evaluation, novel problem definition.
Experimental Thoroughness: ⭐⭐⭐⭐ 10+ models + human evaluation + user studies + cross-domain validation.
Writing Quality: ⭐⭐⭐⭐ Detailed benchmark construction process, strict quality control.
Value: ⭐⭐⭐⭐⭐ Drives AI for Science from "assisted writing" toward "assisted experiment design".