ACL2025 LLM Evaluation AI paper notes paper summaries LLM Reasoning Adversarial Robustness Agents Sentiment Analysis Question Answering

📊 LLM Evaluation¶

💬 ACL2025 · 89 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (131) · 💬 ACL2026 (97) · 🧪 ICML2026 (40) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (38) · 📹 ICCV2025 (27)

🔥 Top topics: LLM ×32 · Reasoning ×11 · Adversarial Robustness ×4 · Agents ×4 · Sentiment Analysis ×2

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates: A conformal risk control framework is proposed for granular word-level error detection and uncertainty calibration of CLIPScore. By generating a score distribution through simple attention mask sampling, this method provides formal risk control guarantees while remaining model-agnostic.
MisMatched: A Benchmark for Scientific Natural Language Inference: Introduces MisMatched—the first scientific NLI evaluation benchmark covering non-CS fields (Psychology, Engineering, Public Health), consisting of 2,700 human-annotated sentence pairs. The best SLM baseline (SciBERT) achieves a Macro F1 of only 78.17%, while the best LLM baseline (Phi-3) scores only 57.16%. It also proves that training with implicit relation sentence pairs can improve model performance.
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research: Proposes AbGen—the first benchmark to evaluate the ability of LLMs to design ablation studies (1,500 expert-annotated data points from 807 NLP papers). It reveals that the strongest LLM (DeepSeek-R1) falls behind human experts by 14.4%, and LLM-as-Judge scores are highly inconsistent with human evaluations.
Access Denied Inc: The First Benchmark Environment for Sensitivity Awareness: This work formally defines the concept of LLM "Sensitivity Awareness" (SA) for the first time—evaluating whether an LLM can decide whether to provide information based on Role-Based Access Control (RBAC) rules. The authors construct an automated evaluation benchmark, Access Denied Inc, and find that even with highly structured data and minimalist rules, the best-performing model, Grok-2, still exhibits a leak rate of 18.28% across 7 mainstream LLMs.
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models: The board game Codenames is implemented as an LLM evaluation benchmark, where LLMs play the roles of both Spymaster (clue giver) and Field Operative (guesser) against a deterministic opponent across 13 experimental setups of varying difficulty. Among 14 evaluated models, the best-performing model (o3-mini) achieves a win rate of only 49%, revealing substantial limitations of LLMs in vocabulary association, strategic positioning, and error correction.
AD-LLM: Benchmarking Large Language Models for Anomaly Detection: This paper proposes the first LLM anomaly detection benchmark, AD-LLM, to systematically evaluate the capability of LLMs in three core tasks: zero-shot detection, data augmentation, and unsupervised model selection. It reveals that GPT-4o zero-shot detection outperforms traditional training-based methods on most datasets. Additionally, synthetic data benefits detectors utilizing flexible representation learning but harms models with fixed geometric assumptions. Finally, reasoning LLMs achieve near-optimal model selection, though their explanations lack explicit dataset specificity.
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents: AndroidLab is proposed as a systematic Android agent evaluation and training framework, consisting of a unified operating environment, a reproducible benchmark with 138 tasks, and an instruction dataset of 94.3K steps. Through fine-tuning, the success rate of open-source LLMs is improved from 4.59% to 21.50%.
AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge: Proposed AntiLeakBench, an automated anti-leakage benchmark framework that identifies new knowledge post-LLM cutoff dates by tracking Wikidata knowledge update histories, and automatically constructs single-/multi-hop QA test samples (with real-world Wikipedia supporting documents) to ensure strict knowledge-level zero contamination. Large-scale experiments on 12 LLMs demonstrate a pervasive post-cutoff performance decline (with significant EM drop), validating the framework's effectiveness.
Are Bias Evaluation Methods Biased?: Under strictly controlled variables, this study compares three mainstream bias evaluation methods (structured Q&A BBQ, LLM-as-a-Judge, and sentiment analysis) and finds that different methods yield significantly different bias rankings for the same set of LLMs—suggesting that bias evaluation methods themselves are biased, and enterprises should not rely on a single bias benchmark for model selection.
Atomic Calibration of LLMs in Long-Form Generations: This work systematically studies atomic calibration in long-form generation, categorizing confidence elicitation methods into discriminative and generative approaches. It finds these two types to be complementary and proposes a fusion strategy based on confidence consistency, revealing interesting patterns in how model confidence changes during the generation process.
Batayan: A Filipino NLP Benchmark for Evaluating Large Language Models: This work introduces Batayan, the first comprehensive NLP benchmark designed to evaluate LLMs on the Filipino language. It covers 8 tasks across three core capabilities (NLU, NLR, NLG), including 3 first-of-their-kind Filipino tasks. Constructed by native speakers to ensure linguistic authenticity, Batayan evaluates over 50 open-source and commercial LLMs, revealing that performance on Filipino significantly lags behind English. Notably, explicit Filipino linguistic support and model scale expansion consistently yield substantial performance gains.
BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian: This paper introduces BelarusianGLUE, the first NLU benchmark for the Belarusian language (East Slavic branch), containing approximately 15K instances across 5 tasks. It systematically evaluates the performance of BERT models and LLMs, finding that while simple tasks like sentiment analysis approach human-level performance, difficult tasks like the Winograd Schema Challenge still exhibit a significant gap, and the optimal model type varies by task.
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories: Proposes the JitVul benchmark, which pairs each function with its vulnerability-introducing and vulnerability-fixing commits. Based on 879 CVEs covering 91 vulnerability types, it systematically evaluates the capabilities of LLMs and ReAct Agents in repository-level vulnerability detection, finding that ReAct Agents outperform pure LLMs, though both still have significant room for improvement.
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph: This paper constructs the LM-Polygraph uncertainty quantification (UQ) benchmark, implementing 30+ SOTA methods, and systematically evaluates the performance of UQ and confidence normalization techniques across 11 text generation tasks, providing a unified evaluation framework for LLM hallucination detection.
BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English: BESSTIE is constructed as the first annotated benchmark for sentiment analysis and sarcasm detection tailored to varieties of English (Australian, Indian, and British English). Evaluations using nine fine-tuned LLMs reveal that performance on Indian English (an outer-circle variety) is significantly worse than on inner-circle varieties, and cross-variety generalization remains limited.
Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation: This paper proposes TailoredBench, a method that adaptively constructs a customized coreset (Native-coreset) for each target model to be evaluated, instead of using a static subset shared across all models. By utilizing adaptive source model selection, scalable K-Medoids clustering, and a calibrated estimation strategy, it reduces the Mean Absolute Error (MAE) of accuracy estimation by 31.4% on average under an inference budget of only 20-40 samples.
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning: Proposes BLUR (Browsing Lost Unformed Recollections), a benchmark dataset containing 573 real-world "tip-of-the-tongue" (ToT) known-item search and reasoning queries. Human accuracy reaches 98%, whereas the best AI system achieves only about 56%, revealing a significant gap in current AI's tool use and multi-hop reasoning capabilities.
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges: Proposed CalibraEval, an unsupervised inference-time debiasing method. By formulating the debiasing problem as an optimization task, CalibraEval utilizes a Non-parametric Order-preserving Algorithm (NOA) to learn a calibration function that maps the observed probability distribution of LLM judges to an unbiased distribution, effectively mitigating selection bias in LLMs-as-Judges.
Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution's Characteristics: To address the failure of traditional confidence metrics caused by multiple valid outputs in text generation, this work proposes two task-agnostic confidence measures: "Ratio" (probability ratio of head vs. middle) and "Tail Thinness" (the thickness of the distribution tail). Relying solely on the model's output probabilities, these measures improve the confidence calibration of BART/Flan-T5 on summarization, translation, and question-answering tasks.
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?: Proposes Evaluation Agent, a tool-augmented LLM-as-a-Judge framework that integrates web search (fact-checking), code execution, and mathematical verification tools. It improves human agreement from 63% to 81% on long-text fact-checking, and from 31% to 71% on coding evaluation, with virtually no degradation in out-of-domain areas.
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs: CFBench is proposed, a large-scale Chinese constraint-following benchmark containing 1,000 finely annotated samples across 200+ real-world scenarios and 50+ NLP tasks. It systematically defines a taxonomy of 10 major categories and over 25 subcategories of constraints. Furthermore, a multi-dimensional evaluation framework is designed, combining Constraint Satisfaction Rate (CSR), Instruction Satisfaction Rate (ISR), and Priority Satisfaction Rate (PSR). The benchmark reveals significant room for improvement in constraint-following for current top-tier LLMs.
ChatBench: From Static Benchmarks to Human-AI Evaluation: Through user studies, this work converts the static MMLU benchmark into human-AI dialogues, constructing the ChatBench dataset (396 questions, 7,336 dialogues). It reveals that AI-alone accuracy cannot predict user-AI accuracy, and trains a user simulator that improves correlation by 22–26 percentage points, laying the foundation for scalable interactive evaluation.
CodeMEnv: Benchmarking Large Language Models on Code Migration: This paper proposes CodeMEnv, the first benchmark to systematically evaluate the cross-environment code migration capabilities of LLMs. It contains 922 samples from 19 Python/Java packages, covering 3 hierarchical tasks (locating incompatible functions -> describing changes -> migrating code). The average Pass@1 of 9 evaluated LLMs is only 26.50%, with GPT-4o achieving the highest at 43.84%. The findings reveal that LLMs are more familiar with newer function versions and exhibit inconsistency in version-reasoning logic.
Com2: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models: This paper proposes Com2, a complex commonsense reasoning benchmark constructed based on causal event graphs and causal theories (intervention/counterfactuals). It contains 2,500 topical questions and 1,254 detective story questions, revealing significant deficiencies of LLMs in reasoning depth and breadth.
CoPrUS: Consistency Preserving Utterance Synthesis Towards More Realistic Benchmark: This paper proposes the CoPrUS framework, a consistency-preserving utterance synthesis method for dialogue benchmark construction. By explicitly maintaining consistency constraints across personas, knowledge, and dialogue history during dialogue data generation, it produces more realistic dialogue benchmark data than existing methods.
CoV-Eval: Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective: Proposes CoV-Eval, the first multi-task code vulnerability evaluation benchmark covering code completion, vulnerability repair, vulnerability detection, and vulnerability classification. It develops the VC-Judge vulnerability judgment model to replace traditional static analysis tools. A comprehensive evaluation of 20 LLMs reveals that although most LLMs can detect vulnerable code, they still tend to generate unsafe code and possess limited vulnerability repair capabilities.
CuLEmo: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding: This paper proposes CuLEmo, the first multilingual benchmark dataset for evaluating culture-aware emotion prediction. Spanning 6 languages/cultures (Amharic, Arabic, English, German, Hindi, and Spanish), it evaluates the cross-cultural emotion understanding capabilities of LLMs across 400 culturally-relevant scenarios, revealing significant cultural variations in emotional expression and highly varying performance among LLMs.
CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming: CulturalBench is constructed through a Human-AI CulturalTeaming pipeline, comprising 1,696 human-written and five-way independently verified cultural knowledge questions across 45 global regions and 17 themes. CulturalBench-Hard (True/False format) yields only 61.5% accuracy even for the strongest model (OpenAI o1), far below the human performance of 92.4%, revealing models' mode-seeking tendencies in multi-answer questions and imbalanced performance in cross-regional cultural knowledge.
EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association: Defines the E-commerce Script Planning (EcomScript) task and constructs the first large-scale benchmark EcomScriptBench (600k scripts + 2.4M products). By bridging the semantic gap between action steps and product searching through purchase intentions, it reveals significant deficiencies in current LLMs on this task.
EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits: This paper proposes EditInspector, a multi-dimensional text-guided image editing evaluation benchmark based on human annotations. It covers six dimensions: editing accuracy, artifact detection, visual quality, scene fusion, common sense consistency, and change description. It reveals the limitations of current VLMs in comprehensively evaluating editing quality, and proposes two new methods that outperform SOTA in artifact detection and difference description generation.
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework: This work proposes EducationQ, a multi-agent dialogue framework designed to evaluate the teaching capabilities of LLMs by simulating teacher-student informal formative assessment interactions in real classrooms. The study reveals that teaching effectiveness is not linearly correlated with model size or general reasoning ability, with Llama 3.1 70B demonstrating the best teaching performance.
ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming: This work introduces ELABORATION, the first comprehensive benchmark designed to evaluate human-LLM collaborative competitive programming. Supported by a human feedback taxonomy covering the entire programming workflow and a meticulously annotated dataset of 8,320 problems, the study reveals that while LLMs possess limited independent problem-solving capabilities (with only a 3.4% Pass@1 rate on hard problems), human feedback—especially expert guidance during the coding phase—can yield a significant average improvement of 9.3%.
EvoWiki: Evaluating LLMs on Evolving Knowledge: This paper proposes EvoWiki, an automatically updatable dynamic evaluation benchmark that categorizes knowledge into three levels: stable, evolved, and uncharted. It systematically evaluates the ability of LLMs to utilize evolving knowledge, revealing synergistic effects when combining Retrieval-Augmented Generation (RAG) and Continual Learning (CL).
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models: This paper proposes NumericBench, a comprehensive benchmark to evaluate 6 fundamental numerical abilities (number recognition, arithmetic, retrieval, comparison, aggregation, and logical reasoning) across 6 datasets. It reveals that SOTA models, including GPT-4o and DeepSeek-V3, still perform poorly on simple numerical tasks, and provides an in-depth analysis of 5 root causes.
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging: The FinanceReasoning benchmark is proposed to improve financial numerical reasoning evaluation across three dimensions: credibility, comprehensiveness, and level of challenge. This is achieved by re-annotating public datasets, constructing a library of 3,133 Python financial functions, and introducing 908 expert-annotated hard questions.
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions: Proposes MemoryCode, a synthetic multi-session dataset to evaluate the ability of LLMs to track and execute coding instructions over long-term interactions, finding that even GPT-4o's accuracy drops by 67% when provided with the full dialogue history, revealing fundamental limitations of current LLMs in prospective memory and information integration.
GRACE: A Granular Benchmark for Evaluating Model Calibration Against Human Calibration: This paper proposes the GRACE benchmark, which collects 1749 data points through progressive incremental question answering and human-vs-model tournaments to evaluate LLM calibration capability with human calibration performance as a reference. By introducing the CalScore metric, the study reveals that while human accuracy may be lower than that of models, humans generally outperform SOTA models in calibration—models are overconfident when uncertain and underconfident when correct.
GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning: Proposes GuessArena, a self-adaptive LLM evaluation framework based on the "Guess Who I Am" game. Through domain knowledge modeling and multi-turn interactive reasoning, this framework effectively distinguishes models' domain knowledge and reasoning capabilities across five vertical industries.
HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning: This work constructs HellaSwag-Pro, the first large-scale bilingual (Chinese-English) benchmark for evaluating the robustness of LLMs in commonsense reasoning. By generating 11,200 question variants from 1,600 original questions across 7 reasoning forms, systematic evaluation on 41 LLMs reveals that all models fall far short of robust commonsense reasoning—the average accuracy for negation transformation is only 9.01%, highlighting a significant performance gap between models and humans.
Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback: This paper defines the novel task of "LLM writing feedback generation," constructs a dataset (StoryFeedback) containing 1,300 stories with controlled writing defects (totaling 83K story-feedback pairs), and systematically evaluates the performance of 8 LLMs across four dimensions (specificity, correctness, error detection, and praise appropriateness) using automatic metrics and human evaluation. The study finds that while models provide specific and generally correct feedback, they often miss major writing flaws and struggle to determine when to praise.
HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices: This paper introduces HomeBench, the first evaluation benchmark for smart-home LLMs that incorporates both valid and invalid instructions across single- and multi-device scenarios. The study reveals that even GPT-4o achieves a success rate of only 0.0% in multi-device scenarios involving invalid instructions.
How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation: This paper proposes the BehaviorChain benchmark to evaluate LLMs' ability to simulate sequential human behaviors for the first time. Containing 15,846 behavior samples under 1,001 persona profiles, the study reveals that even state-of-the-art models perform poorly in sequential behavior simulation.
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators: By integrating 8 key factors affecting LLM evaluation prompts (scoring scale, ICL examples, evaluation criteria, reference answer, CoT, AutoCoT, evaluation metrics, and component order), this work proposes HPSS, a genetic-algorithm-based heuristic prompting strategy search method. HPSS efficiently searches for the optimal prompting strategy within a search space of 12,960 combinations, outperforming G-Eval and CloserLook at only 5% of the baseline generation cost.
Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles: This paper systematically investigates three major factors influencing LLM calibration: multi-model response agreement, loss function selection, and prompt style. It proposes the Calib-n framework, which trains an auxiliary model to aggregate responses from multiple LLMs to estimate confidence, and reveals that response agreement and focal loss significantly improve calibration performance.
JuStRank: Benchmarking LLM Judges for System Ranking: The first large-scale study on the performance of LLM judges in system ranking tasks, introducing the JuStRank benchmark. Over 1.5 million ratings from 48 judges across 63 systems were collected, revealing a significant gap between instance-level judgment and system-level ranking performance, and identifying two quantifiable system-level behavioral characteristics of LLM judges: "decisiveness" and "system-specific bias."
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding: KITAB-Bench is a comprehensive Arabic OCR benchmark covering 8,809 samples across 9 major domains and 36 sub-domains. Evaluation results indicate that modern vision-language models (such as GPT-4o and Gemini) outperform traditional OCR methods by an average of 60% in terms of character error rate (CER), yet the best model only achieves a 65% accuracy in PDF-to-Markdown conversion, highlighting the massive challenges of Arabic document understanding.
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning: This paper proposes KRISTEVA, the first benchmark to evaluate LLM close reading capabilities, consisting of 1331 multiple-choice questions constructed from university classroom data. It covers three progressive levels of difficulty: stylistic feature extraction, contextual retrieval, and feature-context multi-hop reasoning. Nineteen SOTA LLMs still lag behind human experts in 10 out of 11 tasks.
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America: This work constructs the first open-source LLM leaderboard for Spanish and Latin American languages, integrating 66 datasets covering Spanish, Catalan, Basque, and Galician, to evaluate 50 models and analyze the relationships between training strategies, compute, and performance.
Language Complexity Measurement as a Noisy Zero-Shot Proxy for Evaluating LLM Performance: Using language complexity calculation tasks (LIX readability index and average dependency distance, ADD) as zero-shot proxy evaluation methods for general LLM capabilities, this paper evaluates six models on Swedish essays. The findings show a strong negative correlation between LIX error and MMLU score (\(r=-0.875\), \(p=0.026\)), demonstrating that structural analysis performance can serve as a cost-effective approximation of general model capability.
Language Model Probabilities are Not Calibrated in Numeric Contexts: This paper systematically investigates the probability calibration of language models in numeric contexts. It reveals that even in simple scenarios (such as drawing marbles from a bag), all tested models, including GPT-4o, are severely miscalibrated. They exhibit systematic biases based on token order, token frequency, and token identity (e.g., some models consistently select the first option, while others select the second). Furthermore, instruction tuning is found to exacerbate mode collapse.
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework: Proposes the ARJudge evaluation framework, which adaptively generates evaluation criteria and executes text+code dual-driven analysis by fine-tuning an Analyzer, paired with a training-free Refiner for comprehensive judgment. ARJudge outperforms existing fine-tuned evaluators across multiple evaluation benchmarks, particularly achieving a performance gain of up to 11.1% on instruction-following evaluation via code-driven analysis.
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset: This paper proposes a formal definition of Metaphysical Reasoning, formulating reasoning under distributional changes as a three-step discrimination process. It constructs Mars (355K annotated instances), the first large-scale evaluation benchmark. Experiments demonstrate that over 20 language models perform poorly on this task, revealing a significant weakness of LLMs in understanding modifications of event components and their causal effects.
McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models: This paper proposes McBE, the first multi-task Chinese bias evaluation benchmark, containing 4,077 Bias Evaluation Instances (BEIs) across 12 bias categories and 82 subcategories. By utilizing 5 evaluation tasks (Preference Computation, Subcategory Classification, Scenario Selection, Bias Analysis, and Bias Scoring), McBE multi-dimensionally quantifies Chinese bias in LLMs, revealing that the conventional conclusion of "larger parameters yield stronger bias" may stem from the limitations of single-task evaluations.
MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance: Proposes MDBench, a multi-document reasoning QA benchmark synthesized through a "structured knowledge → LLM-assisted enhancement → natural text generation" pipeline. It controllably injects cross-document dependencies and poses a significant challenge to frontier LLMs (the best model achieves only ~60% EM).
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling: The Mis-prompt benchmark is proposed, containing 4 evaluation tasks, a taxonomy of 14 error categories, and a dataset of 14,969 items, systematically investigating the proactive error-handling capabilities of LLMs when no explicit error-handling instructions are provided. It is found that the proactive error-handling capabilities of current LLMs are severely lacking, whereas SFT can significantly improve them.
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark: This paper proposes MMLU-CF, a contamination-free multi-task language understanding benchmark containing 20,000 questions. It avoids inadvertent and malicious data leakage by collecting data from broader sources and applying three decontamination rules (rephrasing questions, shuffling options, and randomly replacing options). On this benchmark, the strongest model GPT-4o achieves only 73.4% (compared to 88.0% on MMLU).
Movie101v2: Improved Movie Narration Benchmark: Proposes Movie101v2, a large-scale bilingual movie narration benchmark (203 movies, 46K Chinese-English video-narration pairs). It decomposes automatic movie narration into a three-stage progressive goal: L1 visual factual description \(\rightarrow\) L2 plot narration \(\rightarrow\) L3 deployable AD. It designs an LLM-based hierarchical evaluation framework, systematically benchmarks multiple LVLMs, and provides an in-depth analysis of the core bottlenecks in visual perception and text generation.
Navigating Rifts in Human-LLM Grounding: Study and Benchmark: This paper systematically studies the grounding (establishing mutual understanding) failure problem in human-LLM dialogues. It reveals that the frequency of proactive clarification by LLMs is only 1/3 of that of humans, and the frequency of proactive follow-up questions is only 1/16. To address this, the authors propose the Rifts benchmark (~1.8K tasks) to evaluate the grounding capabilities of LLMs, and implement preliminary interventions using a grounding forecaster.
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark: This paper proposes NorEval, a comprehensive Norwegian evaluation suite containing 24 manually created datasets across 9 task categories. It systematically evaluates the language understanding and generation capabilities of 19 open-source Norwegian language models, finding that models still lag significantly behind humans in common-sense reasoning, truthfulness, and instruction following.
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities: ONEBench proposes a new benchmarking paradigm: pooling samples from multiple evaluation datasets into a unified pool, and performing sample-level model comparisons using the Plackett-Luce rank aggregation algorithm. This paradigm supports heterogeneous metric aggregation, incomplete data handling, and personalized capability probing.
Pap2Pat: Benchmarking Outline-Guided Long-Text Patent Generation with Patent-Paper Pairs: This work constructs the Pap2Pat benchmark containing 1.8k patent-paper pairs, proposes COPGen, an outline-guided chunked patent description generation method, and designs NLI-based factuality, coverage, and style evaluation metrics to systematically assess the capabilities and limitations of modern LLMs in generating ultra-long patent documents.
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory: This paper proposes the PapersPlease benchmark, containing 3,700 moral dilemma scenarios based on Alderfer's ERG motivation theory. By instructing LLMs to role-play as immigration officers deciding on applicant entry, this benchmark reveals significant differences in motivational value prioritization across six LLMs, as well as biases toward marginalized social groups.
PATCH: Psychometrics-Assisted Benchmarking of LLMs Against Human Populations: Proposes the PATCH framework, which introduces Item Response Theory (IRT 3PL/2PL models) from psychometrics into LLM benchmarking. By comparing GPT-4V, Gemini-Pro-Vision, and Qwen-VL with human populations on the TIMSS 2011 eighth-grade mathematics test (88 questions, 56 countries/regions), it finds that IRT ability estimations differ significantly from simple accuracy rankings, with GPT-4V ranking in the same bracket as students from South Korea, Singapore, and Chinese Taipei. Additionally, four high-quality datasets are released (TIMSS 2011 & 2008 Mathematics/Science/Physics).
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning: This paper proposes PhysReason, a physics reasoning benchmark containing 1,200 physics problems (averaging 8.1 solving steps per problem). It designs a two-tier automatic evaluation framework, PSAS, covering both answer-level and step-level evaluations. The benchmark reveals that top-tier models (such as DeepSeek-R1 and o3-mini) achieve less than 60% accuracy on physics reasoning and identifies four major reasoning bottlenecks.
READoc: A Unified Benchmark for Realistic Document Structured Extraction: READoc proposes the first unified benchmark that defines Document Structured Extraction (DSE) as an end-to-end PDF-to-Markdown conversion. It includes 3,576 realistic documents from arXiv, GitHub, and Zenodo, along with a three-module evaluation suite (Standardization, Segmentation, and Scoring), revealing for the first time the gap between current DSE systems and real-world requirements.
RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis: This paper proposes RealHiTBench, the first benchmark to comprehensively evaluate LLMs' capacity to understand complex hierarchical tables. It contains 708 real-world complex tables from 13 platforms across 24 domains and 3,752 questions. It defines 5 complex structural types and 5 major task types, and introduces TreeThinker, a tree-style reasoning pipeline that significantly enhances model understanding of hierarchical headers.
Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models: This work proposes ToolRet, the first large-scale tool retrieval benchmark (comprising 7.6k retrieval tasks and 43k tools), revealing that existing strong Information Retrieval (IR) models perform poorly on tool retrieval tasks (with the strongest model achieving an nDCG@10 of only 33.83). It also contributes the ToolRet-train dataset, featuring over 200k training instances, which significantly improves the tool retrieval capabilities of IR models and enhances the success rate of end-to-end tool-use tasks.
Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?: This work reveals the "2D-Cheating" issue in 3D LLM evaluations—where 2D VLMs outperform 3D SOTA on certain benchmarks after rendering point clouds into images. This demonstrates that these benchmarks fail to effectively evaluate genuine 3D understanding capabilities, based on which design principles for effective 3D evaluation are proposed.
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice QA: This study systematically exposes inconsistencies in LLM evaluation within Multiple-Choice Question Answering (MCQA). It demonstrates that different combinations of evaluation strategies (RegEx, Logprobs, xFinder) and prompting setups (constrained vs. free generation) lead to substantial discrepancies in reported model performance. Furthermore, even state-of-the-art LLM-based answer extractors fail to reliably identify reasoning contradictions, highlighting the urgent need for standardized evaluation protocols.
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios: This paper proposes RuleArena—a benchmark based on three real-world scenarios: airline baggage fees, NBA transaction rules, and tax regulations—to evaluate the ability of LLMs to perform reasoning while following complex natural language rules. Experiments reveal that even the strongest model (o1-preview) achieves less than 50% accuracy on the most difficult tasks, exposing systematic deficiencies of LLMs in rule recall, rule discrimination, and mathematical computation.
SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture: Developed SANSKRITI, a large-scale cultural knowledge benchmark covering all 36 administrative regions of India, 16 cultural attributes, and containing 21,853 multiple-choice questions (MCQs). Zero-shot evaluation across 11 LLMs/SLMs/ILMs reveals significant imbalances in the models' cultural knowledge across geographic regions and attributes.
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science: This paper proposes SeedBench—the first multi-task LLM evaluation benchmark for seed science (seed breeding). It contains 2,264 expert-verified questions covering three major breeding components: gene information retrieval, gene function regulation, and variety selection. Systematically evaluating 26 LLMs, it reveals a significant gap between current LLMs and real breeding requirements.
skLEP: A Slovak General Language Understanding Benchmark: This paper introduces skLEP, the first comprehensive natural language understanding benchmark for Slovak, consisting of 9 multi-level tasks (word-level, sentence-pair-level, and document-level). Systematic evaluation of Slovak-specific, multilingual, and English models reveals that mDeBERTaV3 outperforms all Slovak-specific models in average performance.
Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks: This work presents a systematic analysis of three structural weaknesses—excessive overlap, semantic simplicity, and ground-truth noise—in prevailing Table Union Search (TUS) benchmarks, revealing that simple bag-of-words (BoW) and pre-trained embedding baselines can match or outperform complex SOTA methods on these benchmarks. The findings suggest that existing benchmarks fail to effectively evaluate semantic understanding capabilities.
StrucText-Eval: Evaluating LLM's Reasoning on Structure-Rich Text: This paper proposes StrucText-Eval, which automatically generates semantic-free structured text samples covering 8 structured languages and 29 tasks with a total of 5,800 samples. By adjusting difficulty through controllable nesting depth and width, it reveals that the strongest open-source LLM achieves only 45.8% accuracy on the hard set compared to 92.6% for humans, systematically exposing serious shortcomings of LLMs in pure structural reasoning.
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following: Introduces StructFlowBench, a multi-turn instruction following benchmark integrating structural flow modeling, which defines six fundamental turn-to-turn relationships (Follow-up, Refinement, Recall, Summary, Expansion, Unrelatedness) and establishes a dual-layer constraint evaluation system (intra-turn constraints + inter-turn structural constraints) to systematically evaluate the capability of 13 major LLMs in understanding multi-turn dialogue structures.
SwiLTra-Bench: The Swiss Legal Translation Benchmark: Constructs SwiLTra-Bench, a large-scale multilingual benchmark featuring over 180,000 aligned Swiss legal translation pairs (covering laws, headnotes, and press releases across German, French, Italian, Romansh, and English). It systematically evaluates the performance of frontier LLMs and fine-tuned open-source SLMs in legal translation, and proposes the SwiLTra-Judge automatic evaluation methodology.
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining: This paper proposes TiC-LM, a large-scale temporal continual learning benchmark based on 114 months of Common Crawl data (2.9T tokens). Through over 150 experiments, the work systematically evaluates the performance of optimizers, data replay, and regularization methods in continual pretraining scenarios. The findings reveal that an autoregressive learning rate schedule combined with fixed-ratio data replay can closely approach the performance of training from scratch with 2.6 times less computational cost.
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States: This paper proposes the DynToM benchmark, evaluating LLM capabilities in tracking the temporal evolution of human mental states across 5,500 temporally linked scenes within 1,100 social contexts and 78,100 questions, revealing that models lag behind humans by an average of 44.7%.
Towards Objective Fine-tuning: How LLMs' Prior Knowledge Causes Potential Poor Calibration?: This paper reveals that the prior knowledge of LLMs leads to calibration degradation during fine-tuning (known data triggers overconfidence, whereas unknown data actually benefits calibration). It proposes CogCalib, a cognition-aware calibration framework that dynamically applies different learning strategies based on knowledge bias during training, reducing the Expected Calibration Error (ECE) by an average of 57% while maintaining task performance.
TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning: Proposes TripCraft, a travel planning benchmark dataset integrating real-world spatiotemporal constraints (public transit, activity availability, user personas, etc.), accompanied by five continuous evaluation metrics to systematically assess the quality of LLM-generated itineraries. It improves the temporal meal score from 61% to 80% under the parameter-informed setting.
TripTailor: A Real-World Benchmark for Personalized Travel Planning: Proposes TripTailor, a large-scale real-world data-based travel planning benchmark containing over 500k POIs across 40 cities and nearly 4000 real itineraries. By introducing a three-dimensional evaluation framework assessing feasibility, rationality, and personalization, the authors find that less than 10% of plans generated by state-of-the-art LLMs reach human-level quality.
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages: Proposes TUMLU and TUMLU-mini, the first native multi-task language understanding benchmark for 9 Turkic languages. It contains 38,139 middle and high school multiple-choice questions spanning Latin, Cyrillic, and Arabic script systems. It systematically evaluates 13 open-source and closed-source LLMs, revealing the differentiated impacts of script systems, language resource sizes, and CoT on model performance.
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models: Proposes VoxEval, the first SpeechQA benchmark supporting end-to-end speech-only input-output evaluation, covering 56 subjects and 26 input audio conditions, systematically revealing the severe deficiencies of current end-to-end spoken language models in knowledge understanding and mathematical reasoning.
WebWalker: Benchmarking LLMs in Web Traversal: This paper proposes the WebWalkerQA benchmark to evaluate the capabilities of LLMs in deep web traversal for information gathering. It also designs the WebWalker multi-agent framework to mimic human web navigation behaviors via an Explore-Critic paradigm, significantly improving complex QA performance by integrating horizontal and vertical retrieval with RAG.
Where Are We? Evaluating LLM Performance on African Languages: Constructed the Sahara benchmark covering 517 African languages, 30 datasets, and 16 task categories to systematically evaluate 24 LLMs on African languages, revealing how language policy-driven data inequality directly impacts model performance.
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging: This paper proposes the WiCkeD method, which increases the difficulty of existing benchmarks by randomly replacing one option of multiple-choice questions with "None of the above". This leads to an average performance drop of 12.1 percentage points across 18 LLMs, and even Chain-of-Thought reasoning cannot compensate for this decrease.
WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models: Proposes WXImpactBench, the first LLM evaluation benchmark for extreme weather impact understanding. It features a four-stage data construction pipeline and two evaluation tasks (multi-label classification and ranking-based question answering) to systematically evaluate the capabilities of multiple LLMs in the domain of climate adaptation.
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering: This work proposes the YESciEval framework, which combines a nine-dimensional fine-grained evaluation rubric and SFT+RL alignment strategies to mitigate the optimism bias of LLM judges. It builds a robust open-source LLM-as-a-Judge system for scientific question answering without requiring human annotations or closed-source models.