📊 LLM Evaluation¶
💬 ACL2026 · 97 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (131) · 🧪 ICML2026 (40) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (37) · 📹 ICCV2025 (27)
🔥 Top topics: LLM ×33 · Reasoning ×13 · Multimodal/VLM ×4 · Personalized Generation ×3 · Agents ×3
- AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
-
AgentEval models agent execution traces as "Evaluation DAGs," using GPT-4o as a judge to score nodes across five types and trace root causes through a greedy parent strategy. Combined with 21 failure categories and CI/CD integration, it achieved a 2.17× improvement in failure detection recall (0.41→0.89) over end-to-end evaluation on 450 production traces. It reached human consistency of \(\kappa=0.84\), root cause accuracy of 72% (approaching the human limit of 81%), and reduced the median root cause localization time from 4.2 hours to 22 minutes in a 4-month pilot.
- Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
-
Addressing the reality of systematic expert disagreement in business idea evaluation, this work constructs the PBIG-DATA dataset containing 3,000 individual expert ratings. It empirically demonstrates that "personalized judges" (conditioned on a target reviewer's history) align better with expert behavior than "aggregate judges" (conditioned on mixed reviewer histories), challenging the common assumption of using pooled labels as the sole ground truth.
- AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
-
This paper introduces AJ-Bench, the first benchmark to systematically evaluate the capabilities of Agent-as-a-Judge. It covers three domains—Search, Data Systems, and GUI—with a total of 155 tasks and 516 annotated trajectories. Experiments demonstrate that Agent-as-a-Judge improves the average \(F1\) score by approximately 13 percentage points compared to LLM-as-a-Judge.
- Are They Lovers or Friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
-
This paper proposes the SCRIPTS benchmark, containing 1.1K English and Korean movie dialogues, to evaluate the social relation reasoning capabilities of 9 LLMs through three-tier probabilistic labels (HIGHLY LIKELY / LESS LIKELY / UNLIKELY). The study finds that models achieve only 75-80% accuracy in English and 58-69% in Korean, and CoT or reasoning-based models provide almost no benefit for social reasoning.
- arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation
-
The authors present the arXiv2Table benchmark (1,957 tables, 7,158 papers), which achieves a more realistic evaluation of LLM-based literature-review table generation by introducing distractor papers, schema-agnostic user demands, and a QA-based reference-free evaluation framework, alongside an iterative batch generation method.
- Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models
-
This paper systematically reviews 134 papers on evidence-based text generation for LLMs. It proposes the first unified taxonomy (Attribution Mechanism × Citation Features × Task), analyzes 300 evaluation metrics categorized into seven dimensions and six methods, and provides a panoramic reference framework for this fragmented field.
- Automated Creativity Evaluation of Language Models Across Open-Ended Tasks
-
This paper proposes an automated, task-decoupled, and reference-free framework to quantify LLM creativity. "Semantic Entropy" is employed to measure divergent creativity (novelty and diversity of ideas), while "Retrieval-based Multi-agent Judging" measures convergent creativity (whether the solution effectively addresses the problem). The study systematically uncovers the impact of model scale, temperature, and reasoning capabilities on creativity across three domains: problem-solving, scientific hypothesis generation, and creative writing.
- BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers?
-
The authors developed a "BadScientist" pipeline: a generation agent that conducts no real experiments uses five "performative fraud" strategies to write seemingly rigorous but fundamentally unsound papers. These are then fed to a multi-model reviewer agent composed of o3 / o4-mini / GPT-4.1. Results show that the acceptance rate for fraudulent papers reaches up to 82%. Furthermore, reviewers often point out integrity issues in their text comments while still assigning acceptance scores (concern-acceptance conflict), and existing mitigation methods perform barely better than random guessing.
- BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
-
Drawing on mature quality control frameworks for multiple-choice questions (MCQs) from the field of education, this work constructs BenchMarker. This tool uses LLM-as-judge to audit 12 mainstream NLP MCQA benchmarks across three dimensions: "contamination + shortcuts + writing errors." The study finds that 47% of TruthfulQA questions can be found directly online, while 100% of HellaSwag questions violate multiple writing rules. It empirically demonstrates that these flaws significantly inflate or deflate LLM accuracy and even alter model rankings.
- Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
-
To be added after deep reading.
- Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind
-
The authors construct the Chameleon psychological profile dataset covering 1,667 users across multiple subreddit contexts. Using ICC decomposition, they demonstrate that 72-74% of psychological variation stems from "state (context)" rather than "trait (personality)." They further reveal that LLMs are nearly blind to these states, while reward models react to states in contradictory directions—consequently, RLHF blindly inherits these state-based biases from the reward models.
- Beyond Itinerary Planning: A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks
-
TravelBench is proposed as the first travel planning benchmark integrating real user queries, implicit user preferences, multi-turn interactions, unsolvable task identification, and 10 real-world tools. It implements reproducible evaluation through a sandbox environment, revealing unbalanced performance of cutting-edge models across different capability dimensions.
- Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs
-
This paper proposes a framework for evaluating LLM representativeness beyond marginal distributions. By simultaneously examining marginal response distributions and cross-question correlation structures to evaluate demographic-aligned models, it reveals that while fine-tuning and persona prompting improve marginal distribution approximation, neither faithfully reproduces the multivariate correlation patterns found in human value surveys.
- Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation
-
Ours proposes a paired-task framework to jointly evaluate the literary text comprehension and translational creativity of LLMs. Based on a large-scale evaluation of 23 models using 11 classic English novels, it is found that strong comprehension ability does not translate into human-level translational creativity.
- Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation
-
The authors drive LLM agents to act as users writing harmful comments on real Reddit posts using a "2D persona" (intrinsic identity + extrinsic strategy). This synthesizes a harmful content evaluation set that is more challenging, diverse, and comprehensive than traditional static benchmarks. It reduces the accuracy of four mainstream safety classifiers to 13–31% (vs. 60–94% on static sets), exposing the fact that existing benchmarks have been "over-saturated."
- Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation
-
The authors formalize LLM benchmarking as a hierarchical Bayesian estimation problem—prompt difficulty \(p_i \sim \mathbb{P}(\mu,\sigma)\), and the accuracy of \(k\) generations per prompt follows Bernoulli\((p_i)\). It is theoretically proven that using \(k>1\) samples reduces within-prompt variance to \(\frac{1}{nk}\), and this leads to the derivation of prompt-level difficulty scores \(\mathbb{P}(\text{correct})\) and a "data map" capable of detecting mislabeled instances (with a 44.4% hit rate on GSM8K).
- BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications
-
This paper proposes BizCompass, a business reasoning benchmark that bridges theoretical foundations and practical applications. It covers four knowledge domains (Finance, Economics, Statistics, Operations) and three application roles (Analyst, Trader, Consultant). The study systematically evaluates the business reasoning capabilities of open-source and closed-source LLMs, revealing the patterns of transforming theoretical knowledge into practical performance.
- Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
-
This paper constructs ProHist-Bench: anchored by the 1,300-year history of the Chinese Imperial Examination, it features 400 expert-level questions handwritten by historians and 10,891 fine-grained rubrics to evaluate the professional historical research capabilities of 18 SOTA LLMs. Even the strongest models, Gemini-3-Pro and Qwen3-235B, achieved Rubric Scores of only approximately 28, significantly lower than those of open-book historians.
- Can We Predict Before Executing Machine Learning Agents?
-
This paper demonstrates that LLMs can serve as implicit "world models" to predict the quality of ML solutions based solely on task descriptions, verified data reports, and code snippets (DeepSeek-V3.2-Thinking achieves 61.5% accuracy). Based on this, the authors develop ForeAgent, which transforms the traditional "Generate-Execute-Feedback" loop of AIDE into a "Predict-then-Verify" loop, achieving a 6× speedup, 3.2× expanded search space, and a +6% Beat Ratio on MLE-Bench.
- Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry
-
This paper proposes a three-step evaluation framework (computational feature extraction + LLM-as-Judge + human expert verification) to systematically evaluate the performance of six LLMs in Tang poetry generation. It identifies a critical "echo chamber" effect: LLMs systematically overestimate machine-generated poems that mimic statistical patterns but violate metrical rules, deviating significantly from human expert judgments.
- Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
-
OlymMATH is proposed as the first Olympiad-level mathematical benchmark that unifies natural language evaluation and formal theorem proving. It contains 350 bilingual (Chinese and English) problems, covering OlymMATH-EASY/HARD (200 problems with numerical answers) and OlymMATH-LEAN (150 Lean 4 formalized problems), revealing that the strongest models achieve only 58.4% accuracy on the HARD subset.
- CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems
-
CLARITY is the first diagnostic benchmark for NL2SQL proposed by Oracle that supports "multi-facet ambiguity + unanswerability + single/multi-turn dialogue + diverse user clarification behaviors". Using a controllable LLM pipeline (SQL → pivot term → rewriting → dialogue → screening), it automatically extends Spider/BIRD into approximately 30,000 instances. Through schema-level pivot/group annotations, it reveals a failure mode where SOTA LLMs "can detect ambiguity but cannot locate specific schema elements."
- Common to Whom? Regional Cultural Commonsense and LLM Bias in India
-
This paper constructs Indica, the first benchmark to evaluate sub-national cultural commonsense in LLMs. Focusing on cultural variations across five major regions of India in eight daily life domains, the study finds that only 39.4% of questions reach a consensus across all regions. Furthermore, all evaluated LLMs exhibit geographic bias—disproportionately selecting Central and North India as "default" cultural representatives.
- Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation
-
To address the difficulty of quantifying "missing key information" in long-form generation, this work proposes three comprehensiveness metrics—NLI decomposition + graph analysis, QA comparison, and end-to-end LLM identification. Coverage \(S = |\mathcal{A}_{in}| / (|\mathcal{A}_{in}| + |\mathcal{A}_{out}|)\) is calculated against a reference corpus \(\mathcal{C}\) as the benchmark. Meta-evaluation on WikiContradict / ConflictBank reveals that the simplest E2E method is strongest on average (best LMR=0.85), but Q&A exhibits better robustness (cross-model std of only 0.009 vs. 0.044 for E2E), indicating specific application scenarios for each.
- Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge
-
This paper reveals that LLM-as-a-Judge exhibits score range bias in direct assessment tasks, where model outputs are highly sensitive to predefined score ranges. It proposes using a contrastive decoding method to mitigate this issue by canceling out similar biases within the same model family, achieving an average relative improvement of up to 11.3% in Spearman correlation.
- CUB: Benchmarking Context Utilisation Techniques for Language Models
-
The authors evaluate 7 mainstream types of "Context Utilisation Manipulation Techniques" (CMTs) using the unified CUB benchmark. Covering 3 datasets (CounterFact / NQ / DRUID) × 3 context types (gold / conflicting / irrelevant) × 11 LLMs with approximately 800 experimental points, the study demonstrates a fundamental trade-off between "sensitivity to relevant context vs. robustness to irrelevant context" across all existing CMTs, and shows that their effectiveness is generally overestimated on synthetic data.
- DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain
-
The authors constructed DiningBench, the first hierarchical multi-view food benchmark (3,021 dishes / 15,928 images / avg. 5.27 views per dish). It covers three levels of cognitive tasks: "Fine-grained classification (same-store hard negatives) → Nutrition estimation (4-dimensional regression) → VQA (reasoning)." Evaluation of 29 SOTA VLM systems reveals that existing models are significantly deficient in fine-grained visual discrimination and nutritional quantification, and that Chain-of-Thought (CoT) actually impairs pure visual perception.
- Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff
-
Ours proposes LLMThinkBench, a systematic benchmark for evaluating the efficiency of LLMs in basic mathematical reasoning. It introduces the Overthinking Score (a harmonic mean of accuracy and token efficiency) and evaluates 53 LLMs using 14 dynamically generated deterministic math tasks. The study finds that reasoning models generate an average of approximately 18× more tokens, sometimes resulting in lower accuracy, and that scaling the reasoning budget yields diminishing returns.
- Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models
-
DIA is a training-free method for format-constrained generation in diffusion large language models. By predicting the position of end anchors before iteratively infilling between them, it significantly improves the format accuracy of reasoning templates and JSON outputs while mitigating truncation or redundancy caused by fixed anchors.
- E2EDev: Benchmarking Large Language Models in End-to-End Software Development Task
-
This paper proposes E2EDev, an end-to-end software development benchmark based on Behavior-Driven Development (BDD) principles. It contains 46 real Web projects, 244 fine-grained requirements, and 703 executable BDD tests. The evaluation reveals that even the strongest LLMs (Claude series) do not exceed 60% in requirement accuracy, and the complex interaction costs of multi-agent frameworks are disproportionate to their performance gains.
- EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
-
This paper proposes EngiBench—the first multi-level LLM evaluation benchmark for real-world engineering problem solving. Tasks are organized into three difficulty levels (Basic Knowledge Retrieval → Contextual Reasoning → Open-ended Modeling) and accompanied by three controlled variants (Perturbation / Knowledge Enhancement / Math Abstraction). Covering 1,760 problems across three engineering sub-domains, it reveals that even GPT-4.1 and Claude 3.7 Sonnet lag significantly behind human experts on Level 3 open-ended engineering tasks.
- Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks
-
L2T proposes a pre-training framework that integrates 14 language learning tasks (char-level to discourse-level) with standard next-token prediction. It improves BLiMP linguistic competence scores by 2-3 percentage points and accelerates the acquisition process at 500M and 1B parameter scales while maintaining general reasoning performance.
- Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics
-
LEGIT automatically extracts "hierarchical issue trees" from Korean civil/administrative judgments to serve as rubrics. This allows LLM-as-a-judge to evaluate both "issue coverage" and "issue correctness." The study reveals complementary effects between RAG and RL in legal reasoning: RAG improves comprehensiveness, while RL sacrifices coverage for higher correctness.
- Evaluating Memory Capability in Continuous Lifelog Scenario
-
This paper introduces LifeDialBench, a benchmark for evaluating memory capabilities in continuous lifelog scenarios (comprising EgoMem with 7 days of real data and LifeMem with 1 year of simulated data). It introduces an online evaluation protocol to ensure temporal causality and counter-intuitively finds that simple RAG baselines consistently outperform complex memory systems.
- Evaluating Reasoning Models for Queries with Presuppositions
-
This paper constructs ≈13K true/false claims across health, science, and common sense with five levels of presupposition intensity to evaluate 6 major models (GPT-OSS / Qwen3 / GPT-5 Mini / Gemini 2.5) in both thinking-on and thinking-off modes. It finds that reasoning only yields a slight 2-11% accuracy improvement while making models more "decisive"—being wrong with higher confidence—and remaining sycophantic to 26-42% of false claims.
- Evaluating Temporal Consistency in Multi-Turn Language Models
-
This paper introduces ChronoScope, an evaluation suite containing 1.46 million automatically synthesized multi-turn QA chains based on Wikidata. It specifically tests whether LLMs can "maintain previously implied temporal scopes" during multi-turn interactions. The study finds that high-performing models, including GPT-4 and Gemini-2.5, systematically suffer from "present-day drift," which worsens as interactions lengthen and cannot be eliminated even with oracle context.
- Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language
-
This paper introduces "Chouxiang Language," a Chinese internet subculture language, to the NLP community and constructs Mouse, the first evaluation benchmark (comprising six tasks: translation, representation classification, intent recognition, toxicity detection, meaning selection, and cloze test). It discovers that while SOTA LLMs perform reasonably in contextual semantic understanding, they exhibit significant limitations in other tasks.
- Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
-
Fin-Bias constructs a control benchmark using 8868 long-form analyst reports with three input versions—"Original / Removed Rating / Replaced with Fake Rating." It demonstrates that 18 LLMs (including GPT-5 and Claude-4-Sonnet) exhibit severe "herding" in financial investment ratings; even fabricated fake ratings are blindly followed in 30% of samples. Combining MPQA subjectivity lexicon filtering with DPO fine-tuning can boost an open-source 8B model to accuracy levels exceeding GPT-5.
- Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
-
This paper introduces Finch (FinWorkBench), a benchmark for financial and accounting (F&A) workflows constructed from authentic enterprise environments (e.g., the Enron dataset). It comprises 172 composite workflows and 1,710 spreadsheets (27 million cells). Even the most advanced Agent, GPT 5.1 Pro, achieves only a 38.4% success rate despite an average execution time of 16.8 minutes, highlighting significant deficiencies of state-of-the-art AI Agents in real-world corporate scenarios.
- Gated Tree Cross-Attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs
-
The authors attach a Gated Tree Cross-Attention side branch to frozen decoder-only LLMs (Qwen-2.5-7B, Llama-3-8B). An offline Berkeley parser pre-calculates constituency trees, which are indexed into chunk memory by height. Token hidden states retrieve residual updates from this memory via head-wise gated cross-attention, combined with a token update mask and three-stage training to prevent interference. BLiMP accuracy improves from 78.58/79.95 to 83.12/84.61, while performance on MCQA, HellaSwag, and WinoGrande remains stable.
- How Hypocritical Is Your LLM Judge? Listener–Speaker Asymmetries in the Pragmatic Competence of Large Language Models
-
This paper systematically compares 14 LLMs as "pragmatic listeners" (judging pragmatic appropriateness) and "pragmatic speakers" (generating pragmatically appropriate language) across three pragmatic tasks (false presuppositions, anti-presuppositions, and deductive reasoning). The study reveals a widespread listener-speaker asymmetry: most models perform significantly better as judges than as generators, and item-level analysis demonstrates that correct judgment does not reliably predict successful generation.
- HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
-
This paper identifies that LLM-as-a-judge exhibits "negotiation inconsistency" in long-form open-ended writing—where sub-score aggregation is unstable and uninterpretable. It proposes Tree-of-Writing (ToW), which explicitly models writing evaluation as a tree pipeline consisting of three main nodes (Content / Format / Impression), leaf nodes, and an explicit LLM-negotiator for weights. On HoWToBench (1,302 Chinese samples across 12 genres), it increases system-level Pearson correlation from 0.85-0.89 to 0.93 while remaining robust to text perturbations.
- HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
-
This paper proposes the HumanLLM framework, which models 244 psychological patterns (100 personality traits + 144 social cognitive patterns) as interacting causal forces rather than isolated labels. It constructs 11,359 scenarios featuring interactions of 2-5 patterns and a multi-turn dialogue dataset. Through a dual-layer checklist evaluation, it achieves high alignment with human judgment (\(r=0.90\)). HumanLLM-8B outperforms Qwen3-32B in multi-pattern dynamics despite having 4x fewer parameters.
- Identifying the Achilles' Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models
-
HalluHunter is a fully automated LLM factual error testing framework based on Knowledge Graphs (KG). It extracts factual triples from Wikidata, generates three question types (Yes/No, Multiple Choice, and WH-questions) using rule-based methods, and supports multi-hop reasoning. Through an "adaptive iterative algorithm" that selects the next batch of difficult questions based on entity similarity and relationship accuracy from previous incorrect responses, it reduces the accuracy of nine mainstream LLMs by 32–42% after five iterations, triggers errors in up to 55% of items, and significantly outperforms static benchmarks.
- Idiom Understanding as a Tool to Measure the Dialect Gap
-
This paper proposes three new French idiom understanding benchmark datasets (Quebec French QFrCoRE/QFrCoRT and Standard French MFrCoE). Evaluation of 111 LLMs reveals that 65.77% of models perform significantly worse on dialectal idioms than on standard French, quantifying the dialect gap phenomenon.
- IF-Critic: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
-
This paper proposes IF-Critic-14B: it first uses a Checklist Generator to decompose complex instructions into a list of constraints, then enables the critic to provide "explanation + 0/1 judgment" for all constraints within a single inference. Through high-quality critique training with multi-stage filtering and constraint-level DPO, it outperforms o4-mini / Gemini-1.5-Pro (noted as Gemini-3-Pro in original text) on four instruction-following benchmarks. Furthermore, using approximately 1/3 of the compute, it enables 7B/8B policy models to match the performance of 32B/70B family models on Multi-IF / CFBench / SysBench after GRPO training.
- IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
-
This paper introduces IF-RewardBench: the first meta-evaluation benchmark for judges that covers single-turn, multi-turn, and system-prompt instructions. It features responses generated by 16 LLMs and rigorous human annotation (Cohen's \(\kappa=0.87\)). The benchmark upgrades the traditional pairwise/BoN evaluation paradigm to listwise evaluation based on Pareto-dominance preference graphs. Evaluations of 22 SOTA judges (including Gemini-3-Pro, GPT-5.1, and various reward models) reveal that the strongest judge achieves a Kendall \(\tau_b\) of only 0.609 (far below the human baseline of 0.755), all specialized RMs score below 0.2, and this benchmark shows significantly higher correlation with downstream BoN performance compared to existing benchmarks like RewardBench-2 and PPE-IF.
- Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation
-
The authors turn the research lens on the NLP community itself: using a "reportability codebook" of 20 standards, they perform a large-scale audit of 9100+ *CL papers from 2023–2025 (284 fully manually annotated + 1800+ LLM-assisted). They demonstrate that human evaluation, revered as the "gold standard," suffers from widespread underreporting—more than half of the papers report \(\le 7\) out of 20 items, statistical significance is rarely mentioned, and power analysis is virtually non-existent, suggesting the gold standard is more of an "illusion."
- Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications
-
POLARIS compiles natural language safety policies into first-order logic specifications, constructs semantic policy graphs, and systematically traverses them to generate test queries. This shifts LLM safety evaluation from heuristic red-teaming to traceable, coverage-guaranteed, and reproducible specification-driven testing.
- K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
-
The authors constructed K-MetBench, containing 1,774 questions based on 25 editions of the South Korean National Meteorological Engineer certification exams. Evaluating 55 LLMs/MLLMs across four orthogonal dimensions—"Multimodal Vision / Expert Reasoning / Geo-cultural / Sub-domain Granularity"—the study reveals a universal modality gap (an average 18.6% drop in accuracy for visual meteorological charts compared to text), a reasoning gap (correct answers with hallucinated rationales), and a geo-cultural gap (the smaller local model A.X-4.0 outperformed the 235B Qwen3-VL 78.9 to 72.6 on Korean-specific questions). This demonstrates that parameter scale alone cannot resolve cultural localization issues.
- Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
-
The authors develop MyScholarQA, the first open-source personalized Deep Research (DR) system using a profile-action-report tripartite architecture, which outperforms other DR baselines across 16 offline metrics. However, 90-minute interviews with 21 researchers reveal nine types of personalization failure modes completely undetected by offline evaluations. Furthermore, four major LLM judges fail to accurately predict user satisfaction, serving as a warning against replacing real users with LLM judges.
- Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
-
This paper presents the first large-scale systematic audit of the native sampling capabilities of 11 frontier LLMs across 15 probability distributions. It reveals that LLMs severely lack intrinsic probability sampling mechanisms, and this deficiency translates into systematic biases in downstream applications.
- LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
-
The RAB-Cred expert-annotated dataset for three-class ("Absent / Positive / Negative") credibility assessment was constructed using 273 asylum decision documents from the Danish Refugee Appeals Board (RAB). A systematic evaluation of 21 open-source LLMs across 30 system×user prompt combinations reveals that prompt design is more significant than model selection. While Phi-4 (14B) achieved a 94.7% F1 in a zero-shot setting, individual models consistently committed "unacceptable" errors. Consequently, a majority-voting ensemble utilizing the 15 optimal model-prompt combinations is recommended, which increased accuracy by 1.5 pp to 96%.
- LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control
-
LoCar proposes 13 deployment-level KPIs for Korean in-vehicle assistants and evaluates 11 models using human-calibrated LLM-as-a-Judge with honorific morphological verification. Findings show that while general understanding is near saturation, fine-grained honorific control and multi-turn strategic guidance remain significantly unstable.
- MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference
-
Ours proposes the MARCH benchmark (2,209 multi-hop ambiguous questions) and the CLARION framework, marks the first systematic study of QA challenges at the intersection of ambiguity resolution and multi-step reasoning, revealing significant deficiencies in existing SOTA models for such problems.
- Minos: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text
-
By employing a three-step strategy of "stringent data quality control + SFT + DPO alignment," the authors trained Minos, an 8B evaluation model. Using 57K high-quality evaluation samples—less than half the scale of existing works—Minos can score bidirectional multimodal generation tasks (I2T and T2I). It outperforms all open-source MLLM-evaluators across 16 out-of-domain tasks and approaches the performance of GPT-4o.
- MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
-
The authors formalize whether "MLLM judges truly integrate images, queries, and responses" as Compositional Bias and construct MM-JudgeBias—a diagnostic set containing 9 types of bias and 1804 samples from 29 source benchmarks. Using two complementary metrics, Bias-Deviation (failure to decrease scores when semantics are destroyed) and Bias-Conformity (failure to remain stable when semantics are preserved), they reveal that 26 SOTA MLLM judges (including Gemini-3 Pro, GPT-5.1, and Claude Opus 4.5) exhibit severe modality neglect.
- Modeling Multi-Dimensional Cognitive States in Large Language Models under Cognitive Crowding
-
This paper identifies a "Cognitive Crowding" effect where LLM accuracy plummets to 5.7% when jointly predicting four cognitive dimensions (Emotion-Thinking Style-Stance-Intent). Through Gromov \(\delta\)-hyperbolicity analysis, cognitive states are proven to possess a hierarchical structure. The proposed HyCoLLM framework models these states in hyperbolic space, enabling an 8B model to outperform GPT-4o.
- Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
-
This paper proposes MT-RL-Judge, a multi-task reinforcement learning framework that jointly optimizes multiple evaluation tasks using GRPO to train a unified MLLM-as-a-Judge model. It consistently outperforms SFT baselines across six benchmarks, including text-image alignment, safety compliance, and visual quality assessment. Furthermore, it demonstrates robust out-of-distribution generalization on the unseen MJ-Bench pairwise comparison format (82.23% on Safety vs. 49.40% for SFT-Unified).
- MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
-
Ours proposes MultiFileTest, the first multi-file level LLM unit test generation benchmark, covering 20 projects each for Python/Java/JavaScript. It evaluates 11 frontier LLMs and analyzes the impact of manual fixing and self-repair mechanisms on test quality, revealing that even the strongest models exhibit numerous basic executability errors.
- NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment
-
NovBench pairs "novelty claims in paper introductions" with "textual novelty evaluations from reviewers" to create a benchmark of 1,684 samples. Using four dimensions—Relevance, Correctness, Coverage, and Clarity—it systematically reveals that while current general-purpose and specialized LLMs can generate fluent evaluations, they still struggle to truly understand and comprehensively judge academic novelty.
- Personalized Benchmarking: Evaluating LLMs by Individual Preferences
-
This paper performs a personalized ranking analysis of 115 active users on Chatbot Arena, finding that the average Spearman correlation between personalized Bradley-Terry rankings and global rankings is only \(\rho=0.04\) (with 57% of users showing near-zero or negative correlation). This demonstrates that aggregated benchmarks fail to reflect the individual preferences of most users. Furthermore, the study successfully predicts user-specific model rankings using topic and style features.
- PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models
-
This paper introduces PolicyBench (a 21K-item cross-regime policy understanding benchmark for China and the US) and PolicyMoE (a Mixture-of-Experts model based on cognitive levels). It systematically evaluates the capabilities of 11 SOTA LLMs across three cognitive tiers—memory, understanding, and application—revealing that while models perform well in structured reasoning, they remain weak in abstract policy concepts.
- PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts
-
PolitNuggets proposes a multilingual agentic discovery benchmark featuring 400 global political figures and over 10,000 career facts. Using the FactNet dynamic evidence verification protocol, it finds that current agents exhibit high precision but low recall, with the primary bottlenecks being long-tail fact discovery, non-English evidence, and efficient tool utility.
- Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
-
This paper conducts systematic pressure tests on deception probes using internal activations of LLMs, finding that near-perfect AUROC on clean data does not equate to deployable robustness: single-direction and entropy proxy explanations are untenable; instead, deceptive signals appear dispersed across multi-dimensional weak features. Style-augmented training can restore probes on 27B models from near-random performance to a held-out style AUROC of 0.983.
- Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs
-
This paper compares humans and four LLMs on a conditional presupposition projection task based on linguistic theory. It finds that while humans jointly utilize probability, antecedent-presupposition relevance, and contextual cues, LLM scoring similarity is significantly decoupled from the quality of theorized reasoning; many human-like judgments likely stem from surface pattern matching.
- Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring
-
Q-Daps estimates LLM question-answering difficulty by generating multiple candidate answers and calculating the Shannon entropy of the plausibility distribution after debiasing for popularity. It systematically outperforms readability, retrieval complexity, prompt-based scoring, and uncertainty baselines on TriviaQA, NQ, MuSiQue, and QASC.
- Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
-
This paper systematically compares the performance of reasoning models versus standard LLMs as judges. It finds that while reasoning models exhibit superior accuracy, evaluation instruction following, and attack robustness, they remain susceptible to surface-level quality biases. The authors propose PlanJudge, a prompt-only strategy to mitigate these biases.
- ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering
-
This paper constructs ReCoQA—a large-scale benchmark containing 29,270 real estate QA pairs—which requires models to integrate database queries and map API calls for hybrid multi-source reasoning. A hierarchical multi-agent framework, HIRE-Agent, is proposed as a strong baseline, systematically revealing the bottlenecks of existing LLMs in complex reasoning within vertical domains.
- ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
-
ResearchBench is proposed as the first large-scale benchmark to evaluate the scientific discovery capabilities of LLMs. Based on the theoretical decomposition of "inspiration-driven hypothesis generation," it covers 1386 papers across 12 disciplines. By decomposing scientific discovery into three sufficient subtasks—inspiration retrieval, hypothesis composition, and hypothesis ranking—the study finds that LLMs perform exceptionally well in cross-disciplinary inspiration retrieval.
- Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation
-
This paper redefines meeting effectiveness evaluation by proposing an objective "Goal Achievement / Time Cost" standard and a temporal fine-grained evaluation paradigm. The authors constructed the AMI-ME dataset containing 2,459 annotated segments from 130 meetings and developed an LLM-based automatic evaluation framework that achieves a Spearman correlation of 0.64.
- ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering
-
This paper proposes ReTraceQA, the first reasoning process evaluation benchmark for commonsense reasoning tasks. It includes 2421 expert-annotated step-level error localizations and classifications, revealing that 14–24% of SLMs provide correct answers despite flawed reasoning. When reasoning-aware evaluation replaces answer-only evaluation, SLM performance drops by up to 25 percentage points.
- Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
-
This paper proposes SEMANTICQA, which unifies idioms, lexical collocations, noun compounds, and verbal multiword expressions into classification, extraction, interpretation, and sequential composition tasks. It finds that while strong LLMs perform well in open-ended interpretation, they remain significantly unstable in structured extraction, fine-grained semantic classification, and cascaded workflows.
- Revisiting the Reliability of Language Models in Instruction-Following
-
This paper introduces nuance-oriented reliability and the reliable@k metric, utilizing IFEval++ to examine whether models can consistently handle "cousin prompts" with similar semantics but varying details. It reveals that even high-performing models experience significant performance drops under subtle prompt variations.
- Reward Modeling for Scientific Writing Evaluation
-
This paper proposes SciRM and SciRM-Ref, two open-source reward models specifically designed for scientific writing evaluation. By employing a two-stage reinforcement learning (GRPO) approach, the models optimize evaluation preferences and reasoning capabilities respectively, achieving fine-grained multi-aspect evaluation across various scientific writing tasks and generalizing to unseen evaluation tasks and criteria.
- RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
-
RoleConflictBench constructs 13,914 role conflict scenarios and utilizes situational urgency as an objective constraint to evaluate the contextual sensitivity of LLMs. The study reveals a significant issue where model decisions are dominated by static role preferences rather than responding to dynamic situational cues.
- Same Voice, Different Lab: On the Homogenization of Frontier LLM Personalities
-
This paper uses an external ELO preference evaluation of 144 personality traits to find that nine frontier LLMs, despite originating from different laboratories, generally converge toward a "structured, systematic, and precise" assistant-like personality. Distinctions are primarily concentrated in mid-range stylistic traits such as being poetic or playful.
- ScaleBox: Enabling High-Fidelity and Scalable Code Verification for Large Language Models
-
ScaleBox improves verification accuracy and throughput in LLM code training and evaluation through automated special judge synthesis, unified verification workflows, and distributed fine-grained parallelism, yielding more stable Pass@1 gains in LiveCodeBench RLVR experiments.
- SCAN: Structured Capability Assessment and Navigation for LLMs
-
SCAN advances LLM evaluation from a single leaderboard to a navigable capability profile: it automatically constructs a hierarchical capability taxonomy, generates realistic queries covering long-tail capabilities using RealMix, and improves automatic scoring reliability via the PC2 judge. This reveals fine-grained strengths and weaknesses across 21 mainstream LLMs that are otherwise masked by total scores.
- SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
-
SciCustom decomposes scientific evaluation requirements into reusable ontological knowledge units and automatically constructs domain-specific benchmarks via a tagger, multi-model voting, binary-search relevance filtering, and proxy subset selection, achieving the highest Spearman rank consistency across 10/11 chemistry and medical subtasks.
- SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction
-
This paper constructs SciImpact—the first large-scale scientific impact prediction benchmark spanning 19 disciplines and 7 impact dimensions (citations, awards, patents, media, code, datasets, and models). It contains 215,928 comparative paper pairs, and multi-task fine-tuning enables a 4B model to outperform large models such as o4-mini.
- SessionIntentBench: A Multi-Task Inter-Session Intention-Shift Modeling Benchmark
-
This paper proposes SessionIntentBench, a multi-task benchmark for evaluating the capability of L(V)LMs to understand cross-step intention drift in e-commerce shopping sessions. It comprises four progressive sub-tasks (intention purchase likelihood estimation, attribute regularization, intention verification contrast, and intention evolution modeling), featuring 1.9 million intention items and 1.13 million intention trajectories. Experiments indicate that over 20 current L(V)LMs perform poorly in capturing complex session intentions.
- SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
-
SPENCE detects and quantifies data contamination behaviors of LLMs on NL2SQL benchmarks by systematically rewriting benchmark queries syntactically and measuring the decay of execution accuracy with syntactic distance. It finds that older benchmarks (such as Spider) exhibit stronger contamination signals, while the newer BIRD benchmark is almost unaffected.
- Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
-
This paper reveals a critical vulnerability of LLM evaluators: while highly stable under repeated evaluation, they undergo significant reversals (49% flip rate, 74% under authoritative framing) when subjected to subsequent conversational challenges. This indicates that stability does not equate to robustness and that confidence levels fail to predict actual reliability.
- Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
-
PRECISE extends Prediction-Powered Inference (PPI) to ranking evaluation metrics. By combining a small number of human annotations with a large volume of LLM judgments, it corrects systemic biases in LLM systems while reducing estimation variance, achieving statistically reliable ranking system evaluation.
- StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
-
StratMem-Bench categorizes memories in virtual character conversations into three types: must, nice, and irr. It evaluates whether models can actively incorporate beneficial memories and suppress irrelevant ones while ensuring factual requirements are met. The results reveal that current powerful LLMs remains significantly unstable in "supportive memory selection."
- Stress Testing Factual Consistency Metrics for Long-Document Summarization
-
This paper stress-tests six commonly used reference-free factuality metrics in long-document summarization. It discovers that these metrics are significantly influenced by meaning-preserving paraphrasing, retrieval window sizes, and high-information-density claims, indicating that metrics designed for short summaries cannot be reliably transferred to long-document scenarios.
- TabReX: Tabular Referenceless eXplainable Evaluation
-
Ours proposes TabReX, a referenceless graph-reasoning-based framework for tabular generation evaluation. It converts source text and generated tables into knowledge graph (KG) triples and aligns them to compute explainable attribute-driven scores. TabReX significantly outperforms existing methods in correlation with human judgment and establishes a large-scale benchmark, TabReX-Bench.
- TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
-
This paper proposes TaxPraBen, the first LLM evaluation benchmark for Chinese tax practice, consisting of 14 datasets with \(7.3K\) samples covering three real-world scenarios: tax risk prevention, audit analysis, and tax planning. It designs a scalable evaluation paradigm of "structured parsing—field alignment extraction—numerical and text matching." Evaluations of 19 LLMs show that closed-source and Chinese-optimized models perform better, while the tax-domain fine-tuned model YaYi2 shows limited improvement.
- Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies
-
This work reformulates grounded claim factuality checking as a True/False reading comprehension task. By incorporating structured prompts based on human test-taking strategies, LLMs can efficiently and accurately verify claims with minimal reasoning steps. Furthermore, Small Language Models (SLMs) are trained via Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to replace Large Language Models, achieving over 80% savings in inference costs.
- Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation
-
This paper investigates whether language models can learn to predict the empirical success of research ideas. By constructing a dataset of 11,488 idea pairs based on objective outcomes from PapersWithCode, the authors trained an 8B model using SFT and RLVR to achieve 77.1% accuracy, outperforming GPT-5's 61.1% and serving as an effective idea verifier for automated scientific discovery.
- The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods
-
This paper points out that constrained softmax in zero-shot LLM classification discards probability mass near label synonyms. It proposes a training-free Semantic Softmax that aggregates the "silent votes" of top-K vocabulary tokens back to target labels, significantly reducing ECE and Brier Score while improving AUROC/F1.
- VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis
-
This paper proposes VC-Inspector, a reference-free video caption evaluation metric based on lightweight open-source multimodal models (Qwen2.5-VL 3B/7B). By generating training data through a controllable factual error synthesis pipeline, it achieves a human judgment correlation of \(\tau_b\)=42.58 on VATEX-Eval, surpassing the GPT-4o-dependent G-VEval (\(\tau_b\)=39.40), and reaches 99.6% accuracy on hallucination detection benchmarks.
- Erosion of Correct Beliefs: A Study of LLM Cognitive Resilience under Clinical Stress
-
By designing a multi-turn adversarial stress evaluation framework Med-Stress, this paper reveals that high medical knowledge does not guarantee LLM belief stability. It proposes two defense strategies—inference-time RBED and training-time R-FT—to enhance the cognitive resilience of LLMs in clinical dialogues.
- When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
-
This paper reveals a severe "informativeness bias" in VLM-as-a-Judge systems—where judges tend to favor more detailed and rich responses even when they contradict visual content. It proposes the BIRCH paradigm, which reduces bias by up to 17% and improves performance by up to 9.8% by calibrating candidate answers before comparison.
- WildIFEval: Instruction Following in the Wild
-
WildIFEval is a single-turn constraint generation benchmark extracted from real-world user conversations, comprising 7,523 tasks and 24,731 constraints. It automatically decomposes each user instruction into fine-grained constraints categorized into 8 major classes and employs an LLM-as-judge for "strict/soft" dual scoring. This work characterizes the distribution and co-occurrence of constraints in real-world instructions for the first time and reveals a capacity bottleneck where the overall success rate drops sharply as the number of constraints increases, while the success rate per individual constraint remains nearly unchanged.
- Zero-shot Large Language Models for Automatic Readability Assessment
-
This paper systematically evaluates the zero-shot ARA capabilities of 10 open-source LLMs across 14 multilingual readability datasets and proposes LAURAE: an ensemble method that weights the LLM's expected readability score against traditional formulas using verbal confidence, outperforming existing unsupervised methods on 13/14 datasets.