📊 LLM Evaluation¶

💬 ACL2026 · 45 paper notes

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL: Abstain-R1 proposes a clarification-aware RLVR reward that jointly optimizes explicit abstention and post-refusal clarification (identifying missing information) on unanswerable queries, enabling a 3B model to match or surpass large models such as DeepSeek-R1 on both abstention and clarification quality.
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models: This paper proposes AnchorMem, a memory framework inspired by the Proustian phenomenon in cognitive science. It decouples retrieval units (atomic facts) from generation contexts (original interactions) and connects fragmented memories via an associative event graph, achieving substantial improvements over existing memory systems such as A-Mem and Mem0 on the LoCoMo benchmark.
Are They Lovers or Friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues: This paper introduces SCRIPTS, a benchmark comprising 1.1K English and Korean movie dialogues, evaluating the social relationship reasoning capabilities of 9 LLMs via a three-tier probabilistic labeling scheme (HIGHLY LIKELY / LESS LIKELY / UNLIKELY). Results show that models achieve only 75–80% accuracy on English and 58–69% on Korean, with Chain-of-Thought prompting and reasoning models providing little to no benefit for social reasoning.
Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models: This paper presents a systematic survey of 134 papers on evidence-based text generation with LLMs, proposing for the first time a unified taxonomy (attribution approach × citation characteristics × task), analyzing 300 evaluation metrics organized into seven dimensions and six method categories, and providing a panoramic reference framework for this fragmented field.
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage: AutoReproduce proposes a multi-agent framework that mines implicit domain knowledge from cited references via a "Paper Lineage" algorithm, enabling end-to-end automatic reproduction of paper experiments. On the self-constructed benchmark ReproduceBench, it achieves a code execution rate of 94.87% with a performance gap of only 19.72%.
Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation: This paper proposes a paired-task framework for jointly evaluating LLMs' literary text comprehension and translational creativity, conducting a large-scale benchmark of 23 models across 11 classic English novels, and finding that strong comprehension ability does not transfer to human-level translational creativity.
BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and Applications: This paper introduces BizCompass, a business reasoning benchmark bridging theoretical foundations and practical applications. It covers four knowledge domains (finance, economics, statistics, and operations management) and three application roles (analyst, trader, and consultant), systematically evaluating the business reasoning capabilities of both open-source and closed-source LLMs, and revealing how theoretical knowledge transfers to real-world performance.
Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry: This paper proposes a three-step evaluation framework (computational feature extraction + LLM-as-Judge + human expert validation) to systematically assess the Tang poetry generation capabilities of six LLMs. A critical "echo chamber" effect is identified: LLMs systematically overrate machine-generated poems that mimic statistical patterns while violating prosodic rules, diverging significantly from human expert judgments.
CAST: Achieving Stable LLM-based Text Analysis for Data Analytics: This paper proposes the CAST framework, which constrains the latent reasoning trajectories of LLMs through two complementary mechanisms—Algorithmic Prompting and Thinking-before-Speaking—to significantly improve run-to-run stability in text summarization and annotation tasks without sacrificing output quality.
Closing the Modality Reasoning Gap for Speech Large Language Models: This paper proposes TARS (Trajectory Alignment for Reasoning in Speech), a reinforcement learning-based framework that aligns speech-conditioned reasoning trajectories with text-conditioned ones via two dense reward signals—representation alignment and behavior alignment. TARS achieves state-of-the-art performance at the 7B scale, with a Modality Recovery Rate (MRR) approaching or exceeding 100%.
Common to Whom? Regional Cultural Commonsense and LLM Bias in India: This paper introduces Indica, the first benchmark for evaluating LLM performance on sub-national cultural commonsense, focusing on cultural differences across five regions of India in eight domains of everyday life. Only 39.4% of questions reach consensus across all five regions, and all evaluated LLMs exhibit geographic bias—systematically over-selecting Central and North India as the "default" cultural representative.
Contrastive Decoding Mitigates Score Range Bias in LLM-as-a-Judge: This paper identifies score range bias in LLM judges under direct assessment settings — i.e., model outputs are highly sensitive to predefined score ranges — and proposes contrastive decoding as a mitigation strategy, leveraging the mutual cancellation of similar biases within the same model family, achieving an average relative improvement of up to 11.3% in Spearman correlation.
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition: DiZiNER simulates the pilot annotation workflow in human labeling pipelines by employing multiple heterogeneous LLMs as annotators and a supervisor LLM to analyze inter-model disagreements and iteratively refine task instructions. The method achieves zero-shot state-of-the-art on 14 out of 18 NER benchmarks, with an average improvement of +8.0 F1, and surpasses its own supervisor model, GPT-4o mini, without any parameter updates.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff: This paper presents LLMThinkBench, a benchmark for systematically evaluating the efficiency of LLMs on basic mathematical reasoning. It introduces the Overthinking Score — a harmonic mean of accuracy and token efficiency — and evaluates 53 LLMs across 14 deterministically generated math tasks. Results show that reasoning models generate on average ~18× more tokens yet sometimes achieve lower accuracy, and that scaling inference budgets yields diminishing returns.
E2EDev: Benchmarking Large Language Models in End-to-End Software Development Task: This paper proposes E2EDev, an end-to-end software development benchmark grounded in Behavior-Driven Development (BDD) principles. It comprises 46 real-world web projects, 244 fine-grained requirements, and 703 executable BDD tests. Evaluation reveals that even the strongest LLMs (Claude series) achieve no more than 60% requirement accuracy, and that the interaction overhead of multi-agent frameworks is disproportionate to their performance gains.
Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks: L2T proposes a pre-training framework that mixes 14 language learning tasks spanning four linguistic granularities (character → discourse) with standard next-token prediction. At the 500M/1B parameter scale, it improves BLiMP linguistic competence scores by 2–3 percentage points and accelerates their acquisition, while preserving general reasoning performance.
Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language: This paper introduces Chinese internet subculture language "Chouxiang" (抽象话) to the NLP community, constructs the first evaluation benchmark Mouse — comprising six tasks: translation (TR), representation classification (RC), intent recognition (IR), toxicity detection (TD), meaning selection (MS), and cloze completion (CC) — and finds that state-of-the-art LLMs perform reasonably well on contextual semantic understanding but exhibit significant limitations across other tasks.
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows: This paper introduces Finch (FinWorkBench), a finance and accounting workflow benchmark constructed from real enterprise environments (e.g., the Enron dataset), comprising 172 composite workflows and 1,710 spreadsheets (27 million cells). Even the strongest model, GPT 5.1 Pro, spending an average of 16.8 minutes per workflow, passes only 38.4% of the workflows, revealing critical gaps in frontier AI agents under realistic enterprise conditions.
From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning: This paper formally defines two granularities of LLM unlearning—domain-level and instance-level—and proposes the BiForget framework. Rather than relying on external strong models, BiForget leverages the target model itself to construct high-quality forget datasets via two stages: seed-guided synthesis and adversarial probing. On the Harry Potter domain, it improves relevance by ~20 and diversity by ~0.05 while halving the data volume.
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents: This paper proposes HiGMem, a two-level event-turn memory system that enables an LLM to first browse event summaries and then predict which fine-grained conversation turns are worth reading, achieving the best F1 on four out of five question categories on the LoCoMo10 benchmark while retrieving an order of magnitude fewer turns.
Idiom Understanding as a Tool to Measure the Dialect Gap: Three new French idiom understanding benchmark datasets are proposed — QFrCoRE and QFrCoRT for Quebec French, and MFrCoE for standard French. Evaluation across 111 LLMs reveals that 65.77% of models perform significantly worse on dialectal idioms than on standard French idioms, quantifying the dialect gap phenomenon.
Language Model as Planner and Formalizer under Constraints: This paper introduces the CoPE benchmark, which injects formally categorized natural language constraints into classical planning environments, revealing that a single constraint sentence can halve the planning performance of state-of-the-art LLMs, exposing critical deficiencies in LLM planning robustness.
LexRel: Benchmarking Legal Relation Extraction for Chinese Civil Cases: This work introduces the first structured taxonomy of legal relations in Chinese civil law (9 domains, 265 relation types) and presents LexRel, a benchmark comprising 1,140 expert-annotated instances. The benchmark is used to evaluate leading LLMs on legal relation extraction, revealing significant limitations of current models on this task, while also demonstrating that incorporating legal relation information yields consistent gains on downstream legal AI tasks.
MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification: This paper introduces MADE—a "living" multi-label text classification benchmark built on FDA medical device adverse event reports, featuring 1,154 hierarchical labels and strict temporal splits. It systematically evaluates 20+ encoder/decoder models across discriminative fine-tuning, generative fine-tuning, and few-shot prompting paradigms, assessing both predictive performance and uncertainty quantification (UQ) capabilities. Key findings reveal critical trade-offs: small discriminatively fine-tuned decoders achieve the best head-to-tail accuracy; generative fine-tuning yields the most reliable UQ; and large reasoning models improve rare-label performance but exhibit surprisingly weak UQ.
Min-k Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics: Min-k Sampling detects "semantic cliffs" — the boundary between high-confidence candidate tokens and low-quality tail noise — by analyzing the local structure of the sorted logit distribution. This yields strictly temperature-invariant truncation that maintains robust performance on reasoning and creative writing tasks even under extreme temperatures.
Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates: This paper proposes Source-Shielded Updates (SSU), a column-wise freezing strategy driven by source-data parameter importance scoring. During continual pre-training (CPT) using only unlabeled target-language data, SSU reduces source-language performance degradation from 20.3% (full fine-tuning) to 3.4%, while maintaining target-language performance on par with or superior to full fine-tuning.
Modeling Multi-Dimensional Cognitive States in Large Language Models under Cognitive Crowding: This paper identifies that LLMs suffer a dramatic accuracy drop to 5.7% when jointly predicting four cognitive dimensions—sentiment, thinking style, stance, and intent—a phenomenon termed "cognitive crowding." Through Gromov \(\delta\)-hyperbolicity analysis, the paper demonstrates that cognitive states exhibit hierarchical structure, and proposes HyCoLLM, a framework that models cognitive states in hyperbolic space. An 8B model trained under this framework surpasses GPT-4o.
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models: This paper proposes MTR-DuplexBench, a comprehensive multi-round evaluation benchmark for full-duplex speech language models (FD-SLMs). By introducing a novel turn segmentation method, it addresses the challenges of ambiguous turn boundaries and context inconsistency inherent in full-duplex dialogue. The benchmark covers four dimensions: conversational characteristics, dialogue quality, instruction following, and safety. Experiments reveal a consistent performance degradation of existing FD-SLMs across multi-round interactions.
MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms: This paper introduces MultiFileTest, the first multi-file-level benchmark for LLM-based unit test generation, covering 20 projects each in Python, Java, and JavaScript. It evaluates 11 state-of-the-art LLMs and analyzes the impact of manual and self-repair mechanisms on test quality, revealing that even the strongest models produce substantial basic executability errors.
ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification: This paper introduces the ODUTQA-MDC task and benchmark, the first systematic study of underspecified query detection and multi-turn dialogue-based clarification in open-domain tabular QA. The authors construct a large-scale dataset of 25,105 QA pairs and propose the MAIC-TQA multi-agent framework to perform end-to-end "detect–clarify–reason" tabular question answering.
PIArena: A Platform for Prompt Injection Evaluation: This paper presents PIArena, a unified and extensible evaluation platform for prompt injection (PI), integrating multiple state-of-the-art attack and defense methods with plug-and-play evaluation support. It introduces a strategy-based adaptive attack method and systematically exposes critical limitations of existing defenses in terms of generalization, resilience to adaptive attacks, and task-aligned injection scenarios.
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition: This paper proposes ResearchBench, the first large-scale benchmark for evaluating LLMs in scientific discovery. Grounded in a theoretically motivated decomposition of inspiration-driven hypothesis generation, it covers 1,386 papers across 12 disciplines and decomposes scientific discovery into three sufficient subtasks: inspiration retrieval, hypothesis composition, and hypothesis ranking. Results show that LLMs perform surprisingly well on cross-disciplinary inspiration retrieval.
Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation: This paper redefines meeting effectiveness evaluation by proposing an objective criterion of "goal achievement / time cost" and a temporal fine-grained evaluation paradigm. It constructs the AMI-ME dataset comprising 2,459 annotated segments from 130 meetings, and develops an LLM-based automatic evaluation framework achieving a Spearman correlation of 0.64.
ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering: This paper introduces ReTraceQA, the first reasoning process evaluation benchmark for commonsense question answering, comprising 2,421 instances annotated by domain experts with step-level error localization and error categorization. The benchmark reveals that 14–24% of SLMs produce correct answers via flawed reasoning, and that replacing answer-only evaluation with reasoning-aware evaluation reduces SLM performance by up to 25 percentage points.
Revisiting the Uniform Information Density Hypothesis in LLM Reasoning: This paper introduces the Uniform Information Density (UID) hypothesis from psycholinguistics into the analysis of LLM reasoning. It proposes an entropy-based, step-level information density measurement framework, revealing a counterintuitive pattern in high-quality reasoning trajectories characterized by local uniformity combined with global non-uniformity, and demonstrates that this pattern significantly outperforms conventional confidence/entropy baselines in Best-of-N sampling.
RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity: RoleConflictBench constructs 13,914 role conflict scenarios and leverages situational urgency as an objective constraint to evaluate LLMs' contextual sensitivity, revealing that model decisions are dominated by static role preferences rather than dynamic contextual cues.
SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction: This paper introduces SciImpact — the first large-scale scientific impact prediction benchmark spanning 19 disciplines and 7 impact dimensions (citations, awards, patents, media, code, datasets, and models), comprising 215,928 contrastive paper pairs. Multi-task fine-tuning enables a 4B model to outperform large models such as o4-mini.
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness: This paper proposes SABA, a reasoning framework that adopts a "perceive before act" paradigm, explicitly constructing and auditing knowledge states prior to any final decision. It employs Information Fusion (IF) to consolidate narratives into a verifiable baseline state, and Query-driven Structured Reasoning (QSR) to recursively identify and resolve missing premises, achieving state-of-the-art performance on both detective reasoning and general reasoning benchmarks.
SessionIntentBench: A Multi-Task Inter-Session Intention-Shift Modeling Benchmark: This paper proposes SessionIntentBench, a multi-task benchmark for evaluating the ability of L(V)LMs to understand inter-session intention shifts in e-commerce shopping sessions. It comprises four progressively structured subtasks—intent-purchase likelihood estimation, attribute normalization, intent verification contrast, and intent evolution modeling—constructed from 1.9 million intent entries and 1.13 million intent trajectories. Experiments on 20+ L(V)LMs demonstrate that current models perform poorly at capturing complex session-level user intent.
Subject-level Inference for Realistic Text Anonymization Evaluation: SPIA introduces the first subject-level PII inference evaluation benchmark (675 documents, 1,712 subjects, 7,040 PII instances), revealing that even when 90%+ of PII spans are redacted, the subject-level inference protection rate can be as low as 33%, and that anonymization focused on a single target subject leads to greater exposure of non-target subjects.
Task-Aware LLM Routing with Multi-Level Task-Profile-Guided Data Synthesis for Cold-Start Scenarios: This paper proposes a multi-level task-profile-guided data synthesis framework to address the cold-start problem in LLM routing, and introduces TRouter—a routing method that treats task type as a latent variable—which models the query-cost-performance relationship via variational inference, achieving effective routing in both cold-start and in-domain settings.
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context: This paper proposes Quantile Token Regression, a method that inserts dedicated quantile tokens into the input sequence and incorporates retrieved neighbor instances along with their empirical distributions, enabling LLMs to predict full conditional distributions rather than single point estimates. The approach reduces MAPE by approximately 4 points over baselines and narrows prediction intervals by more than 2× on the Airbnb and StackSample datasets.
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation: This paper proposes SignThought, a reasoning-driven gloss-free sign language translation framework that introduces learnable latent thought slots as an explicit intermediate semantic layer between video and text. A "plan-then-locate" dual-stream decoder decouples semantic planning from visual evidence retrieval, achieving state-of-the-art performance among gloss-free methods on multiple benchmarks.
TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale: TingIS is an end-to-end risk event discovery system deployed on a fintech platform. It employs a five-module architecture—semantic distillation, cascaded routing, event linking engine, state management, and multi-dimensional denoising—to extract actionable risk events from massive noisy customer complaints in real time, achieving a P90 alert latency of 3.5 minutes and a 95% high-priority event discovery rate.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems: This paper proposes ErrorProbe, a framework that achieves self-improving semantic fault attribution in multi-agent systems through MAST taxonomy-driven structured decomposition, symptom-driven backward tracing, and a verified memory mechanism. The approach substantially outperforms baselines, particularly in step-level error localization.