Skip to content

📊 LLM Evaluation

🔬 ICLR2026 · 131 paper notes

📌 Same area in other venues: 💬 ACL2026 (97) · 🧪 ICML2026 (40) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (37) · 📹 ICCV2025 (27)

🔥 Top topics: LLM ×31 · Reasoning ×9 · Agents ×6 · Question Answering ×3 · Diffusion Models ×2

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

AcadReason utilizes 50 research questions from top-tier journal papers across 5 high-reasoning disciplines (Computer Science, Economics, Law, Mathematics, Philosophy) to specifically test whether LLMs and Agents can acquire and reason through academic knowledge "like a researcher." The results show that most LLMs score below 20, with even GPT-5 only reaching 16 points and the strongest Agent, OAgents, peaking at 34 points, revealing a significant gap in "super-intelligent academic research" capabilities.

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

By statistically analyzing the dynamic changes in token confidence during the denoising process of Diffusion Large Language Models (dLLMs), it was discovered that the "Volatility Band" (VB) region encodes the local semantic structure of the text. Consequently, AdaBlock-dLLM is proposed—a training-free, plug-and-play adaptive block size scheduler that naturally aligns the block boundaries of semi-autoregressive decoding with semantic steps, achieving up to a 5.3% accuracy improvement at the same throughput.

Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation

This paper points out that the mainstream QA selective prediction evaluation for NLG uncertainty estimation is significantly biased by approximate correctness functions. It proposes using SP-MoJI, structured tasks, OOD/perturbation detection, and Elo aggregation to make evaluation conclusions more robust.

Agentic Reinforced Policy Optimization

ARPO is a reinforcement learning algorithm tailored for multi-turn tool-calling agents. It identifies that the token entropy of LLMs spikes after each tool return. Consequently, it adaptively "forks" sampling at these high-entropy steps and employs advantage attribution to propagate the performance differences of branched paths back for learning. This achieves superior performance across 13 reasoning/deep-search benchmarks compared to trajectory-level RL, while using only half the tool-calling budget.

AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

AirQA is a human-annotated AI research QA dataset (13,956 papers, 1,246 questions) covering four question types (single/multi-doc/retrieval/comprehensive) and five element types (text/table/image/formula/metadata). It introduces instance-level objective evaluation using 19 "customized per question" Python functions and proposes a three-agent framework, EXTRACTOR, to automatically synthesize QA pairs and interaction trajectories, enabling a 7B model to reach the tool-calling performance of a 14B model after fine-tuning.

AlphaBench: Benchmarking Large Language Models in Formulaic Alpha Factor Mining

AlphaBench is the first benchmark to systematically evaluate Large Language Models (LLMs) in "Formulaic Alpha Factor Mining" (FAFM). It decomposes the real workflow of quantitative researchers into three major tasks: factor generation, factor evaluation, and factor searching. By cross-evaluating over ten open-source and closed-source models in a real-world backtesting environment (Qlib + CSI300), the study finds that LLMs can reliably generate valid factors but perform close to random guessing when judging factor quality (evaluation task).

An Open-Ended Benchmark and Formal Framework for Adjuvant Research with MLLM

Addressing the "vaccine adjuvant" field long neglected by AI, this work constructs the first expert-annotated open-ended QA benchmark (1294 QA pairs + 1364 formal descriptions). It systematically evaluates 11 closed-source and 19 open-source MLLMs and proposes a formal framework that encodes adjuvant design principles and immune mechanisms into structured variables and functions.

AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning

The authors construct AnesSuite, the first comprehensive dataset suite for anesthesiology reasoning. It includes the benchmark AnesBench (7,972 bilingual multiple-choice questions across three cognitive levels) and three training datasets (AnesCorpus/AnesQA/AnesR1). The Morpheus model, trained using SFT+GRPO, allows a 7B model to match the 14B baseline performance while revealing significant bottlenecks in complex clinical reasoning (System 2) for currently leading LLMs.

Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory

This paper argues that when LLMs fail QA tasks or respond with "unsure," it is often not because the knowledge is missing from the parameters, but because it is "submerged" and not expressed. It proposes the Hits@k metric to demonstrate that correct answers frequently reside within the top-k logits but are not selected (e.g., LLaMA3-8B achieves only 17.2% Hits@1 on DBpedia but 57.9% Hits@5). It further reveals that the prevalent "allow unsure" prompting paradigm actively suppresses low-confidence correct answers.

ASIDE: Architectural Separation of Instructions and Data in Language Models

The paper proposes ASIDE, an architectural modification that distinguishes instructions and data via orthogonal rotation at the token embedding layer. By modifying only the forward pass and training on standard instruction-tuning data, it significantly enhances instruction-data separation and prompt injection robustness without requiring specialized safety training.

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

The AI2 team addresses five methodological flaws in existing scientific agent benchmarks by constructing AstaBench, the first evaluation suite covering the full scientific research process. It includes 4 categories of 11 sub-benchmarks with a total of 2400+ problems, equipped with a production-grade controlled search tool based on Semantic Scholar and 9 types of research-optimized Asta Agent baselines. Conducted as the largest systematic evaluation to date on 57 agents (22 types), the study finds that while progress has been made in individual tasks like literature search, AI remains far from reaching standards for end-to-end scientific research assistance.

AutoCode: LLMs as Problem Setters for Competitive Programming

AutoCode utilizes a "Validator-Generator-Checker(-Interactor)" closed-loop multi-agent framework to enable LLMs to generate test data for existing competitive problems with ~99% official verdict consistency. Furthermore, starting from seed problems, it automatically generates new problems recognized by Grandmasters as competition-level through "Reference vs. Brute-force" dual verification.

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

AutoCodeBench utilizes AutoCodeGen to automatically synthesize high-difficulty, multilingual, and execution-validated code generation problems. It chains LLM-generated test inputs, sandbox execution for test outputs, reverse prompt generation, and multi-stage filtering into a single pipeline. This process constructed a benchmark of 3,920 problems across 20 programming languages. Experiments show that even the strongest current models achieve an average Pass@1 of no more than 55.4%.

AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

AutoLibra automatically induces a set of fine-grained evaluation metrics—complete with definitions and positive/negative examples—from open-ended natural language feedback on agent trajectories (e.g., "Don’t keep clicking the button if it's disabled"). It employs two meta-metrics, "coverage" and "redundancy," to optimize the metric set. This approach characterizes agent behavior more precisely than expert-defined metrics and enables front-end models to achieve a 20%+ increase in success rates on 2D text games through self-regulated optimization.

AutoMetrics: Approximate Human Judgments with Automatically Generated Evaluators

AutoMetrics automatically converts fewer than 100 sparse human feedback signals (upvotes/downvotes, Likert scales, behavioral signals) into a set of interpretable evaluation metrics. It first generates candidate LLM-as-a-Judge criteria and retrieves from a MetricBank of 48 off-the-shelf metrics, then uses Partial Least Squares (PLS) regression to combine them into a composite metric that best fits human judgment. It improves Kendall's correlation with human ratings by up to 33.4% across five tasks and can serve as a proxy reward to optimize downstream agents, performing on par with verifiable rewards.

Benchmarking Overton Pluralism in LLMs

The authors propose the OvertonBench framework, formalizing Overton pluralism as a set coverage metric, OvertonScore, through a large-scale human study (1,208 representative US participants, 60 subjective questions, 8 LLMs). It is found that all current models score only between 0.35–0.41 (theoretical upper bound is 1.0), and a highly correlated (ρ=0.88) automated evaluation tool is constructed based on these findings.

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Addressing the three major flaws of existing long-dialogue memory evaluations—topic fragmentation, narrow domains, and simple recall—this paper first utilizes a recursive plot planning synthesis pipeline to create BEAM (100 dialogues up to 10M tokens + 2,000 probes covering 10 memory capabilities). It then proposes the LIGHT framework, inspired by human cognition, which integrates "Episodic Memory + Working Memory + Scratchpad" systems, achieving an average improvement of 3.5%–12.69% over the strongest baselines on BEAM.

BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation

Proposes BiasScope, a fully LLM-driven iterative framework that automatically discovers latent unknown biases in LLM-as-a-Judge at scale. Based on this, it constructs the more challenging JudgeBench-Pro benchmark, where even strong LLM evaluators exhibit error rates exceeding 50%.

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

BIRD-INTERACT transforms single-turn text-to-SQL evaluation into a dynamic interactive environment featuring a user simulator, knowledge base management, and test case execution. It covers full CRUD operations and intentionally injects ambiguity to evaluate LLM interaction capabilities via two settings: c-Interact (protocol-guided) and a-Interact (autonomous agent). Even the strongest model, GPT-5, achieved only 8.67% (c) / 17.00% (a) on the 600-question full set, exposing a significant gap where current models can write SQL but struggle to clarify tasks through interaction.

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

This paper proposes the Refusal Index (RI)—a measure defined as the Spearman rank correlation between "refusal probability" and "error probability." Using a lightweight procedure that requires only two standard evaluation passes, it quantifies the LLM's capability to "actively refuse questions beyond its knowledge," a dimension overlooked by existing metrics.

Can Vision–Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective

The study introduces AesEval-Bench, the first systematic benchmark to evaluate VLM capabilities in graphic design aesthetic assessment (4 dimensions × 12 indicators × 3 tasks). It finds that existing VLMs (including reasoning-enhanced ones) show limited performance in design aesthetics. By utilizing human-guided VLM labeling and indicator-grounded reasoning to construct training data, a fine-tuned 7B model outperforms GPT-5 on precise localization tasks.

Can You Hear Me Now? A Benchmark for Long-Range Graph Propagation and Beyond

This paper proposes the ECHO benchmark, comprising 3 synthetic tasks and 2 real-world chemical tasks based on Density Functional Theory (DFT). It requires Graph Neural Networks to effectively propagate information across 17–40 hops, systematically evaluating the long-range propagation capabilities of 11 GNN architectures.

CatalystBench: A Comprehensive Multi-Task Benchmark for Advancing Language Models in Catalysis Science

This paper constructs the first multi-task benchmark for catalysis science, CatalystBench, which unifies theoretical computational data and experimental literature into 8 tasks covering the "full process of catalyst design." It proposes Multi-head Full-task Fine-tuning (MFT) to decouple classification, regression, and generation heads. The resulting CatalystLLM outperforms strong baselines like GPT-4.1 on most tasks, achieving an average improvement of 12.44% over single-task baselines.

Characterizing Deep Research: A Benchmark and Formal Definition

This paper provides a formal definition for "Deep Research (DR)," a task frequently claimed by various models but never strictly defined. The core is identified as "high fan-out" during the search process rather than merely "outputting long reports." Accordingly, the authors constructed LIVEDRBENCH, a benchmark of 100 open-web tasks, using claim-based Precision/Recall for objective scoring. It reveals that the strongest current system, OpenAI DR, achieves an average F1 of only 0.55, with systems generally covering only about half of the necessary search queries.

ChemEval: A Multi-level and Fine-grained Chemical Capability Evaluation for Large Language Models

ChemEval decomposes LLM chemical capabilities into a four-level hierarchy (Concept → Literature → Molecule → Reasoning), spanning 13 dimensions and 62 tasks (including text and multimodal). Using 3,160 expert-curated questions for fine-grained diagnosis, it reveals that general-purpose models excel at literature comprehension but struggle with deep chemical reasoning, while chemical-specific models understand terminology but almost entirely lose instruction-following capabilities.

Choices Speak Louder than Questions

This paper points out that in Multiple-Choice Question Answering (MCQA) evaluation, large language models (LLMs) often "look at the choices instead of the question"—meaning their decisions are dominated by surface features of the answer options rather than a genuine understanding of the question. It proposes a new scoring method called NPSQ, which disentangles the "question contribution" from the "choice contribution," ensuring evaluation stability even when options are maliciously tampered with.

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

CLASH is an evaluation benchmark consisting of 345 human-written high-stakes value dilemmas and 3795 character perspectives. It specifically tests whether language models can judge whether a controversial action should be taken from different perspectives. It systematically examines model understanding of ambivalence, psychological discomfort, and value shift over time. Results indicate that even top-tier models like GPT-5 and Claude-4-Sonnet only achieve accuracies of 24.06% and 51.01% in judging ambivalence.

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

The authors propose CMPhysBench—a benchmark of 520 graduate-level open-ended calculation problems in condensed matter physics. Accompanied by the tree-edit-distance-driven SEED metric for fine-grained partial scoring, it reveals that even the strongest Grok-4 achieves only 36 SEED / 29% accuracy, exposing a significant capability gap for LLMs in frontier physics.

CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Global condensed matter theory experts manually curated CMT-Benchmark, a set of 50 research-level physics problems. Using an automated scoring pipeline capable of handling non-commutative operator algebra, 17 frontier LLMs were tested—resulting in the strongest model, GPT-5, scoring only 30% with an overall average of 11.4%, debunking the illusion of LLMs as research assistants.

CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

CogniLoad is a synthetic natural language reasoning benchmark built on Cognitive Load Theory (CLT). It employs three independent and tunable parameters—intrinsic difficulty \(d\), distractor density \(\rho\), and task length \(N\)—to manipulate intrinsic, extraneous, and germane cognitive loads (the latter represented as maintenance burden). This allows for precise attribution of long-context reasoning failures to specific dimensions. Evaluating 22 SotA reasoning models revealed that task length is the primary bottleneck and models exhibit a U-shaped response to distractors.

Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

Addressing the failure of self-consistency when a "model confidently gives a wrong answer," this paper estimates epistemic uncertainty (EU) using semantic disagreement among a group of same-scale, cross-family open-source LLMs. By adding EU to the original aleatoric uncertainty (AU) to obtain total uncertainty (TU), the study demonstrates that TU's calibration (AUROC) and selective prediction performance are consistently superior to using AU alone across 10 long-form generation tasks using five 7–9B models. The method uses pure text output only, requiring no training or access to logits.

Computer Agent Arena: Toward Human-Centric Evaluation and Analysis of Computer-Use Agents

The "human blind voting + Elo ranking" paradigm from Chatbot Arena is ported to Computer-Use Agents (CUA): two anonymous CUAs execute user-provided tasks in parallel within real cloud desktop environments. Users provide pairwise preference votes on execution trajectories, revealing ranking flips and behavioral-level errors that static benchmarks (e.g., OSWorld) fail to detect.

Contamination Detection for VLMs Using Multi-Modal Semantic Perturbations

To address the risk that high scores of VLMs on public benchmarks may derive from training set leakage rather than genuine reasoning, this paper proposes multi-modal semantic perturbations for detection. By using LLMs and diffusion models to slightly modify image semantics while simultaneously changing the correct answer, the method compares model accuracy on original vs. perturbed benchmarks. Clean models correctly answer both, while contaminated models (relying on memorization) fail on the perturbed versions, reliably flagging contamination without requiring access to "clean reference models."

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Borrowing the "production frontier" theory from economics, this paper proposes cost-of-pass (the expected dollar cost to generate one correct answer) as a unified evaluation framework that merges "accuracy × inference cost" into a single metric. It uses this framework to reveal the economic niches of models of different sizes across various tasks, the rate of decline in the cost frontier over the past year, and the fact that most inference-time enhancement techniques (majority voting, self-correction) are actually not cost-effective on the scale of "buying correctness."

Credit-Budgeted ICPC-Style Coding: When Agents Must Pay for Every Decision

This paper proposes USACOArena, an ACM-ICPC style online arena driven by a unified "credit" economy. Coding agents must pay for every generated token, local test, and second of wall-clock time, shifting evaluation from "isolated code accuracy" to "budget-constrained cost-aware decision making."

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning under Partial Observations

CubeBench is a generative benchmark with three difficulty tiers based on the Rubik's Cube. It isolates three core cognitive abilities—spatial reasoning, long-horizon mental simulation, and active exploration under partial observation—from perception. Findings reveal that all major LLMs, including GPT-5, achieve a consistent 0.00 pass rate on long-horizon tasks.

Culture In a Frame: C\(^3\)B as a Comic-Based Benchmark for Multimodal Culturally Awareness

C3B (Comics Cross-Cultural Benchmark) utilizes 2,220 comic panels and 18,789 QA pairs to establish a task chain of three progressive difficulty levels: "Identifying Cultural Objects → Judging Cultural Conflicts → Cross-lingual Cultural Content Generation." It specifically evaluates the cultural awareness of Multimodal Large Language Models (MLLMs). Evaluations across 11 open-source MLLMs demonstrate a significant performance gap compared to human levels.

Culture in Action: Evaluating Text-to-Image Models through Social Activities

This paper argues that existing text-to-image (T2I) evaluation focuses only on static objects like "food/landmarks/clothing," neglecting social activities that truly carry culture. The authors construct the CULTIVate benchmark (16 countries × 576 social activities × 19k generated images) and propose the AHEaD framework. This framework uses LLM-generated "cultural descriptors" to decompose images into interpretable dimensions, quantifying cultural faithfulness through alignment, hallucination, exaggeration, and diversity. Its composite metric, FAITH, correlates 27% better with human judgment than baselines and reveals that T2I models are systematically more faithful to the Global North than the Global South.

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

CyberGym constructs a cybersecurity evaluation benchmark that is more than 7 times larger than existing similar benchmarks by using 1,507 historical vulnerabilities from 188 real-world open-source projects on OSS-Fuzz. The core task requires AI agents to generate Proof-of-Concept (PoC) exploits given only a textual vulnerability description and the pre-patch codebase. Results show that even the strongest agent+model combinations achieve only about a 20% success rate. Furthermore, the evaluation process inadvertently discovered 34 0-day vulnerabilities and 18 incomplete patches, proving it to be both a rigorous benchmark for measuring AI progress and a platform capable of generating real-world security impact.

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

DAComp is a 210-task benchmark covering the enterprise-grade "full data intelligence lifecycle." It decomposes data intelligence into a "Hard axis" for warehouse-level Data Engineering (DE) and a "Soft axis" for open-ended Data Analysis (DA). These are evaluated using executable multi-metrics and hierarchical rubric-based LLM-judging, respectively. The study found that even GPT-5's strict success rate on DE is only 20%, and its DA average is below 50%, exposing critical weaknesses in current data agents regarding holistic pipeline orchestration and open-ended reasoning.

DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

DARE-bench is a large-scale verifiable benchmark for data science tasks, containing 6,300 Kaggle-derived tasks. it supports two evaluation categories: ML modeling and instruction following. It provides training sets to support SFT and RL—improving Qwen3-32B by 1.83\(\times\) via SFT and Qwen3-4B by over 8\(\times\) via RL.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Proposes DeepResearch Bench, the first systematic benchmark for "Deep Research Agents" (DRA)—comprising 100 PhD-level research tasks across 22 disciplines crafted by experts, supported by two automated and highly human-aligned evaluation frameworks: RACE for report quality and FACT for information retrieval and citation reliability.

DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

DeepTRACE translates real-world failure modes identified by the community into 8 computable metrics to perform end-to-end auditing of Generative Search Engines (GSE) and Deep Research (DR) agents. It reveals that these systems generally suffer from one-sided expression, overconfidence, and a high volume of statements that "cite sources without actually being supported by them," with citation accuracy ranging only between 40–80%.

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

This paper presents the first systematic study on benchmark data contamination during the RL post-training stage of LLMs. It proposes Self-Critique, which utilizes the similarity of token-level entropy trajectories between two generations to capture policy path dependency on contaminated samples. Furthermore, it constructs the RL-MIA benchmark, demonstrating that traditional likelihood-based detectors perform near random guessing at this stage, while the proposed method significantly and consistently improves AUC.

Detecting Data Contamination in LLMs via In-Context Learning

The paper proposes CoDeC (Contamination Detection via Context), which determines if an LLM was trained on a specific dataset by observing whether model confidence rises or falls when samples from the same dataset are provided as context. Confidence typically drops for "seen" datasets and rises for "unseen" ones. By requiring only gray-box access to token probabilities and two forward passes, it achieves near-perfect separation (99.9% AUC) at the dataset level.

DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

DISCO proposes a minimalist criterion—"selecting samples where models disagree most"—to condense evaluation sets. Coupled with a "model signature + simple regression" approach to directly predict full-set performance, it reduces evaluation costs by 99% using only 100 samples on MMLU/HellaSwag/Winogrande/ARC with an error of approximately 1 percentage point, setting a new SOTA for efficient evaluation.

Do LLM Agents Know How to Ground, Recover, and Assess? Evaluating Epistemic Competence in Information-Seeking Agents

The authors propose SeekBench—the first process-level evaluation framework for LLM search agents. It decomposes "ability to use evidence" into three epistemic competencies: groundedness, recovery, and calibration, with quantifiable metrics (RQI / ERF / CE). Utilizing 190 expert-annotated trajectories to calibrate a highly consistent annotation schema, the framework scales to 28,493 trajectories via LLM-as-judge, revealing behavioral defects invisible to final answer accuracy.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

This paper reformulates "LLM evaluation" as a statistical inference problem. It replaces Pass@k and avg@N with a Bayesian posterior estimate under a Dirichlet prior (Bayes@N). By utilizing closed-form posterior means and credible intervals, it provides stable rankings with fewer samples and introduces transparent decision rules where winners are declared only when intervals do not overlap.

Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

This paper points out that multinomial sampling in short-answer QA frequently repeats high-probability answers, leading to high variance in consistency-based uncertainty estimation. By replacing sampling candidates with probability-weighted beam search candidates, the authors demonstrate stable improvements in PRR, ROC-AUC, and PR-AUC across six QA datasets and six models.

Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas

Proposes a doubly-robust estimation framework that combines imperfect LLM persona ratings with biased human ratings to produce statistically valid quality estimates of GenAI systems even when covariate shift and selection bias coexist.

DRBench: A Realistic Benchmark for Enterprise Deep Research

DRBench constructs the first deep research benchmark oriented toward enterprise scenarios. It requires Agents to simultaneously mine and synthesize key insights from public web pages and private enterprise data (emails, chats, PPTs, tables, PDFs). Evaluated across four dimensions—Insight Recall, Factuality, Distractor Avoidance, and Report Quality—it reveals significant deficiencies in current Agents regarding enterprise insight recall (even the strongest GPT-5 achieves only ~37%).

EARTHSE: A Benchmark for Earth Science Knowledge Exploration

EARTHSE constructs a three-layer progressive benchmark (Breadth QA → Professional QA → Open Conversation) from 100,000 Earth science papers. It covers 5 major Earth spheres, 114 subfields, and 11 task categories. The benchmark systematically evaluates LLMs across basic knowledge and scientific exploration dimensions, revealing significant shortcomings of existing LLMs in domain depth and open-ended scientific thinking.

EIP: Weighted Ranking of LLMs by Quantifying Question Difficulty

This paper proposes Empirical Interaction Propagation (EIP), which models the binary interactions of "model correctly/incorrectly answering questions" as a bi-directional graph propagation system. By jointly estimating question difficulty and model ability, EIP achieves a finer-grained LLM ranking that aligns closely (90%) with human judgment of difficulty compared to pure accuracy metrics.

Evaluating Language Models' Evaluations of Games

This paper proposes a novel evaluation paradigm—shifting from assessing whether AI "can play a game" to whether it "can judge if a game is worth playing." Using 121 novel board games and over 450 human judgments, the authors systematically compare how well language/reasoning models align with humans, game-theoretic optimal solutions, and symbolic game agents across two types of queries: "payoff estimation (fairness)" and "funness evaluation."

ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

The authors propose EXPERTLONGBENCH (11 expert-level long-form generation tasks across 9 domains) and the CLEAR evaluation framework. By using expert-designed rubrics to decompose both model outputs and reference answers into checkable checklists, the study finds that even the strongest model, Gemini-2.5-Pro, achieves an average F1 of only 33.4, indicating a massive performance gap in expert-level long-form generation for current LLMs.

Fewer Battles, More Gain: An Information-Efficient Framework for Arena-based LLM Evaluation

The selection of "which two models should battle" in an Arena is modeled as an optimal experimental design problem. By utilizing the A-optimal/D-optimal criteria of the Fisher Information Matrix to actively select battles with the highest information gain, the framework achieves more reliable rankings with the same amount of human labeling, effectively realizing "fewer battles, more gain."

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

FinSearchComp is the first fully open-source, end-to-end open-domain financial search and reasoning agent benchmark. It comprises 635 analyst tasks across Global and Greater China markets annotated by 70 financial experts. Evaluations of 21 models reveal that the strongest, Grok 4 (web), still lags behind human experts by 6.1 percentage points.

FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

This paper introduces FormalML, the first Lean 4 benchmark dedicated to "subgoal completion." By employing a self-developed to_theorem translation strategy, the authors automatically extract 4,937 proof fragment problems from formalization libraries of machine learning theory (Optimization + Probability). This benchmark systematically exposes the real-world shortcomings of current LLM provers in handling complex contexts, premise utilization, and efficiency.

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

This paper takes a counter-intuitive approach—rather than chasing new methods like RL, it pushes "data scaling" to the extreme. By meticulously curating 2.5 million training samples across 5 evaluation tasks and multiple reasoning domains, the authors use simple and stable iterative Rejection Sampling SFT to train the FARE series evaluators (8B and 20B). FARE-8B challenges larger RL-specific evaluators, while FARE-20B surpasses 70B+ open-source evaluators. They demonstrate significant efficacy in real-world downstream scenarios such as re-ranking, RL verification, and domain-specific continued training.

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

The authors propose a hierarchical "Aspect Tree" covering 112 evaluation aspects and construct FRABench, a fine-grained evaluation dataset with 60.4k pairs and 325k labels spanning four task categories: text generation, image understanding, image generation, and interleaved image-text generation. They further train UFEval, the first unified judge model with dual "task + aspect" generalization capabilities. The core thesis is that evaluation aspects are naturally interconnected, and joint multi-task learning yields synergistic gains.

From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking

The AUTOEXPERIMENT benchmark is proposed: an Agent is provided with a paper, a codebase with several core functions "progressively masked," and execution commands. The Agent must complete the missing code, run experiments, and report results. By adjusting the number of masked functions \(n\), the benchmark continuously interpolates between "reproduction" and "replication," quantifying the true capability boundaries of research Agents.

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

GDPval is a benchmark proposed by OpenAI for "real economic value digital knowledge work": covering 9 industries with the largest contributions to US GDP, 44 occupations, and 1,320 real tasks constructed by professionals with 14 years of experience. Using "model vs. human expert blind win rate" as the core metric, the study finds that the delivery quality of frontier models is linearly approaching industry experts year by year.

GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time

Ours proposes GuidedSampling, an inference algorithm that explicitly decouples the implicit exploration and generation processes of Repeated Sampling (RS) into two stages: first iteratively generating diverse problem-solving concepts/theorems, and then generating candidate solutions based on each concept. This achieves an average improvement of approximately 21.6% on pass@50 and 9.7% on pass@5 after fine-tuning.

HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

HackWorld establishes the first framework to systematically evaluate Computer-Use Agents (CUAs) on their ability to discover and exploit real-world Web vulnerabilities via graphical interfaces using a CTF methodology, revealing that current SOTA CUAs achieve success rates below 12%, with bottlenecks residing in reasoning, planning, and security tool orchestration rather than perception.

Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in LLMs

This paper introduces TDBench, which utilizes temporal databases and database technologies (Temporal Functional Dependencies, Temporal SQL, and temporal joins) as an automated engine for constructing TSQA datasets. It generates questions covering 13 types of temporal constraints without manual intervention and introduces a "time accuracy" metric, revealing that LLMs often hallucinate incorrect temporal references in their explanations (21.7% on average) even when providing the correct answer.

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

HAL provides a standardized, distributed, and automated infrastructure for AI agent evaluation. Through 21,730 rollouts across the dimensions of "model × scaffold × benchmark," it maps the accuracy-cost Pareto frontier. By using LLMs to analyze 25 billion log tokens, it reveals behaviors hidden by traditional metrics, such as performance degradation with increased reasoning effort, agents searching for answers on HuggingFace, and using incorrect credit cards for bookings.

How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective

This paper formalizes the evaluation of test case generation methods as finding a "diagnostic basis"—a subset with a rank equal to the matrix rank and maximized internal diversity—within a binary matrix of "Wrong Code × Test Cases." Based on this, it constructs TC-Bench, a compact benchmark resistant to score inflation, revealing that even the strongest methods achieve a HackRate of only approximately 60%.

How Reliable is Language Model Micro-Benchmarking?

Ours proposes the Minimum Detectable Ability Difference (MDAD) meta-evaluation metric, systematically revealing that micro-benchmarks cannot reliably distinguish model pairs with small performance gaps at extremely small scales, and that random sampling performs comparably to sophisticated micro-benchmark methods when the sample size reaches ~250.

Human-LLM Collaborative Feature Engineering for Tabular Learning

A human-LLM collaborative feature engineering framework is proposed, which decouples LLM feature operation proposals from the selection process. It models operation utility and uncertainty via Bayesian Neural Networks (BNN) to guide selection and selectively introduces human preference feedback. The approach achieves an average error rate reduction of 8.96% to 11.23% across 18 tabular datasets.

In-Context Learning for Pure Exploration

The paper proposes ICPE (In-Context Pure Exploration), an in-context learning framework combining supervised and reinforcement learning. It uses Transformers to directly learn exploration strategies from experience, achieving near-optimal instance-adaptive performance in active sequential hypothesis testing/pure exploration problems without explicit modeling of information structures.

In-Context Learning of Temporal Point Processes with Foundation Inference Models

Proposes FIM-PP—the first foundation inference model for Marked Temporal Point Processes (MTPP). By pre-training a Transformer on 72K synthetic point processes (14.4 million events) to perform in-context inference of conditional intensity functions, it achieves zero-shot performance comparable to specialized models trained for hours. After minutes of fine-tuning, it sets a new SOTA across four real-world datasets for multi-event prediction.

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

This paper proposes Inverse IFEval, an instruction-following benchmark that systematically reverses the "ideal labeling paradigm" of SFT. Using 8 categories of "counterintuitive instructions" and 1012 bilingual Chinese-English problems, it specifically measures whether LLMs can break free from the "cognitive inertia" implanted by alignment training to execute real-world instructions that conflict with training habits.

JQBench: A Benchmark for Reading and Writing JSON from Natural Language and/or Examples

This paper constructs JQBench, a benchmark for evaluating the ability of LLMs to translate natural language and/or I/O examples into jq expressions (querying, filtering, and transforming JSON). Generated via two automated pipelines—Stack Overflow (JQSTACK, 1496 tasks) and Spider (JQSPIDER, 859 tasks)—it reveals three counter-intuitive findings: the "Documentation Trap," "jq lagging behind Python," and "Example feedback is crucial."

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

The authors constructed LFQA-E, a long-form QA evaluation benchmark featuring expert reference answers, covering 15 domains in both Chinese and English, with 1,618 questions and 7,323 comparison pairs. The study systematically demonstrates that none of the 17 existing automatic evaluation metrics can approximate human judgment and analyzes the root causes of their failure.

LiveClin: A Live Clinical Benchmark without Leakage

LiveClin introduces a "live benchmark" updated every six months using the latest peer-reviewed case reports. It upgrades single-question Q&A into multimodal sequential exams simulating complete clinical pathways to fundamentally resist data contamination and knowledge obsolescence—the strongest model achieved a Case Accuracy of only 35.7%, still trailing behind chief physicians.

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

LiveResearchBench utilizes 100 expert-refined "dynamic real-time web retrieval" tasks equipped with checklists, alongside the DeepEval evaluation suite using six distinct dimensions and specific evaluation protocols. It places single/multi-agent deep research systems on a unified, anti-cheating, and highly human-aligned scale for the first time, revealing systematic weaknesses where current systems "know how to collect but fail to analyze deeply, with frequent citation errors."

LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

This paper proposes the "LLM-as-a-Prophet" evaluation paradigm and Prophet Arena, a live benchmark. By using continuously updated real-world future events from the Kalshi prediction market to assess the predictive intelligence of LLMs, the framework is naturally immune to data contamination. It systematically decomposes bottlenecks in event recall, information source understanding, and information aggregation near settlement using Brier scores, calibration errors, and market returns.

LLMs Get Lost In Multi-Turn Conversation

Through large-scale experiments involving "instruction sharding + simulated dialogue" (200k+ dialogues, 15 LLMs), this paper demonstrates that all top-tier LLMs suffer an average performance drop of 39% in multi-turn underspecified conversations compared to single-turn full instructions. This degradation is primarily caused not by a decline in aptitude, but by a reliability collapse—once a model takes a wrong turn, it becomes "lost" and cannot recover.

LMGame-Bench: How Good are LLMs at Playing Games?

LMGame-Bench converts six classic games into a pluggable and modular evaluation benchmark using a unified Gym-style API. By enabling or disabling three types of harnesses (perception/memory/reasoning), it individually probes capabilities such as visual perception, long-range planning, and reflection. Combined with data contamination detection and prompt standardization, it allows for the clear differentiation of 13 frontier models under unsaturated conditions.

Log Probability Tracking of LLM APIs

Proposes the Logprob Tracking (LT) method, which utilizes log probabilities of single-token inputs and single-token outputs to detect minute changes in LLM APIs (e.g., single-step fine-tuning). It achieves sensitivity 2-3 orders of magnitude higher than existing methods at a 1000x lower cost.

LogiConBench: Benchmarking Logical Consistencies of LLMs

LogiConBench utilizes a pipeline of "automated logical graph generation → proposition sampling and truth value propagation along reasoning paths → translation to natural language" to construct an infinitely scalable, depth-controllable logical consistency evaluation set of 280K samples with explicit reasoning paths. By designing three categories of tasks — discrimination, enumeration, and generation — the study reveals a critical weakness in frontier LLMs, with the highest exact accuracy on enumeration tasks reaching only 34%.

Mapping Overlaps in Benchmarks through Perplexity in the Wild

This paper proposes benchmark signature—extracting a set of "salient tokens" from large-scale real-world corpora and using the perplexity of LLMs on these tokens to predict their performance on a specific benchmark. This characterizes the capabilities actually tested by each benchmark and quantifies the true overlap structure among 89 benchmarks that is otherwise obscured by semantic similarity and performance correlation.

Mapping Post-Training Forgetting in Language Models at Scale

The authors propose a sample-wise + chance-adjusted forgetting and backward transfer measurement framework. Large-scale empirical testing on nearly 30 base→post-trained model pairs across approximately 100 sub-benchmarks reveals that real-world post-training forgetting is significantly milder than predicted by continual learning literature, while backward transfer in mathematics and logic is prevalent.

Measuring LLM Novelty as the Frontier of Original and High-Quality Output

This paper proposes defining LLM "novelty" as the harmonic mean of originality (the proportion of n-grams not seen in training data) and quality (task-specific scoring). Using this unified metric, the authors systematically characterize the factors that drive the novelty frontier across three open-data model families and three creative tasks.

MLE-Smith: Scaling MLE Tasks with Automated Multi-agent Pipeline

MLE-Smith utilizes a three-stage "generation–verification–execution" multi-agent pipeline to automatically transform raw datasets into competition-style Machine Learning Engineering (MLE) tasks. It produces 606 high-quality, executable, and discriminative benchmark tasks without human intervention.

Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

The authors model LLM factuality as a "cumulative product of per-claim scores," apply group-conditional conformal calibration to provide distribution-free coverage guarantees, and employ a multi-LLM ensemble to refine factuality score estimation. This approach strictly controls error rates while retaining as much truthful information as possible.

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

This paper introduces AnthroBench, a scalable evaluation benchmark that utilizes an LLM to simulate users, automatically executes multi-turn dialogues, and employs multiple LLM judges to annotate 14 types of anthropomorphic behaviors. A human experiment (\(N=1101\)) demonstrates that these automated behavioral measurements effectively predict human perceptions of AI anthropomorphism. furthermore, over half of the anthropomorphic behaviors first emerge only between turns 2 and 5.

NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

NAIPv2 reformulates "paper quality scoring" as pairwise ranking learning within the same field and year, augmented by an RTS signal that probabilistically fuses review scores with confidence. It learns relative superiority during training and degrades to a linear-time pointwise regressor during deployment, achieving SOTA results on ICLR review prediction (78.2% AUC / 0.432 Spearman) while being thousands of times faster than autoregressive LLM reviewers.

Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges

The paper utilizes a small set of human-annotated data to estimate the True Positive Rate and False Positive Rate (TPR/FPR) of an LLM-as-a-Judge. It constructs a "variance-corrected" critical threshold to process massive judge-generated labels, ensuring that the certification test maintains controlled Type-I error (avoiding misclassifying unsafe models as safe) even when the judge itself is imperfect.

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

PACEbench constructs 32 realistic cyber-attack scenarios using real-world CVEs, multi-host network topologies, and authentic WAF defenses. Accompanied by PACEagent, a three-stage penetration testing agent, and a weighted scoring metric with partial credit, the framework evaluates seven frontier LLMs. Results show significant performance degradation in complex multi-host scenarios and zero success in bypassing defenses, suggesting that current models do not yet pose a general cyber-attack threat.

ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

This paper utilizes information-theoretic analysis combined with analytically solvable synthetic list tasks to quantify and reveal the inevitable quality loss in Diffusion Language Models (dLLMs) caused by the "conditional independence assumption" during parallel decoding. Based on these insights, the authors construct ParallelBench—the first diagnostic benchmark (17 tasks in 3 categories) specifically designed to measure the speed-quality trade-off in parallel decoding. The study demonstrates that existing dLLMs suffer severe performance drops on tasks that are trivial for humans and Autoregressive (AR) models once parallelism increases, and current decoding strategies fail to adaptively adjust parallelism based on task difficulty.

PCB-Bench: Benchmarking LLMs for Printed Circuit Board Placement and Routing

PCB-Bench is the first comprehensive benchmark to systematically evaluate the capabilities of (multimodal) large language models in printed circuit board (PCB) placement and routing tasks. By utilizing three types of tasks—"pure text QA/CQ + image-text multimodal + real-world design understanding"—it covers approximately 3,700 text-based questions, 500 image-text questions, and 174 real-world engineering projects, revealing that current state-of-the-art models still have significant shortcomings in spatial layout reasoning, rule constraint following, and engineering drawing interpretation.

PerSpectra: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments

PerSpectra integrates the "clear structure" of Kialo debate graphs with the "linguistic diversity" of real Reddit discussions through a retrieval-rewriting pipeline. It constructs a configurable benchmark featuring 100 controversial topics, 762 pro/con stances, and 3,810 naturalized arguments. Three derived tasks—perspective counting, perspective matching, and polarity judgment—reveal systematic failures in current LLMs regarding multi-perspective understanding, such as overestimating perspective counts, confusing fine-grained viewpoints on the same side, and being biased by concessive clauses.

Pitfalls in Evaluating Language Model Forecasters

This is a position/analysis paper: the authors systematically categorize two major classes of pitfalls unique to the emerging field of "LLM-based future event forecasting"—various forms of temporal leakage in backtesting that render results untrustworthy, and the difficulty of extrapolating benchmark scores to real-world forecasting capabilities. Using numerous specific examples from existing literature, they argue that claims of LLMs reaching or surpassing human-level forecasting must be seriously questioned.

POEMetric: The Last Stanza of Humanity

This paper proposes POEMetric, the first framework for systematically evaluating poetry generation. Using 10 dimensions (basic instruction following + advanced creative abilities + global evaluation), a dataset of 203 human fixed-form poems, and 6,090 poems generated by 30 LLMs, the authors employ "Rule-based Algorithms + LLM Judges + Human Experts" for cross-verification. Results quantitatively demonstrate that while top-tier LLMs approach perfect scores in meter and theme, they remain far inferior to human poets in creativity, idiosyncrasy, emotional resonance, imagery, and rhetoric—the core elements that define poetry.

PrefDisco: Benchmarking Proactive Personalized Reasoning

This paper proposes PrefDisco—a suite of evaluation methods that transform any static reasoning benchmark into an "interactive personalized task." It requires models to proactively ask questions to discover hidden user preferences under cold-start conditions (no history), adjust reasoning chains accordingly, and measure the degree of alignment using fine-grained rubric metrics (PrefAlign). Testing 21 frontier models across 10 tasks revealed that 29.0% of personalization attempts performed worse than generic responses.

Preference Leakage: A Contamination Problem in LLM-as-a-judge

This paper defines and systematically investigates Preference Leakage (PL) in LLM-as-a-Judge—a phenomenon where judge \(M_J\) systematically favors "related student models" when the synthetic data generator \(M_G\) is associated with \(M_J\) (same model, inheritance, or same family). In same-model scenarios, the PLS reaches 28.7% (Arena-Hard), and this bias is more insidious and harder to detect than egocentric bias.

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

PRISM-Physics models the reference solutions of physics competition problems as "formula DAGs" (where nodes represent formulas and edges represent causal dependencies). Combined with a rule-based physical formula equivalence matcher and an "ancestor-closure scoring" method with proven theoretical optimality, it introduces the first benchmark for step-by-step scoring of physics reasoning. This approach aligns more closely with human expert ratings than LLM-as-judge or existing linear process scoring models.

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench utilizes 7,000+ "response-rubric" pairs authored by experts (Physics/Chemistry PhDs and Finance/Consulting MBAs) to establish a cross-domain rubric benchmark requiring professional knowledge for both answering and judging. Accompanied by a debiased, cost-effective LLM-Judge—which is 2-3 orders of magnitude cheaper—the study finds that even GPT-5-high achieves an overall score of only 65.9%.

Prompt and Parameter Co-Optimization for Large Language Models

The paper proposes MetaTuner, a framework that simultaneously generates prompts and LoRA parameters via a shared meta encoder. It unifies discrete prompt optimization and continuous parameter fine-tuning into an end-to-end optimizable joint framework, significantly surpassing methods that optimize them independently on mathematical reasoning and question-answering tasks.

RedacBench: Can AI Erase Your Secrets?

This paper proposes RedacBench—a comprehensive benchmark for evaluating Large Language Model (LLM) text redaction capabilities using "policy conditioning + proposition-level annotation." Utilizing 514 human-written texts, 187 safety policies, and 8,053 annotated propositions, the benchmark quantifies both the "Security" of erasing sensitive information and the "Utility" of preserving non-sensitive information. Systematic evaluation of 11 mainstream models across 3 redaction strategies reveals that stronger models achieve higher security but struggle more to maintain utility, highlighting a significant tradeoff between the two.

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

The authors propose RefineBench—a multi-round refinement evaluation benchmark covering 11 domains and 1,000 difficult problems scored via "checklists." By systematically distinguishing between "self-refinement (no feedback)" and "guided refinement (with feedback)," the study finds that even frontier models like Gemini-2.5-Pro and GPT-5 achieve extremely low scores (31.3%/29.1%) after five rounds of self-refinement. However, they approach near-perfect scores when explicitly told "what is wrong," suggesting that current models do not lack the capability to "refine" but rather the capability to "detect their own errors."

Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Addressing the gap where "LLM-generated natural language math proofs cannot be reliably scored," this paper first constructs the first fine-grained expert-annotated set, PROOFBENCH (145 problems / 435 proofs / 0–7 scale). It then systematically searches the evaluator design space (backbone model, context, instructions, workflow) to derive PROOFGRADER (O3 + reference solutions and marking schemes + simple ensemble), which achieves a Mean Absolute Error as low as 0.926 compared to expert scores and approaches the human upper bound in best-of-n selection.

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

ResearchRubrics utilizes 2800+ hours of human effort to pair 101 real-world open-ended research prompts with 2593 expert-written, weighted, fine-grained rubrics. Using LLM-as-Judge to score agents based on these rubrics, the study evaluates mainstream Deep Research (DR) systems and finds that even the strongest agents, such as Gemini DR and OpenAI DR, fail to reach an average rubric adherence rate of 68%. Theoretical bottlenecks are concentrated in implicit requirement inference and multi-source information synthesis.

ResiliBench: Evaluating Agentic Workflow Adaptation in Stochastic Environments

ResiliBench treats two types of real-world deployment uncertainties—"probabilistic tool failure" and "flaws in user-provided workflow instructions"—as the primary evaluation targets. Using a tool library of 30 APIs, it automatically generates 5040 tasks, each paired with an MDP-derived optimal workflow and seven types of systematically perturbed flawed workflows, to quantify LLMs' error correction and replanning capabilities in stochastic environments.

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

This paper proposes the "Representation-as-a-Judge" paradigm: instead of requiring small language models (SLMs) to generate scoring text, they are frozen and a lightweight probe classifier reads evaluation scores directly from their hidden layer representations. This approach significantly outperforms prompt-based scoring by models of the same size on reasoning tasks like GSM8K/MATH/GPQA, approaches the performance of LLM judges, and effectively serves as a data filter to enhance downstream SFT.

Rethinking LLM Evaluation: Can We Evaluate LLMs with 200× Less Data?

To address the bottleneck of expensive LLM benchmarking across numerous models, this paper reframes benchmark compression as a subset optimization problem aimed at "preserving overall leaderboard ranking." The proposed EssenceBench utilizes a three-step pipeline: dual-redundancy filtering (text + rank), genetic algorithm search with a fixed proxy predictor, and attribution-guided refinement. On HellaSwag (10,000 samples), it controls model ranking error within 5% using only 50 samples, achieving 200× compression.

Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

This paper models the numerical confidence expression of LLMs as a "betting-style" reinforcement learning problem. By rewarding high confidence for correct answers and penalizing overconfidence for incorrect ones using strictly proper logarithmic scoring rules, the authors significantly improve model calibration and cross-task generalization without compromising response accuracy.

RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers

RouterArena is the first open evaluation platform for LLM routers. It utilizes the DDC library classification method to build a query dataset covering 9 major domains and 44 categories with approximately 8,400 difficulty-labeled entries. It features five-dimensional metrics—accuracy, cost, optimality, robustness, and latency—alongside an Arena Score that synthesizes accuracy and cost. Through an automated framework that refreshes leaderboards, it compares academic and commercial routers on a unified scale, revealing that no single router excels across all metrics and current methods generally struggle with "using small models when appropriate."

Same Content, Different Representations: A Controlled Study for Table QA

The first controlled-variable study: By keeping table content identical while varying representation forms (structured vs. semi-structured), this work systematically evaluates the robustness of NL2SQL, LLM, and hybrid methods across different table sizes, schema qualities, and query complexities, identifying representation as a first-order factor affecting Table QA performance.

Sci2Pol: Evaluating and Fine-tuning LLMs' "Science-to-Policy Brief" Generation Capabilities

This paper introduces Sci2Pol-Bench, the first benchmark for the "generating policy briefs from scientific papers" task (decomposing the five-stage writing process into 18 tasks), and Sci2Pol-Corpus, a training corpus (filtering 639 high-quality "paper-brief" pairs from 5.6 million policy documents). The authors point out that BERTScore/ROUGE cannot measure the quality of briefs and instead use an LLM evaluation metric aligned with expert judgment. After fine-tuning on the corpus, Gemma-3-27B outperforms much larger models like GPT-4o and DeepSeek-V3 (671B).

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

SimBench unifies 20 cross-disciplinary datasets from ethics, economics, psychology, and politics into a "population response distribution prediction" task, constructing the first large-scale standardized LLM human behavior simulation benchmark. Systematic evaluation across 45 models reveals that even the strongest current models achieve only a moderate fidelity of 40.80/100. Simulation capability grows log-linearly with scale but does not improve with increased inference compute, and instruction tuning exhibits a distinct "alignment-simulation tradeoff."

SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

This paper formalizes "estimating LLM performance on an entire benchmark using a small number of samples" as a sparse optimization problem. It is the first to directly learn anchor weights using an MLP via gradient descent and iteratively replace anchors through AIS/CIS importance scores. Using only approximately 100 samples, it reduces estimation error to 1–2% while maintaining high ranking consistency (Kendall's τ).

SysMoBench: Evaluating AI on Formally Specifying Complex Real-World Systems

Ours proposes SysMoBench—the first benchmark to evaluate the capability of AI in automatically generating formal models (TLA+) for real-world complex systems (concurrency/distributed). By scoring with four automatically verifiable metrics (syntax, runtime, code consistency, and invariants), it is discovered that LLMs can handle small systems like spinlocks but significantly struggle with large-scale protocol implementations like Etcd Raft.

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

The TED (Talk, Evaluate, Diagnose) framework is proposed to achieve user-aware dynamic Agent evaluation through general and reusable expert/non-expert persona templates. It utilizes new indicators such as grading notes + LLM-as-judge + MaxProgressRate@k for fine-grained efficiency assessment, while providing actionable improvement feedback through automated error discovery and clustering. Evaluation results on τ²-bench and ToolSandbox reveal new insights into Agent performance.

Teach2Eval: An Interaction-Driven LLMs Evaluation Method via Teaching Effectiveness

Teach2Eval redefines "evaluating an LLM" as "tasking it to teach weaker student models." Instead of answering questions directly, the candidate model provides feedback, error correction, and multi-turn guidance without seeing the options or correct answers. The score is determined by the improvement in the students' accuracy. Tested across 33 models and 60 datasets, it achieves a Spearman correlation of 0.94–0.975 with Chatbot Arena and LiveBench, is naturally robust to data contamination, and decomposes into four orthogonal fine-grained capability dimensions.

Textual Bayes: Quantifying Prompt Uncertainty in LLM-based Systems

This paper treats prompts in LLM systems as "textual parameters \(\theta\)" within a statistical model and performs Bayesian inference using a small training set. It proposes a textual MCMC algorithm, MHLP (Metropolis-Hastings through LLM Proposals), to sample from the prompt posterior. This achieves principled quantification of predictions and uncertainty for black-box LLMs, outperforming several frequentist baselines in both accuracy and calibration (ECE/SECE).

The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

This paper employs a randomized controlled experiment—involving expert execution and blind review—to verify whether research ideas generated by LLMs truly translate into superior research outcomes. It finds that while LLM ideas receive higher scores when evaluated as standalone "proposals," they suffer significantly larger drops in novelty, excitement, effectiveness, and overall quality after execution.

The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs

This paper constructs the Open Proof Corpus (OPC), containing 5,062 human-judged LLM mathematical proofs, and uses it to systematically answer key differences between natural language and formal proofs, final answers and complete proofs, best-of-n selection, and proof judge training.

THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

THEMIS constructs a multi-task benchmark for "scientific paper image fraud forensics" (4,054 questions, 5 fraud types, 16 fine-grained manipulations, 7 real academic scenarios). It maps fraud types to 5 expert-level visual reasoning abilities and evaluates 16 mainstream MLLMs. The study reveals systematic shortcomings in "forensics" capabilities, with even the strongest GPT-5 achieving an overall score of only 56.15% in complex real-world scenarios.

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

TokUR utilizes low-rank random perturbations of attention weights to construct a lightweight Bayesian model ensemble. It estimates total, aleatoric, and epistemic uncertainty for each generated token, then aggregates these signals into a response-level confidence score to identify faulty reasoning, filter high-quality answers, and assist in test-time scaling.

Towards Personalized Deep Research: Benchmarks and Evaluations

The authors propose PDR-Bench, the first benchmark for "Personalized Deep Research," consisting of 250 personalized queries generated from 50 research tasks across 10 domains paired with 25 real user personas. Accompanying this is the PQR Evaluation Framework (Personalization alignment P / content Quality Q / factual Reliability R). Evaluations reveal that existing deep research systems "know how to write reports but fail to personalize," and while more user information improves personalization, implicit context is significantly less effective than explicit personas.

Towards Self-Evolving Agent Benchmarks: Validatable Agent Trajectory via Test-Time Exploration

The TRACE framework is proposed to allow agents to "freely explore and self-evolve" seed tasks from existing benchmarks into more difficult new tasks. Execution trajectories generated during evolution are treated as first-class citizens, recorded, and subjected to multi-level validation. This transforms static, manually annotated evaluation sets into dynamic evaluation systems capable of sustainable self-upgrading.

Train-before-Test Harmonizes Language Model Rankings

The paper proposes train-before-test—a standardized protocol where every model undergoes uniform fine-tuning on a benchmark's training set before being evaluated on its test set. Demonstrated across 24 benchmarks and 61 models, this "potential-based" ranking is highly consistent across benchmarks (average Kendall's \(\tau\) increased from 0.52 to 0.76). It restores the link between perplexity and downstream performance and reveals that the model-score matrix is nearly rank-one.

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

TrustJudge systematically reveals two long-overlooked "self-contradictions" within the LLM-as-a-judge framework: conflicts between scoring and pairwise comparisons, and transitivity cycles in pairwise comparisons. By attributing root causes to information loss in discrete scoring and ambiguous ties, the authors introduce "distribution-sensitive scoring + likelihood-aware aggregation" to significantly reduce inconsistency rates without training while maintaining or improving evaluation accuracy.

Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

This paper proposes applying the Peer Prediction mechanism from game theory to LLM evaluation and training. By measuring the mutual predictability of participants' answers to distinguish honest from deceptive responses, honesty incentives are achieved without ground-truth labels. It demonstrates a surprising "inverse scaling" property—weaker experts are more resistant to deception from stronger models.

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

The HUMAINE framework is proposed to evaluate human preference across 28 SOTA models via 23,404 demographically stratified participants. Using a multi-dimensional (5-dimensional), multi-turn conversation approach and a hierarchical Bayesian BTD model, the study reveals that age is the strongest driver of preference heterogeneity (average rank shift of \(\pm 2.8\)), proving that single aggregated leaderboards fail to reflect diverse population preferences.

vCache: Verified Semantic Prompt Caching

Ours proposes vCache—the first semantic caching system with user-defined error-rate guarantees. By utilizing online learning to independently estimate optimal similarity thresholds for each cached embedding without pre-training, it achieves up to 12.5× higher cache hit rates and 26× lower error rates compared to static baselines while satisfying correctness constraints.

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

VideoJudge utilizes a bootstrapping loop where a "generator creates samples according to target scores and an evaluator validates score alignment" to synthesize 100,000 video evaluation samples with score supervision without human labeling. This enables training 3B/7B small video evaluator models that match or exceed 32B/72B general-purpose MLLM judges on most meta-evaluation benchmarks.

When LLMs Get Significantly Worse: A Statistical Approach to Detect Model Degradations

Addressing the question of whether a quantized/sparsified LLM has actually degraded or if the change is merely evaluation noise, this paper formalizes the problem as a statistical hypothesis test. It proposes the Exact One-sided McNemar Test, which, instead of examining task-level aggregate accuracy, compares the correctness of two models sample-wise. This allows for the detection of even a 0.3% accuracy drop as "true degradation" while maintaining a controlled false positive rate.

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Proposes SAFE (Stable And Fast LLM Ensembling), which selectively ensembles LLMs with heterogeneous tokenizers at the token level via a Generate-Verify-Ensemble loop. It resolves the OOV-like contamination caused by tokenizer mismatches in long-form generation. Ensembling on less than 1% of tokens significantly improves performance, raising UniTE from 59.6% to 77.4% on MATH500.