📊 LLM Evaluation¶

🧪 ICML2026 · 40 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (131) · 💬 ACL2026 (97) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (38) · 📹 ICCV2025 (27) · 🧪 ICML2025 (22)

🔥 Top topics: LLM ×19 · Reinforcement Learning ×3 · Reasoning ×3 · Agents ×2

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning: This paper proposes Agent World Model, a fully synthetic pipeline encompassing scenarios, tasks, databases, MCP tool interfaces, and verifiers. It generates 1,000 executable, database-driven environments used to train tool-calling agents, achieving superior out-of-distribution generalization on BFCLv3, \(\tau^2\)-bench, and MCP-Universe.
AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning: AGZO discovers that the row space of linear layer gradients is constrained by the forward activation subspace. Based on this, it perturbs parameters only along activation-guided low-rank directions during zeroth-order fine-tuning, thereby improving gradient alignment and downstream task performance while maintaining memory usage levels nearly identical to MeZO.
Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models: This paper introduces AuthorityBench—a multi-domain benchmark with 220,000 prompts using a fully balanced 2×2 factorial design (independently manipulating "claim veracity × citation veracity") to isolate the influence of the "citation authority signal" itself on LLM cognitive behavior. It finds that adding a citation (regardless of its veracity) increases hallucination rates, with the "True Claim + Fabricated Citation" condition causing the most severe hallucinations across all tested models (raising hallucinations in general knowledge domains to 35–77%), and larger models are not necessarily more robust.
BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback: The Bespoke benchmark is proposed, collecting 2,870 sessions from 30 annotators over 3 weeks of real chat and search history. It constructs an evaluation framework with fine-grained preference ratings and diagnostic feedback to systematically assess the personalization capabilities of search-augmented LLMs. Findings indicate that current models score below 60 on average across all configurations, with the bottleneck for personalization lying in history reasoning rather than generation.
Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum: This paper systematically investigates the behavior of probability-based objective functions in SFT, discovering that the standard NLL is not universally optimal: on tasks where the model has a strong prior, prior-leaning objectives like \(-p\) significantly outperform NLL (with gains up to 16%). Conversely, NLL remains superior on tasks with weak priors, revealing an objective selection principle governed by the model-capability continuum.
Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning: Ours proposes GraphGPO, which aggregates all rollout trajectories into a unified state transition graph. By leveraging global shortest path information on the graph to calculate distance-based advantages for each step, it achieves finer-grained credit assignment than trajectory-level attribution, significantly outperforming GRPO and GiGPO on ALFWorld, WebShop, and Sokoban.
BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction: BuildArena places LLMs into the physical sandbox game Besiege, requiring them to use natural language to build bridges, vehicles, and rockets brick by brick. By using a physics engine for simulation and scoring, it systematically evaluates for the first time the engineering construction capability of LLMs to "translate language into functional physical structures." Results indicate that only GPT-5 is marginally competent on hard tasks, while most other models almost entirely fail at the Hard level.
CapBencher: Give Your LLM Benchmark a Built-in Alarm for Test-Set Overfitting: CapBencher injects randomness into each problem (generating multiple logically correct answers and randomly selecting one as the gold label) to cap the Bayes accuracy of a benchmark at a controllable level (e.g., 50%). This enables black-box statistical detection of data contamination in publicly released benchmarks—any model with an accuracy significantly exceeding the Bayes upper bound is flagged as contaminated.
Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering: The authors argue that mainstream LLM benchmark metrics rely on two frequently violated assumptions: a sufficient number of evaluations (permitting the Central Limit Theorem) and independence between prompts. They propose BHM-ESC, a Bayesian Hierarchical Model with "Embedding-Space Clustering": it groups semantically similar prompts into clusters sharing a success probability, and infers the number of clusters as an unknown variable. This provides more reliable performance estimates that correct for prompt dependence under small sample sizes, reducing Mean Absolute Error (MAE) by 4–73% and increasing Expected Log Posterior Density (ELPD) by 40–450 on adversarial robustness benchmarks.
Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees: This paper proposes the DSR (Decompose-Structure-Repair) neuro-symbolic framework, which decomposes the formalization of natural language theorems into three stages: "decomposing NL components → joint generation of FL components and Operator Trees (OPT) → hierarchical repair based on subtree localization." Using a 7B model, it sets new SOTAs on ProverBench / ProofNet / PRIME and releases PRIME, a graduate-level Lean 4 benchmark consisting of 156 problems.
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search: This paper proposes DEI, which treats multiple LLMs from different families as heterogeneous mutation operators distributed across different nodes. By using fully asynchronous gossip to broadcast champions of each round, it creates cross-model adversarial pressure. In Core War program synthesis tasks, it achieves a +124% QD-Score and +28% archive coverage compared to single-node baselines under equal total computational budget.
Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation: DoLQ inserts a "Scientist Agent" into the search loop of LLM-based symbolic regression. This agent performs simultaneous qualitative (physical plausibility) and quantitative (ablation-based MSE contribution) evaluations, pushing LLM-SR from "low-error but bloated and physically absurd" candidates toward equations that are both numerically accurate and structurally compact.
Estimating Tail Risks in Language Model Output Distributions: Constructs an "unsafe proxy model" via activation steering combined with importance sampling to accurately estimate rare events like "the probability of a safe model outputting harmful content" (at the \(10^{-4}\) level). This achieves accurate estimation with 10–20\(\times\) fewer samples than brute-force sampling and enables worst-case deployment risk prediction.
From Human-Level AI Tales to AI Leveling Human Scales: This paper employs LLMs as population extrapolators to calibrate 18 capability dimensions on a logarithmic scale \(L=-\log_B p_W\) according to "world population accuracy." It reveals that the Volume and Attention dimensions have a true base \(B \gg 10\), while the Comprehension dimension has \(B \approx 1\), uncovering a severe misalignment in current comparisons between AI and humans.
Top-W: Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for LLMs: Top-W formulates next-token truncation as a minimization problem of "Wasserstein-Entropy-Mass" that incorporates token embedding geometry. It theoretically proves that the optimal solution is either a single token or a prefix sorted by \(f(i)+\lambda\log p_i\). The engineering implementation entails an \(O(n\log n)\) scan. It outperforms baselines in the majority of 15 (T, model) combinations across GSM8K, GPQA, AlpacaEval, and MT-Bench; notably, it yields a Gain of up to 33.7% over Top-H on GSM8K under high temperatures.
Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics: This work demonstrates that the generative perplexity (gen-PPL, i.e., the per-token negative log-likelihood of samples under a frozen AR scorer like gpt2-large)—the almost exclusively relied-upon metric for current diffusion/continuous flow language models—is unreliable. The authors use a set of zero-parameter, intentionally nonsensical samplers (structurally incoherent by construction) to achieve "SOTA gen-PPL" on LM1B/OpenWebText under non-degenerate entropy, surpassing recently published diffusion and flow models. Consequently, the authors advocate for re-evaluating models using a distributional distance metric suite that directly measures the discrepancy between "generated distributions vs. human text distributions."
HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents: HiPER transforms the flat RL used for LLM agents into a two-level Plan-Execute structure consisting of "high-level planning of subgoals + low-level execution of atomic actions." It introduces Hierarchical Advantage Estimation (HAE), which slices GAE along subgoal segments to perform coupled advantage estimation with bounded differences. On ALFWorld and WebShop, HiPER achieves success rates of 97.4% and 83.3% respectively (using Qwen2.5-7B), representing gains of +6.6% and +8.3% over the strongest baseline, GiGPO.
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem: InnoEval redefines "evaluating a research idea" as a knowledge-grounded + multi-perspective reasoning problem: it first employs a heterogeneous deep search engine to retrieve live knowledge from papers, webpages, and code, aligning it at a fine-grained level to each component of the idea. Then, an "Innovation Review Committee" composed of diverse academic personae scores the idea across five dimensions, aggregating them into a decision-bearing meta-review. It consistently outperforms existing baselines and achieves high alignment with human experts across critique, pairwise comparison, and grouping tasks.
Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction: This paper proposes "Black-Box Environment Interaction" as a new paradigm for evaluating integrated reasoning (deduction + induction + abduction). By constructing the ORACLE benchmark with 96 environments across 6 task categories and evaluating 19 LLMs, it is discovered that even the strongest model, o3, only achieves ~70% accuracy in simple environments and drops to ~40% in difficult ones. Furthermore, all LLMs lack high-level planning capabilities for "adaptive optimization of exploration strategies based on feedback."
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement: Addressing the frequently violated monotonicity assumption in LLM-as-a-judge—where high confidence does not necessarily imply reliability—this paper proposes using a small MLP to map multiple in-context prediction probabilities to confidence scores. By deriving a margin-adaptive training strategy via margin-based ranking loss and PAC-Bayes generalization bounds, the learned confidence achieves lower ranking loss, higher AUROC, and significantly improves target consistency in fixed-sequence testing across four datasets and six judge models.
Multi\(^2\): Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments: This paper proposes the Multi\(^2\) framework, which explicitly decouples the "planning" and "execution" of LLM agents into System 1 (an SFT-trained sub-goal planner) and System 2 (an offline-to-online RL-trained atomic action executor). By utilizing role-specific LoRA adapters and training objectives with policy-anchoring/KL-regularization, it significantly mitigates objective drift and improves token efficiency across three long-horizon interactive environments: ScienceWorld, ALFWorld, and TextCraft.
NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama: The authors developed NarrativeWorldBench, a nine-metric benchmark for testing structural consistency in "long-form serialized script continuation." They found that 21 frontier LLMs collectively hit a ceiling in Plot-Beat F1 between \([0.78,0.81]\), with performance dropping by \(-0.20\) when the horizon extends to 200 episodes. To address this, they proposed N-VSSM, a world model utilizing Mamba-2 to maintain a 256-dimensional explicit narrative latent state. N-VSSM achieves an F1 \(\geq 0.84\) with \(4\times\) lower compute and is preferred by professional screenwriters with a 71% probability.
Nonparametric LLM Evaluation from Preference Data: Addressing the issues where current LLM leaderboards rely on parametric Bradley–Terry models and fail to provide valid confidence intervals under model misspecification or when using black-box ML/LLM-as-a-judge, this paper proposes a nonparametric framework DMLRank: it abstracts ranking scores as functionals of context-dependent preference probabilities (GARS), applies Debiased Machine Learning to derive asymptotically efficient estimators with valid confidence intervals, and further provides optimal preference collection strategies under budget constraints.
On Cost-Effective LLM-as-a-Judge Improvement Techniques: Addressing the issue that LLM-as-a-judge accuracy depends heavily on prompts and aggregation strategies but lacks systematic evidence on "which tricks are truly cost-effective," this paper adopts a unified perspective of "noise control for stochastic judges" on RewardBench 2. It systematically compares four drop-in techniques: ensemble scoring, task-specific scoring criteria, calibration context, and adaptive model upgrading. The study finds that combining "criteria injection (nearly zero cost) + ensemble scoring" achieves up to 85.8% accuracy (+13.5pp over baseline), dominating the cost-precision Pareto frontier and outperforming calibration and model upgrading.
On Effectiveness and Efficiency of Agentic Tool-calling and RL Training: The authors systematically examine LLM tool-calling through two dimensions: "evaluation effectiveness" and "training efficiency." Using BFCL as a case study, they demonstrate that "small details" such as random seeds, multi-turn templates, thought history, and system prompts can cause significant drift in leaderboard scores, making cross-paper comparisons unreliable. On the efficiency side, they identify waste in the rollout and policy update stages of RL (GRPO) training and propose a dual-solution: "online pre-rollout filtering + max-variance rollout subsampling." This achieves 1.7× and 2.6× end-to-end speedups in single-turn and multi-turn tool-calling, respectively, without performance degradation.
PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Stage Roleplay: PoliticsBench is a novel benchmark based on multi-stage roleplay. By evaluating LLM political value expressions through 20 political scenarios and 4-stage interactions, it reveals that 7 mainstream LLMs are left-leaning (19–39 points), while only Grok is right-leaning (-22.7) but exhibits the highest volatility. Scenario prompting stimulates value dimensions more effectively than direct questioning (feature activation +0.48, commitment +1.39).
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities: Using ~7,000 model checkpoints spanning 2022–2026 (including 5k historical and 2k self-evaluated), this paper models "attainable downstream accuracy given a pre-training compute budget" as a monotonic saturating sigmoid capability frontier via high quantile regression. The study validates the temporal stability of this frontier and demonstrates its efficient reconstruction using only ~20% of the evaluation budget.
REAL: Integrating Regression-Aware Rewards into RL, Teaching LLM-as-a-Judge that "Even a One-Point Difference Matters": Addressing the inherent flaw of binary 0/1 rewards in RL for LLM-as-a-Judge which ignores ordinal structures, the authors integrate RAFT's "expected value prediction + squared error" into the RL objective. Since the reward explicitly depends on policy parameters, a Generalized Policy Gradient is employed—decomposing cleanly into a "CoT Exploration term" and a "Prediction Refinement term." Across 8B–32B base models, it consistently outperforms SFT and standard RL, with Qwen3-32B showing an 8.4/7.2 point gain in Pearson/Spearman correlation over SFT.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge: RACER models the decision of whether to invoke reasoning mode for each judge query as a distributionally robust constrained optimization problem with a KL uncertainty set. It uses a primal-dual algorithm to derive an optimal routing strategy that satisfies cost budgets under OOD conditions and provides the first theoretical guarantee of linear convergence for LLM router policies.
Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges: This work redefines "safety judging" as a "rubric-following" problem. By utilizing "instance-conditioned dynamic rubrics" and a "reliable-to-expressive" curriculum, the authors train a 12B judge. The model maintains 94%+ accuracy across three vastly different rubric styles with a cross-rubric fluctuation of only 0.76, significantly outperforming larger 20B/30B judges in stability.
Resolution Diagnostics for Paired LLM Evaluation: This paper treats the "Model A is 0.X pp higher than B" rankings on LLM leaderboards as a paired hypothesis testing problem. By inverting the level-\(\alpha\) / power-\((1-\beta)\) test, it defines the "Resolution Ratio" \(q=N/N^\star\). The authors prove that the common shortcut of multiplying the single-arm Cohen-\(h\) formula by \((1-\rho)\) systematically underestimates the required sample size by half under small effects. Empirical results show that 11/40 pairs on the Open LLM Leaderboard v1 and 4/9 adjacent pairs in the MMLU-Pro top-10 are fundamentally "unresolvable" at \((\alpha, 1-\beta) = (0.05, 0.8)\). This number increases to 6/9 after accounting for multiple comparisons, real-world subject clustering, and anytime-validity.
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior: This paper systematically deconstructs "when exactly psychometric self-reports (SR) of LLMs predict their actual behavior." Using a \(2\times2\times2\) factorial experiment (Theory of Planned Behavior TPB vs Big5 × In-session vs Cross-session × Parameter Grid vs Persona Induction) across 4 behavioral tasks and 11 frontier models, it finds that SR–behavior consistency exists but is selective. While fine-grained TPB achieves human-level consistency within the same session, Big5 yields almost no signal. In cross-session settings, consistency survives only for tasks where behavior is "anchored outside the prompt" (e.g., training-locked implicit bias), while tasks strongly primed by context (e.g., sycophancy) collapse entirely.
RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing: RouteJudge points out that current LLM router evaluations are confined to the "offline, ground-truth-based, and auto-scoring" paradigm, which ignores the diverse preferences of real users. Consequently, it proposes an online pairwise preference evaluation platform: for the same query, multiple routers select one model each from the same model pool and budget for an anonymous pairwise duel. User preferences are then attributed back to the router level. This is accompanied by a reproducible modular toolbox, ORBIT, serving as the entry point for routing method development and submission.
Spherical Steering: Geometry-Aware Activation Rotation for Language Models: This paper proposes Spherical Steering: rotating activation vectors along geodesics on the unit hypersphere of LLM hidden states toward a "truthfulness direction" estimated from contrastive samples. Unlike traditional additive activation steering, this approach maintains activation magnitudes (norms) while significantly improving multiple-choice accuracy on benchmarks such as TruthfulQA, COPA, and StoryCloze (+10% range) without degrading open-ended generation quality.
The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust: This paper identifies two critical flaws in "Expected Calibration Error (ECE)" as a trust metric: its inability to distinguish between an oracle and an uninformative "base-rate" estimator, and its insensitivity to task risk. To address this, the authors propose a new metric, euro (Oracle-normalized Expected Utility), which links calibration with decision utility. They further introduce the acute protocol, which uses layer-wise activations during generation as features for a Random Forest classifier to estimate confidence. Across 6 models and 3 task types, acute maintains low calibration error while significantly outperforming strong baselines on euro.
Toward Training Superintelligent Software Agents through Self-Play SWE-RL: This paper proposes Self-play SWE-RL (SSR), where a single LLM acts as both a "bug-creating proposer" and a "bug-fixing solver" within sandboxed code repositories. Using only Docker images as input and employing consistency checks and solve-rates as rewards for joint RL, SSR achieves self-improvements of +10.4 and +7.8 points on SWE-bench Verified and SWE-Bench Pro, respectively, consistently outperforming "human-data" baselines that rely on human-annotated issues and test suites.
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation: This paper defines AI benchmark saturation as the loss of reliable discriminative power between frontier models. It proposes an uncertainty-aware saturation index based on leaderboard metrics and analyzes 60 text LLM benchmarks. The study finds that nearly half are highly saturated, and that benchmark age and test set size are more significant predictors of saturation than private test sets, open-ended outputs, or template diversity.
Who can we trust? LLM-as-a-jury for Comparative Assessment: This paper points out that the reliability of multiple LLM judges in pairwise comparisons varies significantly. It proposes the BT-\(\sigma\) model with judge-specific discrimination parameters, which simultaneously learns the ranking of candidate outputs and the reliability of each LLM judge without human calibration labels, thereby aligning more closely with human rankings than simple averaging or standard Bradley-Terry aggregation.
Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs: The paper proposes a two-stage evaluation protocol centered on "providing counterarguments without social pressure," quantifying the probability (Answer Flip Rate) that an LLM "changes its mind" after answering correctly when challenged by an argument supporting a wrong option. It finds that flip rates across seven frontier models diverge massively from 17.5% to 97.3%, and attributing arguments to the model's "own previous writing" further increases flipping. Finally, an optimal cross-model selection is used to construct a "most toxic" challenge set, MaxFlip.
Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts: This paper proposes CALM to evaluate whether LLMs align with the actual decision-making processes of organizations rather than just output labels. By comparing ECHR legal adjudication with German Credit lending decisions, it demonstrates that process alignment predicts accuracy in stable normative domains, whereas in value-controversial domains, high process alignment is both difficult to achieve and not necessarily desirable.