📊 LLM Evaluation¶

🔬 ICLR2026 · 60 paper notes

Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms: This paper proposes the first unified benchmark for PU learning and systematically addresses two critical issues: (1) enabling model selection without negative samples via proxy accuracy and proxy AUC; (2) identifying and resolving intra-dataset label shift in the one-sample setting through a simple calibration strategy that merges positive samples into the unlabeled set, enabling fair comparison of two-sample algorithms under one-sample evaluation.
AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning: This paper introduces AnesSuite, the first comprehensive dataset suite for anesthesiology reasoning, comprising AnesBench—an evaluation benchmark of 7,972 bilingual multiple-choice questions organized into three cognitive difficulty levels—and three training datasets (AnesCorpus/AnesQA/AnesR1). The Morpheus models trained on this suite via SFT+GRPO enable a 7B model to match a 14B baseline, while revealing significant bottlenecks of state-of-the-art LLMs on complex clinical reasoning (System 2).
ASIDE: Architectural Separation of Instructions and Data in Language Models: This paper proposes ASIDE, an architectural modification that distinguishes instructions from data at the token embedding level via orthogonal rotation. Requiring only changes to the forward pass and training on standard instruction fine-tuning data, ASIDE significantly improves instruction-data separation and robustness against prompt injection without any dedicated safety training.
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite: The AI2 team identifies five methodological flaws in existing scientific research agent benchmarks and introduces AstaBench, the first agent evaluation suite covering the full scientific research pipeline. AstaBench comprises 4 categories and 11 sub-benchmarks with 2,400+ questions, a production-grade controllable search tool backed by Semantic Scholar, and 9 research-optimized Asta Agent baselines. It conducts the largest systematic evaluation to date across 57 agents (22 types), finding that despite progress on individual tasks such as literature retrieval, AI remains far from meeting the demands of end-to-end scientific research assistance.
Benchmarking Overton Pluralism in LLMs: This paper proposes the OvertonBench framework, which formalizes Overton pluralism as a set-coverage metric called OvertonScore through a large-scale human study (1,208 demographically representative U.S. participants, 60 subjective questions, 8 LLMs). All evaluated models score only 0.35–0.41 (theoretical maximum: 1.0), and an automated evaluation tool achieving high correlation with human judgments (ρ=0.88) is constructed.
BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation: This paper proposes BiasScope, a fully LLM-driven iterative framework that automatically discovers previously unknown biases in LLM-as-a-Judge evaluation at scale. Based on the discovered biases, the authors construct JudgeBench-Pro, a more challenging benchmark on which even powerful LLM judges exceed 50% error rate.
Biologically Plausible Online Hebbian Meta-Learning: Two-Timescale Local Rules for Spiking Neural Brain Interfaces: This paper proposes an online SNN decoder that eliminates BPTT by combining three-factor Hebbian local learning rules with dual-timescale eligibility traces and adaptive learning rate control. The approach achieves neural decoding accuracy comparable to offline-trained methods (Pearson R ≥ 0.63/0.81) under O(1) memory complexity, and demonstrates continuous adaptation to non-stationary neural signals in closed-loop simulations.
Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors: This paper provides the first theoretical analysis of the "PCC plateau" phenomenon observed when training attention-based regression models with a joint MSE+PCC objective. The root causes are identified as the conflict between MSE optimization and PCC gradients, together with an expressivity upper bound imposed by the convex aggregation of softmax. The authors propose the ECA (Extrapolative Correlation Attention) framework, which breaks through this limitation via three components: scaled residual aggregation, dispersion-aware temperature softmax, and dispersion-normalized PCC loss.
Can Vision–Language Models Assess Graphic Design Aesthetics? A Benchmark, Evaluation, and Dataset Perspective: This paper proposes AesEval-Bench, the first benchmark for systematically evaluating VLMs on graphic design aesthetics (4 dimensions × 12 indicators × 3 tasks). It finds that existing VLMs—including reasoning-augmented models—perform poorly on design aesthetics, and constructs training data via human-guided VLM labeling combined with indicator-grounded reasoning. Fine-tuning a 7B model with this data surpasses GPT-5 on the precise localization task.
Can You Hear Me Now? A Benchmark for Long-Range Graph Propagation and Beyond: This paper proposes the ECHO benchmark, comprising 3 synthetic tasks and 2 real-world chemistry tasks grounded in density functional theory (DFT), requiring graph neural networks to propagate information effectively over 17–40 hops. The benchmark systematically evaluates the long-range propagation capabilities of 11 GNN architectures.
Conformal Prediction Adaptive to Unknown Subpopulation Shifts: To address the failure of standard conformal prediction under subpopulation shift, this paper proposes three adaptive algorithms: weighting calibration data via a learned domain classifier (Algorithms 1/2) or via embedding similarity (Algorithm 3). Coverage guarantees are maintained even with imperfect or absent domain labels, with applications to visual classification and LLM hallucination detection.
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science: DARE-bench is a large-scale verifiable benchmark for data science tasks, comprising 6,300 Kaggle-derived tasks that support evaluation across two dimensions—ML modeling and instruction following—along with training data for SFT and RL. SFT improves Qwen3-32B by 1.83×, while RL improves Qwen3-4B by more than 8×.
Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding: This paper proposes FlexQP — an "always feasible" convex quadratic programming (QP) solver based on \(\ell_1\) elastic relaxation — and combines it with deep unfolding to learn an LSTM feedback policy that accelerates convergence, yielding Deep FlexQP. When embedded as a submodule within an SQP framework, it solves nonlinear trajectory optimization problems 4–16× faster than OSQP, reduces safety violations in predictive safety filters by over 70%, and improves task completion rates by 43%.
Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces: This paper proposes Discount Model Search (DMS), which replaces the histogram-based discrete representation in CMA-MAE with a neural network that fits a continuous, smooth discount function. This addresses the issue of search stagnation caused by distortion in high-dimensional measure spaces, and enables, for the first time, the direct use of image datasets to define measure spaces (the QDDM paradigm).
Disentangling Shared and Private Neural Dynamics with SPIRE: A Latent Modeling Framework for Deep Brain Stimulation: This paper proposes SPIRE (Shared–Private Inter-Regional Encoder), a nonlinear dual-latent-space autoencoder framework that decomposes intracranial recordings from multiple brain regions into shared and private subspaces via cross-region alignment and orthogonal disentanglement losses. Trained exclusively on baseline data, SPIRE detects frequency-dependent network reorganization induced by DBS stimulation.
Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity: This paper empirically demonstrates that linear mode connectivity (LMC) between independently trained models can be achieved by simply increasing model width, without any parameter permutation. It further proposes Layer-wise Exponentially Weighted Connectivity (LEWC) to explain the underlying mechanism.
Enabling Fine-Grained Operating Points for Black-Box LLMs: This paper identifies that verbalized probabilities from black-box LLMs produce only 16–23 unique values (low-cardinality problem), resulting in coarse PR/ROC curves that prevent fine-grained threshold tuning. By injecting parameterized noise and an optional MLP correction, the number of unique values increases from 16 to 20,000+, matching the performance of 20-sample ensembles with only 1–2 API calls.
Function Spaces Without Kernels: Learning Compact Hilbert Space Representations: This paper proves that Function Encoders, which learn neural network basis functions, implicitly define a valid kernel, thereby bridging neural feature learning and RKHS theory. It further proposes PCA-guided compact basis selection algorithms and establishes finite-sample generalization bounds.
GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time: This paper proposes GuidedSampling, an inference-time algorithm that explicitly decouples the implicit exploration and generation process of repeated sampling (RS) into two stages: iteratively generating diverse problem-solving concepts/theorems, followed by generating candidate solutions conditioned on each concept. The method achieves an average improvement of ~21.6% on pass@50 and ~9.7% on pass@5 after fine-tuning.
How Reliable is Language Model Micro-Benchmarking?: This paper proposes Minimum Detectable Ability Difference (MDAD) as a meta-evaluation metric, systematically demonstrating that micro-benchmarks at extremely small scales cannot reliably distinguish model pairs with small performance gaps, and that random sampling becomes competitive with carefully designed micro-benchmark methods once the sample size reaches ~250.
Human-LLM Collaborative Feature Engineering for Tabular Learning: This paper proposes a human-LLM collaborative feature engineering framework that decouples the proposal and selection of feature operations. A Bayesian neural network models operation utility and uncertainty to guide selection, with selective human preference feedback incorporated when appropriate. The framework achieves 8.96%–11.23% average error rate reduction across 18 tabular datasets.
Improving Set Function Approximation with Quasi-Arithmetic Neural Networks: This paper proposes QUANN (Quasi-Arithmetic Neural Networks), which employs invertible neural networks to implement a learnable Kolmogorov mean as the pooling operation. It is the first to realize a machine-learning instantiation of generalized measures of central tendency. QUANN serves as a universal approximator for mean-decomposable set functions, and the learned embeddings exhibit stronger cross-task transferability.
In-Context Learning for Pure Exploration: This paper proposes ICPE (In-Context Pure Exploration), an in-context learning framework that combines supervised learning and reinforcement learning. Using a Transformer trained directly from experience, ICPE learns exploration policies for active sequential hypothesis testing and pure exploration problems, achieving near-optimal instance-adaptive algorithmic performance without explicit modeling of the information structure.
In-Context Learning of Temporal Point Processes with Foundation Inference Models: This paper proposes FIM-PP — the first foundation inference model for marked temporal point processes (MTPP). A Transformer is pretrained on 72K synthetic point processes (14.4M events) to perform in-context inference of conditional intensity functions. In zero-shot settings, FIM-PP matches the performance of specialized models trained for hours; after a few minutes of fine-tuning, it achieves state-of-the-art results on multi-event prediction across four real-world datasets.
LCA: Local Classifier Alignment for Continual Learning: This paper proposes Local Classifier Alignment (LCA), a loss function that simultaneously minimizes classification loss and loss sensitivity within local regions of class prototype Gaussian distributions. LCA addresses the classifier mismatch problem arising from incremental backbone merging in continual learning. Combined with an Incremental Merging (IM) strategy for PEFT modules, the method achieves an overall average accuracy of 85.6% across 7 benchmark datasets, substantially outperforming prior state-of-the-art methods.
LLM Unlearning with LLM Beliefs: This paper reveals that LLM unlearning methods such as GA and NPO suffer from a squeezing effect—reducing the probability of a target response causes probability mass to redistribute toward semantically related high-likelihood regions, resulting in spurious unlearning. The authors propose a bootstrapping-based framework that leverages the model's own high-confidence predictions (model beliefs) as additional unlearning targets. Two instantiations, BS-T (token-level) and BS-S (sequence-level), achieve more thorough unlearning while preserving model utility across multiple benchmarks including TOFU, MUSE, and WMDP.
Measuring Uncertainty Calibration: For the problem of estimating the \(L_1\) calibration error of binary classifiers from finite samples, this paper proposes the first non-asymptotic, distribution-free certifiable upper bound methods under two structural assumptions—bounded variation and bounded derivatives—where the latter can be guaranteed by applying a small perturbation to classifier outputs. Experiments demonstrate that the calibration error upper bound can be controlled to approximately 0.02 with \(10^7\) samples.
Mitigating Spurious Correlation via Distributionally Robust Learning with Hierarchical Ambiguity Sets: A hierarchical DRO framework is proposed to simultaneously capture inter-group (group proportion shifts) and intra-group (intra-group distributional shifts) uncertainty. By defining intra-group ambiguity sets in the semantic space via the \(W_\infty\) distance, the method achieves state-of-the-art performance on standard benchmarks and maintains strong robustness under a newly designed minority group distributional shift setting where all competing methods fail.
MOSIV: Multi-Object System Identification from Videos: This paper proposes MOSIV—the first complete framework for multi-object system identification from multi-view videos—comprising three stages: (1) object-aware 4D dynamic Gaussian reconstruction of per-object geometry and motion; (2) Gaussian-to-continuum lifting to construct MPM simulation particles; and (3) differentiable MPM forward rollout with geometry-alignment objectives (3D Chamfer + 2D silhouette) to back-propagate and optimize per-object continuous material parameters (\(E, \nu, \mu\)). On a contact-rich synthetic benchmark spanning four material types (elastic, elastoplastic, fluid, and granular), MOSIV achieves PSNR 30.51 vs. OmniPhysGS 25.93 and reduces Chamfer distance by 9.4×, establishing a new baseline for multi-object long-horizon physical simulation.
Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses: This paper proposes MACI (Multi-LLM Adaptive Conformal Inference), which combines a cumulative-product conformity score, a multi-LLM ensemble for factuality scoring, and group-conditional calibration to significantly improve the retention rate of factual claims in LLM responses while strictly guaranteeing user-specified error rates.
Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization: This paper is the first to formally define the Noise-Aware Generalization (NAG) problem — simultaneously pursuing in-domain robustness and out-of-domain generalization under label noise — and proposes DL4ND, a method that detects noisy labels via cross-domain comparison, achieving up to 12.5% improvement across 7 datasets.
Non-Clashing Teaching in Graphs: Algorithms, Complexity, and Bounds: This paper studies non-clashing teaching of closed-neighborhood concept classes in graphs, providing tight algorithmic bounds (a matching \(2^{\mathcal{O}(|E|)}\) bound for N-NCTD⁺), FPT algorithms parameterized by treedepth and vertex cover (including the first FPT result with negative labels), and combinatorial upper bounds for planar graphs and unit square graphs, substantially advancing both the computational and combinatorial understanding of non-clashing teaching.
Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence: This paper leverages the geometric singularity boundaries of semi-discrete optimal transport (OT) to locate semantically ambiguous regions in latent space, generates proxy OOD samples (OTIS) near these boundaries, and applies a confidence suppression loss during training to enforce uniform predictions in structurally uncertain regions, thereby systematically mitigating OOD overconfidence in DNNs.
PlanetAlign: A Comprehensive Python Library for Benchmarking Network Alignment: This paper presents PlanetAlign, a PyTorch-based network alignment benchmark library integrating 18 datasets across 6 domains, 14 methods spanning three categories (consistency-based, embedding-based, and optimal transport-based), and a standardized evaluation pipeline. Through large-scale systematic experiments, PlanetAlign reveals that OT-based methods (PARROT/JOENA) achieve comprehensive superiority in effectiveness, while different method categories exhibit distinct trade-offs in scalability and robustness.
Predicting LLM Reasoning Performance with Small Proxy Model: This paper proposes rBridge, which uses reasoning traces from frontier models as gold labels and applies token-level task-aligned weighted NLL, enabling small models (≤1B) to effectively predict the reasoning performance of 13B–32B models, achieving over 100× computational savings on dataset ranking tasks.
Preference Leakage: A Contamination Problem in LLM-as-a-judge: This paper is the first to formally define and systematically investigate Preference Leakage in LLM-as-a-Judge — when the synthetic data generator \(M_G\) and the judge \(M_J\) are related (same model / inheritance / same family), the judge exhibits systematic preference toward the "associated student model." Under the same-model scenario, PLS reaches 28.7% on Arena-Hard, and this bias is more subtle and harder to detect than egocentric bias.
Prompt and Parameter Co-Optimization for Large Language Model Task Adaptation: This paper proposes MetaTuner, a framework that employs a shared meta-encoder to simultaneously generate query-specific prompts and LoRA parameters, enabling mutual reinforcement between prompt optimization and fine-tuning. A supervised regularization loss is designed to address the mixed discrete-continuous optimization problem. MetaTuner consistently outperforms standalone prompt optimization and fine-tuning methods on MATH, GSM8K, HotpotQA, and CosmosQA.
Prompt and Parameter Co-Optimization for Large Language Models: This paper proposes MetaTuner, a framework that simultaneously generates prompts and LoRA parameters via a shared meta encoder, unifying discrete prompt optimization and continuous parameter fine-tuning into an end-to-end jointly optimizable framework, achieving substantial improvements over independently optimized methods on mathematical reasoning and question answering tasks.
RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty: This paper proposes RankLLM, a non-parametric framework based on bidirectional score propagation over a directed bipartite graph, which jointly estimates question difficulty and model competency to achieve difficulty-aware LLM ranking, reaching 90% agreement with human judgments.
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures: This paper identifies that the true driver of "benign relearning" in LLM machine unlearning is not topical relevance but syntactic similarity, and proposes a syntactic diversification strategy to improve unlearning robustness.
Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures: This paper reveals that the true driver of "benign relearning" in LLM machine unlearning is syntactic similarity rather than topical relevance, and proposes a syntactic diversification strategy (paraphrasing the forget set) that effectively suppresses relearning, accelerates forgetting, and alleviates the trade-off between unlearning efficacy and model utility.
Revisiting the Past: Data Unlearning with Model State History: This paper proposes MSA (Model State Arithmetic), an algorithm that leverages intermediate training checkpoints to construct "forgetting vectors" and removes the influence of specific data via parameter-space arithmetic. MSA consistently outperforms existing unlearning methods such as NPO, RMU, and GradDiff on the TOFU and RESTOR benchmarks, while maintaining model utility even without a retain set.
Same Content, Different Representations: A Controlled Study for Table QA: The first controlled study that systematically evaluates the robustness of NL2SQL, LLM, and hybrid approaches under varying table size, schema quality, and query complexity by changing only the representation format (structured vs. semi-structured) while holding table content constant, demonstrating that representation format is a first-order factor in Table QA performance.
SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs: SimpleToM exposes a critical gap in LLMs' Theory of Mind capabilities: frontier models can accurately infer others' mental states (explicit ToM), but performance drops sharply when this knowledge must be applied to behavior prediction and behavior judgment (applied ToM), revealing a substantial divide between "knowing what" and "knowing how to use what is known."
SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home Agents: SimuHome is a high-fidelity smart home simulator built on the Matter protocol and a 600-episode evaluation benchmark supporting dynamic environmental variable updates and time-accelerated scheduling evaluation, revealing that workflow scheduling remains the most persistent challenge for current LLM agents.
Soft Quality-Diversity Optimization: This paper proposes the Soft QD Score as a novel quality-diversity optimization objective that eliminates the need for behavior space discretization, and derives a differentiable algorithm, SQUAD, which scales more effectively to high-dimensional behavior spaces while achieving competitive performance on standard benchmarks.
Spectral Attention Steering for Prompt Highlighting: This paper proposes SEKA/AdaSEKA, which learns a "relevance subspace" via spectral decomposition of key embeddings and directly edits key vectors prior to attention computation to achieve prompt highlighting. The approach requires no storage of the full attention matrix, is fully compatible with FlashAttention, and incurs negligible overhead (+0.03s/sample).
Subliminal Signals in Preference Labels: This paper demonstrates that preference labels can serve as a covert communication channel: even when a student model generates semantically irrelevant numeric sequences, a biased judge model can transmit subliminal behavioral tendencies to the student model through binary preference labels alone, and this transmission is amplified under iterative alignment.
TabStruct: Measuring Structural Fidelity of Tabular Data: This paper proposes the TabStruct evaluation framework and a global utility metric that measures the structural fidelity of tabular data generators with respect to causal structure, without requiring ground-truth causal graphs. A systematic comparison of 13 generators across 29 datasets reveals that diffusion models significantly outperform other methods in preserving global structure.
Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis: This paper proposes TED (Talk, Evaluate, Diagnose), a framework that achieves user-aware dynamic agent evaluation via general, reusable expert/non-expert persona templates; enables fine-grained efficiency assessment through grading notes, LLM-as-judge scoring, and novel metrics such as MaxProgressRate@k; and provides actionable improvement feedback via automated error discovery and clustering. Experiments on τ²-bench and ToolSandbox reveal new insights into agent performance.
Towards Anomaly-Aware Pre-Training and Fine-Tuning for Graph Anomaly Detection: This paper proposes the APF framework, which addresses the dual challenges of label scarcity and homophily disparity in graph anomaly detection through Rayleigh quotient-guided anomaly-aware pre-training and granularity-adaptive fine-tuning.
Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction: This paper proposes applying the Peer Prediction mechanism from game theory to LLM evaluation and training. By measuring the mutual predictability of participants' answers, the method distinguishes honest from deceptive responses without requiring ground-truth labels, thereby incentivizing truthfulness. It exhibits a striking inverse scaling property — weaker experts are actually more resistant to deception by stronger models.
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking: This paper identifies and formalizes the problem of Unindexed Information Seeking (UIS)—dynamic web pages, embedded files, and interactive content that cannot be directly retrieved by search engines—and proposes the first UIS benchmark UIS-QA (110 questions) along with the multi-agent framework UIS-Digger. A ~30B parameter model trained with SFT+RFT achieves 27.27% accuracy, surpassing systems integrating O3/GPT-4.1.
Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework: This paper proposes the HUMAINE framework, which conducts multi-dimensional (5-axis), multi-turn human preference evaluations of 28 SOTA models using 23,404 demographically stratified participants. A hierarchical Bayesian BTD model reveals that age is the largest driver of preference heterogeneity (mean rank shift ±2.8), demonstrating that a single aggregated leaderboard is insufficient to reflect the true preferences of diverse populations.
Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework: This paper proposes the HUMAINE framework, which conducts multi-dimensional evaluations of 28 models with 23,404 demographically stratified participants, revealing that age is the greatest axis of divergence in human preference and that a single leaderboard obscures critical differences.
vCache: Verified Semantic Prompt Caching: This paper proposes vCache — the first semantic caching system with user-defined error-rate guarantees — which employs online learning to independently estimate the optimal similarity threshold for each cached embedding. Without any pre-training, vCache achieves up to a 12.5× improvement in cache hit rate and a 26× reduction in error rate while satisfying correctness constraints.
When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining: This paper reveals a fundamental vulnerability of Unlearnable Examples (UE) against pretrained models—pretraining priors enable models to bypass the spurious shortcuts injected by UE and recover learning of true semantics—and proposes BAIT, a bilevel optimization framework that counters pretraining priors by binding perturbations to incorrect target labels.
When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining: This paper exposes a fundamental vulnerability of Unlearnable Examples (UEs) against pretrained models — pretraining priors enable models to bypass perturbation shortcuts and learn true semantics — and proposes the BAIT framework, which counters pretraining priors by binding perturbations to incorrect target labels.
When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling: This paper proposes SAFE (Stable And Fast LLM Ensembling), which selectively ensembles multiple heterogeneous-tokenizer LLMs at the token level via a Generate-Verify-Ensemble loop. SAFE addresses OOV-like contamination caused by tokenization mismatch in long-sequence generation, achieving performance gains by ensembling on fewer than 1% of tokens—improving UniTE from 59.6% to 77.4% on MATH500.
Which LLM Multi-Agent Protocol to Choose?: This paper introduces ProtocolBench and ProtocolRouter, presenting the first systematic comparison of multi-agent communication protocols (A2A, ACP, ANP, Agora, etc.) across four dimensions—task success rate, latency, message overhead, and robustness—and proposes a learnable protocol router for scenario-adaptive protocol selection, reducing fault recovery time by up to 18.1%.