Skip to content

📊 LLM Evaluation

🧪 ICML2025 · 22 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (131) · 💬 ACL2026 (97) · 🧪 ICML2026 (40) · 🤖 AAAI2026 (16) · 🧠 NeurIPS2025 (38) · 📹 ICCV2025 (27)

🔥 Top topics: LLM ×6

AAAR-1.0: Assessing AI's Potential to Assist Research

The AAAR-1.0 benchmark is proposed to systematically evaluate the actual capabilities of LLMs in assisting scientific research across four expert-level tasks: equation inference, experimental design, paper weakness detection, and peer review critique. The benchmark reveals significant deficiencies in current models when performing deep research tasks.

Are LLM Belief Updates Consistent with Bayes' Theorem?

This paper proposes the Bayesian Coherence Coefficient (BCC) to quantify whether LLM belief updates conform to Bayes' theorem, revealing that larger and more powerful pretrained models exhibit belief updates that are more consistent with Bayes' theorem when presented with new evidence.

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

This paper proposes SITAlign, a satisficing alignment framework based on bounded rationality, which maximizes the primary objective (e.g., helpfulness) at inference time while ensuring secondary objectives (e.g., harmlessness) satisfy threshold constraints. Solved through duality theory, it achieves a 22.3% win rate improvement over state-of-the-art multi-objective decoding on GPT-4 evaluation.

Communicating Activations Between Language Model Agents

A method is proposed to allow LLM agents to communicate via intermediate layer activations (instead of natural language) by injecting the activation vector of Model A into the intermediate layers during the forward pass of Model B. This requires zero additional parameters or training data, while improving performance by up to 27% compared to natural language communication across multiple reasoning benchmarks, using only 1/4 of the computation.

Consistency in Language Models: Current Landscape, Challenges, and Future Directions

This paper systematically surveys the landscape of LLM consistency research, proposing a taxonomy that comprises logical consistency (negation, symmetry, transitivity), semantic consistency, factual/informational consistency, and non-logical consistency (morality/norms). It analyzes the deficiencies of evaluation methods from 2019 to 2025 and calls for the establishment of standardized multilingual benchmarks and interdisciplinary approaches.

Correlated Errors in Large Language Models

Through a large-scale empirical analysis of over 350 LLMs, this paper reveals highly correlated error patterns across different LLMs. When both models make mistakes, they choose the same incorrect answer in approximately 60% of cases, and more accurate models exhibit higher correlation. Furthermore, the paper investigates the downstream impacts of this correlation on LLM-as-Judge evaluation and the labor market.

DataDecide: How to Predict Best Pretraining Data with Small Experiments

This work constructs DataDecide—the largest open model suite to date (25 data recipes \(\times\) 14 model scales \(\times\) 3 random seeds)—to systematically study how small-scale experiments can predict the best pretraining data. The study reveals that a single small-scale ranking (e.g., at 150M parameters) achieves approximately 80% pairwise decision accuracy, and continuous likelihood proxy metrics require only 0.01% of the target compute to reach over 80% prediction accuracy across multiple benchmarks.

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures

This paper proposes the Dual Attention Transformer (DAT). By introducing "relational attention" heads into the standard attention mechanism, it decouples and parallelly processes sensory and relational information before integrating them. DAT exhibits significant improvements in data and parameter efficiency across relational reasoning benchmarks, mathematical problem solving, image recognition, and language modeling.

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

EnIGMA is an LM agent designed to autonomously solve Capture The Flag (CTF) challenges. By introducing novel interactive agent tools (debuggers and server connection utilities), it enables LM agents to execute interactive terminal programs for the first time. It achieves state-of-the-art (SOTA) results across 390 CTF challenges from 4 benchmarks and uncovers "soliloquizing," a new type of hallucination behavior.

MultiCogEval: Evaluating LLMs Across Multi-Cognitive Levels

Inspired by Bloom's Taxonomy, this work proposes a multi-cognitive level evaluation framework, MultiCogEval, to assess the medical capabilities of LLMs across three levels: knowledge mastery, comprehensive application, and situational problem-solving. The findings reveal that the performance of all models decreases significantly as cognitive complexity increases, and model scale becomes more critical at higher levels.

Fleet of Agents: Coordinated Problem Solving with Large Language Models

Proposes Fleet of Agents (FoA), which coordinates LLM reasoning across multiple agents based on genetic particle filtering: independent exploration by multiple agents \(\rightarrow\) resampling based on heuristic value functions \(\rightarrow\) dynamic branching to adapt to discovered solutions. It improves quality by 5% on average compared to SOTA methods while requiring only 40% of the cost.

G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration

This paper proposes G-Sim, a hybrid framework that utilizes LLMs to automatically design the causal structures of simulators (submodules and connectivity), and then calibrates empirical numerical parameters using gradient-free optimization (GFO) or simulation-based inference (SBI) within an iterative refinement loop to generate reliable, intervenable, and general-purpose simulators.

How Much Can We Forget about Data Contamination?

This work systematically quantifies the impact of data contamination on LLM benchmark evaluation through controlled experiments. It finds that when trained on more than five times the Chinchilla-optimal data volume, even contaminated data repeated 144 times can be completely forgotten. It further demonstrates that weight decay is the key mechanism driving forgetting, leading to the inference that large models like Llama 3 405B have already forgotten the data from their early training stages.

Hyperband-based Bayesian Optimization for Black-box Prompt Selection

This work proposes the HbBoPs method, which combines a structure-aware deep kernel Gaussian process (separately encoding instructions and few-shot exemplars) with a Hyperband multi-fidelity scheduler. It achieves both sample efficiency and query efficiency in black-box LLM prompt selection, outperforming all SOTA methods across ten benchmarks and three LLMs.

Learning Distribution-Wise Control in Representation Space for Language Models

Deterministic nodes in representation fine-tuning are replaced with randomized nodes. By employing the reparameterization trick to learn latent distributions instead of single pointwise transformations, consistent performance gains are achieved across commonsense and mathematical reasoning tasks, with intervention in earlier layers exhibiting the most significant impact.

Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

By leveraging community content from the Art of Problem Solving (AoPS) forum, this work constructs AoPS-Instruct, a training set of 652K Olympiad-level mathematical QA pairs, and LiveAoPSBench, a timestamped contamination-resistant evaluation set. It reveals that the high performance of LLMs on older datasets may stem from pre-training data leakage rather than genuine reasoning capabilities.

LLM-SRBench: A New Benchmark for Scientific Equation Discovery with LLMs

Proposes the LLM-SRBench benchmark (239 problems across 4 scientific domains) that prevents LLM memorization through equation transformation (LSR-Transform) and synthetic problems (LSR-Synth). The current best method achieves only 31.5% symbolic accuracy.

PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

PhantomWiki is proposed—an evaluation framework that on-demand generates fictional world corpora and QA pairs. By controlling reasoning difficulty via context-free grammars (CFGs) and adjusting retrieval difficulty via universe scale, it achieves a decoupled evaluation of LLM reasoning and retrieval capabilities while naturally resisting data leakage.

Position: Theory of Mind Benchmarks are Broken for Large Language Models

This position paper points out that most current LLM Theory of Mind (ToM) benchmarks only evaluate whether models can "predict others' behavior" (Literal ToM), but fail to test whether they can "take optimal responses based on that prediction" (Functional ToM). Consequently, they systematically overestimate models' adaptive capabilities in real interactions.

Sample Efficient Demonstration Selection for In-Context Learning

This paper proposes a sample-efficient demonstration selection method for in-context learning (ICL). Under limited annotation budgets, it efficiently selects the optimal combination of demonstrations, significantly improving the ICL performance of LLMs while dramatically reducing the required amount of labeled data.

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Proposes the GraphFilter method, which models SFT datasets as sentence–n-gram bipartite graphs and simultaneously optimizes data quality and diversity through a multiplicative priority function, comprehensively outperforming 9 baseline methods across 3 models and 6 benchmarks.

Unlocking Post-hoc Dataset Inference with Synthetic Data

This paper proposes utilizing synthetically generated held-out datasets combined with post-hoc calibration to achieve dataset inference without the need for real held-out sets. It generates high-quality synthetic data via suffix completion and decouples generative shift from membership signals using dual-classifier calibration, achieving high-confidence copyright detection with low false positive rates across 15 diverse text datasets.