📊 LLM Evaluation¶
🧠 NeurIPS2025 · 37 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (131) · 💬 ACL2026 (97) · 🧪 ICML2026 (40) · 🤖 AAAI2026 (16) · 📹 ICCV2025 (27)
🔥 Top topics: LLM ×11 · Alignment/RLHF ×4 · Reasoning ×2
- AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
-
This work identifies that random data sampling in STaR (Self-Taught Reasoner) leads to severely imbalanced observation training frequencies—easy problems are over-trained while hard problems are under-trained—and proposes AdaSTaR, which combines adaptive diversity sampling (prioritizing under-trained samples) with adaptive curriculum sampling (adjusting difficulty based on model strength) to achieve the highest accuracy on all 6 benchmarks while reducing training FLOPs by 58.6%.
- Bayesian Evaluation of Large Language Model Behavior
-
This paper proposes a Beta-Binomial Bayesian framework for evaluating LLM behavior. By modeling the posterior distribution of \(\theta_m\) over stochastic generations for each prompt, the framework quantifies statistical uncertainty in evaluation metrics and introduces sequential sampling strategies such as Thompson sampling to achieve narrower credible intervals with fewer API calls.
- Benchmarking is Broken — Don't Let AI be its Own Judge
-
This paper systematically critiques the fundamental flaws of current AI benchmark evaluation—data contamination (45%+ overlap in MMLU), selective reporting, and lack of proctoring—and proposes PeerBench: drawing on the proctoring paradigm of high-stakes exams (e.g., SAT/GRE), it constructs a next-generation AI evaluation infrastructure via a rolling confidential question bank, peer-review quality control, reputation-weighted scoring, and cryptographic commitment mechanisms.
- Benchmarking Large Language Models for Zero-Shot and Few-Shot Phishing URL Detection
-
This paper systematically evaluates three commercial LLMs — GPT-4o, Claude-3.7, and Grok-3-Beta — on phishing URL detection under a unified zero-shot and few-shot prompt framework. Results show that few-shot prompting consistently improves performance across all models, with Grok-3-Beta achieving the best F1 (0.9399) on the balanced dataset, while different models exhibit distinct precision–recall trade-off behaviors.
- Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation
-
This paper formalizes LLM benchmark evaluation as a hierarchical statistical model, theoretically demonstrates that multiple stochastic generations (\(k>1\)) reduce the variance of benchmark score estimates, and introduces a prompt-level difficulty metric \(\mathbb{P}(\text{correct})\) along with data maps for benchmark quality control.
- Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
-
This paper proposes LAGER, a framework that aggregates score token logits from intermediate to final layers of an LLM and computes an expected score to derive the final judgment. Without any model fine-tuning, LAGER improves human alignment by up to 7.5% and matches or surpasses reasoning-based methods without requiring chain-of-thought inference.
- BLINK-Twice: You See But Do You Observe? A Reasoning Benchmark on Visual Perception
-
This paper introduces BLINK-Twice, a vision-centric reasoning benchmark comprising 345 visually challenging images, 103 adversarial samples, 896 VQA pairs, and 1,725 annotated reasoning steps. Through seven categories of visual illusion scenarios, it evaluates the "you see but do not observe" reasoning capability of MLLMs. The strongest model, Gemini-2.5 Pro, achieves only 26.9% G-Acc, suggesting that multi-round image observation and active visual interaction are promising directions for improvement.
- Can Large Language Models Master Complex Card Games?
-
This paper systematically evaluates the ability of LLMs to learn eight complex card games. It finds that through SFT on high-quality game trajectory data, LLMs can approach the performance of strong game AIs and simultaneously master multiple games, though general capabilities degrade — a decline that can be mitigated by mixing in general instruction data.
- CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
-
This paper proposes CodeAssistBench (CAB), the first fully automated benchmark for evaluating multi-turn, repository-level programming assistance. CAB automatically constructs 3,286 real-world programming help scenarios from GitHub Issues, spanning 7 languages and 214 repositories, and reveals a substantial performance gap: state-of-the-art models achieve 70–83% on StackOverflow-style questions but only 7–16% on post-cutoff repositories.
- ComPO: Preference Alignment via Comparison Oracles
-
To address likelihood displacement and verbosity caused by noisy preference pairs (where preferred and dispreferred responses are highly similar) in DPO, this paper proposes ComPO, a zeroth-order preference alignment method based on comparison oracles. The approach partitions data into clean and noisy subsets, applying DPO to the clean subset and ComPO to extract alignment signals from the noisy subset, achieving consistent improvements in LC win rate on benchmarks such as AlpacaEval 2.
- ConTextTab: A Semantics-Aware Tabular In-Context Learner
-
ConTextTab integrates semantic embeddings (text encodings of column names and categorical values) into a table-native ICL architecture, and pretrains on large-scale real-world tabular data (T4, ~2.18M tables). It achieves a new state of the art on the semantics-rich CARTE benchmark while remaining competitive with existing methods on non-semantic benchmarks.
- Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models
-
This work constructs the Braingle Brainteaser benchmark (242 math + 236 logic puzzles) and systematically evaluates LLM reasoning strategies on brainteasers. The findings reveal that models occasionally produce creative, insight-driven solutions, but frequently fall back on brute-force enumeration even when elegant solutions exist; self-correction ability is limited; and translating narrative formats into mathematical formats yields modest performance gains.
- DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering
-
This paper proposes Dual-Stage Adaptive Sharpening (DSAS), a training-free plug-and-play attention optimization framework. It employs Contextual Gate Weighting (CGW) to enhance attention from key passages toward the question and target positions, and Reciprocal Attention Suppression (RAS) to suppress information exchange between key and irrelevant passages, achieving an average F1 improvement of 4.2% on multi-document QA benchmarks.
- Efficient Semantic Uncertainty Quantification in Language Models via Diversity-Steered Sampling
-
This paper proposes a diversity-steered sampling framework that injects NLI-based semantic similarity penalties during decoding to encourage semantically diverse generation, and corrects distributional bias via importance weighting with control variates to reduce variance. The method accurately estimates semantic entropy (aleatoric uncertainty) and mutual information (epistemic uncertainty) of LLMs using as few as 16 samples.
- EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving
-
This paper proposes EvaLearn, a benchmark that evaluates the learning capability and learning efficiency of LLMs through a sequential problem-solving paradigm, revealing that models with stronger static performance do not necessarily possess greater learning potential.
- Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
-
Through controlled experiments, this paper reveals the fundamental mechanism by which larger vocabularies improve language model performance: expanding the vocabulary reduces the Kolmogorov complexity of tokenized text, exploiting vocabulary frequency imbalance to substantially lower the loss on high-frequency tokens, thereby driving down global cross-entropy and improving downstream task performance.
- HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
-
This paper proposes HybridNorm, a hybrid normalization strategy that applies QKV normalization within the attention module to decouple gradients and Post-Norm within the FFN to enhance regularization. Across scales from 550M to 7B parameters, HybridNorm simultaneously achieves the training stability of Pre-Norm and the generalization performance of Post-Norm, yielding an average downstream task improvement of 2.45% at the 7B scale.
- Hyperbolic Fine-Tuning for Large Language Models
-
This work identifies that LLM token embeddings follow power-law distributions and exhibit tree-like hyperbolic structure, and proposes HypLoRA — performing low-rank adaptation directly on the Lorentz hyperbolic manifold (bypassing the cancellation effect of tangent space mappings) — achieving significant gains over standard LoRA on arithmetic and commonsense reasoning tasks (e.g., M.AVG +7.5% on Qwen2.5-7B).
- Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities
-
This paper introduces the Ineq-Comp benchmark, which applies compositionally transformed variants of simple inequality seed problems—variants that humans can resolve with minimal additional effort—to expose fundamental deficiencies in the compositional reasoning of current LLM-based formal theorem provers. Even DeepSeek-Prover-V2-7B suffers a performance drop exceeding 20%.
- Leveraging Robust Optimization for LLM Alignment under Distribution Shifts
-
This paper proposes DoRA (Distribution-aware Optimization for Robust Alignment), which trains a distribution classifier to assign calibrated weights to individual samples and incorporates them into a KL-DRO framework to minimize worst-case loss. DoRA operates as a model-agnostic plug-and-play module that consistently improves the robustness of various alignment algorithms—including DPO, RRHF, and LIRE—under distribution shifts.
- LTD-Bench: Evaluating Large Language Models by Letting Them Draw
-
LTD-Bench evaluates the spatial reasoning capabilities of LLMs by having them draw (via dot-matrix output or code-based rendering), transforming abstract evaluation metrics into intuitive visual outputs. The benchmark reveals critical deficiencies in current state-of-the-art LLMs regarding bidirectional mapping between linguistic and spatial concepts.
- MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments
-
This paper proposes the MEMTRACK benchmark to evaluate LLM agents' long-term memory and state tracking capabilities in multi-platform dynamic environments (Slack/Linear/Git), revealing that even the strongest model, GPT-5, achieves only 60% accuracy.
- Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories
-
This paper presents the first systematic evaluation of how train/test splitting strategies affect generalization performance in attribute prediction tasks. It proposes four progressively harder splitting schemes based on LLM semantic grouping, embedding similarity, embedding clustering, and ground-truth supercategory labels. The study finds that unsupervised clustering-based splitting achieves leakage reduction comparable to ground-truth supercategory splits—without requiring any annotations—while retaining substantially better predictive performance.
- On Evaluating LLM Alignment by Evaluating LLMs as Judges
-
This paper systematically investigates the consistency between LLMs' generation capability and evaluation capability (GE-consistency), finding a strong correlation between the two rankings under a strong preference oracle (Spearman \(\rho = 0.96\)). Based on this finding, the authors propose the AlignEval benchmark, which measures LLM alignment by assessing LLMs' ability as judges—without directly invoking LLM-as-Judge to evaluate model outputs—achieving performance comparable to or better than AlpacaEval and Arena-Hard.
- On the Entropy Calibration of Language Models
-
This paper systematically investigates the entropy calibration of language models — whether the entropy of generated text matches the log loss on human text — and finds that due to the power-law nature of data distributions (\(\alpha \approx 1\)), error accumulation improves extremely slowly with model scale (scaling exponent \(\approx -0.05\)). The paper further provides a theoretical proof that entropy can be calibrated in polynomial time without sacrificing diversity.
- OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling
-
This paper proposes OptiTree, which organizes hierarchical classification and modeling thoughts for operations research (OR) problems by constructing a modeling tree, and employs tree search to adaptively decompose complex problems into sequences of simpler subproblems, achieving significant accuracy gains in optimization modeling tasks for LLMs (exceeding 10% on multiple challenging benchmarks).
- PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation
-
This paper presents PARROT, a practical and realistic benchmark for cross-system SQL translation (SQL-to-SQL), comprising 598 core translation pairs (expanded to 28,003 pairs) sourced from 38 open-source benchmarks and real-world business scenarios, covering 22 production-grade database systems. The benchmark reveals that the strongest current LLMs achieve an average accuracy below 38.53%.
- PaTH Attention: Position Encoding via Accumulating Householder Transformations
-
This paper proposes PaTH (Position encoding via accumulating Householder Transformations), a data-dependent multiplicative position encoding scheme that replaces RoPE's static rotation matrices with accumulated Householder transformations, achieving superior theoretical expressiveness and empirical language modeling performance over RoPE.
- PFΔ: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations
-
PFΔ is the first power flow benchmark dataset to simultaneously encompass load, generation dispatch, and topology variations. It comprises 859,800 solved instances across six grid scales, includes close-to-infeasible extreme operating conditions, and introduces a standardized evaluation task suite for systematically assessing ML methods under diverse operating conditions.
- Predicting the Performance of Black-Box LLMs through Follow-Up Queries
-
This paper proposes QueRE, a method that poses approximately 50 follow-up questions to a black-box LLM (e.g., "Are you confident in your answer?") and uses the resulting "Yes" token probabilities as features to train a linear classifier. QueRE achieves strong performance on predicting model correctness, detecting adversarial manipulation, and distinguishing between different LLMs — surpassing even white-box methods that require access to internal model states.
- Risk Management for Mitigating Benchmark Failure Modes: BenchRisk
-
Grounded in the NIST Risk Management Framework, this paper systematically analyzes 26 mainstream LLM benchmarks, identifies 57 potential failure modes and 196 mitigation strategies, and proposes the BenchRisk meta-evaluation framework for quantifying the reliability risk of benchmarks.
- Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems
-
This work systematically evaluates three language models with fewer than 1.5B parameters (gemma3, llama3.2, qwen2.5) on compiler auto-parallelization tasks. Using six inference strategies across 11 real-world kernels, the approach achieves an average speedup of 6.81x and a peak speedup of 43.25x, demonstrating that small models can serve as powerful compiler optimization reasoning engines.
- The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses
-
This work systematically evaluates GPT-4o and Claude-3.7 on readability and empathy in medical diagnostic communication. Both models produce reading levels well above recommended standards (grades 9–13 vs. the recommended grades 6–8). Affective empathy varies significantly with diagnosis type and patient education level, and LLM-as-Judge exhibits severe self-serving bias (GPT inflates its own empathy scores by ~0.3 points).
- Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking
-
This paper introduces DeepFund — the first live fund investment benchmark for LLMs — which employs a multi-agent architecture (Financial Planner + Analyst Team + Portfolio Manager) connected to real-time market data, eliminating the information leakage caused by LLM "time travel" in traditional backtesting. Over 24 trading days of live testing across 9 flagship LLMs, only Grok 3 achieves positive returns, revealing fundamental limitations of current LLMs in active fund management.
- Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs
-
This paper introduces EngDesign—the first LLM engineering design benchmark spanning 9 engineering domains (operating systems, computer architecture, control systems, mechanical engineering, structural engineering, digital hardware, analog circuits, robotics, and signal processing)—replacing conventional QA matching with a simulation-driven evaluation pipeline. The benchmark reveals that even the most capable reasoning model, o3, achieves only a 34% pass rate.
- Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally
-
This paper constructs WCB, the most comprehensive central bank monetary policy corpus to date (380,000+ sentences, 25 central banks, spanning 28 years), defines three NLP tasks (stance detection, temporal classification, uncertainty estimation), and through 15,075 benchmark experiments demonstrates that models trained on aggregated multi-bank data significantly outperform single-bank training, confirming the principle that "the whole is greater than the sum of its parts."
- Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
-
This paper identifies that post-training (SFT/RLHF/DPO) degrades the confidence calibration of pre-trained language models, and proposes DACA, a method that exploits the well-calibrated nature of pre-trained models by aligning confidence distributions exclusively on prediction-consistent samples, achieving label-free calibration of post-trained models with up to 15.08% ECE improvement.