Skip to content

📊 LLM Evaluation

🤖 AAAI2026 · 16 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (131) · 💬 ACL2026 (97) · 🧪 ICML2026 (40) · 🧠 NeurIPS2025 (37) · 📹 ICCV2025 (27)

🔥 Top topics: LLM ×4 · Reasoning ×2

BCWildfire: A Long-term Multi-factor Dataset and Deep Learning Benchmark for Boreal Wildfire Risk Prediction

This paper introduces BCWildfire, a multimodal wildfire risk prediction dataset covering 240 million hectares of British Columbia, Canada over a 25-year span, encompassing 38 driving factors. It conducts a systematic benchmark evaluation of time series forecasting models across four paradigms—CNN, Linear, Transformer, and Mamba—revealing the performance ceiling of current models and the key influential factors in wildfire prediction.

Benchmarking LLMs for Political Science: A United Nations Perspective

This paper presents UNBench, the first comprehensive LLM evaluation benchmark for political science grounded in UN Security Council records from 1994 to 2024. It encompasses four interrelated tasks—resolution drafting, voting simulation, adoption prediction, and representative statement generation—to systematically assess LLMs' ability to understand and simulate complex political dynamics.

Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

Drawing on Cognitive Load Theory (CLT) from psychology, this work decomposes the complexity of tool-use tasks into intrinsic load (structural complexity of the solution path) and extraneous load (ambiguity of problem formulation). It constructs ToolLoad-Bench, a benchmark with parametrically adjustable cognitive load, and employs an exponential decay model \(\text{Acc} \approx e^{-(k \cdot CL + b)}\) to precisely characterize the capability boundaries of different agents.

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions

This paper proposes ConInstruct, a benchmark for evaluating LLMs' ability to detect and resolve conflicting constraints in instructions. Results show that most proprietary models can detect conflicts reasonably well but rarely notify users explicitly, with DeepSeek-R1 and Claude-4.5-Sonnet achieving the best conflict detection performance (F1 of 91.5% and 87.3%, respectively).

DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning

This paper proposes DiCaP (Distribution-Calibrated Pseudo-labeling), which estimates the posterior correctness rate of pseudo-labels to calibrate their weights, introduces a dual-threshold mechanism to separate confident and ambiguous regions with differentiated strategies, and surpasses the state of the art by up to 4.27% in semi-supervised multi-label learning.

Do LLMs Really Struggle at NL-FOL Translation? Revealing Their Strengths via a Novel Benchmarking Strategy

This paper critically examines existing evaluation methodologies for natural language to first-order logic (FOL) translation — specifically FOLIO and MALLS — exposing fundamental flaws in their datasets and evaluation protocols. The authors propose a novel benchmarking strategy that decomposes the translation task into ontology extraction (OE) and logical translation (LT), augmented with "most similar selection" and "ranking" subtasks. Experiments demonstrate that conversational LLMs (o3-mini, GPT-4o-mini, Qwen3 series) exhibit strong NL-FOL translation capabilities and genuine logical semantic understanding, while embedding-based models perform significantly worse.

Gaming the Answer Matcher: Examining the Impact of Text Manipulation on Automated Judgment

This paper systematically evaluates three text manipulation strategies—verbosity, strategic multi-answer embedding, and correct-answer-first with contradictory suffix—against LLM-based answer-matching judges. The results show that these manipulations do not improve scores and often reduce them. Binary scoring proves more robust than continuous scoring, demonstrating that answer matching is resistant to low-cost text manipulation as an evaluation method.

LLM-as-a-Judge for Scalable Test Coverage Evaluation

This paper applies the LLM-as-Judge paradigm to Gherkin acceptance test coverage evaluation, systematically quantifying accuracy–reliability–cost trade-offs across 20 model configurations × 500 evaluations. It finds that GPT-4o Mini achieves the optimal production balance with a MAAE of 6.07, an ECR@1 of 96.6%, and a cost of $1.01 per 1K evaluations—approximately 1/78th the cost of GPT-5 at high reasoning effort.

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

This paper proposes PSN-IRT (Pseudo-Siamese Network for IRT), an enhanced Item Response Theory framework that jointly estimates LLM ability parameters and four-parameter item characteristics (difficulty / discrimination / guessing / feasibility). Applied to 41,871 items across 11 benchmarks, the framework reveals systemic issues including widespread saturation, insufficient difficulty ceilings, and data contamination. Item subsets selected by PSN-IRT achieve a ranking consistency of Kendall \(\tau = 1.00\).

Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning

This paper proposes LOREN, a curvature-aware zeroth-order optimization method that captures the anisotropic curvature of the loss landscape via a low-rank block-diagonal preconditioner, combined with REINFORCE Leave-One-Out (RLOO) variance reduction. LOREN achieves higher accuracy and faster convergence in LLM fine-tuning while reducing peak memory by up to 27.3% compared to MeZO-Adam.

MCTS-SQL: Light-Weight LLMs can Master the Text-to-SQL through Monte Carlo Tree Search

This paper proposes MCTS-SQL, enabling lightweight LLMs (e.g., Qwen-1.5B) to achieve strong Text-to-SQL performance via Monte Carlo Tree Search — a three-component architecture (Selector for schema pruning + Direct Generator for initial SQL generation + MCTS-Refiner for iterative refinement), combined with a prefix caching mechanism that reduces inference time by 53%. Qwen-1.5B achieves 40.69% execution accuracy on BIRD, surpassing ChatGPT-3.5.

MindVote: When AI Meets the Wild West of Social Media Opinion

This paper introduces MindVote — the first LLM opinion prediction benchmark grounded in real social media poll data, comprising 3,918 naturally occurring polls (across 23 topics) collected from Reddit and Weibo, enriched with platform- and topic-level context. Evaluation of 15 LLMs reveals: the best model (o3-medium) achieves a 1-Wasserstein score of only 0.892 versus an upper bound of 0.972; survey-specialized fine-tuned models underperform general-purpose models (the "survey specialization trap"); and models exhibit strong cultural alignment — Western models excel on Reddit while Chinese models excel on Weibo.

OptScale: Probabilistic Optimality for Inference-time Scaling

This paper proposes OptScale, a probabilistic optimality framework that models the probability distribution of verifier scores to derive a theoretical lower bound on the optimal number of samples, dynamically determining the minimum number of samples required per problem and substantially reducing computational overhead while preserving inference accuracy.

Test-time Diverse Reasoning by Riemannian Activation Steering

This paper proposes SPREAD, an unsupervised test-time activation steering framework that maximizes the total volume spanned by hidden activations across multiple reasoning paths by solving a Riemannian optimization problem on a product of spherical manifolds. SPREAD improves reasoning diversity and accuracy in Best-of-N sampling, outperforming temperature sampling baselines on mathematical reasoning benchmarks.

Towards a Common Framework for Autoformalization

This paper systematically surveys existing work on autoformalization across mathematics, logical reasoning, planning, and knowledge representation, and proposes a unified cross-disciplinary definitional framework. Autoformalization is defined as the semantically equivalent transformation from informal language to formal reasoning languages, with the goal of facilitating methodology sharing across research communities and accelerating the development of next-generation AI reasoning systems.

Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

This paper proposes SNIC, a diagnostic testbed comprising 9,000 instances across 51 scenarios, designed to evaluate whether LLMs can leverage implicit social norms to resolve ambiguous reference expressions (e.g., "hand me the cup" when multiple cups are present). Results show that LLMs achieve an average accuracy of only 44% given scene descriptions alone; adding Prolog-based formal logic yields negligible improvement (44.2%), whereas explicitly providing a list of norms dramatically raises accuracy to 70.5% (GPT-4.1 reaches 99.6%). This demonstrates that LLMs lack implicit physical norm knowledge yet can effectively exploit explicit norms.