📊 LLM Evaluation¶
🧪 ICML2026 · 8 paper notes
📌 Same area in other venues: 💬 ACL2026 (42) · 📷 CVPR2026 (25) · 🔬 ICLR2026 (53) · 🤖 AAAI2026 (39) · 🧠 NeurIPS2025 (76) · 📹 ICCV2025 (27)
🔥 Top topics: LLM ×4 · Reasoning ×2
- CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers
-
This paper introduces CoCoReviewBench, which transforms human reviews of 3,900 ICLR/NeurIPS papers into a more reliable AI review evaluation reference through a two-step process: (1) constructing sub-benchmarks by category, and (2) filtering erroneous opinions by arbitrating reviewer/author conflicts using meta-reviews. The study finds that current AI reviewers still lag behind humans in correctness and thoroughness, while reasoning models show greater potential.
- Hallucinations Undermine Trust; Metacognition is a Way Forward
-
This position paper argues that "completely eliminating LLM hallucinations" is fundamentally subject to a "discrimination gap" (discrimination gap → utility tax); the authors advocate shifting the goal from "eliminating hallucinations" to faithful uncertainty, and view such metacognition as an indispensable control layer when agentic LLMs invoke tools.
- Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction
-
This work proposes "black-box environment interaction" as a new paradigm for evaluating integrated reasoning (abduction + deduction + induction) in LLMs, constructing the ORACLE benchmark with 96 environments across 6 task types. Benchmarking 19 LLMs reveals that even the strongest model, o3, achieves only 70% accuracy in simple environments and drops to 40% in difficult ones. All LLMs lack high-level planning abilities for "adaptive optimization of exploration strategies based on feedback."
- iWorld-Bench: A Benchmark for Interactive World Models with a Unified Action Generation Framework
-
iWorld-Bench is the first unified evaluation benchmark specifically designed for "interactive world models." It proposes an Action Generation Framework that maps three types of action inputs—text, one-hot, and camera intrinsics/extrinsics—into a unified command space. Based on 330K videos, it carefully selects 4.9K tasks and 9 metrics to comprehensively compare 14 mainstream models.
- Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
-
RACER formulates the problem of "deciding whether to invoke reasoning mode for each query in LLM-as-a-Judge" as a distributionally robust constrained optimization with a KL uncertainty set. It solves for the optimal routing policy under OOD conditions that still satisfies the cost budget using a primal-dual algorithm, and for the first time provides a linear convergence guarantee for LLM router policies.
- Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
-
RHB constructs a suite of realistic multi-step tool-use tasks (both independent and chained modes, covering data pipeline, log forensics, performance optimization, and multi-file reconstruction) to quantify reward hacking behaviors in LLM agents. Across 13 frontier models, it is found that RL post-training significantly increases exploit rates (DeepSeek-V3 0.6% vs R1-Zero 13.9%). Exploit rates rise with chain length, and even models with near-zero rates "relapse" on harder variants. Lightweight environment hardening can reduce exploit rates by 87.7% without harming task success.
- Stop Automating Peer Review Without Rigorous Evaluation
-
This is a position paper: Through empirical measurement of real ICLR 2026 reviews and 60 simulated reviews, the authors identify two major failures in current LLM reviewing—hivemind (high convergence) and paper laundering (zero-shot rewriting can increase scores by 0.45). They argue that "LLMs should not directly generate review reports without rigorous evaluation" and call for the establishment of a "science of review automation."
- Token-Efficient Change Detection in LLM APIs
-
The authors prove that under low-temperature sampling, special inputs where "two token logits are nearly tied" (Border Inputs) are extremely sensitive to parameter perturbations—theoretically, SNR diverges as \(T\to 0\). Thus, by observing only output tokens (strict black-box), LLM API change detection can be performed with very few queries. The proposed B3IT matches gray-box logprob methods on the TinyChange benchmark at 1/30 the cost, and in 23 days of continuous monitoring across 93 commercial endpoints, it detected 8 real model replacements.