Skip to content

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Conference: ICLR 2026 arXiv: 2602.23729 Code: To be released Area: AI Safety / Evaluation Methodology Keywords: dynamic benchmark, text anomaly detection, agent-centric evaluation, LLM reasoning, teacher-student

TL;DR

This paper proposes ATAD (Agent-Centric Text Anomaly Detection), which replaces static benchmarks with a Teacher-Orchestrator-Student three-agent competition and validation loop. Using text anomaly detection as the task format, ATAD achieves self-calibrating, dynamically evolving LLM reasoning evaluation — all evaluated LLMs achieve average accuracies of only 54–59% (far below 90%+ on static benchmarks), effectively exposing reasoning weaknesses.

Background & Motivation

Background: Static benchmarks such as MMLU, GSM8K, and Big-Bench have long served as reliable indicators of model progress, but frontier LLMs have approached or surpassed human-level performance on most tasks.

Three Fatal Problems with Static Benchmarks: - Data Contamination: Large-scale pretraining corpora frequently contain benchmark questions; incomplete decontamination allows models to memorize answers rather than perform genuine reasoning. - Overfitting Loop: Model developers may inadvertently tune toward benchmark-specific features, creating a feedback loop that inflates scores. - Rapid Obsolescence: Once a benchmark is "solved," the community must quickly produce replacements, forming a wasteful cycle.

Key Challenge: Evaluation must evolve dynamically to keep pace with model progress, yet constructing high-quality items is inherently difficult — increasing difficulty typically sacrifices clarity, while preserving clarity tends to yield overly simple items.

Why Text Anomaly Detection: (a) requires cross-sentence logical reasoning; (b) resists pattern-matching shortcuts and training data leakage; (c) supports objective, fine-grained scoring.

Core Idea: A three-agent competition and validation loop automatically generates reasoning evaluation items calibrated to difficulty, enabling the benchmark to co-evolve with model progress.

Method

Overall Architecture

The protocol comprises three phases: 1. Initialization (Base Problem Generation): The Teacher generates baseline-difficulty items → the Orchestrator validates across multiple criteria (format correctness, clarity, logical consistency, fairness) → failed items trigger regeneration until passing or reaching the max_init_loops limit. 2. Adaptive Difficulty Scaling: The Student attempts to solve an item → incorrect answers cause the item to be collected into the benchmark → correct answers prompt the Orchestrator to request a harder variant from the Teacher → the new item is re-validated → the loop continues until the Student fails or max_student_loops is reached. 3. Evaluation Phase: The finalized benchmark items are used to evaluate any target LLM.

Key Designs

  1. Teacher-Student Competition Mechanism:
  2. The Teacher is implicitly incentivized to analyze the Student's success and failure patterns.
  3. Generated hard items target the Student's specific weaknesses rather than increasing difficulty randomly.
  4. The competitive loop continuously drives deeper benchmark refinement.

  5. Orchestrator Quality Gating:

  6. Validation dimensions: format correctness, clarity, logical consistency, task-type alignment, difficulty appropriateness, and fairness.
  7. Prevents adversarial or unsolvable items from entering the benchmark.
  8. Autonomously decides whether the Teacher must regenerate — no fixed iteration schedule.
  9. If a harder variant fails validation, the Orchestrator may instruct the Teacher to refine at the same difficulty level, preserving task structure.

  10. Failure-Driven Sample Collection: Items are finalized into the benchmark only when the Student answers incorrectly, ensuring the benchmark always operates at the boundary of model capability.

  11. Cross-Agent Instantiation: Different model pairings are supported (e.g., \(\text{ATAD}_{\text{gemini}}^{\text{gpt-4o}}\)), enabling cross-model comparison and tracking of model evolution.

Task Taxonomy: 7 Text Anomaly Types

Task Full Name Reasoning Ability Tested Challenge Factor
T1 Contextual Anomaly Contextual reasoning Subtle topic shifts, semantic deviation (grammatically correct but thematically inconsistent)
T2 Paragraph Order Consistency Discourse coherence Locally coherent but globally disordered structure
T3 Fill-in-the-Blank Selection Anomaly Lexical + pragmatic reasoning Grammatically correct but contextually inappropriate
T4 Bridging Sentence Evaluation Logical transition Weak logical connections, abrupt topic shifts
T5 Referential Ambiguity Coreference resolution Ambiguous pronouns, unclear referents
T6 Logical Contradiction Causal / contradiction reasoning Contradictory statements, causal reversals
T7 Style Violation Stylistic reasoning Register mixing, abrupt tone shifts

Six academic domains are covered: science, philosophy, politics/society, psychology, economics, and literature.

Key Experimental Results

Main Results: 10 LLMs on ATAD (Accuracy %)

Model T1 T2 T3 T4 T5 T6 T7 Avg
GPT-o4-mini 63.3 30.3 68.5 53.0 47.3 57.3 80.0 57.1
Gemini-2.0-Flash 65.3 25.0 63.0 58.3 51.0 62.0 88.0 58.9
Gemini-2.0-Flash-Lite 64.0 10.8 63.5 52.3 62.8 62.0 86.3 57.4
GPT-4o 62.0 21.3 68.3 53.3 49.3 56.8 81.0 56.0
GPT-4o-mini 57.3 17.0 62.5 54.0 52.5 58.8 83.0 55.0
GPT-3.5-Turbo 59.0 16.0 66.8 48.5 55.8 51.8 81.5 54.2
Gemini-1.5-Flash 6.0 11.3 62.0 48.8 17.5 10.8 21.0 25.3

Ablation Study: Effectiveness of Difficulty Scaling

Comparison Dimension Initial Items Orchestrator-Finalized Items Change
Average Student accuracy Higher Significantly lower Difficulty effectively increased
Clarity validation pass rate Remains high Clarity not sacrificed
Cross-model discriminability Low High Better differentiation of model capability

Key Findings

  • All LLMs achieve average accuracy of only 54–59% on ATAD, far below the 90%+ typical of static benchmarks such as MMLU — demonstrating that ATAD effectively exposes reasoning weaknesses.
  • T2 (paragraph ordering) is the hardest (10–30%), requiring global discourse understanding; T7 (style violation) is the easiest (80–88%), as its patterns are more salient.
  • Gemini-1.5-Flash performs anomalously poorly on several tasks (T1: 6%, T5: 17.5%, T6: 10.8%), revealing severe reasoning deficiencies.
  • Cross-model pairings reveal complementary relationships: items generated by a given Teacher model tend to be more discriminative for specific evaluated models.
  • Reasoning-specialized models (GPT-o4-mini) show a relative advantage primarily on T2 (paragraph ordering), with limited gains on other tasks.

Highlights & Insights

  • Co-evolution of Benchmark and Model: As stronger models are introduced as Teacher/Student/Orchestrator, the benchmark upgrades automatically — addressing the fundamental problem of benchmarks being "solved."
  • Resolving the Clarity–Difficulty Trade-off: Orchestrator validation ensures that items remain clear and unambiguous even as difficulty increases, inspired by the design principles of standardized tests such as GRE, GMAT, and LSAT.
  • Failure-Driven Benchmark Construction: Items are collected only when the Student fails, ensuring the benchmark always resides at the boundary of model capability.
  • Instance-Level Difficulty Localization: ATAD adjusts difficulty at the instance level rather than globally, precisely probing a model's specific reasoning weaknesses.

Limitations & Future Work

  • The Orchestrator is itself an LLM, so validation quality is bounded by its reasoning capability — a weaker Orchestrator may admit flawed items.
  • The framework focuses exclusively on text anomaly detection; extension to mathematics, code, or multimodal domains requires entirely new task designs.
  • Generation cost is high: each item requires multiple LLM calls (Teacher generation + Orchestrator validation + Student solving, potentially across multiple loops).
  • Leaderboard comparability: benchmarks generated under different agent configurations are not identical, and cross-configuration comparisons require careful interpretation.
  • vs. MMLU/GSM8K: Static vs. dynamic — ATAD's adaptive difficulty prevents it from being "solved."
  • vs. DynaBench: DynaBench also employs human-model adversarial generation to produce difficult samples, but ATAD is fully automated (replacing human annotation with three agents).
  • vs. C3LLM: C3LLM statistically certifies safety risks, while ATAD dynamically generates reasoning evaluations — both transcend the limitations of fixed benchmarks, but in different directions.

Rating

  • Novelty: ⭐⭐⭐⭐ The three-agent dynamic benchmark paradigm is novel; the Teacher-Orchestrator-Student design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers 10 LLMs × 4 agent configurations × 7 task types with broad scope.
  • Writing Quality: ⭐⭐⭐⭐ Framework description is clear; protocol design is convincing.
  • Value: ⭐⭐⭐⭐ Proposes a new direction for sustainable LLM evaluation, particularly important given score saturation on MMLU.