The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents¶
Conference: ICML 2026
arXiv: 2603.00801
Code: None (Not provided in the paper)
Area: Language Agents / Web Agents / Evaluation Benchmarks / Adversarial Robustness
Keywords: Synthetic Web, Honeypot Injection, Positional Anchoring, Miscalibration, Epistemic Humility
TL;DR¶
This paper constructs a procedurally generated "Synthetic Web" environment. By injecting a single high-confidence honeypot misinformation piece at search rank 0, it causally demonstrates that the accuracy of frontier LLM agents like GPT-5 plummets from 65% to 18% under a 1/several-thousand adversarial contamination. The models do not increase search efforts and still answer with high confidence, revealing a deep-seated "positional anchoring" failure mode.
Background & Motivation¶
Background: LLMs are evolving from text generators into web-enabled agents capable of using search/browse tools to autonomously acquire information (WebGPT, ReAct, Toolformer). Existing benchmarks like WebArena, Mind2Web, and WebLINX focus on functional navigation and task completion rates, while FEVER and TruthfulQA focus on static factuality.
Limitations of Prior Work: Neither category of benchmarks can causally isolate a critical vulnerability: how an agent reacts when search result rankings are adversarially manipulated (misinformation appearing at top positions). Conducting this experiment on the real web involves four confounders: unknown and drifting content distribution, unlabeled misinformation density, rank gaming, and the possibility of the model recalling popular sources directly from pre-training memory rather than performing actual retrieval-based reasoning.
Key Challenge: The critical risk of web agent deployment is epistemic, namely the "ability to identify and resist misinformation." However, current evaluations focus on either functional performance or static factuality, which do not overlap. Controlling top-ranked results (via SEO, paid placement, or infrastructure intrusion) is a low-barrier, realistic threat, but no one has quantitatively assessed its severity.
Goal: To construct a fully controllable synthetic web environment where every article has ground-truth labels for credibility, bias, and factuality. Ranking can be procedurally manipulated to causally measure the impact of adversarial contamination using minimal perturbation (a single honeypot) while providing process-level traces (queries, reads, confidence) to diagnose failure modes.
Key Insight: Borrowing from the procedural generation concepts in RL (Procgen) and the synthetic world paradigms of TextWorld/ALFWorld, "misinformation attacks" are operationalized as "rank 0 honeypot injection" in a controlled experiment to isolate causal effects.
Core Idea: Utilize a procedurally generated mini-internet + a single rank-0 honeypot for minimal perturbation causal experiments, transforming "adversarial ranking risk" from a vague discussion into a quantifiable and reproducible failure mode.
Method¶
Overall Architecture¶
Four components (Figure 1): Synthetic Web Generation Environment (using LLMs to generate thousands of interlinked articles with site credibility labels based on a topic taxonomy) + Hybrid Search Layer (lexical + dense retrieval; in adversarial mode, a honeypot is injected at rank 0) + Agent Protocol (zero-shot prompt + search/read_article tools, requiring outputs for Answer/Confidence/Explanation) + Evaluation Pipeline (fixed LLM-as-judge scoring + calibration metrics). Honeypots are removed between rollouts to avoid residual contamination.
Key Designs¶
-
Synthetic Web Generation & Contamination Filtering:
- Function: Generate a "mini-internet" where content distribution, credibility, and factuality are fully known and controllable, while preventing the model from answering based on pre-training memory.
- Mechanism: World IDs and timelines are defined by seeds; LLMs expand topic taxonomies into subtopics, entities, and controversy levels. Site profiles (news, blog, research, social, conspiracy) are generated with base credibility and topic-specific bias, forcing ~43% of sites to be low-credibility, with publication frequency decoupled from credibility (to prevent the "high frequency means high quality" heuristic). For each topic, an article cluster is generated containing a factual timeline, perspective narratives, and high-confidence but false misinformation claims (fake numbers/study names/quotes with no surface flaws). Finally, contamination filtering is performed using strong models without tools to remove any queries the model can answer directly, ensuring the task requires tool-based retrieval.
- Design Motivation: Contamination filtering makes this benchmark more rigorous than FEVER/TruthfulQA by forcing the model to "actually search." The mandatory 43% proportion of low-credibility sites ensures the baseline task remains non-trivial.
-
Rank-0 Honeypot Injection (Minimal Perturbation Causal Experiment):
- Function: Formalize "adversarial ranking attacks" as a controllable variable by changing only one bit (replacing the rank 0 result with a honeypot).
- Mechanism: In standard mode, search returns results normally by relevance. In adversarial mode, a single honeypot article is inserted at rank 0 for the initial query, containing a "detailed but incorrect" counterfactual claim tailored to the topic. The honeypot exists only momentarily and is deleted between rollouts. Agents see the title/snippet/domain but must explicitly use
readfor the full text. Agents have an unlimited tool-calling budget + full access to true sources, so "failure = active choice not to verify." - Design Motivation: This is the most ingenious experimental design—it turns minimal contamination (one fake source among thousands of true ones) into a repeatable experiment. It rules out the defense that the "model didn't have the chance to see the correct answer." The honeypot does not actively suppress true sources; thus, the attack's leverage comes entirely from "position" rather than "coverage."
-
Process-Level Tracing and Multi-dimensional Evaluation:
- Function: Measure not only final accuracy but also tool-calling trajectories, search escalation, and self-reported confidence to diagnose why failures occur.
- Mechanism: Agents must output (Answer, Confidence 0-100%, Explanation). Every search/read is recorded. Beyond accuracy, metrics include: average tool calls, \(P(\text{tool calls} \geq 5)\) (deep search ratio), ECE/Brier calibration error, and inter-world variance. Scoring is done by a fixed LLM-as-Judge with a rubric and lightweight normalization.
- Design Motivation: Process traces allow differentiation between three failures: not checking (minimal escalation), checking but failing to integrate (synthesis failure), and finding information but being afraid to answer (epistemic paralysis). Calibration metrics reveal the dangerous "confident but wrong" mode.
Loss & Training¶
No training, pure evaluation benchmark. All models use a consistent zero-shot prompt and tool protocol. The grader model is fixed across all experiments to ensure consistency. Each model runs 10 rollouts across 4 independent worlds, totaling 5,870 queries per condition.
Key Experimental Results¶
Main Results: Six Frontier Models under Standard vs. Adversarial Conditions (5,870 queries / condition)¶
| Model | Standard Accuracy | Adversarial Accuracy | Gain |
|---|---|---|---|
| GPT-5 | 65.1% | 18.2% | -46.9 |
| o3 | 48.4% | 16.7% | -31.7 |
| o1 | 39.0% | 8.4% | -30.7 |
| GPT-4o | 27.2% | 3.8% | -23.4 |
| o4-mini | 0.3% | 0.0% | -0.3 |
| o1-mini | 0.0% | 0.0% | 0.0 |
| Human Baseline | 98% | 93% | -5 |
Humans drop only 5 points, while frontier models drop up to 47 points, indicating this is a structural failure of models rather than task difficulty.
Behavior Analysis (Tool Usage, std vs adv)¶
| Model | Std Tool Calls | Adv Tool Calls | Adv \(P(\geq 5)\) |
|---|---|---|---|
| GPT-5 | 6.45 | 6.61 | 0.62 |
| o3 | 3.88 | 4.23 | 0.42 |
| o1 | 1.83 | 1.86 | 0.13 |
| GPT-4o | 1.14 | 1.13 | 0.07 |
| o4-mini | 0.02 | 0.04 | 0.00 |
Tool calls barely change under adversarial conditions—this is the most striking finding: models lack the instinct to "search more" when encountering conflicting information.
Key Findings¶
- Minimal Perturbation Amplification Effect: A single honeypot among thousands of sources (1/1000s contamination density) causes GPT-5's accuracy to drop by 47 points. This massive leverage establishes "controlled top results" as a practically viable attack vector, far more stealthy than prompt injection.
- Three Failure Modes: (1) Minimal search escalation—no increase in searching despite conflicts; (2) Synthesis failure—inability to integrate multiple sources even after 162 searches; (3) Epistemic paralysis—finding evidence but stating "insufficient data to answer." All point to a root cause: positional anchoring, where rank order is implicitly treated as evidential strength.
- Severe Miscalibration: Models maintain high confidence even when wrong. ECE/Brier scores significantly deteriorate in adversarial mode; models are completely unaware they have been deceived.
- Robust and Systematic Failure: Small variance across 4 worlds × 10 rollouts rules out the explanation that failures are due to outlier queries—failures are systematic, not incidental.
Highlights & Insights¶
- "Minimal Perturbation + Causal Isolation" paradigm is highly valuable: Research concerning "whether models fail under specific conditions" can benefit from this "procedural generation + single-point intervention + process trace" model (similar to Procgen in RL), turning vague discussions of adversarial robustness into quantitative science.
- Positional Anchoring Hypothesis unifies three failure modes: The authors attribute minimal escalation, synthesis failure, and miscalibration to the root cause of "rank being implicitly treated as evidential strength," linking it to the "lost in the middle" (Liu et al. 2024) phenomenon in long-context attention. This is a unified hypothesis that subsequent work should address.
- Exposing shallow heuristics from RLHF/Instruction Tuning: The authors speculate that models learn a shallow heuristic—"read top 1 → answer"—which works perfectly under clean retrieval but collapses under adversarial conditions. This has direct implications for search-related RLHF data construction, which should include adversarially contaminated samples.
Limitations & Future Work¶
- The content distribution and ranking algorithms of a synthetic web are not a 1:1 match for the real web, which may over- or under-estimate certain failure modes.
- Only a "single rank-0 honeypot" attack was tested; actual attackers might inject coordinated misinformation from multiple sources or use stealthier narrative drifts.
- Lack of empirical validation for mitigations—while five directions (procedural safeguards, adversarial training, calibration improvements, tool redesign, search interface improvements) are listed, they remain future work.
- Testing primarily focused on the OpenAI family; failure modes in Anthropic, Google, or open-source models (Llama-3, Qwen) might differ.
- Evaluation relies on LLM-as-Judge, meaning the grader’s own biases could propagate.
Related Work & Insights¶
- vs WebArena / Mind2Web: These measure functional navigation, while this paper measures epistemic robustness, creating two complementary evaluation dimensions.
- vs FEVER / TruthfulQA: Static QA without process traces or adversarial ranking control; this paper is interactive, adversarial, and traceable.
- vs RAGuard (Zeng et al. 2025): RAGuard uses real Reddit data for RAG robustness, but the corpus is static; this paper uses procedurally generated, controllable, agent-level data.
- vs SafeArena (Tur et al. 2025): SafeArena tests if agents perform harmful tasks; this paper tests if agents can be deceived (misuse vs. deception).
- vs Corpus Poisoning (Su et al. 2024): They attack dense retrievers to insert adversarial passages; this paper attacks the ranking layer, which is easier to implement.
- vs "Lost in the Middle" (Liu et al. 2024b): Found a U-shaped attention bias in long contexts; this paper extends this phenomenon to the positional dimension of search-based retrieval.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of "procedural synthetic web + rank-0 honeypot + process trace" is new and fills a genuine gap.
- Experimental Thoroughness: ⭐⭐⭐⭐ 5,870 queries × 6 models × 4 worlds × 10 rollouts ensure statistical robustness; lack of mitigation trials is a slight deduction.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation, methodology, three failure modes, and the positional anchoring hypothesis are presented with great clarity. Limitations are honestly addressed.
- Value: ⭐⭐⭐⭐⭐ Directly reveals a deployment-grade security vulnerability (collapse with 1/1000s contamination), serving as a warning for the RAG/agent/search industry.