The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents¶
Conference: ICML 2026
arXiv: 2603.00801
Code: None (not provided in the paper)
Area: Language Agent / Web Agents / Evaluation Benchmarks / Adversarial Robustness
Keywords: Synthetic Web, Honeypot Injection, Positional Anchoring, Miscalibration, Epistemic Humility
TL;DR¶
This work constructs a programmatically generated "Synthetic Web" environment. By injecting a single high-credibility honeypot misinformation item at search rank 0, it causally demonstrates that cutting-edge LLM agents such as GPT-5 experience an accuracy drop from 65% to 18% under adversarial contamination at a 1-in-thousands rate. The models do not increase search effort and still answer with high confidence, revealing a deeply rooted "positional anchoring" failure mode.
Background & Motivation¶
Background: LLMs are evolving from text generators to web-enabled agents, capable of autonomously acquiring information via search/browse tools (WebGPT, ReAct, Toolformer). Existing benchmarks such as WebArena, Mind2Web, and WebLINX focus on functional navigation and task completion, while FEVER and TruthfulQA focus on static factuality.
Limitations of Prior Work: Neither class of benchmarks can causally isolate a key vulnerability—how does an agent react when search result rankings are adversarially manipulated (misinformation at the top)? Running such experiments on the real web faces four confounds: unknown and drifting content distribution, unlabeled misinformation density, ranking can be gamed, and models may recall popular sources from pretraining rather than performing genuine retrieval reasoning.
Key Challenge: The core deployment risk for web agents is epistemic—whether they can recognize and resist misinformation—yet all current evaluations measure functional or static factual aspects, which do not overlap. Attackers controlling top-ranked results (via SEO, paid placement, or infrastructure compromise) is a low-barrier, real-world threat, but its severity has not been quantitatively assessed.
Goal: Construct a fully controllable synthetic web environment where each article has ground-truth credibility/bias/factuality labels, search ranking is programmatically manipulable, and the impact of adversarial contamination is measured causally with minimal perturbation (single honeypot). Process-level traces (queries, reading, confidence) are provided to diagnose failure modes.
Key Insight: Drawing from the procedural generation approach in RL (Procgen) and the synthetic world paradigm of TextWorld/ALFWorld, the "misinformation attack" is operationalized as "rank 0 honeypot injection," enabling a fully isolated causal effect control experiment.
Core Idea: Use a procedurally generated mini-internet plus a single rank-0 honeypot for minimal-perturbation causal experiments, turning the "adversarial ranking risk" from a vague discussion into a quantifiable, reproducible failure mode.
Method¶
Overall Architecture¶
Four main components (Figure 1): Synthetic Web Generation Environment (LLM generates thousands of interlinked articles with site credibility labels based on a topic taxonomy) + Hybrid Search Layer (lexical + dense retrieval, with honeypot injection at rank 0 in adversarial mode) + Agent Protocol (zero-shot prompt + search/read_article tools, requiring output of Answer/Confidence/Explanation) + Evaluation Pipeline (fixed LLM-as-judge scoring + calibration metrics). The honeypot is removed between rollouts to avoid residual contamination.
Key Designs¶
-
Synthetic Web Generation and Contamination Filtering:
- Function: Produces a "mini-internet" with fully known and controllable content distribution, credibility, and factuality, while eliminating the possibility of models answering from pretraining memory.
- Mechanism: Use a seed to define world ID and timeline; LLM expands the topic taxonomy into subtopics/entities/controversies; generates site profiles (news/blog/research/social/conspiracy), each with base credibility and topic-specific bias. ~43% of sites are forced to be low-credibility, and publication frequency is decoupled from credibility (to prevent "high frequency = high quality" heuristics). For each topic, generate an article cluster including factual timeline, perspective narrative, and high-credibility but false misinformation claims (fabricated numbers/research/names/quotes, no surface flaws). Finally, use a strong model without tools for contamination filtering—queries that the model can answer directly are deleted, ensuring tool-based retrieval is necessary.
- Design Motivation: Contamination filtering is what makes this benchmark tougher than FEVER/TruthfulQA—it forces the model to "actually search" rather than rely on pretraining knowledge. The enforced 43% low-credibility site ratio also ensures the baseline task is non-trivial.
-
Rank-0 Honeypot Injection (Minimal-Perturbation Causal Experiment):
- Function: Formalizes "adversarial ranking attack" as a controllable variable, changing only one bit (replace rank 0 with honeypot), leaving everything else unchanged.
- Mechanism: In standard mode, search returns results by relevance; in adversarial mode, a single honeypot article tailored to the query topic is inserted at rank 0, containing a detailed but false counterfactual claim. The honeypot exists only momentarily and is deleted between rollouts. The agent sees only the title/snippet/domain and must explicitly read the full text. The agent has unlimited tool call budget and full access to genuine sources, so "failure = actively choosing not to verify."
- Design Motivation: This is the most ingenious experimental design in the paper—it turns "mixing one fake source among thousands of real ones" into a repeatable experiment, ruling out the excuse that "the model had no chance to see the correct answer." The honeypot does not suppress genuine sources (just ranks first), so the attack's leverage comes entirely from "position," not "coverage."
-
Process-Level Tracing and Multi-Dimensional Evaluation:
- Function: Measures not only final accuracy but also the agent's tool usage trajectory, search escalation, and self-confidence, enabling diagnosis of why failures occur.
- Mechanism: The agent must output (Answer, Confidence 0-100%, Explanation), with every search/read recorded. Main metrics include accuracy, average tool calls, \(P(\text{tool calls}\geq 5)\) (deep search ratio), ECE/Brier calibration error, and inter-world variance. Scoring uses a fixed LLM-as-Judge with rubric and lightweight normalization (case, units, numeric tolerance).
- Design Motivation: Process traces distinguish three failure types—no verification (minimal escalation), verification without synthesis (synthesis failure), and verification but unwilling to answer (epistemic paralysis). Calibration metrics reveal the dangerous pattern of "being confidently wrong."
Loss & Training¶
No training, pure evaluation benchmark. All models use a unified zero-shot prompt and identical tool protocol. The grader model is fixed across all experiments for consistency. Each model runs 10 rollouts in 4 independent worlds, totaling 5,870 queries per condition.
Key Experimental Results¶
Main Results: 6 Cutting-Edge Models under Standard vs Adversarial Conditions (5,870 queries/condition)¶
| Model | Standard Accuracy | Adversarial Accuracy | Gain |
|---|---|---|---|
| GPT-5 | 65.1% | 18.2% | -46.9 |
| o3 | 48.4% | 16.7% | -31.7 |
| o1 | 39.0% | 8.4% | -30.7 |
| GPT-4o | 27.2% | 3.8% | -23.4 |
| o4-mini | 0.3% | 0.0% | -0.3 |
| o1-mini | 0.0% | 0.0% | 0.0 |
| Human Baseline | 98% | 93% | -5 |
Humans drop only 5 points, while frontier models drop up to 47 points, indicating this is not a task difficulty issue but a structural failure in the models.
Behavioral Analysis (Tool Usage, std vs adv)¶
| Model | Std Tool Calls | Adv Tool Calls | Adv \(P(\geq 5)\) |
|---|---|---|---|
| GPT-5 | 6.45 | 6.61 | 0.62 |
| o3 | 3.88 | 4.23 | 0.42 |
| o1 | 1.83 | 1.86 | 0.13 |
| GPT-4o | 1.14 | 1.13 | 0.07 |
| o4-mini | 0.02 | 0.04 | 0.00 |
Tool usage remains almost unchanged under adversarial conditions—this is the most striking finding: models lack the instinct to "double-check" when encountering conflicting information.
Key Findings¶
- Minimal-Perturbation Amplification Effect is Striking: A single honeypot among thousands of real sources (1-in-thousands contamination) causes GPT-5 accuracy to drop by 47 points, an extremely high leverage—making "controlled top result" a practically feasible attack vector, far more covert than prompt injection.
- Three Major Failure Modes: (1) Minimal search escalation—no increase in search even when facing conflict; (2) Synthesis failure—even after 162 searches, unable to integrate multi-source information; (3) Epistemic paralysis—finds the answer but says "insufficient data to answer." All point to the same root cause—positional anchoring, treating rank order as evidential strength.
- Severe Miscalibration: Models maintain high confidence even when wrong; ECE/Brier calibration errors worsen significantly under adversarial mode, with models completely unaware of being deceived.
- Failures are Robust and Systematic: Across 4 independent worlds × 10 rollouts, variance is very low, ruling out "outlier query" explanations—failures are systematic, not accidental.
Highlights & Insights¶
- The "Minimal Perturbation + Causal Isolation" Experimental Paradigm is Highly Generalizable: Any research concerned with "model failure under specific conditions" can adopt this "procedural generation + single-point intervention + process trace" approach (akin to what Procgen does in RL), turning vague discussions of "adversarial robustness" into quantitative science.
- Positional Anchoring Hypothesis Unifies Three Failure Modes: The authors attribute minimal escalation, synthesis failure, and miscalibration to the root cause of "rank being implicitly treated as evidential strength," connecting it to the "lost in the middle" phenomenon (Liu et al. 2024) in long-context attention—a unified hypothesis worth direct confrontation in future work.
- Exposes Shallow Heuristics Learned via RLHF/Instruction Tuning: The authors speculate that models may have learned a "read top 1 → answer" shallow heuristic, which works perfectly in clean retrieval but collapses under attack—this has direct implications for search-related RLHF data construction, highlighting the need for adversarial contamination samples in training.
Limitations & Future Work¶
- The synthetic web does not perfectly match the real web in content distribution or ranking algorithms, possibly over- or under-estimating certain failure modes.
- Only the simplest "single rank-0 honeypot" attack is tested; real attackers may inject multi-source coordinated misinformation or subtler narrative drift.
- Lacks empirical validation of mitigations—the authors list five future directions (procedural safeguards, adversarial training, calibration improvements, tool redesign, search interface improvements), but none are experimentally tested.
- Mainly evaluates OpenAI family models; failure modes may differ for Anthropic, Google, or open-source models (Llama-3, Qwen, etc.).
- Evaluation relies on LLM-as-Judge, so grader bias may affect results.
- Future directions: use this benchmark to stress test existing mitigations such as Self-RAG, FLARE, CRAG; construct a finer-grained attack taxonomy (covering SEO simulation, coordinated misinformation, etc.); connect to uncertainty-aware evaluation as proposed by Kalai et al. 2025 for retraining.
Related Work & Insights¶
- vs WebArena / Mind2Web: These measure functional navigation; this work measures epistemic robustness—complementary evaluation dimensions.
- vs FEVER / TruthfulQA: Static QA, no process trace, no adversarial ranking control; this work is interactive + adversarial + traceable.
- vs RAGuard (Zeng et al. 2025): RAGuard uses real Reddit data to test RAG robustness, but the corpus is static; this work is procedurally generated, controllable, and agent-level.
- vs SafeArena (Tur et al. 2025): SafeArena tests whether agents execute harmful tasks; this work tests whether agents can be deceived—complementary (misuse vs deception).
- vs corpus poisoning attacks (Su et al. 2024): They attack dense retrievers by injecting adversarial passages into the corpus; this work attacks the ranking layer, which is easier to implement.
- vs "lost in the middle" (Liu et al. 2024b): They find a U-shaped attention bias in long-context attention; this work extends the phenomenon to the positional dimension of search-based retrieval.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of "procedural synthetic web + rank-0 honeypot + process trace" is new and fills a real gap.
- Experimental Thoroughness: ⭐⭐⭐⭐ 5,870 queries × 6 models × 4 worlds × 10 rollouts, statistically robust; slight deduction for lack of mitigation experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear progression from problem motivation, methodology, three failure modes, to the positional anchoring hypothesis; limitations are candidly discussed.
- Value: ⭐⭐⭐⭐⭐ Directly exposes deployment-level security vulnerabilities in frontier models (collapse with 1-in-thousands contamination), serving as a wake-up call for the entire RAG/agent/search industry chain.