When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets¶

Conference: AAAI 2026 arXiv: 2510.00332 Code: GitHub Area: Robotics Keywords: AI Agent Evaluation, Adversarial Benchmark, Cryptocurrency, Tool Selection, Hallucination

TL;DR¶

This paper introduces the CAIA benchmark, which leverages cryptocurrency markets as a natural adversarial laboratory to evaluate 17 state-of-the-art LLMs on agent capabilities in high-stakes adversarial environments. Results reveal that frontier models achieve only 67.4% accuracy (GPT-5) compared to a human baseline of 80%, and expose systematic tool selection failures.

Background & Motivation¶

Blind Spots in AI Benchmarking: Existing AI evaluations (GLUE, ImageNet, ICPC, IMO) measure capabilities in closed-world settings, assuming tools behave as expected, information is trustworthy, and other agents are cooperative. Real-world autonomous deployment, however, demands adversarial robustness—the ability to operate in open systems rife with uncertainty, misinformation, and adversarial incentives.

Benchmark Scores ≠ Deployment Readiness: Agents that score highly on reasoning benchmarks may still fall for fabricated news, purchase manipulated assets, or be deceived by social engineering—because their evaluations never included deceptive scenarios. As AI agents increasingly interact with untrusted users, real funds, and critical infrastructure, this gap poses a significant safety risk.

Why Cryptocurrency Markets:

Genuinely Adversarial Environment: Anonymous blockchains free malicious actors from reputational consequences; profit motives drive sophisticated attack strategies; regulatory vacuums permit deceptive practices illegal in traditional markets. Everyday attacks include honeypot contracts, flash loan exploits, and coordinated social engineering.

High-Stakes and Irreversible: No traditional financial safety nets exist; transactions are irreversible; smart contract execution is final. Losses exceeded $30 billion in 2024.

Verifiable Ground Truth: Despite adversarial chaos, blockchains provide complete transparency and immutability—every transaction is permanently recorded and cryptographically verifiable.

Method¶

Overall Architecture¶

The CAIA benchmark follows an "adversarial-first" design principle, comprising 178 temporally anchored tasks that evaluate agent knowledge, planning, and execution capabilities in cryptocurrency analysis.

Key Designs¶

1. Three Pillars of Quality Assurance¶

Knowledge: Assesses foundational understanding of crypto-native concepts (AMM mechanisms, governance structures, etc.), testing conceptual mastery rather than definitional recall.
Planning: Evaluates the ability to decompose complex problems into executable analytical workflows, requiring agents to specify tool selection and sequencing prior to execution.
Execution: Tests real-world execution using production-grade APIs (Etherscan, CoinGecko, DefiLlama), assessing both technical capability and judgment.

2. Five-Stage Data Curation Pipeline¶

Automated Filtering: LLM reviewers filter irrelevant, ambiguous, or unanswerable queries, retaining the top 15% (~1,000 tasks).
Expert Review: 92 domain experts, with at least 4 reviewers per task; extreme scores are dropped and the top 200 are selected by average.
Format Standardization: Uniform formatting with block number/timestamp anchoring to ensure objective evaluation.
Ground Truth Verification: Reproducible tool-chain solutions are validated; tasks that cannot be reproduced are removed → final set of 178 tasks.
Categorization: Six analytical categories for diagnostic purposes.

Source: Over 10,000 real-world queries from 3,000+ active users.

3. Task Category Distribution¶

Category	Count	Proportion	Focus
On-Chain Analysis	77	43.3%	Transaction patterns, MEV, fund flows
Project Discovery	49	27.5%	Protocol evaluation, security analysis
Token Economics	23	12.9%	Incentive design, value accrual
Cross-Domain	14	7.9%	Multi-domain synthesis
Trend Analysis	8	4.5%	Temporal patterns, adoption metrics
General Knowledge	7	3.9%	Foundational concepts

Experimental Setup¶

Evaluated Models: 17 SOTA models (GPT-5, GPT-4.1, Claude Opus 4, Gemini 2.5 Pro, Grok 4, DeepSeek R1/V3.1, Llama 4, Qwen 3, etc.)
Two Conditions: Without tools (closed-book) vs. with tools (open-book, 23 specialized tools)
Agent Framework: Standard ReAct framework, ensuring evaluation is not confounded by implementation differences
Human Baseline: 16 participants from university blockchain clubs and early-stage companies (junior analysts) completing a 10% stratified sample; average accuracy 80%
Metrics: Majority-vote accuracy over 5 independent runs + Pass@1/Pass@5 + cost efficiency

Key Experimental Results¶

Main Results¶

Without Tools (catastrophic failure across all models):

Model	Majority-Vote Accuracy	Pass@1	Pass@5
GPT-5	27.5%	28.1%	42.7%
Gemini 2.5 Pro	22.5%	20.2%	29.8%
GPT-o3	20.8%	22.5%	29.2%
Claude Opus 4	13.5%	13.5%	16.9%
DeepSeek R1	20.8%	21.9%	35.4%

With Tools:

Model	Majority-Vote Accuracy	Pass@1	Pass@5	Avg. Cost
GPT-5	67.4%	70.2%	77.0%	$0.1154
GPT-OSS 120B	62.9%	56.2%	72.5%	$0.0066
Grok 4 Fast	61.2%	57.9%	71.9%	$0.0098
Claude Sonnet 4	56.7%	57.9%	66.9%	$0.2291
Claude Opus 4	57.3%	59.6%	71.9%	$1.1139
Human Baseline	~80%	-	-	-

Ablation Study¶

Tool Usage Distribution (Tool Selection Catastrophe):

Tool Category	Call Count	Proportion	Notes
Google Search	11,626	49.6%	Models prefer unreliable sources
Specialized Blockchain Tools	8,351	35.6%	Authoritative sources containing correct answers
URL Fetch	1,743	7.4%	-
Twitter Search	1,388	5.9%	Socially manipulated information
Code Execution	355	1.5%	-

Cost–Accuracy Trade-off:

Model	Accuracy	Cost per Run	Cost Efficiency
Claude Opus 4	57.3%	$1.1139	Worst
GPT-OSS 120B	62.9%	$0.0066	100× better value
Grok 4 Fast	61.2%	$0.0098	Pareto-optimal

Key Findings¶

Fundamental Capability Gap: Without tools, all models perform near random chance (12–28%), whereas junior analysts routinely complete these tasks.
Tool Selection Catastrophe: Models direct 55.5% of tool calls to unreliable web search, even when specialized blockchain tools can directly provide correct answers.
Illusion of Pass@k: Gemini 2.5 Flash's improvement from 39.3% (Pass@1) to 62.4% (Pass@5) suggests the model is essentially guessing via trial and error, rather than engaging in strategic reasoning.
Twitter Search Paradox: Accuracy when used alone is only 6.6%, rising to 40.7% in combination, indicating that tools require orchestration capability.
Case Study (Task 49): Retrieving monthly token launch data from Pump.fun—a task solvable with a single API call—resulted in failure across all 17 models, each falling into a cascade of web search → outdated blog posts → speculative Twitter content.

Highlights & Insights¶

Adversarial-First Evaluation Paradigm: The first benchmark to explicitly incorporate active deception, source verification, and adversarial robustness as core dimensions of AI agent evaluation.
The Danger of Pass@k Metrics: High Pass@k does not reflect capability but trial-and-error behavior—which is extremely dangerous in irreversible financial decision-making.
Cost Efficiency of Open-Source Models: GPT-OSS 120B achieves near-frontier performance at 1/100th of the cost.
Generalizable Insights: The findings extend beyond cryptocurrency—any adversarial domain (cybersecurity, content moderation, medical diagnosis) faces analogous challenges.

Limitations & Future Work¶

The benchmark scale of 178 tasks is relatively small and highly concentrated in the crypto domain.
The human baseline involves only 16 participants completing a 10% sample, limiting statistical significance.
Model learning and adaptation following feedback are not evaluated.
Categorization concerns: the paper's classification under "object detection" appears to be a mislabeling; it is properly an AI Agent evaluation work.
The paper lacks concrete improvement recommendations for safe agent deployment.
Reliance on the ReAct framework means results may not generalize to other agent architectures.

GAIA (General AI Assistant Benchmark) tests task completion but assumes a cooperative environment; this work extends evaluation to the adversarial dimension, filling an important gap in AI safety assessment.
The temporally anchored task design draws on methods from time-sensitive evaluations such as RealTimeQA.
The work resonates with HELM (Holistic Evaluation of Language Models) but focuses specifically on high-stakes scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The first high-stakes adversarial AI agent benchmark, with far-reaching implications.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 17 models, dual-condition evaluation, multi-dimensional analysis.
Writing Quality: ⭐⭐⭐⭐⭐ — Compelling argumentation, deep insights, and vivid case studies.
Value: ⭐⭐⭐⭐⭐ — Carries significant cautionary implications for the safe deployment of AI agents.