Benchmarking Vision-Language Models under Contradictory Virtual Content Attacks in Augmented Reality¶

Conference: CVPR 2026 arXiv: 2604.05510 Code: GitHub Area: Multimodal / AR Security Keywords: Augmented reality security, semantic contradiction detection, VLM robustness, benchmark, AR attacks

TL;DR¶

This paper introduces ContrAR, the first benchmark for contradictory virtual content attacks in AR environments, comprising 312 real videos recorded on Meta Quest 3, validated by 10 annotators with an average Likert score of 4.66/5. It systematically evaluates 11 VLMs (including GPT-5/Gemini-2.5/Grok-4) on semantic contradiction detection, finding that GPT-5 achieves the highest accuracy (88.14%) but incurs a 19s latency, while GPT-4o offers the best accuracy–latency trade-off (84.62% / 7.26s). An OCR-only text baseline reaches only 56%, demonstrating that visual reasoning is indispensable.

Background & Motivation¶

Background: In AR systems such as Meta Quest 3, multiple applications simultaneously render virtual content, and users rely on this virtual information for decision-making (e.g., navigation, safety inspection). Existing AR content analysis focuses primarily on rendering quality (low-level metrics such as lighting consistency and depth alignment), leaving semantic consistency analysis largely unexplored.

Limitations of Prior Work: (1) Malicious applications can inject information that is semantically contradictory to other virtual content (e.g., an arrow pointing left while text reads "turn right"), misleading users and potentially endangering safety. (2) VLMs perform well on general semantic reasoning but have not been systematically evaluated in AR mixed-reality scenarios. (3) No standardized benchmark dataset exists for measuring VLM detection capability against AR contradiction attacks.

Key Challenge: Semantic contradiction detection in AR scenes requires multimodal reasoning—recognizing both the visual and textual meaning of virtual content and inferring their logical consistency—yet existing evaluations are confined to natural images and text, leaving a significant gap with the dynamic mixed-reality environment of AR.

Goal: Formally define the threat model for contradictory virtual content attacks in AR, construct a standard benchmark dataset, and systematically evaluate the detection capability and real-time performance of mainstream VLMs.

Key Insight: The paper models AR semantic contradiction detection as a multimodal reasoning task for VLMs, constructs a standardized evaluation benchmark from videos recorded on a real HMD device, and provides the first capability profile of VLMs in this domain.

Core Idea: For the first time, a real AR video benchmark is used to systematically reveal the capability boundaries of VLMs in contradictory virtual content detection and to characterize the accuracy–latency trade-off.

Method¶

Overall Architecture¶

Threat model definition (gray-box assumption) → ContrAR dataset construction (5 AR application scenarios, recorded on Meta Quest 3) → VLM inference evaluation (single-frame / multi-frame strategies + OCR text-only baseline) → result analysis.

Key Designs¶

Threat Model and Formal Definition
- Function: Establish a theoretical framework for AR contradiction attacks, specifying the capability boundaries of both attacker and detector.
- Mechanism: Gray-box assumption—the attacker is a user-level application that can only render its own virtual objects and cannot modify other applications or system-level content; the detection system, likewise a user-level process, can only observe the composited scene. The contradiction condition is formalized as: given a set of virtual contents \(\mathcal{C} = \{c_1, ..., c_n\}\), if there exists \(I(c_i) \perp I(c_j)\) (two virtual contents are semantically mutually exclusive), the scene contains a contradiction attack. The label is defined as \(C(V)=1\) when a contradictory pair exists, and 0 otherwise.
- Design Motivation: Strictly bounding the capabilities of both attacker and detector ensures that evaluation results are practically meaningful. A semantic-level definition (rather than a visual-level one) keeps the problem focused on high-level reasoning ability.
ContrAR Dataset
- Function: Construct the first standardized evaluation dataset for contradictory virtual content attacks in AR.
- Mechanism: Videos are recorded using Meta Quest 3 across 5 AR application scenarios: Indoor Navigation (IN), Outdoor Navigation (ON), Safety Inspection (SI), Smart Apartment (SA), and Smart Retail (SR). A strict 1:1 positive-to-negative ratio (156 contradictory + 156 non-contradictory) yields 312 videos at 1920×1080 resolution, 5–15 seconds in duration, at 30 FPS. Of these, 90 contain text-only virtual content and 222 contain visual + text virtual content. Three AR experts designed attack patterns through structured brainstorming, and 10 participants independently annotated and validated the labels (average Likert score 4.66/5).
- Design Motivation: Recording on a real device ensures scene authenticity; the 1:1 ratio avoids class imbalance; multi-scenario coverage ensures comprehensiveness; human validation ensures label reliability.
VLM Evaluation Framework
- Function: Design a standardized inference-and-evaluation pipeline for fair comparison of 11 VLMs on detection performance.
- Mechanism: Two inference strategies are employed—single-frame (middle frame of the video, simulating real-time decision-making) and multi-frame (first, middle, and last frames, capturing temporal context). A unified prompt template guides four-step reasoning: ① identify the real scene → ② describe the virtual content → ③ analyze contradictions → ④ assess harm. An additional OCR text-only baseline (EasyOCR extraction → GPT-4o judgment) is established to quantify the necessity of pure visual reasoning.
- Design Motivation: Single-frame vs. multi-frame corresponds to the trade-off between real-time performance and accuracy; the OCR baseline demonstrates that visual reasoning cannot be replaced by text-only approaches.

Loss & Training¶

No training is involved; the evaluation is purely inference-based. Commercial models are accessed via API and open-source models via HuggingFace.

Key Experimental Results¶

Main Results — VLM Detection Accuracy and Latency¶

Model	Strategy	Overall Accuracy (%)	Latency (s)
GPT-5	Single-frame	88.14	19.29
GPT-5	Multi-frame	85.58	23.78
GPT-4.1	Single-frame	82.05	11.47
GPT-4.1	Multi-frame	86.54	16.61
GPT-4o	Single-frame	79.17	5.92
GPT-4o	Multi-frame	84.62	7.26
Gemini-2.5-Pro	Single-frame	83.97	14.29
Gemini-2.5-Flash	Single-frame	79.81	9.90
Grok-4	Single-frame	68.27	27.76
Claude-Sonnet-4.5	Multi-frame	68.59	18.01
Qwen-2.5-VL-72B	Multi-frame	64.10	14.93
OCR-Text GPT-4o	Single-frame	56.41	4.58

Per-Scenario Accuracy Comparison (Single-Frame Mode)¶

Model	Indoor Nav.	Outdoor Nav.	Safety Insp.	Smart Apt.	Smart Retail
GPT-5	81.48	91.67	80.95	94.44	86.36
GPT-4o	83.33	86.67	71.43	77.78	75.76
Gemini-2.5-Pro	75.93	90.00	83.33	86.67	81.82
Claude-Haiku-4.5	50.00	55.00	64.29	48.89	56.06

Key Findings¶

GPT-5 achieves the highest accuracy but the greatest latency: 88.14% single-frame accuracy vs. 19.29s latency, making it unsuitable for real-time AR detection.
GPT-4o offers the best accuracy–latency trade-off: multi-frame mode at 84.62% / 7.26s, the most practical choice for commercial deployment.
The OCR text-only baseline reaches only 56.41% (near random), demonstrating that visual semantic reasoning is the core capability for contradiction detection and cannot be replaced by text-only approaches.
Multi-frame is not consistently superior to single-frame: GPT-5 (−2.56%) and Gemini-2.5-Pro (−7.37%) both decline in multi-frame mode, possibly because additional frames introduce redundant information that interferes with reasoning.
Open-source models lag significantly: Qwen-2.5-VL-72B peaks at 64.10%, a gap of 24 percentage points behind GPT-5.
Substantial cross-scenario variation: Smart Apartment (state-indicator contradictions) is the easiest to detect, while Safety Inspection (sign contradictions) is the most challenging.

Highlights & Insights¶

The problem definition has real-world significance: AR contradiction attacks are an emerging security threat, and as the AR application ecosystem becomes more open (with multiple co-existing apps), the practical risk of such attacks is growing. This paper is the first to formally define the problem and provide an evaluation tool.
The OCR baseline is elegantly designed: The result of only 56% powerfully demonstrates that "visual reasoning cannot be replaced by text," providing clear technical direction for future research.
The accuracy–latency trade-off has engineering value: The paper provides first-hand data for selecting AR security system components—GPT-4o is recommended when real-time detection requires latency below 10s, while GPT-5 is preferred when maximum accuracy is the priority.

Limitations & Future Work¶

Limited data scale: 312 videos across 5 scenarios offer insufficient diversity to cover all AR attack patterns.
Video models not utilized: Evaluation is frame-based only, without leveraging the temporal modeling capabilities of video VLMs (the authors attribute this to API limitations and computational constraints).
A unified Unity app simulates attacks: Using a single application to simultaneously simulate both victim and attacker diverges from real multi-application scenarios.
Evaluation only, no defense proposed: A natural limitation of benchmark papers; future work should develop efficient lightweight detection models.
Adversarial evasion not considered: Attackers may design more subtle contradictions to fool VLMs.

vs. BoardgameQA / Pan et al.: These address purely textual contradiction detection; ContrAR extends the problem to visual–textual multimodal mixed scenarios, increasing the complexity by one dimension.
vs. MMIR: MMIR studies visual–text inconsistencies within documents; ContrAR focuses on security threats in real-time AR scenes, with more clearly defined application value.
vs. AR quality evaluation (lighting / depth alignment): Advancing from low-level visual metrics to high-level semantic reasoning represents a qualitative shift in AR security research.

Rating¶

⭐⭐⭐⭐

Novelty ⭐⭐⭐⭐: First to formally define AR contradiction attacks and construct an evaluation benchmark; the problem definition is valuable.
Experimental Thoroughness ⭐⭐⭐⭐: Comprehensive evaluation covering 11 VLMs, 2 strategies, an OCR baseline, and 5-scenario analysis.
Writing Quality ⭐⭐⭐⭐: Threat model definition is rigorous; experimental design is clearly presented.
Value ⭐⭐⭐⭐: Provides the first standardized evaluation tool for the AR security domain, filling a research gap.