Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark¶

Conference: ICLR 2026 arXiv: 2510.02356 Code: GitHub Area: AI Safety / Privacy / Embodied Intelligence Keywords: privacy awareness, embodied agent, physical privacy, contextual integrity, benchmark, PDDL

TL;DR¶

This paper proposes EAPrivacy — the first 4-tier benchmark for evaluating LLM physical-world privacy awareness (400+ procedurally generated scenarios, 60+ physical scenes). It finds that all frontier models exhibit "asymmetric conservatism" (over-cautious on task execution yet insufficient on privacy protection), that enabling reasoning/thinking mode actually degrades privacy performance, and that the best model (Gemini 2.5 Pro) achieves only 59% accuracy in dynamic environments.

Background & Motivation¶

Background: LLMs are increasingly deployed as embodied agents (home robots, medical assistants, office robots) operating in physical spaces. Existing privacy benchmarks (e.g., Mireshghallah 2023) only evaluate text-level privacy leakage.

Limitations of Prior Work: - Physical privacy ≠ textual privacy: Physical-world privacy requires spatial reasoning (e.g., "a diary is on the desk"), contextual integrity judgment (e.g., "should not start cleaning while a meeting is in progress"), and multimodal perception (e.g., "hearing faint conversation") - Task–privacy conflicts are unevaluated: An agent instructed to "clear the desk" encounters a hidden surprise gift on it — how should it balance the two? - Social norms vs. privacy: A neighbor's apartment emits screams — should the agent report it (sacrificing privacy) or ignore it (respecting privacy)? - Aligned LLMs perform well on textual privacy benchmarks (Gemini/GPT-4's secret disclosure rate can reach 0%), yet physical privacy poses an entirely different challenge

Key Challenge: In the physical world, privacy is not a static rule but a dynamic social contract that depends on context and requires reasoning — the question is whether LLMs possess this reasoning capacity.

Core Idea: Construct a 4-tier progressive evaluation using procedurally generated physical scenes in PDDL format (encoding spatial relationships and multimodal perception cues), ranging from simple sensitive-object identification to complex ethical dilemmas.

Method¶

Overall Architecture¶

A 4-tier progressive design covering different levels of cognitive complexity in physical privacy:

4-Tier Design¶

Tier 1: Sensitive Object Identification
- Function: Identify sensitive objects (e.g., social security cards, passports) among 3–30 distractors on desks or in containers
- Input: Spatial relationships among objects in PDDL format (not natural-language descriptions)
- Evaluation: True positive rate, false positive rate, and spatial localization accuracy
- Clutter levels (3/5/10/30 distractors) are varied to assess the effect of environmental complexity on privacy awareness
- Design Motivation: The most fundamental physical-privacy capability — can the agent "see" what is privacy-sensitive in a real scene?
Tier 2: Privacy Reasoning in Dynamic Environments
- Function: Judge the appropriateness (scored 1–5) of a given action under different contexts
- Input: Multimodal perception cues (Visual: "5 people at table"; Audio: "continuous speech"), simulating the perception of a physical agent
- Evaluation modes: (i) Rating Mode (MAD against human scores); (ii) Selection Mode (choose the most appropriate action from three options)
- Coverage: park / library / private residence × cleaning / security patrol / food delivery × normal→emergency / empty room→private meeting
- Design Motivation: Privacy is context-dependent — "start cleaning" is appropriate in an empty room but not during a private meeting
Tier 3: Inferential Privacy and Task Conflicts
- Function: The agent must infer implicit privacy constraints from multimodal cues (Theory of Mind) and respect them while executing tasks
- Example scenario: Observing someone hide a gift under a desk → instructed to "move everything off the desk" → should skip the gift
- Evaluation: (i) privacy violation rate; (ii) task completion rate (how much of the task is completed without violating privacy); (iii) three-way selection (one privacy-violating action vs. two non-violating alternatives)
- Design Motivation: Closest to real deployment — the agent must simultaneously satisfy explicit instructions and implicit privacy constraints
Tier 4: Social Norms vs. Privacy Ethical Dilemmas
- Function: In high-stakes scenarios, critical social norms (public safety, child protection) should override individual privacy
- Example scenario: Hearing "a cry for help" from a neighboring apartment + observing "erratic silhouettes" → should call the police (sacrificing the neighbor's privacy)
- Evaluation: Binary judgment accuracy
- Grounded in U.S. legal and social normative frameworks, with acknowledgment of cross-cultural variation
- Design Motivation: Test the agent's ethical judgment under extreme circumstances

Technical Characteristics¶

Procedural generation: 400+ scenarios across 60+ unique physical scenes (offices, laboratories, homes, etc.)
PDDL format: Structured representation of physical spatial relationships, beyond pure text narration
Multimodal perception simulation: Visual / Audio / Action cues simulating the real perception of an embodied agent
Human annotation validation: Five PhD-level annotators to establish ground truth

Key Experimental Results¶

Main Results (16 Models × 4 Tiers)¶

Tier	Best Model	Performance	Key Challenge
Tier 1	Gemini 2.5 Pro	96%→26% (3→30 distractors)	Collapses in cluttered environments
Tier 2	Gemini 2.5 Pro	59% Selection Acc, MAD=1.32	Insufficient dynamic context understanding
Tier 3	Gemini 2.5 Flash	71% privacy violation rate (best)	All models severely insufficient
Tier 4	Multiple models	81–95% accuracy	Relatively easier but still gaps remain

Core Finding: Asymmetric Conservatism¶

Dimension	Performance	Explanation
Task execution	Over-cautious (Tier 3 task completion rate near 0%)	Models "prefer refusing tasks over making mistakes"
Privacy protection	Severely insufficient (violation rate 71–98%)	Models simultaneously fail to protect privacy
Overall outcome	Neither tasks nor privacy handled well	Over-safe and under-safe coexist

Thinking Mode Degradation (Counter-intuitive Finding)¶

Model	Standard Mode	Thinking Mode	Change
Gemini 2.5 Pro	Baseline	Declines across Tier 1–3	Reasoning introduces over-interpretation
Claude 3.5	Baseline	Similar degradation	—

Key Findings¶

Asymmetric conservatism is the most important finding: Models are over-cautious about "doing things" (Tier 3 completion rate near 0% — almost all potentially privacy-related tasks are refused) yet insufficiently cautious about "protecting privacy" (violation rate 71–98%) — both types of errors occur simultaneously
Thinking/Reasoning mode degradation (Tier 1–3): Enabling reasoning mode leads to worse performance — likely because longer reasoning chains increase false positives (labeling irrelevant objects as sensitive) and over-interpretation (judging normal actions as inappropriate)
Sensitivity to environmental complexity: Accuracy is 96% with 3 distractors but drops to 26% with 30 — physical scene complexity is a key bottleneck
Textual privacy ≠ physical privacy: Models with 0% disclosure rates on text benchmarks exhibit severe deficiencies in physical privacy
GPT-4o and Claude-3.5-haiku ignore social norms in more than 15% of Tier 4 cases

Highlights & Insights¶

Deeper implication of "asymmetric conservatism": Current alignment training creates a distorted safety posture — models have learned "refusal" as a safety strategy but have not learned to "actively protect" privacy. This is a systemic bias of RLHF
Pioneering nature of physical privacy evaluation: Extending privacy assessment from text to the physical world, and using PDDL combined with multimodal cues to simulate embodied perception, represents an important paradigm shift in evaluation
Thinking mode degradation as a warning for scaling reasoning: More reasoning is not always better — in privacy scenarios that require "common sense" rather than "deep analysis," reasoning may over-complicate straightforward judgments

Limitations & Future Work¶

The framework is grounded solely in U.S. legal and social normative frameworks; cross-cultural applicability requires further exploration
The PDDL-based physical descriptions differ from real visual perception — no actual images or videos are used
The scale of 400+ scenarios remains limited relative to the complexity of the physical world
The "correct answers" in Tier 4 ethical dilemmas may be contested across cultures and personal value systems
Truly embodied systems (robots) are not tested; only the text-based reasoning of LLMs is evaluated

vs. Mireshghallah 2023 (textual privacy): That work only tests contextual integrity of information flows; EAPrivacy extends evaluation to spatial reasoning and multimodal perception in the physical world
vs. robot safety evaluation (Robey 2024 et al.): Prior work primarily focuses on jailbreaks and adversarial attacks; EAPrivacy addresses privacy-awareness deficiencies under normal usage
Implications for embodied AI deployment: Current LLMs lack the physical-privacy reasoning capability required for deployment in private spaces — dedicated physical-privacy alignment training is needed

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First physical-world privacy evaluation; the 4-tier design is systematic and theoretically grounded (contextual integrity)
Experimental Thoroughness: ⭐⭐⭐⭐ 16 models × 400+ scenarios × human annotation validation
Writing Quality: ⭐⭐⭐⭐ Failure modes are clearly categorized; findings are insightful
Value: ⭐⭐⭐⭐⭐ Important implications for the safe deployment of embodied AI; reveals fundamental deficiencies in alignment