Position: Embodied AI Requires a Privacy-Utility Trade-off¶

Conference: ICML 2026
arXiv: 2605.05017
Code: https://github.com/rminshen03/EAI_Privacy_Position
Area: AI Safety / Embodied AI / Privacy Protection
Keywords: embodied AI, privacy-utility trade-off, SPINE framework, lifecycle privacy, hierarchical control

TL;DR¶

This paper is a position paper advocating that privacy in embodied AI cannot be addressed with single-stage patches. Instead, it must be treated as an architecture-level dynamic control signal spanning the entire lifecycle of instruction / perception / planning / interaction. The authors propose the SPINE framework, which utilizes an L1-L4 four-level privacy classification matrix to coordinately adjust agent behavior across all stages.

Background & Motivation¶

Background: Embodied AI (EAI) is rapidly transitioning from simulation to real-world environments such as homes, hospitals, and offices. Existing research primarily focuses on optimizing task success rates within the internal stages of instruction understanding, environment perception, action planning, and physical interaction.

Limitations of Prior Work: Current privacy protections for EAI are largely stage-local patches—such as face blurring in the perception stage or adding perturbations during planning. However, (1) these patches are often "reversed" by downstream stages; for example, while the perception layer may anonymize faces, the planning logs might record precise movement patterns of a user taking tremor medication, still allowing the inference of a history of Parkinson's disease. (2) The privacy-utility trade-off is a non-linear safety constraint; aggressive restrictions on planning do not merely reduce efficiency but may directly cause the robot to collide with obstacles or people.

Key Challenge: Privacy in EAI is essentially an attribute that accumulates across stages and time, whereas current architectures treat it as a locally controllable feature independent to each stage. Legally, only high-level principles like GDPR / CCPA exist, which lack actionable guidance for the "embodied closed-loop," resulting in a gap between technology and regulation.

Goal: (1) Demonstrate why privacy must be treated as a lifecycle-level architectural constraint; (2) design a unified framework capable of consistently propagating privacy constraints across stages and dynamically adjusting trade-offs in different contexts; (3) provide preliminary evidence on how privacy reshapes downstream utility through real-world case studies.

Key Insight: The authors utilize embodied navigation as a controlled probe because navigation naturally couples the four stages. This allows for controlled observation of "how downstream utility changes under upstream strong privacy," transforming the trade-off from an abstract slogan into a quantifiable engineering relationship.

Core Idea: Upgrade privacy from "local patches" to a "dynamic control signal," implementing a "context-aware" privacy architecture via a four-level privacy classification matrix and cross-stage orchestration.

Method¶

As a position paper, the core deliverable is a conceptual framework titled SPINE—it is not a training algorithm, but rather a design blueprint that instructs engineers on "which privacy primitives to activate in which scenarios and stages," substantiated by two navigation case studies that ground abstract trade-offs into quantifiable curves.

Overall Architecture¶

SPINE is composed of three components. The first is an L1-L4 four-level privacy classification matrix, responsible for mapping any real-world scenario to a specific privacy level. The second is a \(4 \times 4\) conceptual architecture diagram: the vertical axis represents the four stages of Embodied AI (Instruction / Perception / Planning / Interaction), and the horizontal axis represents levels L1-L4, listing the technical primitives to be activated in each cell. The third is a cross-stage orchestration strategy, which links the four stages into a holistic pipeline using a "highest-triggering-criterion" rule, accompanied by a quantitative analysis of utility degradation relative to privacy intensity. Together, these three components transform privacy from scattered stage-level patches into a dynamic control signal running through the entire pipeline.

Key Designs¶

1. Multi-criteria Privacy Classification Matrix (L1-L4): Upgrading "sensitive/insensitive" binary classification to a formalizable privacy state machine. Traditional approaches typically use only "public" and "private" categories, which are too coarse to distinguish between scenarios with vastly different sensitivities, such as a "bedroom" versus a "private office." SPINE introduces a unified quadruple \(PL = \{S, I, C, \Phi\}\) to describe each privacy state: \(S\) is scenario context, \(I\) is allowed information flow, \(C\) represents enforced control primitives, and \(\Phi\) is the dominant utility goal. The levels range from low to high: L1 (Public, e.g., parks) allows cloud inference with full sensing, where \(\Phi\) is max utility; L2 (Internal, e.g., office hallways) utilizes hybrid information flow, removing biometrics while retaining geometric data; L3 (Confidential, e.g., private offices) switches to local processing, semantic de-identification, and privacy-aware rerouting; L4 (Restricted, e.g., bedrooms/bathrooms) retains only minimal viable safety functions, replaces RGB with LiDAR, and utilizes TEE containers for isolation. This "privacy state machine" ensures cross-stage consistency, while high-cost primitives (FHE / ZKP) are explicitly restricted to L4 only when necessary to avoid performance overhead in low-sensitivity scenarios.

2. Adaptive Privacy Orchestration: Enabling end-to-end privacy constraints and preventing downstream "reversion." This design defines specific actions for instruction / perception / planning / interaction at each level. For the perception stage: L1 uses full FoV RGB-D, L2 performs real-time anonymization of faces/license plates, L3 dynamically masks non-task areas and limits the field of view, and L4 cuts off RGB in favor of LiDAR. Similarly, for the Planning stage: L1 follows the shortest path, L2 plans on de-identified semantic maps, L3 introduces a "privacy cost map" to apply high traversal penalties to private areas, and L4 degrades to minimum viable navigation. These are integrated vertically by the "highest-triggering-criterion" rule—if any stage triggers a higher-level constraint, the entire pipeline upgrades immediately until the trigger condition is cleared or a manual audit occurs. This design addresses the flaw in stage-local patches: previously, while the perception layer anonymized faces, planning logs still recorded precise movement patterns; now, if the robot perceives it has entered a bedroom, not only is perception masked, but planning and logging also switch to L4, preventing any downstream leaks.

3. Threat Model and Privacy-Utility Boundary Quantification: Providing a "knob" for trade-offs and clear failure thresholds. To guide engineering beyond abstract balance, SPINE clarifies three classes of threats: honest-but-curious cloud services, compromised storage or insiders, and external/unauthorized observers. The trade-off is then expressed as a function of utility loss relative to privacy intensity. In the navigation case study, the authors use pixelation intensity \(K\) as the trade-off knob: \(K=1\) corresponds to the L1 original image, while \(K>1\) approaches L3, measuring how the task success rate monotonically decreases with \(K\). Crucially, an "operational boundary" exists on this curve—once \(K\) exceeds a certain threshold, the task fails completely. This boundary represents the upper limit of privacy that can be enforced in that scenario. This allows product managers and engineers to make informed choices based on deployment context.

SPINE itself is a "position + framework" and lacks an end-to-end training objective; the two case studies were conducted using existing EAI simulators and real robots, recording success rates and path lengths under different \(K\) to generate the trade-off curves described above.

Key Experimental Results¶

This paper is a position paper, providing conceptual validation rather than exhaustive experimental comparisons.

Main Results¶

A comparison between SPINE and stage-local patches using the four-stage × four-level conceptual architecture:

Privacy Level	Typical Scenario	Instruction	Perception	Planning	Interaction
L1 Public	Park	Cloud LLM	Full FoV RGB-D	Shortest path	Full logs
L2 Internal	Office hallway	Local logs	Real-time face/plate anonymization	De-identified semantic map planning	Standard latency, de-identified storage
L3 Confidential	Private office	Local semantic de-identification	Limited FoV + dynamic masking	Privacy cost map + rerouting	Session-only encrypted logs
L4 Restricted	Bedroom / Bathroom	Processing within TEE	Cut RGB for LiDAR	Minimum viable navigation	Trace-free volatile execution

Ablation Study¶

The paper uses the navigation case study to observe the degradation of task success rate and path length under different pixelation intensities \(K\):

Configuration	Privacy Level	Success Rate	Path Length	Description
\(K=1\) (Original)	L1	Baseline High	Baseline Short	No privacy constraints
\(K\) Medium	L3	Moderate Decrease	Slightly Increased	Partial semantic loss but still feasible
\(K\) High	L4 Boundary	Significant Decrease	Substantially Increased	Near operational boundary
Beyond Boundary	Infeasible	Failed	Inaccessible	Task cannot be completed

Key Findings¶

Stage-local privacy patches can be "reversed" downstream: even if faces are anonymized at the perception stage, movement patterns in planning logs can still infer identity or health status, indicating a need for lifecycle design.
The privacy-utility relationship is non-linear, and an "operational boundary" exists—once privacy intensity exceeds a certain point, the task fails directly. This threshold is a critical consideration for deployment.
Fixed privacy policies may work in laboratory settings but often fail in real-world deployment because they cannot adapt to scenarios, highlighting the necessity of dynamic classification mechanisms.

Highlights & Insights¶

The framing of "privacy as a dynamic control signal" upgrades privacy from a compliance issue to a system control problem, allowing for seamless integration with fields such as control theory and safety filters.
The L1-L4 quadruple \(\{S, I, C, \Phi\}\) provides a formalizable vehicle for privacy grading that is far superior to the binary "sensitive/insensitive" classification commonly used in industry, directly guiding SDK design.
The "highest-triggering-criterion" rule draws from the concept of priority inheritance in real-time systems; it is simple but effectively resolves accountability issues between stages—once a stage triggers a higher level, the entire pipeline must comply.
Treating navigation as a controlled probe to quantify trade-offs is a clever approach, as navigation naturally couples all four stages and has mature utility metrics (success rate, path length) that can be reused for privacy evaluation in other EAI sub-tasks.

Limitations & Future Work¶

The framework remains at a conceptual level, and the case studies only cover navigation and household pixelation, lacking coverage of more complex scenarios such as robotic manipulation or medical assistance.
How specific thresholds for L1-L4 levels are defined and who defines them remains unclear; there is a risk of degradation where "more levels lead to more conservatism and poorer utility," requiring principled classification algorithms rather than manual tuning.
The "highest-triggering-criterion" may cause a "privacy level lock at L4" during multi-task concurrency, requiring a release mechanism and detailed audit log design.
Although heavy primitives like FHE / ZKP are mentioned, a computational overhead budget analysis is missing; in actual deployment, these often become system bottlenecks.

vs. Pape et al.'s prompt obfuscation: They focus on single-turn privacy obfuscation in LLMs; this paper expands the perspective to the full closed-loop of Embodied AI, emphasizing the systemic problem where "upstream masks can be reversed downstream."
vs. Legal Compliance Frameworks (GDPR/CCPA): Law provides principles but not staged operational guidance; this paper uses a four-level matrix to map high-level compliance principles to specific technical primitives for each EAI stage.
vs. Classical Differential Privacy: DP provide mathematical guarantees but focuses on data publishing; this paper emphasizes real-time, context-aware policy switching during deployment, which is more aligned with the actual requirements of Embodied Agents.

Rating¶

Novelty: ⭐⭐⭐⭐ The framing of "lifecycle privacy as a control signal" is an early systematic attempt in Embodied AI literature.
Experimental Thoroughness: ⭐⭐⭐ Includes only navigation + pixelation cases, lacking diverse validation in manipulation or medical fields.
Writing Quality: ⭐⭐⭐⭐ Clear structure, following a logical flow from problem → classification → orchestration → case study.
Value: ⭐⭐⭐⭐ Provides a reusable blueprint for the privacy architecture design of embodied robots and home service agents.