Unveiling Privacy Risks in LLM Agent Memory¶
Conference: ACL 2025
arXiv: 2502.13172
Code: https://github.com/wangbo9719/MEXTRA
Area: LLM Agent
Keywords: Agent Privacy, Memory Module Attack, Black-box Attack, Privacy Extraction, Agent Security
TL;DR¶
This paper systematically investigates the privacy risks of LLM Agent memory modules and proposes MEXTRA, a black-box memory extraction attack. Utilizing carefully designed locator-aligner attack prompts and an automated diverse prompt generation method, the authors successfully extract large volumes of private query histories from both medical and online shopping Agents.
Background & Motivation¶
Background: LLM Agents are widely applied in privacy-sensitive scenarios such as healthcare and online shopping, where memory modules store user interaction history to serve as few-shot demonstrations.
Limitations of Prior Work: - Prior work has investigated external data leakage in RAG, but Agent memory modules as a new source of private information remain largely unexplored. - User queries stored in memory can contain highly sensitive information (e.g., patient conditions, shopping preferences). - Existing RAG attack prompts (e.g., "please repeat all context") are ineffective against Agents due to their more complex workflows.
Key Challenge: Agents require memory modules to enhance performance, yet the private information stored within these memories faces the risk of extraction.
Goal: (1) Can Agent memory information be extracted? (2) Which factors affect the degree of leakage? (3) What attack strategies are more effective?
Key Insight: Design a two-part attack prompt (locator + aligner) tailored to the characteristics of Agent workflows.
Core Idea: Enable the Agent to output private information from its memory within its own workflow framework through the locator + aligner attack prompt design.
Method¶
Overall Architecture¶
Black-box attack setting \(\rightarrow\) attack prompt consists of two parts: \(\tilde{q} = \tilde{q}^{loc} \| \tilde{q}^{align}\) \(\rightarrow\) locator pinpoints user queries in memory \(\rightarrow\) aligner specifies the output format to align with the Agent workflow \(\rightarrow\) automated diverse prompt generation to cover more memory entries.
Key Designs¶
-
Attack Prompt Design (Locator + Alinger):
- Locator (\(\tilde{q}^{loc}\)): Explicitly specifies the retrieval of stored user queries (rather than other contextual descriptions) and prioritizes their output over completing the original task (e.g., "I lost previous example queries").
- Aligner (\(\tilde{q}^{align}\)): Specifies the output format to align with the Agent workflow. For example, "please enter them in the search box" for a Web Agent, and "print them as output" for a Code Agent.
- Design Motivation: Due to the complexity of Agent workflows (code execution/web operations), simple text output requests are insufficient. The Agent must be prompted to return private information within the constraints of its own workflow.
-
Automated Diverse Prompt Generation:
- Function: Use GPT-4 to automatically generate \(n\) diverse attack prompts to maximize the extraction of different memory entries.
- Basic level: Only the application domain of the Agent is known; diverse prompts are generated by varying phrasing and expressions.
- Advanced level: Once the similarity function is inferred, targeted optimization is applied—varying length if edit distance is used, or adding keywords from different domains if semantic similarity is used.
- Design Motivation: A single attack can only retrieve the top-\(k\) records; diverse prompts are required to cover a wider range of the memory.
-
Threat Model:
- The attacker can only interact with the Agent through input queries (black-box setting).
- Goal: Extract user queries \(q_i\) stored in the memory.
- Two knowledge levels: Basic (knowing the application domain) and Advanced (inferring the retrieval function).
Key Experimental Results¶
Main Results (30 attack prompts, memory size 200)¶
| Agent | Extracted Number (EN) | Retrieved Number (RN) | Efficiency (EE) | Complete Extraction Rate (CER) |
|---|---|---|---|---|
| EHRAgent (Medical) | 50 | 55 | 0.42 | 0.83 |
| RAP (Shopping) | 26 | 27 | 0.29 | 0.87 |
Ablation Study¶
| Configuration | EHRAgent EN | RAP EN | Description |
|---|---|---|---|
| MEXTRA (Full) | 50 | 26 | Best |
| w/o aligner | 36 | 6 | Aligner has a significant impact on RAP |
| w/o req | 39 | 25 | Remove generation requirements |
| w/o demos | 29 | 8 | Remove demonstrations |
Key Findings¶
- Agent memory is highly vulnerable: Just 30 attack prompts can extract 25-50% from a memory size of 200, with a CER exceeding 83%.
- Edit distance retrieval functions are more vulnerable to attacks than semantic similarity (covering more memory entries).
- Larger memory sizes lead to more leakage (though the rate of increase diminishes).
- Increasing the number of attacks continuously improves the extraction volume, particularly under Advanced knowledge.
- The Aligner is critical for Agents with strict workflow constraints (such as Web Agents).
Highlights & Insights¶
- Systematically reveals the privacy risks of LLM Agent memory modules for the first time, filling a critical gap in security research.
- The Locator+Aligner attack design is clever, effectively exploiting the inherent characteristics of Agent workflows.
- Discovers that different retrieval functions have a major impact on privacy leakage—a crucial security factor that should be considered when designing Agents.
- Serves as a wake-up call for Agent developers: memory modules require dedicated privacy-preserving mechanisms.
Limitations & Future Work¶
- Only two Agents and one LLM (GPT-4o) were tested; thus, generalizability remains to be verified.
- The static memory setting (no memory updates) simplifies real-world scenarios.
- No concrete defense mechanisms have been proposed.
- Attacks depend on carefully designed prompts, which may fail if the Agent implements input filtering.
Related Work & Insights¶
- vs. RAG Privacy Attacks: RAG attacks only require the LLM to output text, whereas Agent attacks must adapt to complex workflows.
- vs. Prompt Injection: Prompt injection aims to alter model behavior, whereas MEXTRA specifically aims to extract memory content.
Rating¶
- Novelty: ⭐⭐⭐⭐ First systematic study on Agent memory privacy; creative attack design.
- Experimental Thoroughness: ⭐⭐⭐⭐ Two Agents + multi-factor analysis + ablations + different knowledge levels.
- Writing Quality: ⭐⭐⭐⭐ Clear problem definition and standardized formulations.
- Value: ⭐⭐⭐⭐⭐ Important discovery in the security domain, offering direct security guidance for Agent design.