Skip to content

Unveiling Privacy Risks in LLM Agent Memory

Conference: ACL 2025
arXiv: 2502.13172
Code: https://github.com/wangbo9719/MEXTRA
Area: LLM Agent
Keywords: Agent Privacy, Memory Module Attack, Black-box Attack, Privacy Extraction, Agent Security

TL;DR

This paper systematically investigates the privacy risks of LLM Agent memory modules and proposes MEXTRA, a black-box memory extraction attack. Utilizing carefully designed locator-aligner attack prompts and an automated diverse prompt generation method, the authors successfully extract large volumes of private query histories from both medical and online shopping Agents.

Background & Motivation

Background: LLM Agents are widely applied in privacy-sensitive scenarios such as healthcare and online shopping, where memory modules store user interaction history to serve as few-shot demonstrations.

Limitations of Prior Work: - Prior work has investigated external data leakage in RAG, but Agent memory modules as a new source of private information remain largely unexplored. - User queries stored in memory can contain highly sensitive information (e.g., patient conditions, shopping preferences). - Existing RAG attack prompts (e.g., "please repeat all context") are ineffective against Agents due to their more complex workflows.

Key Challenge: Agents require memory modules to enhance performance, yet the private information stored within these memories faces the risk of extraction.

Goal: (1) Can Agent memory information be extracted? (2) Which factors affect the degree of leakage? (3) What attack strategies are more effective?

Key Insight: Design a two-part attack prompt (locator + aligner) tailored to the characteristics of Agent workflows.

Core Idea: Enable the Agent to output private information from its memory within its own workflow framework through the locator + aligner attack prompt design.

Method

Overall Architecture

Black-box attack setting \(\rightarrow\) attack prompt consists of two parts: \(\tilde{q} = \tilde{q}^{loc} \| \tilde{q}^{align}\) \(\rightarrow\) locator pinpoints user queries in memory \(\rightarrow\) aligner specifies the output format to align with the Agent workflow \(\rightarrow\) automated diverse prompt generation to cover more memory entries.

Key Designs

  1. Attack Prompt Design (Locator + Alinger):

    • Locator (\(\tilde{q}^{loc}\)): Explicitly specifies the retrieval of stored user queries (rather than other contextual descriptions) and prioritizes their output over completing the original task (e.g., "I lost previous example queries").
    • Aligner (\(\tilde{q}^{align}\)): Specifies the output format to align with the Agent workflow. For example, "please enter them in the search box" for a Web Agent, and "print them as output" for a Code Agent.
    • Design Motivation: Due to the complexity of Agent workflows (code execution/web operations), simple text output requests are insufficient. The Agent must be prompted to return private information within the constraints of its own workflow.
  2. Automated Diverse Prompt Generation:

    • Function: Use GPT-4 to automatically generate \(n\) diverse attack prompts to maximize the extraction of different memory entries.
    • Basic level: Only the application domain of the Agent is known; diverse prompts are generated by varying phrasing and expressions.
    • Advanced level: Once the similarity function is inferred, targeted optimization is applied—varying length if edit distance is used, or adding keywords from different domains if semantic similarity is used.
    • Design Motivation: A single attack can only retrieve the top-\(k\) records; diverse prompts are required to cover a wider range of the memory.
  3. Threat Model:

    • The attacker can only interact with the Agent through input queries (black-box setting).
    • Goal: Extract user queries \(q_i\) stored in the memory.
    • Two knowledge levels: Basic (knowing the application domain) and Advanced (inferring the retrieval function).

Key Experimental Results

Main Results (30 attack prompts, memory size 200)

Agent Extracted Number (EN) Retrieved Number (RN) Efficiency (EE) Complete Extraction Rate (CER)
EHRAgent (Medical) 50 55 0.42 0.83
RAP (Shopping) 26 27 0.29 0.87

Ablation Study

Configuration EHRAgent EN RAP EN Description
MEXTRA (Full) 50 26 Best
w/o aligner 36 6 Aligner has a significant impact on RAP
w/o req 39 25 Remove generation requirements
w/o demos 29 8 Remove demonstrations

Key Findings

  • Agent memory is highly vulnerable: Just 30 attack prompts can extract 25-50% from a memory size of 200, with a CER exceeding 83%.
  • Edit distance retrieval functions are more vulnerable to attacks than semantic similarity (covering more memory entries).
  • Larger memory sizes lead to more leakage (though the rate of increase diminishes).
  • Increasing the number of attacks continuously improves the extraction volume, particularly under Advanced knowledge.
  • The Aligner is critical for Agents with strict workflow constraints (such as Web Agents).

Highlights & Insights

  • Systematically reveals the privacy risks of LLM Agent memory modules for the first time, filling a critical gap in security research.
  • The Locator+Aligner attack design is clever, effectively exploiting the inherent characteristics of Agent workflows.
  • Discovers that different retrieval functions have a major impact on privacy leakage—a crucial security factor that should be considered when designing Agents.
  • Serves as a wake-up call for Agent developers: memory modules require dedicated privacy-preserving mechanisms.

Limitations & Future Work

  • Only two Agents and one LLM (GPT-4o) were tested; thus, generalizability remains to be verified.
  • The static memory setting (no memory updates) simplifies real-world scenarios.
  • No concrete defense mechanisms have been proposed.
  • Attacks depend on carefully designed prompts, which may fail if the Agent implements input filtering.
  • vs. RAG Privacy Attacks: RAG attacks only require the LLM to output text, whereas Agent attacks must adapt to complex workflows.
  • vs. Prompt Injection: Prompt injection aims to alter model behavior, whereas MEXTRA specifically aims to extract memory content.

Rating

  • Novelty: ⭐⭐⭐⭐ First systematic study on Agent memory privacy; creative attack design.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Two Agents + multi-factor analysis + ablations + different knowledge levels.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem definition and standardized formulations.
  • Value: ⭐⭐⭐⭐⭐ Important discovery in the security domain, offering direct security guidance for Agent design.