Re-identification of De-identified Documents with Autoregressive Infilling¶

Conference: ACL 2025
arXiv: 2505.12859
Code: Not open-sourced
Area: Other
Keywords: De-identification, Re-identification attacks, RAG, Text infilling, Privacy protection, ColBERT

TL;DR¶

Proposes a RAG-based re-identification method for de-identified documents: first employs sparse + dense retrieval to find relevant background documents, and then uses an autoregressive infilling model to infer masked personally identifiable information (PII), recovering up to 80% of the masked text across three datasets.

Background & Motivation¶

Core Problem: De-identification protects privacy by masking personally identifiable information (PII). However, there is a lack of effective automated methods to evaluate the robustness of de-identification—specifically, whether an adversary can recover masked content from context and background knowledge.
Limitations of Prior Work:
- Human-annotated evaluation: Relying on human experts for comparison is highly costly and suffers from inconsistency.
- Classifier-based attacks: Manzanares-Salor et al. trained classifiers to directly predict names, but they did not attempt to recover the masked text itself, lacking insights into the intermediate reasoning process.
- Morris et al.: Used models to predict infoboxes to guide masking decisions, but similarly did not try to retrieve the masked content.
Design Motivation: By leveraging LLM capabilities, this work constructs an adversary-simulating RAG system that recovers masked content before inferring identities, thereby providing a more comprehensive security assessment of de-identification methods.

Method¶

Overall Architecture¶

Given a de-identified document (where PII is replaced by [MASK]), the system executes a three-step re-identification workflow: (1) Sparse retrieval (BM𝒳) to select Top-100 relevant documents from the background knowledge base; (2) Dense retrieval (fine-tuned ColBERT) to extract the most relevant passage for each [MASK]; (3) An infilling model (GLM or Mistral-12B) to infer the original content using the retrieved passage and context. All masks are replaced sequentially until complete.

Key Designs¶

Two-stage Retrieval: Sparse retrieval (BM𝒳) quickly narrows down the scope to 100 documents, and dense retrieval (ColBERT) precisely pinpoints passages containing the masked information. ColBERT is fine-tuned on de-identified Wikipedia biographies using positive (passages containing the original content) and negative examples.
Four-level Background Knowledge Control: L1 (No Retrieval) \(\rightarrow\) L2 (General Knowledge, excluding the original text) \(\rightarrow\) L3 (Including other de-identified original texts) \(\rightarrow\) L4 (Including the original text of the target document), systematically evaluating the impact of background knowledge on re-identification capability.
Final Identity Inference: After text infilling, a BERT ranking model matches the recovered document with a candidate list of names to finalize identity lock-on.

Loss & Training¶

ColBERT retriever: Trained using the standard contrastive learning loss (positive/negative passage-query pairs) with a learning rate of \(3 \times 10^{-5}\).
GLM infilling model: Trained on de-identified Wikipedia biographies and corresponding retrieved text pairs with a learning rate of \(3 \times 10^{-5}\).
BERT ranking model: Margin ranking loss with a learning rate of \(3 \times 10^{-6}\).

Experiments¶

Main Results (End-to-End Infilling Exact Match / Token Recall)¶

Dataset	Model	L1 (No Retrieval)	L2 (General Knowledge)	L3 (With Other Originals)	L4 (With Original)
Wikipedia	GLM	6.26 / 12.22	9.56 / 15.84	9.77 / 16.05	80.08 / 82.56
TAB (Court)	GLM	0.84 / 6.26	11.27 / 21.35	14.32 / 29.08	66.04 / 75.13
TAB (Court)	Mistral	0.91 / 25.36	10.59 / 47.43	11.00 / 47.98	37.34 / 70.29
Clinical Notes	GLM	18.31 / 26.71	18.92 / 26.36	42.31 / 55.40	90.87 / 92.68

Ablation Study (Final Identity Inference Top-10 Accuracy)¶

Dataset	Model	Masked Document	L1	L2	L3	L4
TAB	GLM	28.3	32.3	31.5	29.1	61.4
Clinical Notes	GLM	57.0	62.4	62.1	77.9	98.7
TAB	Mistral	28.3	32.3	33.1	37.0	57.5
Clinical Notes	Mistral	57.0	61.1	66.1	81.2	97.0

Key Findings¶

Background knowledge significantly impacts re-identification capability: Accuracy jumps from 6% to 80%+ from L1 to L4, indicating that the security of de-identification highly depends on the external knowledge available to the adversary.
Quasi-identifiers are easier to recover than direct identifiers: Token recall for quasi-identifiers such as locations and dates is much higher than that for direct identifiers like names.
Non-fine-tuned Mistral-12B achieves higher token recall on L1-L3: The world knowledge of the LLM compensates for the lack of domain adaptation, but its performance in L4 scenarios is worse than that of the specifically fine-tuned GLM.
Clinical notes are the easiest to re-identify: Highly structured texts with fixed patterns (e.g., patient records) provide adversaries with more exploitable pattern information.
Identity lock-on is effective on small candidate sets: The Top-10 accuracy for 85 candidate patients in clinical notes is as high as 98.7%, whereas it is only 61.4% for 127 candidates in court cases.

Highlights & Insights¶

First to apply the RAG paradigm to re-identification attacks on de-identified documents, providing a novel perspective on evaluating the robustness of de-identification methods.
The experimental design of four-level background knowledge (L1-L4) systematically quantifies the privacy risk gradients under different levels of information availability.
The method can be directly applied to "red-teaming" during the de-identification phase, helping to discover insufficiently masked content.
Discovers that the non-fine-tuned Mistral-12B already exhibits high token recall at low-to-medium background knowledge levels, revealing the privacy risks brought by the broad knowledge inherent to LLMs.
The three datasets cover different document types (encyclopedia/legal/medical), enhancing the generalizability of the conclusions.

Limitations & Future Work¶

Only evaluates English text; the de-identification patterns and difficulty of re-identification in other languages may differ.
The GLM model is relatively small (335M), and Mistral is used only in a zero-shot manner; larger models combined with ICL or fine-tuning could yield further improvements.
The original versions of Wikipedia and court cases might already be included in the LLM pre-training data, leading to an overestimation of re-identification performance.
Uses only textual data as background knowledge, without considering structured information sources such as tables and knowledge graphs.
The de-identification strategy only considers entity masking, without evaluating the robustness of more advanced text-rewriting-based de-identification methods.
Clinical notes are synthetic data, which might introduce pattern artifacts, making re-identification easier than on real data.

Text De-identification: Lison et al. 2021 (NER-based masking), Pilán et al. 2022 (TAB benchmark, manually annotating direct/quasi-identifiers), Sánchez & Batet 2016 (text sanitization), Dernoncourt et al. 2017
Text Infilling: GLM (Du et al. 2022, a unified encoder-decoder), Fill-in-the-Middle (Bavarian et al. 2022), Zhu et al. 2019, Donahue et al. 2020
RAG: Lewis et al. 2020 (original RAG paper), ColBERT (Khattab & Zaharia 2020, dense retrieval), Guu et al. 2020 (REALM), Izacard et al. 2023
Re-identification Attacks: Manzanares-Salor et al. 2024 (classifiers directly predicting personal names), Morris et al. 2022/2024 (infobox prediction guiding masking)
Privacy Protection: GDPR data minimization principle, differentially private text rewriting (Igamberdiev & Habernal 2023)

Rating¶

Novelty: ⭐⭐⭐⭐ — Applying RAG to privacy attacks presents an insightful and fresh perspective.
Practicality: ⭐⭐⭐⭐⭐ — Directly applicable to evaluating the robustness of de-identification systems.
Rigor: ⭐⭐⭐⭐ — The four-level background knowledge is thoroughly designed, with three datasets covering different scenarios.
Overall: ⭐⭐⭐⭐