Towards Context-Aware Image Anonymization with Multi-Agent Reasoning¶
Conference: CVPR 2026 arXiv: 2603.27817 Code: None Area: Image Segmentation Keywords: Image Anonymization, Multi-Agent Reasoning, Diffusion-Based Inpainting, Privacy Protection, GDPR Compliance
TL;DR¶
This paper proposes CAIAMAR, a multi-agent framework that combines high-confidence direct PII processing (pedestrians, license plates) using dedicated models with context-aware reasoning via large vision-language models (LVLMs). Through a PDCA iterative refinement loop, it detects indirect privacy identifiers and applies appearance-decorrelated inpainting via diffusion models. On CUHK03-NP, it reduces re-identification risk by 73% while maintaining high image quality (FID 9.1) on CityScapes.
Background & Motivation¶
- Background: Street-view images are widely used in navigation, urban planning, and autonomous driving datasets, yet contain abundant personally identifiable information (PII). Existing anonymization methods primarily target direct identifiers such as faces and license plates.
- Limitations of Prior Work: (1) Traditional blurring methods (e.g., Gaussian blur) degrade downstream task performance (CityScapes instance segmentation AP drops by 5.3%) and are vulnerable to inversion attacks (95.9% identity recovery rate on CelebA-HQ); (2) Existing generative methods (DeepPrivacy2, FADM, etc.) focus solely on human bodies/faces and neglect indirect identifiers (clothing, accessories, contextual objects); (3) State-of-the-art LVLMs can infer private attributes from contextual cues (76.4% accuracy), and the o3 model achieves 99% city-level geolocation from casual photographs.
- Key Challenge: Effective anonymization cannot address only direct PII; it must also handle context-dependent indirect identifiers. However, the semantic diversity of indirect PII makes fixed detectors and rigid category rules insufficient.
- Goal: Can multi-agent collaboration enable context-aware image anonymization while preserving data utility and providing interpretable audit trails?
- Key Insight: Decompose the task into auditing (PII classification), generation (inpainting), and coordination (workflow management) via a multi-agent system, employing PDCA iterative refinement rather than a single detect-and-inpaint pass.
- Core Idea: A two-phase architecture — Phase 1 applies dedicated models for direct PII; Phase 2 employs multi-agent + LVLM reasoning for context-dependent indirect identifiers.
Method¶
Overall Architecture¶
A two-phase architecture is adopted. Phase 1 (predefined processing) performs detection and anonymization of direct PII: YOLOv8 detects pedestrians → SDXL + OpenPose ControlNet inpaints; YOLOv8s detects license plates → Gaussian blur; YOLO-TS detects traffic signs → exclusion masks are generated. Phase 2 (multi-agent collaboration) coordinates three specialized agents within the AutoGen framework under a round-robin policy, executing bounded PDCA loops to detect and process indirect PII.
Key Designs¶
-
Three-Agent PDCA Coordination Mechanism:
- Function: Systematically detect, process, and verify indirect PII through controlled iterations.
- Mechanism: Three agents operate in a fixed round-robin order — Auditor (Qwen2.5-VL-32B classifies PII + performs independent quality verification) → Orchestrator (tracks workflow state, coordinates retry logic) → Generative (executes scout-and-zoom segmentation + inpainting). Each PDCA cycle consists of: Plan (identify instances to process) → Do (segment and inpaint with IoU-based deduplication) → Check (two-tier verification: Generative agent's IoU deduplication + Auditor's independent visual inspection) → Act (decide to continue or terminate, \(n_{\max}=3\)).
- Design Motivation: A single-pass detection cannot cover all indirect PII; iterative refinement allows progressive discovery and processing of missed targets. Bounded iteration (\(n_{\max}=3\)) prevents infinite loops, with 76% of images converging within two rounds.
-
Scout-and-Zoom Segmentation Strategy:
- Function: Two-stage detection pipeline from coarse LVLM localization to precise semantic segmentation.
- Mechanism: Inspired by the region proposal mechanism in Faster R-CNN — (1) The LVLM (Qwen2.5-VL-32B) generates coarse bounding boxes as candidate regions; (2) crops are made to the LVLM bbox, and Grounded-SAM-2 is applied on the local crop to obtain precise segmentation masks; (3) local masks are mapped back to global image coordinates. A 30% IoU threshold is applied for deduplication to prevent redundant processing of the same instance across iterations (e.g., berlin_000002 skips an instance with IoU=0.88 in Round 2).
- Design Motivation: LVLMs excel at semantic understanding but produce coarse spatial localization; dedicated segmentation models offer precise localization but lack semantic reasoning. Combining both achieves coarse-to-fine detection.
-
Appearance-Decorrelated Diffusion Inpainting:
- Function: Replace PII regions with entirely new appearances to eliminate re-identification vectors.
- Mechanism: Human inpainting uses SDXL + OpenPose ControlNet (conditioning scale 0.8, strength 0.9); the LVLM generates descriptions with randomly sampled clothing colors (20 colors × 10 brightness levels), preserving utility attributes such as body pose and build. Object/text inpainting uses SDXL + Canny ControlNet. Critically, color matching is fully disabled (luminance=0.0, chrominance=0.0) to prevent appearance correlation and fundamentally sever re-identification feature chains.
- Design Motivation: Traditional blurring retains structural features recoverable via inversion attacks; GAN-based inpainting lacks sufficient diversity and controllability. Diffusion models combined with ControlNet fully replace appearance features while preserving scene structure (pose, geometry).
Loss & Training¶
- The framework requires no training and relies entirely on zero-shot/few-shot capabilities of pretrained models.
- Re-ID evaluation uses ResNet50 with triplet loss and center loss, trained for 120 epochs (SGD, lr=0.05).
- The license plate detector fine-tunes YOLOv8s on the UC3M-LP dataset, achieving mAP50-95=0.82.
Key Experimental Results¶
Main Results¶
| Method | CUHK03 R1↓ | CUHK03 mAP↓ | CityScapes KID↓ | CityScapes FID↓ |
|---|---|---|---|---|
| Original (No Anonymization) | 62.4% | 66.0% | - | - |
| Gauss. Blur | 9.4% | 6.4% | 0.224 | 178.5 |
| DeepPrivacy2 | 8.6% | 4.4% | 0.066 | 59.7 |
| FADM | 33.4% | 32.9% | 0.032 | 33.3 |
| CAIAMAR (Ours) | 16.9% | 13.7% | 0.001 | 9.1 |
Ablation Study¶
| Configuration | Indirect PII Detected | Time/Image | Notes |
|---|---|---|---|
| Phase 1 only | 0 | 67.8s | Direct PII only |
| Full pipeline | 1,107 | 133.5s | 54 indirect PII categories |
| Downstream mIoU (Ours) | 0.877 (−0.123) | - | Semantic segmentation preserved |
| Downstream mIoU (SVIA) | 0.478 (−0.522) | - | Severe degradation |
Key Findings¶
- Re-ID risk is reduced by 73% (R1: 62.4% → 16.9%) while image quality far surpasses brute-force methods (FID 9.1 vs. Blur 178.5).
- Phase 2 additionally detects 1,107 indirect PII instances across 54 object categories (vehicle markings 57.4%, textual elements 37.8%).
- Privacy–utility trade-off: stronger privacy protection than FADM (R1 reduced by 49%) with better distribution preservation (KID reduced by 56%).
- Downstream semantic segmentation mIoU decreases by only 0.123 (vs. 0.522 for SVIA), with near-lossless preservation of static categories (road −0.005, sky −0.005).
- 76% of images converge within 2 PDCA rounds; agent communication overhead accounts for only 7.4% of total time.
Highlights & Insights¶
- From "what is PII" to "what is PII in this context": This represents a paradigm shift in anonymization thinking. A vehicle marking on a private driveway constitutes PII, whereas one in a public parking lot may not — context determines privacy sensitivity, requiring reasoning rather than fixed rules.
- Two-tier verification prevents both omission and redundancy: The Generative Agent's IoU deduplication prevents redundant processing (efficiency), while the Auditor Agent's independent visual inspection ensures quality; the complementary design is broadly transferable.
- Fully local deployment + audit trail: The pipeline exclusively uses open-source models (Qwen2.5-VL, SDXL, Grounded-SAM-2), satisfying GDPR data sovereignty requirements, and generates structured audit trails supporting transparency and interpretability.
Limitations & Future Work¶
- Processing speed is slow (133.5s/image), precluding real-time deployment and limiting applicability to batch processing scenarios.
- Zero-shot PII detection underperforms on fine-grained localization (Dice of only 25.78% on the Visual Redactions Dataset).
- No comparison against a single-agent baseline is provided, leaving the advantage of multi-agent over single LVLM undemonstrated through ablation.
- Systematic hyperparameter ablation is absent (e.g., \(n_{\max}\), IoU threshold, ControlNet conditioning scale).
- Inherent LLM failure modes — "confirm but not execute," format inconsistency — are mitigated but not fundamentally resolved.
- Future work may explore hybrid architectures combining dedicated detectors for high-frequency categories (faces/bodies) with LVLM-based open-vocabulary detection for long-tail categories.
Related Work & Insights¶
- vs. DeepPrivacy2: DP2 is GAN-based and achieves stronger privacy protection (R1 8.6%) but at severe image quality cost (SSIM 0.443, KID 0.066); CAIAMAR achieves 73% Re-ID reduction with substantially better image quality.
- vs. FADM: FADM performs full-body anonymization only and ignores indirect identifiers; CAIAMAR additionally discovers 1,107 indirect PII instances.
- vs. SVIA: SVIA anonymizes large scene regions (buildings, roads, etc.), causing catastrophic quality degradation (FID 44.3 vs. 9.1, mIoU 0.478 vs. 0.877).
Rating¶
- Novelty: ⭐⭐⭐⭐ Applying multi-agent + PDCA loops to anonymization is a novel system design; the context-aware PII classification paradigm substantially advances beyond conventional approaches.
- Experimental Thoroughness: ⭐⭐⭐ Re-ID and image quality evaluations are comprehensive, but critical ablations are missing (multi-agent vs. single-agent, comparison across different LVLMs, etc.).
- Writing Quality: ⭐⭐⭐⭐ System architecture is described clearly with detailed tables and case analyses, though the main text is somewhat verbose due to extensive implementation details.
- Value: ⭐⭐⭐⭐ Presents a practically deployable, GDPR-compliant anonymization solution and, for the first time, systematically addresses indirect PII, offering substantial value to industry practitioners.