CompAgent: An Agentic Framework for Visual Compliance Verification¶
Conference: CVPR 2026 arXiv: 2511.00171 Code: None Area: Object Detection / Content Safety Keywords: Visual Compliance Verification, Agentic Framework, Tool-Augmented Reasoning, Content Moderation, MLLM
TL;DR¶
This paper proposes CompAgent, the first agentic framework for visual compliance verification. A Planning Agent dynamically selects visual tools (object detection, face analysis, NSFW detection, etc.) based on compliance policies, while a Compliance Verification Agent integrates image content, tool outputs, and policy context for multimodal reasoning. Without any training, CompAgent surpasses the previous SOTA by 10% on UnsafeBench, achieving 76% F1.
Background & Motivation¶
Visual content compliance verification is a critically important yet underexplored problem in the vision community:
Pressing Practical Demand: Regulations ranging from GDPR to Ofcom require visual content to meet compliance standards, with streaming platforms facing fines of up to $23 million for violations. Content compliance involves detecting harmful objects, inappropriate gestures, explicit content, and more, and continues to evolve across regions, cultures, and industries.
Fundamental Limitations of Existing Approaches: - Dedicated Classifiers: Require expensive annotated data and must be retrained whenever policies change, resulting in poor generalization. LlavaGuard achieves F1=0.91 on its own dataset but drops to 0.66 on UnsafeBench. - Direct MLLM Prompting: While MLLMs possess broad knowledge, they struggle with fine-grained visual detail reasoning and structured compliance rule application. The best zero-shot MLLM (Llama 4 Maverick) achieves only 0.55 F1 on LlavaGuard.
The Agentic Gap: Despite the flourishing of agentic methods in other domains, no agentic framework specifically targeting visual compliance verification has been proposed.
CompAgent's approach: rather than training a dedicated model or relying solely on prompt engineering, it decomposes compliance verification into modular steps via a tool-augmented agentic architecture — combining dynamic tool selection planning with multimodal evidence fusion reasoning.
Method¶
Overall Architecture¶
CompAgent consists of three core components:
- Planning Agent: Parses compliance policies and dynamically selects appropriate visual tools to collect evidence.
- Tool Suite: A modular collection of off-the-shelf tools (object detection, face detection, OCR, NSFW detection, etc.).
- Compliance Verification Agent (CVAgent): Integrates the image, tool outputs, and policy context to render the final verdict.
Key Designs¶
-
Planning Agent — ReAct-Loop Tool Orchestration: Built on the ReAct (Reasoning and Acting) framework, implementing a think–act–observe loop. At each step \(t\), the agent maintains a state \(s_t = \{I, P, E_t\}\) (image, policy, accumulated evidence), reasons about which policy clauses remain unverified, selects a tool \(a_t \in T \cup \{\text{CONCLUDE}\}\), executes the tool to obtain an observation, and updates the evidence:
\(E_{t+1} = E_t \cup (\text{thought}_t, a_t, o_t)\)
A key characteristic is that tool selection does not rely on a fixed routing table or a learned policy; instead, the LLM reasons in context based on three factors: (1) which clauses in policy \(P\) still lack supporting evidence; (2) the capability and limitation descriptions of each tool; and (3) the evidence \(E_t\) already collected. For example, an age-restriction policy triggers face detection, while a text-violation policy prioritizes OCR. The implementation uses LangGraph with Claude Sonnet 3.5 v2, with a maximum of 10 reasoning steps.
-
Modular Tool Suite: Covers the evidence types required by common compliance policies:
- Summarization Tool: Generates scene descriptions.
- Content Detection Tools: Face detection (age/expression/emotion), object detection (bounding boxes + confidence scores), text detection (OCR), and content moderation (unsafe categories + severity).
- Specialized Compliance Tools: LlavaGuard (safety rating + violation categories + rationale), Safe-CLIP (zero-shot detection of seven toxic content categories), and ICM Assistant (template-based safety assessment).
The tool suite is fully modular — tools can be added, removed, or replaced without retraining. The agent treats each tool as a black-box evidence source.
- CVAgent — Multimodal Evidence Fusion and Decision: Upon receiving CONCLUDE, CVAgent receives the complete state \(s_T = \{I, P, E_T\}\) and systematically: (1) directly inspects the image; (2) reviews each tool output, weighing confidence and cross-tool consistency; (3) maps the combined evidence to specific policy clauses; and (4) synthesizes an overall assessment. Outputs include a binary Safe/Unsafe rating, violation categories, and a rationale linking evidence to policy clauses.
Division of Labor with the Planning Agent: The Planning Agent determines what evidence to collect (a text-only LLM suffices), while CVAgent determines how to interpret the evidence (requiring the visual capabilities of an MLLM to directly examine the image).
Loss & Training¶
CompAgent is entirely training-free, requiring neither annotated data nor fine-tuning, making it highly adaptable to continuously evolving compliance policies. This is a key advantage over fine-tuning approaches such as LlavaGuard, which require policy-specific annotated data.
Key Experimental Results¶
Main Results¶
| Method | Type | LlavaGuard F1 | UnsafeBench F1 | Notes |
|---|---|---|---|---|
| Claude Sonnet 3.5 v2 | Zero-shot | 0.61 | 0.54 | Best zero-shot single model |
| Llama 4 Maverick | Zero-shot | 0.55 | 0.71 | Zero-shot |
| LlavaGuard (dedicated policy) | Fine-tuned | 0.91 | 0.66 | Strong on own data, large cross-dataset drop |
| Safe-CLIP | Fine-tuned | 0.36 | 0.59 | Zero-shot toxic detection |
| Category-based Routing | Routing | 0.61 | 0.63 | Fixed-routing baseline |
| CompAgent | Agentic | 0.93 | 0.76 | Best on both datasets |
CompAgent achieves F1=0.93 on the LlavaGuard dataset (surpassing fine-tuned LlavaGuard at 0.91) and F1=0.76 on UnsafeBench (exceeding the previous SOTA by 10%), without any training data.
Ablation Study¶
| Configuration | UnsafeBench F1 | Notes |
|---|---|---|
| No tools (direct MLLM) | 0.54 | Lacks fine-grained visual evidence |
| Fixed tool routing | 0.63 | Static assignment lacks flexibility |
| Without Planning Agent | Lower | Tool selection insufficiently targeted |
| Without CVAgent (Planning Agent decides directly) | Lower | Lacks multimodal evidence fusion |
| Full CompAgent | 0.76 | Dynamic orchestration + evidence fusion is optimal |
Analysis of decision trajectories reveals 95 distinct tool-usage patterns on the LlavaGuard dataset and 147 on UnsafeBench, demonstrating that the framework genuinely adapts dynamically to diverse compliance requirements.
Key Findings¶
- Zero-shot MLLMs are insufficient: Even the strongest MLLMs used directly cannot meet compliance verification requirements, highlighting the indispensability of structured tool augmentation.
- Fine-tuned models generalize poorly: LlavaGuard achieves F1=0.91 on its own data but drops to 0.66 cross-dataset, exposing the fragility of training on policy-specific data.
- Core advantage of the agentic approach: Dynamic tool selection + multi-source cross-validation of evidence + flexible adaptation without training.
Highlights & Insights¶
- The first agentic framework for visual compliance verification, opening a new research direction.
- The result of zero training yet surpassing fine-tuned models is remarkable: it demonstrates that in policy-volatile scenarios like compliance verification, agentic methods are more practical than fine-tuning.
- The separation of Planning Agent and CVAgent is an elegant design: evidence collection (where a cheaper LLM suffices) and evidence adjudication (requiring an MLLM to inspect the image) are decoupled.
- The modular design of the tool suite makes the system easy to extend and adapt to new compliance requirements.
Limitations & Future Work¶
- The current system processes only single images; extending to video compliance verification (continuous scenes, context dependency) requires further development.
- Reliance on Claude Sonnet 3.5 v2 as the backbone incurs high inference costs and introduces dependency on a closed-source model.
- The selection and description of tools in the tool suite require manual design, and integrating new tools still necessitates human intervention.
- While F1=0.76 on UnsafeBench represents the state of the art, there remains a substantial gap from perfect performance.
- Detailed analysis of latency and cost (multiple tool calls + LLM inference) is absent.
Related Work & Insights¶
- This represents the first application of the ReAct framework to visual compliance, demonstrating the unique value of agentic methods in tasks requiring flexible adaptation.
- Relationship to specialized tools such as NudeNet and Safe-CLIP: CompAgent treats them as evidence sources rather than replacements.
- Inspiration: Other visual tasks requiring policy-driven judgment (e.g., advertisement compliance, medical image review) may benefit from adapting this framework.
Rating¶
- Novelty: ⭐⭐⭐⭐ First agentic framework for compliance verification, though ReAct + tool calling itself is not novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison across two datasets, with thorough ablation and interpretability analysis.
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear and framework description is detailed, though the main text is somewhat lengthy.
- Value: ⭐⭐⭐⭐⭐ Extremely high practical value; the training-free adaptation to new policies has direct relevance for industry applications.