OmniCompliance-100K: A Multi-Domain Rule-Grounded Real-World Safety Compliance Dataset¶
Conference: ACL 2026
arXiv: 2603.13933
Code: GitHub
Area: Medical Imaging
Keywords: Safety Compliance, Real-world Case Dataset, Multi-domain Regulations, Web Search Agent, LLM Benchmark
TL;DR¶
This paper constructs OmniCompliance-100K, the first large-scale, multi-domain, real-world case-based LLM safety compliance dataset. It contains 12,985 manually organized regulation/policy rules and 106,009 real-world compliance cases collected via a Web search agent. Covering 9 domains such as AI safety, data privacy, finance, and healthcare, the dataset reveals systematic shortcomings in the safety compliance capabilities of current LLMs through extensive benchmark experiments.
Background & Motivation¶
Background: As LLMs are widely deployed across industries, safety risks have become increasingly prominent—ranging from generating harmful content and leaking private information to violating financial compliance requirements. Existing LLM safety datasets (e.g., ToxicChat, WildGuard, HarmBench) are primarily based on researcher-defined taxonomies and synthesized by LLMs.
Limitations of Prior Work: (1) Existing safety datasets lack systematic regulatory basis, using ad-hoc taxonomies that fail to provide strict compliance protection; (2) Even works that introduce regulatory/policy frameworks (Air-Bench, GuardSet-X) rely on cases synthesized by LLMs, lacking real-world diversity; (3) Real-world compliance cases are scattered across various websites in different formats (PDF, HTML, JSON), making large-scale collection and alignment difficult.
Key Challenge: While laws and regulations provide comprehensive safety guidelines and numerous real-world enforcement cases for reference, existing datasets have failed to utilize these resources. Consequently, LLM safety alignment is limited to synthetic scenarios, leading to poor generalization in real-world applications.
Goal: (1) Construct the first large-scale, rule-grounded safety compliance dataset containing real-world cases; (2) Develop an automated Web search agent pipeline to collect rule-aligned real-world cases at scale; (3) Comprehensively benchmark the safety compliance capabilities of current LLMs.
Key Insight: Utilize modern Web search agents (based on Grok-4.1) to automatically plan queries, retrieve results, filter noise, and summarize cases, thereby addressing the three major challenges of real-world case collection (scattered sources, diverse formats, and information noise).
Core Idea: Safety issues should be approached from a compliance perspective—using authoritative regulations as the basis and real-world cases as materials for training and evaluation, rather than relying on researcher-defined classifications and LLM-synthesized scenarios.
Method¶
Overall Architecture¶
The dataset construction consists of two phases: (1) Rule Collection—three PhDs in computational linguistics spent one month manually organizing a tree-structured rule system from 74 regulations/policies, generating 12,985 rules by traversing the tree; (2) Case Collection—a Web search agent pipeline based on Grok-4.1 was developed to automatically search, filter, and summarize 8-10 real-world cases for each rule. Finally, a rule-case knowledge graph was constructed to analyze the correlation between rules.
Key Designs¶
-
Multi-domain Regulatory Rule System:
- Function: Provides authoritative and systematic compliance standards for LLM safety evaluation.
- Mechanism: Covers 9 major domains: AI Safety Law (EU AI Act, SB 53), Data Privacy Law (GDPR, CCPA, HIPAA), China-related regulations (PIPL, Data Security Law, etc.), Platform Policies (X, Reddit, GitHub, Google, OpenAI, WeChat), Academic Integrity, Financial Regulations (Anti-Money Laundering, Cross-border Payments, Cryptocurrency), Medical Device Regulations, Cybersecurity (MITRE ATT&CK), and Fundamental Rights. All regulations are organized into a tree structure, with rules generated by traversing from root to leaf.
- Design Motivation: Different regulations have inconsistent formats and hierarchical structures, requiring significant manual effort to unify them into an operational tree structure. Multi-domain coverage ensures the comprehensiveness of the evaluation.
-
Web Search Agent Case Collection Pipeline:
- Function: Automated large-scale collection of real-world compliance cases aligned with rules.
- Mechanism: The agent, based on Grok-4.1, executes a three-step process: (a) Analyzes rule content to plan and generate multiple search queries; (b) Calls search engine tools to retrieve results, focusing on official/authoritative sources; (c) Summarizes collected information, filters irrelevant cases, and outputs structured JSON containing case background, compliance outcome, involved parties, applicable regulations, and reference links.
- Design Motivation: Manually collecting 106K cases is infeasible, and existing crawlers cannot adapt to various website structures and formats. The flexibility of Web search agents naturally solves the problems of scattered sources and diverse formats.
-
Rule-Case Knowledge Graph:
- Function: Reveals correlations between regulatory clauses and supports multi-hop compliance reasoning.
- Mechanism: Each searched case references its source rule, forming a
triplet. Aggregating all triplets builds the knowledge graph. Taking GDPR as an example, analysis shows that Articles 5-11 (Principles), Article 32 (Security of processing), Article 33 (Breach notification), and Article 44 (Principles for onward transfer) are highly correlated with almost all other clauses. - Design Motivation: Regulatory clauses do not exist in isolation; actual compliance judgments often require the synthesized consideration of multiple clauses. The knowledge graph reveals these correlations, providing a foundation for future compliance reasoning.
Loss & Training¶
This work focuses on dataset construction and benchmarking. The evaluation task is a 2-way classification (permitted/prohibited), using the macro-F1 score as the metric. The dataset contains 40,385 "permitted" samples and 65,624 "prohibited" samples.
Key Experimental Results¶
Main Results¶
Closed-Source Model Benchmark (Average Macro-F1 %)
| Model | Average | Platform Policy | Major Regulations | Education Bias |
|---|---|---|---|---|
| GLM-4.5 | Highest | 89.61 | 93.65 | 83.91 |
| DeepSeek-V3.2 | Second Highest | — | — | 85.75 |
| Grok-4.1 | 85.60 | — | — | — |
Open-Source Model Comparison
| Model | Average Macro-F1 |
|---|---|
| Qwen2.5-14B-Instruct | High |
| Qwen2.5-7B-Instruct | 84.94 |
| Qwen2.5-3B-Instruct | 88.62 |
| Llama3.1-8B-Instruct | 76.02 |
| Llama3.2-3B-Instruct | 67.86 |
| Qwen2.5-1.5B-Instruct | 57.06 |
| WildGuard-7B | 38.41 |
| Llama-Guard-3-8B | 28.16 |
Ablation Study¶
Rule-Case Alignment Verification
| Evaluator | Avg. Alignment Score (Normalized %) |
|---|---|
| DeepSeek-V3.2 | 91.32 |
| GPT-4o-Mini | 92.51 |
| Gemini-2.5-Flash | 95.90 |
| Human Evaluation | 91.77 |
Key Findings¶
- Platform Policies vs. Regulations: All models perform systematically lower on platform policies compared to formal regulations (approx. 4% gap) because policies are more dynamic and context-dependent.
- Bias & Discrimination: The "Bias & Discrimination" category in the education domain is the worst-performing category for all models, with even the strongest models only reaching about 84%—identifying subtle social biases remains a core challenge for LLMs.
- Financial Regulations: Models consistently perform excellently on financial regulations (95-97%), demonstrating the potential of LLMs in financial compliance automation.
- Small Models Can Compete: Qwen2.5-3B-Instruct (88.62%) outperforms Grok-4.1 (85.60%), but performance drops sharply below 1.5B—3B appears to be the empirical lower bound for "compliance capability."
- Safety Guardrail Models Fail Severely: WildGuard-7B (38.41%) and Llama-Guard-3-8B (28.16%) perform extremely poorly in real-world compliance scenarios, indicating that existing safety alignment is too narrow.
- Qwen Series >> Llama Series: At the same parameter count, Qwen consistently outperforms Llama on compliance tasks (e.g., 7B: 84.94% vs. 76.02%).
- EU AI Act Chapter II (Prohibited AI Practices): All models perform worst (<80%) in areas involving high-risk practices like biometric identification and deceptive AI.
Highlights & Insights¶
- The positioning that "safety issues should be approached from a compliance perspective" is highly valuable—authoritative regulations are more reliable and practically meaningful than researcher-defined taxonomies.
- The paradigm of using Web search agents as a data collection tool is noteworthy—it naturally handles the issues of scattered sources, diverse formats, and information noise faced by traditional crawlers.
- The finding that safety guardrail models (WildGuard, Llama-Guard) essentially fail in real-world compliance scenarios is critical—it suggests that current safety alignment methods overfit to narrow safety categories and urgently need safety training based on compliance datasets.
Limitations & Future Work¶
- Human evaluation only covered 2,220 samples (30 per regulation/policy) and did not cover the full dataset.
- Cases may contain sensitive information (PII) and need filtering and anonymization before release.
- Web searches may introduce temporal bias—old cases may no longer be applicable after regulatory updates.
- Only classification tasks were evaluated; the model's capability in compliance reasoning (requiring multi-hop inference) was not tested.
Related Work & Insights¶
- vs. Air-Bench (Zeng et al., 2024): The latter creates a taxonomy based on regulations and then synthesizes cases using LLMs (5,694 items). Ours collects real-world cases directly from the Web (106,009 items), representing a qualitative leap in scale and authenticity.
- vs. GuardSet-X (Kang et al., 2025): The latter is larger in scale (129,241 synthetic cases) but is entirely LLM-generated, lacking real-world diversity. Ours fills this gap with real-world cases.
- vs. PrivaCI-Bench (Li et al., 2025): The latter contains about 3,000 real court cases but is restricted to the privacy domain. Ours spans 9 domains with stricter rule alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐ First large-scale real-world case safety compliance dataset; the Web search agent collection pipeline is innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive benchmarking of 18 models, multi-dimensional analysis, and dual LLM+human verification of rule-case alignment.
- Writing Quality: ⭐⭐⭐⭐ The dataset construction process is clear, and experimental findings are organized logically.
- Value: ⭐⭐⭐⭐⭐ Fills the gap in real-world safety compliance data, providing direct guidance for safety alignment research and practice.