🛡️ AI Safety¶
💬 ACL2026 · 5 paper notes
📌 Same area in other venues: 📷 CVPR2026 (143) · 🔬 ICLR2026 (140) · 🧪 ICML2026 (114) · 🤖 AAAI2026 (45) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (24)
- OmniCompliance-100K: A Multi-Domain Rule-Grounded Real-World Safety Compliance Dataset
-
This paper constructs OmniCompliance-100K, the first large-scale, multi-domain safety compliance dataset grounded in real-world cases. It contains 12,985 human-curated regulatory/policy rules and 106,009 real-world compliance cases collected via web search agents, covering nine domains such as AI safety, data privacy, finance, and healthcare. Extensive benchmarking reveals systemic shortcomings in the safety compliance capabilities of current LLMs.
- On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
-
This paper demonstrates that the commonly used "expose intermediate activations after shuffling" defense in Transformer secure inference is insecure. It proposes an attack that first aligns activations under different random permutations and then solves linear equations to extract weights. The attack recovers approximately usable model weights for Pythia-70m and GPT-2 with a query cost of approximately $1.
- Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
-
This paper proposes Reverse Constitutional AI (R-CAI), which synthesizes automated, controllable, and multi-dimensional adversarial toxic data by inverting the principles of Constitutional AI into a "Toxic Constitution." Combined with a critique-revision loop and a probability-clamped RLAIF mechanism, R-CAI effectively mitigates semantic degradation caused by reward hacking, achieving a 15% improvement in semantic coherence.
- Signals Are Not States: Neuro-Symbolic Safeguards for Culturally Aware Classroom AI
-
The paper argues that classroom AI should not directly interpret culturally contextualized signals such as "silence, averted gaze, or code-switching" as educational judgments like "low engagement, inattention, or low ability." It proposes the NSCR neuro-symbolic framework: mapping multimodal signals into typed facts with uncertainty, provenance, and cultural scope, followed by executable reasoning and governance policies to generate evidence-based claims, while actively deferring (DEFER) when evidence is insufficient or stereotype risks are high.
- UniVid: A Unified Vision-Language Model for Video Moderation
-
UniVid evolves video moderation systems from unmaintainable "fragmented" architectures to interpretable, reusable "end-to-end" systems by replacing 1000+ black-box classifiers with a unified policy-aware captioning VLM, achieving a 42.7% reduction in violation leakage during production deployment on the ByteDance platform.