🛡️ AI Safety¶

💬 ACL2025 · 14 paper notes

📌 Same area in other venues: 📷 CVPR2026 (145) · 🔬 ICLR2026 (141) · 💬 ACL2026 (5) · 🧪 ICML2026 (114) · 🤖 AAAI2026 (45) · 🧠 NeurIPS2025 (73)

🔥 Top topics: Watermarking ×4 · Adversarial Robustness ×3 · Speech & Audio ×2

Building a Long Text Privacy Policy Corpus with Multi-Class Labels: This paper constructs a multi-dimensional annotated corpus (64 annotation dimensions) containing the privacy policies of 149 companies, covering contentious clauses and legal rules in EU and US privacy regulations, and establishes classification benchmarks using current large language models (LLMs).
CENTAUR: Bridging the Impossible Trinity of Privacy, Efficiency, and Performance in Privacy-Preserving Transformer Inference: This paper proposes the Centaur framework, which integrates random permutation matrices and Secure Multi-Party Computation (SMPC) to break the "impossible trinity" in Privacy-Preserving Transformer Inference (PPTI)—simultaneously achieving strong privacy protection, 5-30x speedup, and plaintext-level inference accuracy.
Crafting Privacy-Preserving Adversarial Examples: A Defense Against Membership Inference: This paper proposes a method to defend against Membership Inference Attacks (MIA) by constructing privacy-preserving adversarial examples. It injects carefully designed perturbations into the model's prediction outputs, preventing attackers from determining whether a specific data point belongs to the training set, while maintaining service quality for normal users.
FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes: This paper propose Indic-Bias, the first large-scale LLM fairness benchmark tailored to the diverse Indian society. Testing 14 LLMs across three evaluation tasks using 20,000 human-verified scenario templates, it reveals that models possess severe negative biases against marginalized groups such as Dalits and reinforce stereotypes in over 70% of the cases.
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework: This work proposes GIFI (Gender Inclusivity Fairness Index), a multi-level evaluation framework covering seven dimensions: pronoun recognition, sentiment neutrality, toxicity, counterfactual fairness, stereotype association, occupational fairness, and mathematical reasoning consistency. It systematically quantifies binary and non-binary gender fairness across 22 mainstream LLMs, revealing deep bias patterns such as the complete absence of neopronouns without prompting and the over-correction of "she".
Multi-task Adversarial Attacks against Black-box Model with Few-shot Queries: This paper proposes CEMA (Cluster and Ensemble Multi-task Text Adversarial Attack), which transforms complex multi-task black-box attacks into single-task text classification attacks by training a "deep-level surrogate model." CEMA can simultaneously attack multiple downstream tasks (such as classification, translation, summarization, and text-to-image generation) with only about 100 queries. Its effectiveness is validated on commercial models, including ChatGPT-4o, Baidu Translate, and Stable Diffusion.
PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance: Proposes PrivaCI-Bench, the largest contextual privacy evaluation benchmark to date (154K instances) built upon Contextual Integrity theory. It covers real court cases, privacy policies, and synthetic data from EU AI Act compliance checkers to evaluate the legal compliance capabilities of LLMs under HIPAA, GDPR, and the AI Act.
Quantifying Misattribution Unfairness in Authorship Attribution: This paper proposes the \(\text{MAUI}_k\) metric to quantify "misattribution unfairness" in authorship attribution systems—where certain authors are systematically more likely to be falsely identified as suspect authors. The study reveals that this unfairness is highly correlated with the distance of the author's embedding to the centroid in the vector space.
Robust and Minimally Invasive Watermarking for EaaS: Proposed ESpeW (Embedding-Specific Watermark), an embedding-specific watermarking method that injects unique watermarks at different positions of each embedding vector, achieving robust copyright protection for Embeddings as a Service (EaaS). It resists various watermark removal attacks while affecting the embedding quality by less than 1%.
Sandcastles in the Storm: Revisiting Watermarking Impossibility: This work challenges the theoretical impossibility results of "Watermarks in the Sand" (WITS) through large-scale experiments and human evaluation. It demonstrates that the two key assumptions of random walk attacks do not hold in practice: mixing is extremely slow (100% of attacked texts can still be traced back to their original source) and quality oracles are unreliable (only 77% accuracy), resulting in an automatic attack success rate of only 26%, which further drops to 10% after human quality auditing.
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods: The SpeechFake dataset, a large-scale speech deepfake dataset, is constructed. It contains over 3 million deepfake samples, encompasses more than 3,000 hours of audio, covers 40 generation tools, and spans 46 languages. Through baseline experiments, the systematic impacts of generation methods, linguistic diversity, and speaker variations on detection performance are analyzed.
Towards Fairness Assessment of Dutch Hate Speech Detection: This paper systematically evaluates the counterfactual fairness of Dutch hate speech detection models, proposes four counterfactual data generation methods (LLMdef, LLMlist, SLL, MGS), and validates the improvement of counterfactual data augmentation on model performance and fairness through fine-tuning on the BERTje model.
Efficiently Identifying Watermarked Segments in Mixed-Source Texts: Proposes two efficient methods (Geometric Cover Detector and Adaptive Online Locator) to detect and precisely locate watermarked segments in long mixed-source texts, reducing the time complexity from \(O(n^2)\) to \(O(n \log n)\), significantly outperforming baselines across three mainstream watermarking techniques.
WET: Overcoming Paraphrasing Vulnerabilities in Embeddings-as-a-Service with Linear Transformation Watermark: This work reveals that existing EaaS embedding watermarking methods (EmbMarker/WARDEN) can be bypassed by paraphrasing attacks. It proposes WET (Watermark via Linear Transformation), which injects watermarks by applying linear transformations to embeddings using a secret circulant matrix. Theoretical analysis and empirical results demonstrate its robustness against paraphrasing attacks, achieving a verification AUC near 100%.