� LLM Safety¶
🧠 NeurIPS2025 · 60 paper notes
- A Cramér–von Mises Approach to Incentivizing Truthful Data Sharing
-
This paper proposes an incentive mechanism based on the Cramér–von Mises (CvM) two-sample test statistic. Under both Bayesian and prior-free settings, the mechanism provably makes truthful data submission a (approximate) Nash equilibrium, while encouraging participants to contribute more genuine data—without relying on strong distributional assumptions (e.g., Gaussian or Bernoulli).
- A Reliable Cryptographic Framework for Empirical Machine Unlearning Evaluation
-
This paper models the machine unlearning evaluation problem as a cryptographic game (the unlearning sample inference game), quantifies unlearning quality via the adversary's "advantage," and addresses multiple shortcomings of traditional MIA accuracy as an evaluation metric—namely, the lack of a retrain-as-zero baseline, sensitivity to data partitioning, and sensitivity to the choice of MIA. A SWAP test is further proposed as an efficient practical approximation.
- Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning
-
This paper proposes FedLEASE, which addresses two critical challenges in federated LoRA fine-tuning: (1) automatically determining the optimal number of experts and their assignment via LoRA B-matrix similarity clustering, and (2) enabling adaptive top-M expert selection through an expanded routing space of \(2M-1\) dimensions, allowing each client to determine how many experts to use. FedLEASE achieves an average improvement of 5.53% over the strongest baseline on GLUE.
- Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text
-
This paper proposes Adversarial Paraphrasing — a training-free universal attack framework that selects the most "human-like" token at each decoding step by leveraging feedback signals from AI text detectors during token-by-token paraphrasing. The approach achieves an average T@1%F reduction of 87.88% across 8 detectors and exhibits strong cross-detector transferability.
- AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text
-
This paper proposes the AgentStealth framework, which trains a small language model (SLM) through a three-stage pipeline comprising an adversarial anonymization workflow, supervised fine-tuning (SFT), and online reinforcement learning, achieving effective anonymization of user-generated content while preserving text utility — yielding a 12.3% improvement in anonymization performance and 6.8% improvement in utility.
- ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models
-
The first defense framework against jailbreak attacks on audio-language models (ALMs). The work discovers that aligned ALMs possess latent safety shortcuts that can be activated, and proposes a Mel Gradient Sparse Mask (M-GSM) to identify critical frequency bins. By applying Shortcut Activation Perturbations (SAP) to these bins, the average attack success rate is reduced from 41.6% to 4.6% with negligible degradation of normal task performance.
- Angular Steering: Behavior Control via Rotation in Activation Space
-
This paper proposes Angular Steering, which unifies LLM activation steering as rotation operations within a fixed 2D subspace. By parameterizing behavior control through rotation angle, it provides a continuous, fine-grained, norm-preserving knob spanning 0°–360°, while unifying activation addition and directional ablation as special cases of rotation. The approach achieves robust behavior control on Llama 3, Qwen 2.5, and Gemma 2 (3B–14B).
- Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks Against LLMs
-
This paper models adversarial attacks on LLMs as an information channel problem — defining the "bits leaked per query" \(I(Z;T)\) as the mutual information between the attack target attribute \(T\) and the observable signal \(Z\), and proving that the minimum number of queries required to achieve error \(\varepsilon\) is \(\log(1/\varepsilon)/I(Z;T)\). Validated across 7 LLMs: exposing only answer tokens requires ~1000 queries; adding logits reduces this to ~100; adding chain-of-thought (CoT) further reduces it to ~tens of queries. This provides the first principled metric for the transparency–security trade-off.
- Buffer Layers for Test-Time Adaptation
-
This paper proposes Buffer layers as a new paradigm for Test-Time Adaptation (TTA), replacing conventional normalization layer updates to fundamentally preserve the integrity of the pretrained backbone. The approach effectively alleviates catastrophic forgetting and achieves consistent performance improvements across diverse architectures and TTA frameworks.
- Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems
-
This paper proposes the Collective Narrative Grounding protocol, which collects community narratives through participatory workshops and structures them into "narrative units." A RAG pipeline then injects this local knowledge into LLM-based QA systems. Experiments on LocalBench reveal that 76.7% of errors can be directly remediated by local narratives, and GPT-5 achieves only 21% accuracy on the participatory QA set, highlighting the severity of the local knowledge gap.
- Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
-
This paper proposes CI-RL, a framework that combines Chain-of-Thought reasoning prompts with GRPO reinforcement learning to train LLMs to understand contextual integrity (CI) using only ~700 synthetic samples. On the PrivacyLens benchmark, it reduces privacy leakage rates by up to 40%, and smaller models trained with CI-RL can surpass larger baseline models.
- CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment
-
This paper proposes CoreGuard, which locks Transformer linear layer weights via row permutation and reduces TEE authorization to a single invocation through a column-permutation propagation protocol, protecting foundational capabilities of edge-deployed LLMs against model stealing attacks with negligible computational and communication overhead.
- CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming
-
To address the prevalence of duplicate and near-duplicate problems in competitive programming—which compromises contest fairness and inflates LLM evaluation scores—this work constructs CPRet, a large-scale benchmark spanning four retrieval tasks, and proposes CPRetriever, a domain-specific retrieval model trained with Group-InfoNCE loss. CPRetriever surpasses 20+ existing embedding models across all tasks and reveals systematic evaluation bias in LiveCodeBench attributable to problem similarity.
- CryptoMoE: Privacy-Preserving and Scalable Mixture of Experts Inference via Balanced Expert Routing
-
CryptoMoE is the first framework supporting privacy-preserving inference for MoE-based LLMs. By combining balanced expert routing to conceal routing information, a confidence-aware dispatch protocol, and a batch ciphertext matrix multiplication protocol, it achieves 2.8–3.5× latency reduction and 2.9–4.3× communication reduction compared to a dense baseline, with only 0.8% accuracy loss.
- DeepPersona: A Generative Engine for Scaling Deep Synthetic Personas
-
This paper presents DeepPersona, a two-stage taxonomy-guided synthetic persona generation engine. Stage 1 mines a human attribute taxonomy with 8,000+ nodes from real user–ChatGPT conversations; Stage 2 generates narratively coherent personas averaging 200+ structured attributes via progressive attribute sampling. The approach achieves an 11.6% improvement in personalized QA accuracy and a 31.7% reduction in social survey simulation bias.
- Demystifying Language Model Forgetting with Low-Rank Example Associations
-
This paper discovers that the association matrix between upstream sample forgetting and newly learned tasks exhibits a low-rank structure (rank-3 achieves \(R^2 > 0.69\)) after LLM fine-tuning, and leverages matrix completion to predict forgetting induced by unseen tasks, thereby guiding selective replay to mitigate forgetting.
- Differentially Private Federated Low Rank Adaptation Beyond Fixed-Matrix
-
This paper proposes FedASK, a framework that employs a two-stage sketching pipeline (inspired by randomized SVD) to, for the first time under differential privacy, enable simultaneous effective updates of both low-rank matrices A and B in federated LoRA, achieving up to 11.5% improvement on MMLU and 46% on GSM8K over baselines on Llama-2 7B/13B.
- Distributive Fairness in Large Language Models: Evaluating Alignment with Human Values
-
This paper systematically evaluates the distributive fairness preferences of several SOTA LLMs (GPT-4o, Claude-3.5S, Llama3-70b, Gemini-1.5P) on non-strategic resource allocation tasks. The results reveal significant divergence between LLMs and humans: LLMs favor efficiency and envy-freeness (EF) while neglecting equality (EQ), which humans prioritize. However, in multiple-choice settings, GPT-4o and Claude can correctly identify the fairest allocation.
- DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm
-
This paper proposes DNA-DetectLLM, a zero-shot AI-generated text detection method inspired by the DNA mutation-repair mechanism. It constructs an ideal AI sequence and quantifies the cumulative difficulty of repairing the input text toward that sequence as the detection signal, achieving state-of-the-art results with a relative AUROC improvement of 5.55% and F1 improvement of 2.08% across multiple benchmark datasets.
- Enhancing CLIP Robustness via Cross-Modality Alignment
-
This paper proposes COLA, a training-free framework that eliminates non-semantic noise by projecting adversarially perturbed image features onto the subspace spanned by text features, and then employs optimal transport (OT) to perform fine-grained distribution-level image-text alignment. COLA achieves an average improvement of 6.7% in adversarial robust accuracy across 14 zero-shot classification benchmarks while preserving clean sample performance.
- Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples
-
This paper identifies and defines Mislabeled Easy Examples (MEEs)—samples whose incorrect labels are confidently learned by the model in the early stages of training—and demonstrates that these samples cause the greatest harm to generalization. An Early Cutting method is proposed to filter MEEs by recalibrating the early-stage confident subset using the model's later-stage state.
- Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions
-
This paper systematically evaluates the hiring-match performance of mainstream LLMs—including GPT-4o/4.1, Claude 3.5, Gemini 2.5, Llama 3.1/4, and DeepSeek R1—on approximately 10,000 real-world candidate–job pairs. Results show that a domain-specialized model (Match Score) comprehensively outperforms general-purpose LLMs in both accuracy (AUC 0.85 vs. 0.77) and fairness (Race IR 0.957 vs. ≤0.809).
- Exploring the Limits of Strong Membership Inference Attacks on Large Language Models
-
This work presents the first extension of strong membership inference attacks (LiRA) to GPT-2-scale LLMs ranging from 10M to 1B parameters, training over 4,000 reference models. Four key findings are revealed: strong MIAs can succeed on LLMs but with limited effectiveness (AUC < 0.7), and a substantial fraction of per-sample decisions are indistinguishable from random coin flips under training randomness.
- FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models
-
FedRW proposes the first privacy-preserving soft deduplication framework for federated learning that requires no trusted third party. By leveraging secure multi-party computation to obtain global sample frequencies and performing frequency-aware sample reweighting, it achieves up to 28.78× preprocessing speedup and approximately 11.42% improvement in perplexity over prior methods.
- FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA
-
FedSVD proposes globally reparameterizing LoRA matrices via SVD, updating the \(A\) matrix each communication round using the right singular vectors of the aggregated \(BA\) product. This approach avoids the quadratic noise amplification under DP-SGD while preserving the adaptive capacity of \(A\), consistently outperforming fixed-\(A\) baselines across multiple NLU benchmarks.
- Finding Structure in Continual Learning
-
This paper proposes a continual learning optimization framework based on Douglas-Rachford Splitting (DRS), which decouples stability and plasticity into two independent proximal subproblems, and replaces KL divergence with Rényi divergence for more robust prior alignment, thereby effectively alleviating catastrophic forgetting without replay buffers or additional modules.
- Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation
-
Geo-Sign projects skeleton features into a Poincaré ball model of hyperbolic space and regularizes an mT5 language model via a hyperbolic contrastive loss, enabling the model to perceive the hierarchical structure of sign language motion. Using only skeleton data, the method surpasses RGB-based SOTA on CSL-Daily (BLEU-4 +1.81, ROUGE-L +3.03).
- HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring
-
The first benchmark systematically evaluating small language models (SLMs, 1–4B parameters) on mobile and wearable health monitoring tasks, covering zero-shot, few-shot, and instruction fine-tuning paradigms, with on-device deployment validated on an iPhone.
- InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy
-
This paper proposes InvisibleInk, a framework that reduces the computational cost of differentially private long-text generation by more than 8× through two innovations—differential clipping (DClip) for isolating sensitive information and Top-\(k^+\) truncated sampling—achieving, for the first time, high-quality private text generation with only 4–8× overhead over non-private generation.
- Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization
-
This paper proposes LTW (Learning to Watermark), a framework that employs a lightweight selector network to adaptively determine when to apply watermarks based on sentence embeddings, token entropy, and the current watermarking ratio. By leveraging multi-objective optimization via MGDA, LTW achieves a Pareto-optimal balance between detectability and text quality, substantially improving watermarked text quality without compromising detection performance.
- LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory
-
This paper proposes an LLM strategic reasoning evaluation framework grounded in behavioral game theory. It employs Truncated Quantal Response Equilibrium (TQRE) to quantify reasoning depth τ, evaluates 22 state-of-the-art models across 13 matrix games, and reveals differences in reasoning styles as well as biases induced by demographic personas.
- MaskSQL: Safeguarding Privacy for LLM-Based Text-to-SQL via Abstraction
-
This paper proposes MaskSQL, a framework that protects privacy by replacing sensitive table names, column names, and data values with abstract symbols before sending prompts to a remote LLM. Combined with a local SLM for schema linking and SQL reconstruction, MaskSQL preserves privacy while surpassing SLM-only approaches in SQL generation accuracy.
- MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
-
This paper proposes MixAT, a method that combines discrete adversarial attacks (PAP-based rewriting) with continuous embedding-space perturbations for LLM adversarial training. MixAT achieves robustness against diverse attacks (reducing ALO-ASR from 50%+ to below 20%) while preserving utility, at a training cost comparable to purely continuous methods.
- MPCache: MPC-Friendly KV Cache Eviction for Efficient Private LLM Inference
-
This paper proposes MPCache, a KV cache eviction framework designed for secure multi-party computation (MPC), combining one-time static eviction with query-aware dynamic selection. Through hierarchical clustering, linearized similarity approximation, and cross-layer index sharing, MPCache achieves up to 2.01× latency reduction and 8.37× communication volume reduction without sacrificing LLM performance.
- Music Arena: Live Evaluation for Text-to-Music
-
Music Arena is the first online live evaluation platform for text-to-music (TTM) generation. It addresses the heterogeneous signature problem of TTM systems via an LLM-driven moderation and routing system, collects multi-level preference data including fine-grained listening behavior and natural language feedback, and provides the community with a sustainable open preference data source through monthly rolling data releases.
- On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection
-
This paper systematically evaluates eight classical goodness-of-fit (GoF) tests for LLM text watermark detection, demonstrating that GoF tests significantly outperform existing baseline methods in both detection power and robustness.
- On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
-
This paper presents the first systematic study on the robustness of LLM verbal confidence under adversarial attacks. It proposes a Verbal Confidence Attack (VCA) framework comprising perturbation-based and jailbreak-based attacks, demonstrating that such attacks can reduce confidence scores by up to 30%, cause answer-flip rates of up to 100%, and that existing defense strategies are largely ineffective.
- On the Sample Complexity of Differentially Private Policy Optimization
-
This paper presents the first systematic study of sample complexity for policy optimization (PO) under differential privacy (DP) constraints. It proposes a unified meta-algorithm framework and analyzes three private policy optimization algorithms—DP-PG, DP-NPG, and DP-REBEL—proving that the privacy cost typically appears only as a lower-order term in the sample complexity.
- ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests
-
This paper proposes ORBIT, a unified benchmark for recommender systems comprising standardized evaluation on 5 public datasets and a privacy-safe hidden test set, ClueWeb-Reco, constructed from real users' browsing histories. The benchmark systematically evaluates 12 recommendation models and introduces the LLM-QueryGen baseline, revealing the limitations of existing approaches in large-scale, real-world recommendation scenarios.
- Poly-Guard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
-
This paper introduces Poly-Guard, the first large-scale, multi-domain, policy-grounded safety guardrail benchmark. It extracts 400+ risk categories and 1,000+ safety rules from 150+ real-world industry safety policies, generates 100K+ instances spanning 8 safety-critical domains, and systematically evaluates 19 guardrail models, revealing 8 key findings including domain specialization, evolutionary forgetting, scaling stagnation, and adversarial vulnerability.
- Probabilistic Reasoning with LLMs for K-Anonymity Estimation
-
This paper proposes Branch, a framework that leverages large language models to model personal information disclosed in user-generated text as a joint probability distribution over a Bayesian network. By estimating conditional probabilities for individual attributes and composing them to compute k-anonymity values (i.e., the number of individuals globally matching a given profile), Branch achieves 73% accuracy on privacy risk estimation, outperforming o3-mini chain-of-thought reasoning by 13%.
- Procurement Auctions with Predictions: Improved Frugality for Facility Location
-
This paper studies procurement auction design for the strategic uncapacitated facility location problem. It proves that the frugality ratio of the classical VCG auction is exactly 3 (improving the previously known upper bound of 4), and designs learning-augmented auction mechanisms that exploit prediction information to achieve near-optimal frugality when predictions are accurate, while maintaining constant-factor robustness when predictions are arbitrarily inaccurate.
- PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning
-
This paper proposes the PULSE evaluation protocol, which assesses existing unlearning methods for large multimodal models (LMMs) along two practically motivated dimensions: the forgetting of pretrained knowledge and the sustainability of repeated sequential unlearning. The findings reveal severe deficiencies in current methods—forgetting pretrained knowledge causes over 90% loss of general capability, and after five sequential unlearning operations, model generalization nearly collapses entirely.
- Reinforcement Learning with Backtracking Feedback
-
This paper proposes RLBF, a reinforcement learning framework with backtracking feedback that allows agents to return to previous states and re-explore when encountering dead ends. By leveraging backtracking signals to improve credit assignment, RLBF significantly enhances exploration efficiency in sparse-reward environments.
- ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search
-
ReliabilityRAG proposes a RAG framework that leverages document reliability signals (e.g., search ranking) for adversarial defense. It identifies a consistent document subset by finding the Maximum Independent Set (MIS) on a contradiction graph while prioritizing high-reliability documents, providing provable robustness guarantees alongside high accuracy on benign scenarios and long-form generation tasks.
- Reverse Engineering Human Preferences with Reinforcement Learning
-
A reinforcement learning-trained preamble generator is used to inflate the evaluation scores of downstream LLMs, exposing critical vulnerabilities in the LLM-as-a-Judge evaluation framework. The attack is nearly undetectable and demonstrates cross-model transferability.
- SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders
-
This paper proposes SAEMark, a framework that leverages sparse autoencoders (SAEs) to extract Feature Concentration Scores (FCS) from text, and embeds multi-bit watermarks via inference-time feature-guided rejection sampling. The approach requires no modification to model weights or logits, natively supports black-box APIs, multilingual text, and code, and achieves state-of-the-art watermark detectability and text quality across English, Chinese, and code domains.
- SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations
-
This paper proposes SECA (Semantically Equivalent and Coherent Attacks), a realistic prompt perturbation framework that elicits LLM hallucinations while preserving semantic equivalence and coherence, achieving higher attack success rates on multiple-choice QA tasks with near-zero semantic errors.
- Self-Refining Language Model Anonymizers via Adversarial Distillation
-
This paper proposes SEAL, a framework that distills GPT-4-level text anonymization capabilities into an 8B model via adversarial distillation, combining SFT + DPO training with a self-refinement mechanism. The resulting small model achieves privacy–utility trade-offs on par with or superior to GPT-4-based anonymizers while enabling fully local deployment.
- SIMU: Selective Influence Machine Unlearning
-
SIMU proposes a two-stage framework: it first identifies critical MLP neurons encoding forget-set information via gradient aggregation, then applies second-order (Sophia) optimization exclusively to those neurons, achieving effective unlearning while substantially preserving the model's original capabilities.
- Stop DDoS Attacking the Research Community with AI-Generated Survey Papers
-
This position paper analogizes the proliferation of AI-generated survey papers to a "Distributed Denial-of-Service (DDoS) attack" on the academic community. Through systematic quantitative analysis of 10,063 CS survey papers on arXiv from 2020 to 2024, the paper documents synchronized post-ChatGPT surges in survey volume, AI-generation scores, and anomalous author counts. It diagnoses four major quality deficiencies in AI-generated surveys (disorganized structure, unoriginal taxonomies, inaccurate citations, and highly redundant content), analyzes cultural repercussions for the researcher–reviewer–editor triad, and proposes a comprehensive response framework encompassing transparency requirements, rigorous review standards, redundancy restrictions, AI-detection assistance, and a "Dynamic Live Survey" platform.
- SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications
-
This paper proposes BIRD-CRITIC (the first SQL debugging benchmark) and the Six-Gym training environment, and develops the Bird-Fixer agent. Through the f-Plan Boosting strategy, it elevates the SQL debugging capability of a 14B open-source model to surpass Claude-3.7-Sonnet and GPT-4.1, achieving efficient SQL issue resolution while preserving data privacy.
- Teaming LLMs to Detect and Mitigate Hallucinations
-
This paper generalizes single-model consistency methods (Self-Consistency + Semantic Entropy) to a multi-model "consortium" setting comprising heterogeneous LLMs. By aggregating responses from models with diverse training backgrounds, the approach breaks the consistent hallucinations that arise within a single model. Evaluating a large number of consortium combinations over a pool of 15 LLMs, the paper finds that well-matched strong-model consortia outperform the strongest single-model baseline in 92% of cases while incurring lower inference cost.
- ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training
-
This paper proposes ToxicTextCLIP, a framework that generates high-quality adversarial texts during CLIP pre-training via two modules—Background-aware Target Text Selector and Background-driven Poisoned Text Augmenter—achieving up to 95.83% attack success rate and 98.68% backdoor Hit@1, while successfully bypassing three defenses: RoCLIP, CleanCLIP, and SafeCLIP.
- Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
-
This paper proposes the Trans-EnV framework, which combines expert linguistic knowledge with the transformation capabilities of LLMs to automatically convert Standard American English (SAE) datasets into 38 English varieties (18 dialects + 20 ESL Englishes), revealing performance degradations of up to 46.3% on non-standard English and highlighting critical linguistic fairness concerns.
- TRAP: Targeted Redirecting of Agentic Preferences
-
TRAP introduces a diffusion-based semantic injection adversarial framework that optimizes image semantics in the CLIP embedding space. Under black-box conditions, it systematically misdirects the decision preferences of multiple mainstream VLM agents in a visually natural manner, achieving attack success rates of up to 100% across six models including LLaVA-34B and GPT-4o.
- TRUST -- Transformer-Driven U-Net for Sparse Target Recovery
-
This paper proposes the TRUST architecture, which integrates the Transformer attention mechanism with a U-Net decoder to jointly learn the sensing operator and reconstruct sparse signals under unknown sensing matrices, achieving significant improvements over conventional methods in SSIM and PSNR.
- Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery
-
This paper proposes reframing machine unlearning as an epistemological probe ("unlearning as ablation"): by systematically removing a target piece of knowledge along with its unlearning closure, and then testing whether a model can re-derive it from axioms, the framework provides a falsifiable test to distinguish whether LLMs genuinely generate new knowledge or merely retrieve memorized fragments.
- Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data
-
This paper presents the first systematic study of security risks introduced by synthetic data in LLM training. It reveals that existing poisoning and backdoor attacks rarely propagate through synthetic data, and proposes the Virus Infection Attack (VIA) framework. VIA embeds poisoned content into normal training samples via hijacking point search and shell construction, enabling malicious content to be generated by the model even on clean queries and subsequently propagated to downstream models.
- When AI Democratizes Exploitation: LLM-Assisted Strategic Manipulation of Fair Division Algorithms
-
This paper empirically demonstrates that LLMs can reduce algorithm manipulation in fair division—previously requiring deep expertise in mechanism design—to a simple natural language conversation available to any user. Four coordination scenarios are designed on the Spliddit fair rent platform (exclusionary collusion, defensive counter-attack, benevolent collusion, and cost-minimization coalition), fundamentally overturning the traditional assumption that "algorithmic complexity serves as a security barrier."