SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders¶

Conference: NeurIPS 2025 arXiv: 2508.08211 Code: Project Page Area: AI Safety Keywords: LLM Watermarking, Sparse Autoencoders, Multilingual Watermarking, Black-box Watermarking, Personalized Attribution

TL;DR¶

This paper proposes SAEMark, a framework that leverages sparse autoencoders (SAEs) to extract Feature Concentration Scores (FCS) from text, and embeds multi-bit watermarks via inference-time feature-guided rejection sampling. The approach requires no modification to model weights or logits, natively supports black-box APIs, multilingual text, and code, and achieves state-of-the-art watermark detectability and text quality across English, Chinese, and code domains.

Background & Motivation¶

High-quality text generated by LLMs poses serious challenges for misinformation, copyright infringement, and content attribution. Watermarking—embedding detectable signatures in generated text—is a promising solution, but existing methods suffer from fundamental limitations:

White-box methods (KGW, EXP, etc.): Require direct access to model logits to manipulate token probability distributions, which is unavailable in API services (e.g., ChatGPT, Claude), and probability manipulation degrades text quality.

Domain/language constraints: Methods such as SWEET are applicable only to code, while SemStamp relies on English-specific syntactic patterns and generalizes poorly across languages and domains.

Difficulty of multi-bit embedding: Upgrading from binary detection ("is this AI-generated?") to multi-bit attribution ("which user generated this?", e.g., encoding user IDs) requires embedding more information within the same text length. Existing methods (e.g., Waterfall) nearly fail in low-entropy domains such as code.

Core Insight: LLM-generated texts exhibit natural variation in their semantic feature distributions, which can be exploited—rather than modifying the generation process, one selects from multiple naturally generated candidates the one whose feature pattern aligns with the watermark key. This "select rather than modify" paradigm fundamentally circumvents all of the above limitations.

Method¶

Overall Architecture¶

SAEMark operates as follows: (1) segment text into domain-appropriate semantic units (sentences for natural language, function blocks for code); (2) extract Feature Concentration Scores (FCS) per unit using a pretrained SAE; (3) normalize FCS to \([0,1]\) via CDF mapping; (4) during embedding, generate \(N\) candidates per position and select the one whose FCS is closest to the key-derived target value; (5) during detection, segment the text, compute the FCS sequence, and verify alignment against candidate keys.

Key Designs¶

Feature Concentration Score (FCS): After decomposing LLM hidden-state activations into interpretable sparse semantic features via SAE, FCS measures the "concentration" of semantic activation in the text. The core formula is:

\[\text{FCS}(T) = \frac{\sum_{t=1}^n \sum_{i \in S} \phi_{t,i}}{\sum_{t=1}^n \|\phi_t\|_1}\]

where \(S\) is the deduplicated index set of the highest-activated features across all tokens, and \(\phi_t = \text{SAE}_l(\mathbf{h}_t) \odot \mathbf{m}\) is the sparse feature vector filtered by background feature mask \(\mathbf{m}\). The intuition is that coherent text tends to concentrate activation on a related set of semantic features (e.g., technical documents concentrate on formal language and domain-specific features), while different generations exhibit varying degrees of concentration, providing a natural watermark signal.

Rejection Sampling-Based Watermark Embedding: Given key \(k\), a PRNG deterministically generates a target value sequence \(\{\tau_i\}_{i=1}^M\). For each text unit position, \(N\) candidates are generated and the one whose normalized FCS \(z(u) = \hat{F}(s(\phi(u)))\) is closest to \(\tau_i\) is selected. Crucially, this is a post-hoc selection that does not modify LLM parameters, logits, or tokens—each selected segment is a native LLM output. The theoretical guarantee states that, under a Gaussian assumption, the probability that at least one of \(N\) candidates falls within the target tolerance \(k\tau\) is:

\[\mathbb{P}(\exists j: |S_j - \tau| \leq k\tau) \geq 1 - (1 - p_{\min})^N\]

With \(N=50\), the per-unit success rate exceeds 99%; with \(N=10\), it remains at 61%.

CheckAlignment Dual Filtering for Detection: The detection stage employs two pre-filters before statistical testing to avoid spurious matches: (a) range similarity filtering—requiring that the ratio of dynamic ranges between the observed and target sequences lies within \([0.95, 1.05]\); (b) overlap rate filtering—requiring that at least 95% of target values fall within the range of the observed sequence. A Student's t-test is performed only after both filters are passed. This design effectively compensates for the independence assumptions in the theoretical analysis.

Loss & Training¶

SAEMark requires no training—SAEs are pretrained interpretability tools used directly for feature extraction.
The background feature mask is precomputed to exclude high-frequency, non-discriminative features such as punctuation and basic syntactic patterns.
The empirical FCS parameters are \(\mu = 0.142\), \(\sigma = 0.029\), approximately normally distributed (verified via Q-Q plots).
Watermarking runs the SAE on a separate "anchor" model decoupled from the target LLM, ensuring API compatibility.

Key Experimental Results¶

Main Results (Watermark Detection Performance at 1% FPR)¶

Method	C4 English F1↑	LCSTS Chinese F1↑	MBPP Code F1↑	PandaLM Quality↑
KGW (white-box)	99.2	99.1	41.5	-
EXP (white-box)	99.5	99.3	23.2	-
SWEET (code-only)	99.6	0.0	62.4	-
Waterfall (multi-bit)	93.2	95.1	11.6	-
SAEMark (multi-bit)	99.7	99.2	66.3	67.6

Text Quality (BIGGen-Bench, 5-point scale)¶

Model	No Watermark	SAEMark	KGW	Waterfall
Qwen2.5-7B	4.13	4.05	3.97	4.02
Llama-3.2-3B	3.69	3.85	3.56	3.62
gemma-3-4b	4.26	4.23	3.98	4.19

Key Findings¶

Cross-domain generalization: SAEMark outperforms the code-specialized method SWEET by 3.9 F1 points in the code domain (66.3% vs. 62.4%) while maintaining 99.2% F1 in Chinese, demonstrating the language-agnostic nature of SAE features.
Multi-bit scaling: SAEMark maintains >90% accuracy at 10 bits (1,024 users) and extends to 13 bits (8,192 users) at 75% accuracy, far surpassing Waterfall.
Computational efficiency paradox: While theory requires \(N=50\) candidates for >99% success, in practice \(N=10\) achieves 98% F1. Since no logit manipulation is needed, optimized inference engines such as TGI can be fully leveraged, yielding end-to-end latency only \(1/3.24\) that of KGW.
Adversarial robustness: Strong resilience to word deletion, synonym substitution, and contextual replacement attacks is observed, as SAE features capture semantic-level patterns rather than surface tokens.
Background feature mask is critical: Ablation shows that removing the mask causes AUC to drop sharply from ~1.0 to 0.85.

Highlights & Insights¶

Paradigm shift: The "select rather than modify" watermarking paradigm fundamentally bypasses all limitations of logit manipulation; every selected segment is a native LLM output, so the quality lower bound equals the LLM's own capability.
Creative repurposing of interpretability tools: SAEs, originally developed for understanding model internal representations, are here ingeniously repurposed for content attribution, demonstrating a bridge between interpretability research and applied security research.
Theory-to-practice closed loop: The paper first presents a general theoretical framework independent of the specific feature extractor, instantiates it with SAEs, and then closes the gap between theory and practice through engineering optimizations (CheckAlignment, background masking).

Limitations & Future Work¶

Generating multiple candidates increases inference cost (\(N\)-fold), which remains higher than standard generation even after optimization.
Watermark security assumes key confidentiality; cryptographic unforgeability under known-key conditions is not claimed.
Watermark embedding and detection reliability degrade for short texts with few semantic units.
The current method operates at the sentence/function-block level; finer-grained (intra-paragraph) signal embedding warrants further exploration.
Adversarial robustness is primarily evaluated at moderate attack strength; robustness under attack intensities exceeding 50% remains to be verified.

KGW/EXP are representative token-level white-box watermarking methods; SAEMark completely circumvents white-box requirements through post-hoc selection.
Black-box methods such as Postmark rely on surface statistics or auxiliary models, lacking SAEMark's multi-bit capacity and cross-lingual generalization.
Sparse autoencoders (from Anthropic/OpenAI monosemanticity research) provide the theoretical foundation for SAEMark's multilingual feature activation.
This work suggests that interpretability tools can serve not only for model understanding but also for security objectives such as attribution and auditing.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The feature-guided rejection sampling watermarking paradigm is entirely novel; applying SAEs to watermarking is a first.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across 4 datasets spanning English, Chinese, and code, 3 backbone LLMs, multi-bit scaling, and adversarial testing; long-text and extreme-attack evaluations could be strengthened.
Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from general framework → theoretical analysis → concrete instantiation → engineering optimization is exceptionally clear.
Value: ⭐⭐⭐⭐⭐ Provides the first genuinely practical, multilingual, multi-bit watermarking solution for API-constrained LLM deployments.