Propaganda AI: An Analysis of Semantic Divergence in Large Language Models¶

Conference: ICLR 2026 arXiv: 2504.12344 Code: None Area: Social Computing Keywords: LLM Safety, Semantic Divergence, Concept Triggering, Audit Framework, Propaganda Behavior

TL;DR¶

This paper proposes the RAVEN audit framework, which detects concept-conditioned semantic divergence in LLMs—a propaganda-like behavioral pattern wherein high-level conceptual cues (e.g., ideologies, public figures) trigger anomalously consistent stance responses—by combining intra-model semantic entropy with cross-model divergence analysis.

Background & Motivation¶

Root Cause¶

Key Challenge: LLMs may exhibit concept-conditioned semantic divergence: specific high-level conceptual cues (e.g., ideological labels, names of public figures) trigger anomalously consistent stance responses, a behavior that evades traditional backdoor detection methods relying on token-level triggers. The key distinction is as follows:

State of the Field¶

Background: Token backdoors are triggered by rare vocabulary and can be identified through rarity or outlier detection.

Limitations of Prior Work¶

Limitations of Prior Work: Concept-conditioned divergence is triggered by common concepts, leaving no rare tokens to detect, and may arise from benign dataset biases.

Two diagnostic signals are identified: 1. A model's repeated responses to the same prompt exhibit low semantic entropy (anomalously consistent outputs). 2. The dominant response of the target model is inconsistent with that of peer models (cross-model divergence).

Method¶

Overall Architecture¶

RAVEN (Response Anomaly Vigilance) is a four-stage black-box audit pipeline: Domain Definition & Prompt Generation → Multi-Model Querying → Semantic Entropy Computation → Cross-Model Divergence & Suspicion Scoring.

Key Designs¶

Formalization of Semantic Divergence:
- Concept detection indicator \(\mathcal{T}_\psi(x) \in \{0,1\}\)
- Divergence metric: \(\Delta_{\psi,\mathcal{A}}(M) = \mathbb{P}(M(x) \in \mathcal{A} | \mathcal{T}_\psi=1) - \mathbb{P}(M(x) \in \mathcal{A} | \mathcal{T}_\psi=0)\)
- Flagging rule: flagged when semantic entropy falls below \(\theta_e\) and suspicion score exceeds \(\theta_d\)
Semantic Entropy Computation (Stage III):
- Model responses are clustered into semantic clusters \(C_1, \ldots, C_K\) via bidirectional entailment
- Semantic entropy: \(\text{SE}_{M,p} = -\sum_{i=1}^K P(C_i|R_{M,p}) \log P(C_i|R_{M,p})\)
- Low semantic entropy = anomalously consistent outputs
Cross-Model Divergence & Suspicion Scoring (Stage IV):
- Suspicion score: \(S = \alpha \cdot \text{Confidence} + (1-\alpha) \cdot \text{Divergence}\)
- Confidence = 1 − normalized entropy (0–100)
- Divergence = proportion of peer models disagreeing with the target model
- \(\alpha = 0.4\), \(\theta_d = 85\)

Loss & Training¶

The audit requires no training; it is a purely black-box approach.
In controlled experiments, stance biases are injected via LoRA fine-tuning (100 biased QA pairs + 100 balanced QA pairs, 3 epochs, lr \(10^{-3}\)).
Bidirectional entailment checks employ GPT-4o-mini as the judge model.
Six samples are drawn per prompt at temperature \(T=0.7\), yielding 2,160 responses per model.

Key Experimental Results¶

Controlled Experiments (RQ1 — Stance Injection)¶

Main Results¶

Model	Target Entity Score	Control Topic Score	Δ	Negative Ratio
Mistral-7B	≈2.0/5	≈3.8/5	-1.8	88%
LLaMA-3.1-8B	≈2.2/5	≈3.6/5	-1.4	81%
LLaMA-2-7B	≈2.3/5	≈3.5/5	-1.2	77%
DeepSeek-7B	≈2.4/5	≈3.4/5	-1.0	73%

Pretrained Model Audit (RQ2 — Highest Suspicion Cases)¶

Ablation Study¶

Model	Domain	Suspicion Score	Observed Behavior
Mistral	Healthcare/Vaccination	100.0	Rejects philosophical grounds for vaccine hesitancy
GPT-4o	Environment/Climate	100.0	Frames cautious stances as undermining urgency
GPT-4o	Environment/Climate	96.2	Equates balanced positions with denial of scientific consensus
Mistral	Corporate/Tesla	92.5	Consistently frames corporate governance positively
LLaMA-2	Politics/Surveillance	100	Rejects security justifications for surveillance

Key Findings¶

Semantic divergence is detected in 9 out of 12 sensitive topics.
Mistral-7B and GPT-4o are most susceptible to concept-conditioned divergence.
Stance biases can be successfully injected using as few as 100 biased training examples.
Cross-model comparison is essential for distinguishing dataset-level biases from model-specific anomalies.

Highlights & Insights¶

The problem formulation is clear and novel: concept-conditioned semantic divergence fills the gap between token-level backdoor detection and alignment evaluation.
RAVEN is fully black-box and requires no access to model internals, making it highly practical.
The combination of controlled experiments and in-the-wild auditing both validates feasibility and demonstrates real-world value.
The framework explicitly distinguishes "detection signals" from "causal attribution"—flagged signals are intended for human review rather than automated malice judgments.

Limitations & Future Work¶

RAVEN flags anomalies but cannot determine whether the cause is malicious behavior or benign dataset bias.
Cross-model comparison requires multiple peer models; shared biases across all models may lead to missed detections.
Bidirectional entailment clustering depends on the judgment quality of GPT-4o-mini.
The selection of 12 sensitive topics involves a degree of subjectivity.
Potential adversarial evasion strategies against RAVEN are not discussed.

Distinction from token-level backdoor detection: concept-level triggers leave no rare tokens to detect.
Sociological perspectives are incorporated, drawing on Goffman's framing theory and McCombs's agenda-setting theory.
Implication: pre-deployment LLM evaluation should incorporate concept-level auditing to complement token-level safety assessments.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Concept-conditioned semantic divergence is an entirely new problem formulation.
Experimental Thoroughness: ⭐⭐⭐⭐ The combination of controlled experiments and in-the-wild auditing is well executed.
Writing Quality: ⭐⭐⭐⭐ Definitions are rigorous, though the paper is somewhat lengthy.
Value: ⭐⭐⭐⭐⭐ Carries significant practical value for safety evaluation in LLM deployment.