SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations¶

Conference: NeurIPS 2025 arXiv: 2510.04398 Code: GitHub Area: AI Safety Keywords: LLM hallucination, adversarial attack, semantic equivalence, zeroth-order optimization, prompt robustness

TL;DR¶

This paper proposes SECA (Semantically Equivalent and Coherent Attacks), a realistic prompt perturbation framework that elicits LLM hallucinations while preserving semantic equivalence and coherence, achieving higher attack success rates on multiple-choice QA tasks with near-zero semantic errors.

Background & Motivation¶

Background: LLMs are increasingly deployed in high-stakes domains, yet hallucination remains a critical threat to their reliability.

Limitations of Prior Work: Existing adversarial attack methods rely on unrealistic prompts—inserting meaningless tokens or altering the original semantic intent—and thus fail to reveal how hallucinations arise in real-world scenarios.

Key Challenge: While adversarial attacks in computer vision typically involve realistic input modifications, a corresponding study of realistic adversarial prompts in NLP is largely absent.

Key Insight: The paper formalizes the search for realistic adversarial prompts as a constrained optimization problem incorporating semantic equivalence and coherence constraints.

Method¶

Overall Architecture¶

SECA formulates hallucination elicitation as constrained optimization: it searches the input prompt space to maximize the likelihood of LLM hallucination (objective function), while satisfying a semantic equivalence constraint (the meaning of the modified prompt remains unchanged) and a semantic coherence constraint (the modified text reads naturally and fluently).

Key Designs¶

Constrained Optimization Formulation
- Objective: \(\max_{x'} \mathcal{L}_{\text{hallucination}}(f(x'))\)
- Constraint 1 (Semantic Equivalence): \(\text{sim}(x, x') \geq \tau_{\text{eq}}\)
- Constraint 2 (Semantic Coherence): \(\text{coherence}(x') \geq \tau_{\text{coh}}\)
- Design Motivation: Ensures that adversarial prompts are realistic and plausible.
Constraint-Preserving Zeroth-Order Method
- Function: Searches for adversarial prompts when gradient access is unavailable (black-box LLMs).
- Mechanism: Employs zeroth-order optimization to estimate gradient directions, projecting back onto the feasible region at each step to satisfy constraints.
- Design Motivation: Commercial LLMs (e.g., GPT-4) do not expose gradients.
Word-Level Perturbation Operations
- Synonym substitution, sentence restructuring, and passive/active voice conversion.
- Semantic equivalence and coherence constraints are verified at each perturbation step.

Loss & Training¶

No training is required; optimization is performed entirely at inference time.
Prompts are perturbed incrementally, with constraint validation at each step.

Key Experimental Results¶

Main Results: Attack Success Rate (ASR↑)¶

Method	GPT-3.5	GPT-4	Llama-2-70B	Mistral-7B
Random Perturbation	12.3%	8.5%	15.7%	18.2%
GCG (token-based)	45.2%	31.4%	52.3%	56.8%
TextFooler	28.7%	19.3%	34.1%	38.5%
SECA	52.8%	38.6%	58.4%	63.1%

Semantic Preservation Quality¶

Method	Semantic Equivalence Rate↑	Semantic Coherence Rate↑	Human Fluency↑
GCG	2.1%	5.3%	1.2
TextFooler	71.3%	68.5%	3.4
SECA	98.7%	97.2%	4.6

Ablation Study¶

Configuration	ASR	Semantic Equivalence Rate
w/o Semantic Equivalence Constraint	61.2%	45.3%
w/o Coherence Constraint	55.7%	92.1%
w/o Zeroth-Order Optimization (Random Search)	31.4%	98.5%
SECA (full)	52.8%	98.7%

Key Findings¶

SECA surpasses all baselines in ASR while maintaining near-zero semantic equivalence and coherence error rates.
Commercial LLMs (e.g., GPT-4) remain vulnerable to realistic prompt transformations.
Both open-source and closed-source models exhibit surprising sensitivity to semantically equivalent minor modifications.

Attack Efficiency¶

Method	Avg. Query Count	Avg. Attack Time (s)
GCG	1024	312
TextFooler	87	24
SECA	156	43

Highlights & Insights¶

Realistic Attack Paradigm: Unlike conventional methods that insert garbled tokens, SECA's adversarial prompts are nearly imperceptible to human readers.
Reveals Fundamental LLM Vulnerability: Hallucinations can be triggered by semantically invariant minor modifications, demonstrating that LLM "understanding" is far from robust.
Source code is publicly available, ensuring strong reproducibility.
The findings carry important implications for AI safety and trustworthy AI research.

Limitations & Future Work¶

Evaluation is currently limited to multiple-choice QA tasks; open-ended generation scenarios remain unexplored.
The zeroth-order method still requires a relatively high number of queries (156 on average).
Defense mechanisms for improving model robustness are not thoroughly discussed.
Attack effectiveness in multilingual settings has not been evaluated.

GCG (Zou et al. 2023): token-level adversarial attacks.
TextFooler (Jin et al. 2020): word-level perturbations.
AutoDAN (Liu et al. 2024): automated jailbreaking.
Insight: Understanding and improving LLM robustness from an adversarial perspective can be extended to retrieval-augmented settings.

Rating¶

Novelty: ⭐⭐⭐⭐ Formalizes realistic attacks via a constrained optimization framework.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple LLMs, ablations, human evaluation, and efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and rigorous framework.
Value: ⭐⭐⭐⭐⭐ Exposes LLM security risks with strong practical relevance.