Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models¶
Conference: ACL 2026 arXiv: 2601.10294 Code: GitHub Area: Robotics & Embodied AI Keywords: Reasoning Hijacking, Indirect Prompt Injection, Criteria Attack, LLM Safety, Alignment Fragility
TL;DR¶
This paper introduces "Reasoning Hijacking," a new attack paradigm that manipulates LLM reasoning logic by injecting false decision criteria into the data channel rather than changing task goals, achieving high attack success rates while bypassing intent-detection-based defenses.
Method¶
Key Designs¶
-
Label-Conditioned Criteria Mining: Extracts label-associated decision criteria from datasets, clusters and deduplicates via text embeddings + k-means.
-
Refutable Criteria Identification: Identifies criteria that the target sample does not satisfy — even though the sample clearly belongs to its true class, heuristic criteria are correlative rather than necessary.
-
Misleading Reasoning Trace Synthesis: Packages refutable criteria as authoritative decision rules via natural language templates, presenting a structured reasoning process leading to incorrect conclusions.
Key Experimental Results¶
| Defense | Criteria Attack ASR (Spam) | Combined Attack ASR (Spam) |
|---|---|---|
| None | 92.7% | 100.0% |
| Instruction | 86.9% | 64.2% |
| Sandwich | 94.2% | 79.0% |
- Highly stable under prompt-level defenses; SecAlign and StruQ also ineffective
- Cross-model generalization: >80% ASR on at least one task for each of 5 victim LLMs
- Fake reasoning traces are the key mechanism: removing them causes the largest ASR drop
Highlights & Insights¶
- Reveals a critical blind spot in safety research: all existing defenses assume attacks manifest as goal deviation; reasoning hijacking proves that even with aligned goals, the reasoning process itself can be manipulated
- Exploits LLMs' "reasoning shortcut preference" — models tend to adopt ready-made structured reasoning rather than performing semantic analysis from scratch
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐