Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models¶

Conference: ACL 2026 arXiv: 2601.10294 Code: GitHub Area: Robotics & Embodied AI Keywords: Reasoning Hijacking, Indirect Prompt Injection, Criteria Attack, LLM Safety, Alignment Fragility

TL;DR¶

This paper introduces "Reasoning Hijacking," a new attack paradigm that manipulates LLM reasoning logic by injecting false decision criteria into the data channel rather than changing task goals, achieving high attack success rates while bypassing intent-detection-based defenses.

Method¶

Key Designs¶

Label-Conditioned Criteria Mining: Extracts label-associated decision criteria from datasets, clusters and deduplicates via text embeddings + k-means.
Refutable Criteria Identification: Identifies criteria that the target sample does not satisfy — even though the sample clearly belongs to its true class, heuristic criteria are correlative rather than necessary.
Misleading Reasoning Trace Synthesis: Packages refutable criteria as authoritative decision rules via natural language templates, presenting a structured reasoning process leading to incorrect conclusions.

Key Experimental Results¶

Defense	Criteria Attack ASR (Spam)	Combined Attack ASR (Spam)
None	92.7%	100.0%
Instruction	86.9%	64.2%
Sandwich	94.2%	79.0%

Highly stable under prompt-level defenses; SecAlign and StruQ also ineffective
Cross-model generalization: >80% ASR on at least one task for each of 5 victim LLMs
Fake reasoning traces are the key mechanism: removing them causes the largest ASR drop

Highlights & Insights¶

Reveals a critical blind spot in safety research: all existing defenses assume attacks manifest as goal deviation; reasoning hijacking proves that even with aligned goals, the reasoning process itself can be manipulated
Exploits LLMs' "reasoning shortcut preference" — models tend to adopt ready-made structured reasoning rather than performing semantic analysis from scratch

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐