Skip to content

Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models

Conference: ACL 2026 arXiv: 2601.10294 Code: GitHub Area: Robotics & Embodied AI Keywords: Reasoning Hijacking, Indirect Prompt Injection, Criteria Attack, LLM Safety, Alignment Fragility

TL;DR

This paper introduces "Reasoning Hijacking," a new attack paradigm that manipulates LLM reasoning logic by injecting false decision criteria into the data channel rather than changing task goals, achieving high attack success rates while bypassing intent-detection-based defenses.

Method

Key Designs

  1. Label-Conditioned Criteria Mining: Extracts label-associated decision criteria from datasets, clusters and deduplicates via text embeddings + k-means.

  2. Refutable Criteria Identification: Identifies criteria that the target sample does not satisfy — even though the sample clearly belongs to its true class, heuristic criteria are correlative rather than necessary.

  3. Misleading Reasoning Trace Synthesis: Packages refutable criteria as authoritative decision rules via natural language templates, presenting a structured reasoning process leading to incorrect conclusions.

Key Experimental Results

Defense Criteria Attack ASR (Spam) Combined Attack ASR (Spam)
None 92.7% 100.0%
Instruction 86.9% 64.2%
Sandwich 94.2% 79.0%
  • Highly stable under prompt-level defenses; SecAlign and StruQ also ineffective
  • Cross-model generalization: >80% ASR on at least one task for each of 5 victim LLMs
  • Fake reasoning traces are the key mechanism: removing them causes the largest ASR drop

Highlights & Insights

  • Reveals a critical blind spot in safety research: all existing defenses assume attacks manifest as goal deviation; reasoning hijacking proves that even with aligned goals, the reasoning process itself can be manipulated
  • Exploits LLMs' "reasoning shortcut preference" — models tend to adopt ready-made structured reasoning rather than performing semantic analysis from scratch

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐