Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?¶
Conference: ICLR 2026 arXiv: 2602.07470 Code: None Area: LLM Reasoning Keywords: reasoning LLM, chain-of-thought, robustness, self-correction, doubt mechanism
TL;DR¶
This paper systematically evaluates the robustness of reasoning LLMs to various interventions (benign/neutral/adversarial) in their chain-of-thought. Models are generally robust and can recover from interventions; however, paraphrasing suppresses "self-doubt" expressions and degrades accuracy, while the recovery process incurs significant computational overhead (CoT length inflation up to 665%).
Background & Motivation¶
Background: Reasoning LLMs (e.g., DeepSeek-R1, QwQ) generate chain-of-thought reasoning to solve complex tasks step by step. In practice, however, the CoT may be corrupted by noisy tool outputs, adversarial injections, or model-generated hallucinations.
Limitations of Prior Work: It is known that conventional (non-reasoning) LLMs have limited self-correction ability and frequently alter correct answers erroneously. Whether RLVR-trained reasoning models have acquired stronger robustness and self-correction capabilities lacks systematic investigation.
Key Challenge: There is a fundamental trade-off between reasoning robustness and reasoning efficiency — models may recover the correct answer, but at the cost of substantial CoT inflation and increased inference overhead.
Goal: (1) Can reasoning LLMs recover from interventions in their CoT? (2) What factors influence recovery ability? (3) What is the computational cost of recovery?
Key Insight: A controlled experimental framework is designed in which seven types of interventions are applied to the model's own correct CoT, and whether the model still produces the correct final answer is measured.
Core Idea: Reasoning LLMs are generally robust to CoT interventions, but their robustness relies on a metacognitive mechanism termed "doubt." Paraphrasing suppresses doubt and thereby degrades performance.
Method¶
Overall Architecture¶
600 math problems that all evaluated models solve correctly are selected from NuminaMath. Each CoT is segmented by paragraph, and an intervention is applied at a specified timestep (\(t = 0.1, 0.3, 0.5, 0.7, 0.9\)). All subsequent content is removed, and the original model continues reasoning from the intervention point. Eight independent completions are sampled per instance, and accuracy is recorded.
Key Designs¶
-
Seven Interventions (Three Categories):
-
Benign: (a) prepend one correct reasoning step generated by another model; (b) paraphrase the entire CoT while preserving semantics.
- Neutral: (c) insert random garbled characters at the current step; (d) replace the current step with an irrelevant Wikipedia passage.
- Adversarial: (e) insert an incorrect reasoning continuation; (f) inject a fabricated mathematical fact; (g) replace the CoT prefix with one from an unrelated topic.
-
Four interventions are context-aware (generated by Qwen-2.5-32B-Instruct); three are context-agnostic.
-
Sampling Robustness Metrics:
-
Function: Quantify robustness at different levels of strictness.
- Three tiers: at-least-once-robust (\(K \geq 1\)), majority-robust (\(K \geq 5/8\)), and all-robust (\(K = 8\)).
-
Majority-robust is the primary metric, balancing against both accidental correctness and strict consistency.
-
Doubt Analysis:
-
Function: Quantify the frequency of self-doubt expressions (e.g., "Wait", "Let me check") in the CoT.
- Mechanism: An LLM classifier performs binary doubt/non-doubt classification on the 20 sentences following an intervention, compared against an unintervened baseline (doubt rate: 0.153).
Experimental Scale¶
9 open-source reasoning models × 600 math problems × 7 interventions × 5 timesteps × 8 samples = 1.52 million reasoning chains (math). Including Science (231 problems) and Logic (326 problems), the total reaches 2.92 million chains.
Key Experimental Results¶
Main Results: Majority Robustness (Math)¶
| Finding | Details |
|---|---|
| Generally robust | Except for the smallest model, all models achieve majority robustness close to 1.0 across all interventions |
| Scale effect | R1-Distill-Qwen-1.5B shows the weakest robustness; 32B models are the strongest |
| Timestep effect | Earlier interventions (\(t=0.1\)) have a larger impact |
| Sole exception | Paraphrasing is the only intervention that consistently degrades performance across all models |
CoT Length Inflation (% change vs. original CoT)¶
| Model | Benign:Rewrite | Neutral:Add Text | Neutral:Insert Chars | Adv:Wrong Cont. |
|---|---|---|---|---|
| R1-Distill-Qwen-1.5B | -37% | +665% | +111% | +32% |
| R1-Distill-Qwen-7B | -60% | +124% | +34% | +9% |
| R1-Distill-Qwen-14B | -62% | +54% | +6% | +10% |
| QwQ-32B | -44% | +167% | +6% | +16% |
Key Findings¶
- Doubt is the core recovery mechanism: Doubt expressions increase significantly after interventions, with adversarial interventions triggering the strongest doubt signals. Successfully recovered traces exhibit slightly higher doubt rates than failed ones, indicating that doubt facilitates but does not guarantee recovery.
- The critical problem with paraphrasing: Paraphrasing reduces the doubt rate from the baseline of 0.153 to 0.068–0.076, causing models to adopt a more "confident" yet more error-prone reasoning style. At \(t=0.1\), paraphrasing shortens CoT length by 59–61% while reducing accuracy.
- Cross-domain consistency: Recovery patterns are largely consistent across Math, Science, and Logic domains.
- Smaller models are substantially more fragile: The 1.5B model exhibits 665% CoT inflation under neutral interventions, compared to 54–167% for larger models.
Highlights & Insights¶
- Doubt as metacognition: This is the first work to systematically quantify the functional role of self-doubt expressions (e.g., "Wait / Let me check") in reasoning LLMs. These expressions are not redundant outputs but constitute an active recovery mechanism — a form of metacognitive capability that emerges from RLVR training. This finding carries important implications for understanding emergent behaviors induced by RLVR training.
- Absence of style invariance: Paraphrasing preserves semantics but alters style, which is sufficient to degrade performance. This reveals a deeper issue: the robustness of current reasoning LLMs partly depends on specific linguistic styles (hedging, self-questioning) rather than purely on logical reasoning capability.
- Practical implications — risks of tool output injection: In agentic systems, intermediate results returned by tools are injected into the CoT. This work quantifies the impact of such injections (up to +665% computational overhead), providing empirical guidance for optimizing inference efficiency.
Limitations & Future Work¶
- Only open-source models are evaluated; closed-source reasoning models such as o1/o3 are not included.
- The 600 math problems are restricted to those all models answer correctly, which may overestimate robustness — recovery ability may be substantially weaker on harder problems.
- Interventions are applied only at a single step; real-world scenarios may involve multiple consecutive perturbations.
- The paper does not explore how to improve style invariance through training (e.g., incorporating paraphrased traces into RLVR training).
Related Work & Insights¶
- vs. BIG-Bench Mistake: That benchmark measures error localization ability; this work extends the scope to measure error recovery ability. Reasoning LLMs significantly outperform conventional LLMs on error localization (GPT-4 achieves only 17–62% across tasks, while reasoning models reach 66–94%).
- vs. Yang et al. (2025): Their work injects misleading content at the beginning of the CoT; this paper intervenes at arbitrary timesteps using the model's own CoT, more closely reflecting real-world conditions.
- Implications for training: RLVR training should preserve doubt expressions, improve style robustness, and develop efficient recovery strategies to control token overhead.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First systematic robustness benchmark for reasoning LLM CoT; the discovery of the doubt mechanism is a significant insight.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 9 models × 7 interventions × 3 domains × 2.92 million chains; impressively large scale.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, detailed experimental reporting, and rich figures.
- Value: ⭐⭐⭐⭐ — Provides systematic empirical evidence on reasoning LLM robustness, with practical guidance for deployment safety and training improvement.