Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!¶

Conference: ICML2026
arXiv: 2504.09762
Code: None (Position paper, no code)
Area: Interpretability (LLM Reasoning · Chain-of-Thought · AI Safety)
Keywords: Intermediate tokens, Chain-of-Thought, Anthropomorphism, Interpretability, RLVR

TL;DR¶

This is a position paper from Kambhampati’s team. The core claim: labeling the "intermediate tokens" generated by reasoning models (such as DeepSeek-R1) before providing an answer as "reasoning traces" or "thinking traces" is a form of dangerous anthropomorphism. It is (1) wishful thinking, (2) largely lacks empirical support, (3) creates false trust in models, and (4) pushes the community toward meaningless research directions. The authors use a series of experimental findings (A maze trace swapping, decoupling trace length from problem complexity, human trust experiments) to argue that trace semantics and final answer correctness are fundamentally decoupled*. They call for the community to stop assigning "user-facing interpretability" to intermediate tokens; trust should instead derive from the verification of the solution itself.

Background & Motivation¶

Background: Intermediate token generation (ITG)—where a model generates a long sequence of tokens before outputting the final answer—has become a standard method for improving LLM reasoning performance. R1-style models use an external verifier (rule-based reward model) during post-training to score "intermediate token + answer" trajectories, rewarding only the correctness of the final answer while applying no direct pressure on the intermediate tokens (i.e., RLVR, RL with Verified Rewards).

Limitations of Prior Work: While it is undisputed that intermediate tokens do improve performance, the community generally refers to them as "Chain-of-Thought" or "reasoning traces." This assumes by default that: ① they are interpretable to users and provide a window into what the model is "thinking"; ② their correctness/interpretability is strongly or even causally correlated with the final answer; and ③ their length reflects "thinking effort" or "problem difficulty." The R1 paper even treats the appearance of "aha" in the trace as a sign of the model's "epiphany."

Key Challenge: Almost none of these assumptions have rigorous evidence. Intermediate tokens might look like human rough drafts (interspersed with "hmm...", "wait a minute", "aha!"), but "looking similar" does not mean "serving the same purpose," and even less does it mean they can be used as a window into model cognition. Taking this anthropomorphic narrative seriously encourages research that is superficially plausible but fundamentally flawed (e.g., pursuing trace readability, measuring difficulty by trace length, or claiming Transformers have learned algorithms superior to A*).

Goal: (1) Clarify what "intermediate tokens" actually are and distinguish them from post-hoc justifications or tool calls; (2) Enumerate the specific harms of anthropomorphism; (3) Use formally verifiable experiments to prove the decoupling of trace semantics from answer correctness and trace length from problem complexity; (4) Provide a call to action.

Core Idea: Replace "Chain-of-Thought/reasoning trace" with the neutral term derivational trace, and use "whether the prompt + intermediate tokens lead to the answer in any logically interpretable way" as the litmus test. Extensive evidence suggests the answer is no; thus, intermediate tokens should not be treated as user-facing, trustworthy explanations.

Method¶

As a position paper, there is no technical pipeline. The argumentative framework is "Define neutral terms → List harms → Present empirical evidence → Issue call to action."

Overall Architecture¶

The argument proceeds in four steps: ① Dismantle linguistic anthropomorphism—use "derivational trace" to refer to "unfiltered tokens before the answer," explicitly excluding two categories: post-hoc justifications/summaries (e.g., o1 hiding real tokens and providing a sanitized summary) and tool calls/returns in agentic scenarios (which must have external semantics). ② List the "unhealthy" research consequences derived from anthropomorphic narratives. ③ Use formally verifiable experiments (A maze trace swapping, QA sub-question decomposition, noisy trace distillation) to prove that trace semantic correctness is decoupled from answer correctness, followed by MDP analysis to show that trace length is mechanistically elongated by RL and unrelated to problem complexity. ④ Use human subject experiments to show that "displaying traces instead amplifies false trust." Finally, issue a call to action and propose the alternative explanation: "intermediate tokens = learned prompt augmentation." The litmus test throughout the paper is whether the trace leads to the answer in a logically interpretable* way, rather than merely altering the conditional distribution of the next token.

Key Designs¶

1. De-anthropomorphized Definition + List of Harms

The authors first perform terminology surgery: replacing "Chain-of-Thought/reasoning trace" with the neutral derivational trace. They define the boundaries among three types of tokens—this paper focuses only on "unfiltered intermediate tokens generated before the answer that require no semantics outside the LLM." This excludes post-hoc justifications (sanitized summaries like in o1) and tool calls in agentic scenarios (which are commitments to an external environment and must have external semantics, often being irreversible in non-ergodic environments). Based on this, they list the harms of anthropomorphism: treating traces as "interpretable" leads to work that merely "makes traces look like pseudo-English" (R1 even added training to remove mixed Chinese-English); the assumption of a strong correlation between trace and answer correctness led to surprise when industrial research found "answers often inconsistent with their traces"; treating trace length as a measure of difficulty/effort; and even claiming Searchformer's Transformer is "superior to A" because its trace is shorter (whereas A is provably optimal for graph search).

2. Decoupling Trace Semantics from Answer Correctness (Core Verifiable Evidence)

This is the strongest empirical pillar. The authors train Transformers on A maze pathfinding: using A (open/closed list operations) traces + solution paths. Since traces are generated by A, they can be formally verified at inference time to see if the trace actually leads to the solution. Findings: trace validity and answer correctness are only loosely correlated, especially on out-of-distribution problems. More strikingly, causal intervention—swapping traces between different instances (making them semantically entirely wrong)—resulted in performance being maintained or even improved. Training on all-correct versus all-swapped traces both yielded high test accuracy; performance only dropped when the two were mixed. Further GRPO post-training experiments showed that while post-training improves answer accuracy, it does not improve trace semantic validity (even when trained on irrelevant traces). In the QA domain: SFT with incorrect intermediate traces paired with correct answers actually outperformed settings with correct traces, showing a high rate of "correct answer but incorrect trace" false positives. Noise distillation and truncated A traces also confirm: what is useful to the model is not the semantics, but the consistent patterns in the training data that the model fits.

3. Trace Length \(\neq\) Problem Complexity / Thinking Effort

Regarding the "length = effort" anthropomorphism, the authors offer two rebuttals. Empirical evidence: looking at trace length across maze difficulties showed that while it looks like "task-adaptive computation" in-distribution, this correlation collapses out-of-distribution. Models trained on complex mazes output extremely long traces even for trivial "no-maze" instances (start and end points in an empty field where A calculation is zero), sometimes exhausting the context window. Theoretical analysis: Samineni et al. analyzed the MDP formulation of R1—under the structural assumption that "state = token sequence" and "terminal rewards are distributed uniformly across intermediate tokens," RL is mechanistically incentivized* to generate longer intermediate sequences. This is misread as "learning stronger reasoning." The root cause: averaging the final reward across every intermediate token bypasses the credit assignment RL should perform, causing RL to degenerate into "filtered on-policy supervised fine-tuning." Thus, trace elongation is a product of the reward structure and has a weak relationship with the computational complexity of solving the problem from scratch.

4. Evidence of False Trust + Call to Action

The authors address the harms at the user level: in a setting where users cannot independently verify the answer, human subject experiments found that displaying traces or their summaries indiscriminately increased user trust in the model's prediction, regardless of whether the answer was correct. Users became more likely to accept incorrect answers, creating false trust. This is where the danger lies: powerful AI might use "stylistically plausible pseudo-reasoning traces" to exploit cognitive weaknesses and persuade users to believe wrong answers. The call to action follows: stop using human interpretation of intermediate tokens as a proxy for answer trustworthiness; trust should come from the verification of the answer itself (via users, third parties, or task-specific verifiers like LLM-Modulo). Since intermediate tokens primarily assist the LLM rather than the user, stop constraining them to pseudo-English formats for "user comfort." The authors propose an alternative explanation—viewing intermediate tokens as "learned prompt augmentation": for task \(T\), there exists an augmentation \(\mathrm{PA}\) such that \(\mathbb{P}(\mathrm{Sol}(\mathrm{LLM}(T+\mathrm{PA}),T))>\mathbb{P}(\mathrm{Sol}(\mathrm{LLM}(T),T))\), where the challenge is learning \(\mathrm{PA}=f_\theta(T,\mathrm{LLM})\). This augmentation need not be human-interpretable (much like gibberish strings used for adversarial jailbreaks). Intermediate tokens are simply "scaffolding" the model uses to pull the current instance closer to its training distribution.

One Example: DeepSeek-R1's "aha moment"¶

The R1 paper treats the appearance of the "aha" token in the trace as a landmark behavior of the model "suddenly realizing" a solution. The authors use this to puncture anthropomorphism: when a human says "aha," it marks a sudden change in internal state, but the model has no such internal state. Between the time before and after "aha," the only difference in the next forward pass is the addition of the "aha" token to the context. Interpreting "aha" as a meaningful epiphany is a classic example of the unexamined assumption that "derivational traces have semantics and are analogous to human reasoning." This example highlights the chasm between "traces looking like human thought" and "traces actually lacking corresponding internal computation."

Key Experimental Results¶

The "experiments" in this position paper are a series of falsification studies by the authors and collaborators.

Main Results: A* Maze Trace Swapping Findings¶

Training Trace Type	Test Answer Accuracy	Trace Semantic Validity
All Correct Traces	High	Relatively High
All Swapped Traces	Remains High	Very Low
Mixed Correct/Swapped	Decreased	Low
GRPO Post-training (w/ irrelevant traces)	Further Gain	No Gain

Conclusion: Answer accuracy is decoupled from trace semantic correctness; what matters to the model is the consistency of patterns, not trace semantics.

Ablation Study: Anthropomorphic Hypotheses vs. Empirical Refutations¶

Anthropomorphic Hypothesis	Empirical Refutation
Trace Correctness \(\implies\) Answer Correctness	Swapped traces yield high accuracy; QA shows many "correct answer, wrong trace" false positives.
Trace Length = Difficulty / Effort	"No-maze" instances produce ultra-long traces; correlation collapses out-of-distribution.
RL Elongating Trace = Stronger Reasoning	MDP Analysis: Uniform reward distribution mechanistically incentivizes length (≈ filtered SFT).
Displaying Traces Increases Transparency	Human Experiments: Traces indiscriminately increase trust, amplifying acceptance of incorrect answers.

Key Findings¶

Most counter-intuitive point: Training with semantically entirely incorrect (swapped) traces does not decrease answer accuracy; it only drops when "correct and incorrect" patterns are mixed—indicating the model extracts "fittable consistent patterns" rather than reasoning semantics.
Inverse relationship between utility and interpretability: Another human subject study showed that the verbose R1-style traces that help model performance the most score lowest on user interpretability, faithfulness, and predictability. Conversely, verifiable traces that users find most understandable do not provide the same accuracy gains—there is a strong decoupling between "useful to model" \(\leftrightarrow\) "understandable to user."
Irony of the status quo: Leading vendors like OpenAI, Google, and Anthropic already do not show real intermediate tokens (citing proprietary reasons while still billing users), effectively following this paper's stance. In contrast, the research community continues to hope that intermediate tokens can provide an interpretability window.

Highlights & Insights¶

The Trace Swapping Experiment is a "Killer" Argument: By using a formally verifiable domain like A* mazes and treating "semantic correctness" as a controllable variable, they directly prove that "semantic incorrectness can still yield gains." This is more powerful than any qualitative analysis—this paradigm of "verifiable domain + causal intervention" can be migrated to any study testing whether CoT truly has semantics.
MDP Perspective on Trace Elongation: Re-interpreting "RL post-training making traces longer" as a "mechanistic byproduct of uniform reward distribution" instead of "learning to reason" punctures a widely misinterpreted phenomenon and provides deep insight into the nature of RLVR.
Prompt Augmentation Alternative: Viewing intermediate tokens as "prompt scaffolding" that needs no human interpretability, and analogizing them to adversarial jailbreak strings, provides a clean framework for why non-semantic traces work. This naturally supports directions using non-linguistic tokens or continuous latent space tokens as intermediate steps.

Limitations & Future Work¶

Acknowledged Limitations: The stance is that intermediate tokens do not necessarily have interpretable semantics, not that they "never can." Accidental interpretability might exist but is unreliable. This position requires more evidence, as it currently relies primarily on the authors' own series of experiments.
Self-identified Limitations: Core evidence is concentrated in formally verifiable or restricted domains like A* mazes and QA. Extrapolating to general LRMs (where R1's free-form traces cannot be formally verified) is an external generalization. The human trust experiment setting (users unable to verify) is realistic, but the boundaries of the conclusion require caution. "Prompt augmentation" remains a speculative alternative explanation.
Future Directions: The authors suggest shifting trust to answer verification (LLM-Modulo / task-specific verifiers). They also note that agentic scenarios must distinguish between "internal intermediate tokens" and "tool calls" (the latter being commitments to the outside world that must have semantics). Their team's LLM-Process-Modulo is a direction for controlling runtime behavior.

vs. DeepSeek-R1 ("aha moment" / Length = Reasoning): The R1 paper promotes the idea that "traces reflect thinking" and "longer traces = learning to reason." This paper directly refutes this—"aha" has no internal state, and length is mechanistically driven by reward structures.
vs. Searchformer / Dualformer (Short traces are better / Truncating traces): Searchformer claims that because its trace is shorter than A, it is "superior." This paper points out A is provably optimal, making that claim groundless. Dualformer showed that truncating A* traces (destroying semantics) still yielded gains, which this paper uses as evidence for "semantic irrelevance."
vs. CoT Monitoring / Interpretability Advocacy (Korbak et al.): Some call for "protecting the fragile monitorability of CoT" for safety purposes. This paper argues such a link might be an illusion; in safety-critical domains, one should not rely on intermediate tokens for monitoring. Third-party verification of the answer is the correct path.

Rating¶

Novelty: ⭐⭐⭐⭐ Uses causal intervention (trace swapping) in verifiable domains to turn "CoT semantics" into a falsifiable proposition; the argument is sharp.
Experimental Thoroughness: ⭐⭐⭐⭐ Aggregates evidence from A* mazes, QA, noise distillation, MDP analysis, and human trust, though mostly in restricted domains.
Writing Quality: ⭐⭐⭐⭐ Terminology surgery is clear; the list of harms and evidence progresses logically; the "aha" example is a highlight.
Value: ⭐⭐⭐⭐⭐ Directly challenges the mainstream "Chain-of-Thought is interpretable" narrative, offering corrective insights for RLVR, AI safety monitoring, and interpretability research.