Skip to content

Reasoning Structure of Large Language Models

Conference: ICML2026
arXiv: 2606.03883
Code: https://github.com/ETH-DISCO/llm-reasoning-efficiency
Area: LLM Reasoning
Keywords: Reasoning Graph, Structural Entropy, Efficiency Metrics, Logic Puzzles, Process Evaluation

TL;DR

This paper converts the free-text Chain-of-Thought (CoT) of Large Reasoning Models (LRM) into a verifiable DAG consisting of "atomic claims + deductive dependencies." Based on the structural entropy of an absorbing Markov chain, the authors define a reasoning flow efficiency metric \(\eta\). They demonstrate that in intervals where accuracy and token counts are saturated or overlapping, \(\eta\) remains capable of distinguishing "focused reasoning" from "divergent exploration," serving as a fine-grained tool for diagnosing LRM failure modes.

Background & Motivation

Background: Current evaluations of LRMs focus almost exclusively on two one-dimensional metrics: final answer accuracy and generated token count. Minor works (Tree-of-Thoughts, Graph-of-Thoughts, RLVR) focus on improving the "triggering mechanism" of reasoning rather than the measurement of the reasoning process itself.

Limitations of Prior Work: Identical accuracy and token counts can mask entirely different reasoning structures: one trace might be a near-linear deduction, while another involves massive backtracking, repetitive verification, or drifting along false hypotheses before correction. These behaviors have distinct implications for RL training, failure mode diagnosis, and model selection, yet existing metrics collapse them.

Key Challenge: To conduct "process-level" evaluation, machine-verifiable intermediate states are required. However, free-text CoT is neither structured nor easily aligned with environments. Existing reasoning-graph works (Xiong et al. 2025; Minegishi et al. 2025) either lack external verification for graph nodes or rely purely on hidden state inference, making it impossible to determine the correctness of individual claims.

Goal: (1) Provide a reasoning benchmark with controllable difficulty and executable verification; (2) automatically reconstruct free-text traces into machine-verifiable DAGs; (3) define a "structural" efficiency metric that is independent of graph size and decoupled from accuracy.

Key Insight: The authors select 21 types of 2D grid puzzles from Simon Tatham as the medium—rules are fully formalized, verification is executable, and difficulty is smoothly scalable. Meanwhile, the reasoning process is modeled as "logical quality flow" on an absorbing Markov chain, using structural information theory to measure "focus vs. divergence" via entropy.

Core Idea: Represent reasoning using a DAG of "atomic claim nodes + deductive edges." After normalization, the graph is projected onto an absorbing Markov chain to calculate structural entropy \(H_{\text{str}}\). A scale-normalized efficiency score \(\eta\), falling within \([0,1]\), is derived using the minimal claim set \(C^*\). This score is decoupled from graph scale and token count, reflecting only the "concentration of logical flow relative to the minimal solution skeleton."

Method

Overall Architecture

The input is a free-text trace \(S=(s_1,s_2,\dots)\) generated by an LRM on a puzzle instance. The pipeline follows four steps: (1) Run the model in an executable puzzle environment and collect traces; (2) use a hybrid "deterministic extraction + LLM extraction" pipeline to decompose the trace into atomic claim nodes \(V\); (3) use an LLM to identify preceding claim dependencies for each claim to generate the edge set \(E\), connecting restated claims back to their first occurrence; (4) use a puzzle solver to label each verifiable claim as correct/wrong/unverifiable. This results in a DAG \(G=(V,E)\), from which a "minimal solution subgraph" \(G_{\text{sol}}\) and a "verification subgraph" \(G_{\text{ver}}\) are derived. All structural metrics and efficiency \(\eta\) are calculated on these three graphs.

Key Designs

  1. Two-stage Extraction Pipeline for Verifiable Claim Graphs:

    • Function: Converts long CoT text into a machine-verifiable DAG where nodes are atomic claims and edges are deductive dependencies.
    • Mechanism: Claim extraction utilizes a "parallel dual-track" approach—one track uses high-precision deterministic regex templates (targeting specific puzzle claim schemas) followed by LLM-based schema repair; the other track uses a high-recall LLM extraction without rule constraints. Results are merged and deduplicated within token-balanced chunks, followed by supportive validation to discard hallucinations. Rule extraction processes non-temporary claims in trace order, providing the LLM with a prefix up to the sentence supporting the claim, requiring it to return a rule application or label it as a "direct statement." If required premises are missing from the trace, a placeholder claim is inserted. Finally, all verifiable claims are hard-verified in the executable environment.
    • Design Motivation: A single extractor might bake its own bias into the graph; a hybrid approach with role separation (e.g., GPT-5.2 for claims, GPT-5-mini for rules) decouples the evaluator from the evaluated model. Ablations show \(\eta\) fluctuations of only 1.9% across extractors, with 75.5% accuracy in manual audits, proving robustness.
  2. Structural Entropy on Absorbing Markov Chains:

    • Function: Translates "graph structure" into a scalar measuring the concentration of logical flow.
    • Mechanism: Each claim node is treated as a transient state. An edge is added from all leaf nodes (out-degree 0) to a single absorbing state \(a\), forming an augmented graph \(G_{\text{abs}}\). Its row-normalized adjacency matrix is partitioned as \(P=\begin{pmatrix}Q & R\\ 0 & 1\end{pmatrix}\). Initial "logical mass" \(\pi\) is distributed uniformly across source nodes and evolves as \(\pi Q^t\), with accumulated mass \(m=\pi N\) (where \(N=\sum_t Q^t\) is the fundamental matrix). Structural entropy is defined as \(H_{\text{str}}(G)=-\sum_v \frac{m(v)}{\|m\|}\log\frac{m(v)}{\|m\|}\). Linear reasoning concentrates mass on few nodes (low entropy), while divergent reasoning spreads mass across branches and restatements (high entropy).
    • Design Motivation: Traditional graph statistics (depth, diameter, width) are strongly coupled with graph scale—difficult puzzles naturally produce larger graphs. Structural entropy reflects whether the "flow is focused."
  3. Scale-normalized Reasoning Flow Efficiency \(\eta\):

    • Function: Compresses the metric to the \([0,1]\) range while eliminating scale effects of graph size, allowing comparison across difficulties and models.
    • Mechanism: Defined as \(\eta=\dfrac{\log|V|-H_{\text{str}}(G)}{\log|V|-\log|C^*|}\). The numerator represents the reduction in actual entropy relative to maximum possible entropy; the denominator represents the maximum possible reduction provided by an ideal minimal solution skeleton (\(|C^*|\) is the minimum claims for a solution). \(\eta \approx 1\) indicates flow concentrated on the skeleton; lower \(\eta\) indicates mass spread across verification and divergent exploration.
    • Design Motivation: Correlating \(\eta\) with traditional metrics reveals that while width and \(|V|\) track puzzle difficulty, \(\eta\) remains positively correlated with accuracy while being decoupled from token count (\(r=-0.05, p=0.64\)).

Key Experimental Results

Main Results

Accuracy and average completion tokens for 21 puzzles across 4 difficulties:

Model Trivial Acc/Tok Human easy Human normal Human hard Avg. Acc / Tok
GPT-5 83.8 / 4.2k 69.5 / 10.2k 58.1 / 17.3k 5.7 / 19.9k 54.3 / 12.9k
Qwen3 235B 69.5 / 10.3k 44.8 / 19.0k 21.0 / 23.1k 0.0 / 23.6k 33.8 / 19.0k
DeepSeek V3.2 77.1 / 7.7k 53.3 / 20.6k 44.8 / 27.0k 0.0 / 36.8k 43.8 / 23.0k
Kimi K2 77.1 / 10.6k 56.2 / 29.7k 41.0 / 43.8k 1.0 / 61.3k 43.8 / 36.3k

GPT-5 is the most accurate and token-efficient across all levels. Kimi K2 uses the most tokens but does not surpass GPT-5. All models struggle significantly on "Human hard," suggesting that increasing token count alone does not solve the hardest level—a structural bottleneck in current LRMs.

Ablation Study

Pearson correlation of \(\eta\) vs. traditional graph statistics:

Metric vs. Claim Accuracy vs. \(\eta\)
Depth −0.263 +0.046
Diameter −0.329 +0.010
Avg. path length −0.182 +0.051
Width −0.618 −0.431
$ V $
Tokens −0.576 −0.120
\(\eta\) +0.368

Width and \(|V|\) correlate negatively with accuracy primarily due to puzzle difficulty. \(\eta\) is the only metric reflecting structural signals independent of difficulty.

Key Findings

  • Extra tokens flow to verification overhead: Correlation between token count and \(|V_{\text{ver}}|/|V_{\text{sol}}|\) is \(r=0.53\ (p=3\times10^{-9})\), contradicting the assumption that "Longer CoT = Better Reasoning."
  • Early errors lead to inefficiency: Deeper initial error claims correlate with higher \(\eta\) (\(r=0.28, p=0.015\)); early mistakes trigger long error-correction cycles.
  • Moderate restatements are beneficial: Restatement frequency correlates positively with \(\eta\), suggesting "restatement anchoring" is part of structured reasoning.
  • Discriminatory power in saturation zones: In trivial puzzles where all models achieve 100% accuracy, \(\eta\) still distinguishes model performance.
  • Generalization to failure traces: In failed traces, \(\eta\) drops by more than half, with graphs becoming more dispersed and errors appearing earlier.

Highlights & Insights

  • A paradigm for quantifying the "reasoning process": Using executable environments for ground-truth verification bypasses the requirement for manual PRM (Process Reward Model) labeling. The structural layer is puzzle-agnostic and transferable to math (symbolic verification) or code (unit tests).
  • Elegant scale normalization: The use of \(\log|V|-\log|C^*|\) allows direct comparison between different puzzle types and difficulties by neutralizing the effect of graph size.
  • Transferable trick: The method of adding an absorbing state to a DAG and using the fundamental matrix \(N=(I-Q)^{-1}\) to calculate mass flow can be applied to action exploration in RL or tool-use graphs in agent frameworks.
  • Training signal potential: \(\eta\) could serve as an auxiliary shaping reward for RLVR, encouraging lower structural entropy while maintaining accuracy to reduce overthinking.

Limitations & Future Work

  • Extraction pipeline noise: LLM-based extraction introduces potential self-bias. While robust in evaluations (1.9% fluctuation), it may fail on traces with weak reasoning markers.
  • Domain-specific claim/rule types: New domains require manual definition of claim schemas and rule templates.
  • Small sample size: The study uses 420 instances and focuses on three open-source models for graph analysis.
  • Gameability risk: Models might learn to "omit verification" to artificially boost \(\eta\) if used as a standalone leaderboard metric.
  • Future improvements: Expanding the structural layer to math (Lean/SymPy) and code, and using \(\eta\) for RLVR shaping rewards.
  • vs. ZebraLogic / SATBench: These benchmarks verify final answers; this work adds claim-level verifiers and structural metrics, turning the benchmark into a "reasoning microscope."
  • vs. Shojaee et al. 2025 (Illusion of Thinking): This work provides a theoretically grounded, puzzle-agnostic definition for "reasoning effort" using \(\eta\).
  • vs. Reasoning Graph works: Previous works used hidden states or task-specific DAGs without external verification. This work leverages hard verification from executable environments.
  • vs. Overthinking studies: This work quantifies observed phenomena into a single \(\eta\) metric and identifies that extra tokens are primarily consumed by verification.

Rating

  • Novelty: ⭐⭐⭐⭐ Combining absorbing Markov chain entropy with scale normalization for reasoning evaluation is a clean and unique perspective.
  • Experimental Thoroughness: ⭐⭐⭐ Wide coverage of models and puzzles, but sample size is relatively small and structural analysis is limited to open-source models.
  • Writing Quality: ⭐⭐⭐⭐ Clear definitions and effective use of figures/tables to support conclusions.
  • Value: ⭐⭐⭐⭐ A puzzle-agnostic structural metric with available code, directly applicable to RL training and failure mode diagnosis.