Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective¶
Conference: ACL 2026 Findings
arXiv: 2601.03154
Code: None
Area: LLM Reasoning
Keywords: Chain-of-Thought, Human Label Variation, Distribution Alignment, Reasoning Decoupling, Model Prior
TL;DR¶
This paper reveals the "decoupling mechanism" of CoT reasoning through Cross-CoT experiments and step-wise analysis: while final accuracy is determined by CoT content (99% variance contribution), the distribution ranking is dominated by the model's intrinsic prior (>80%). This indicates that long CoT acts as a powerful decision-maker but a weak distribution calibrator.
Background & Motivation¶
Background: Reasoning-enhanced LLMs (e.g., DeepSeek-R1, Qwen3) excel in single-answer tasks through long CoT. However, many real-world tasks are inherently ambiguous, exhibiting reasonable Human Label Variation (HLV) among annotators, which requires models to predict probability distributions rather than a single answer.
Limitations of Prior Work: (1) Does CoT reasoning help in better approximating human label distributions? (2) If helpful, is it the CoT content itself or the model's intrinsic parametric knowledge that plays the role? (3) CoT may unconsciously suppress valid alternative interpretations, favoring top-1 selection.
Key Challenge: The design goal of CoT reasoning is to gradually narrow down uncertainty through intermediate steps to produce high-confidence conclusions—this naturally conflicts with the requirement of HLV tasks to preserve probabilistic ambiguity.
Goal: Systematically decouple the distinct effects of "content effect" and "model prior effect" in CoT reasoning on output distributions.
Key Insight: Cross-CoT experiments—injecting CoT from one model into another to test reasoning transferability; Step-wise analysis—truncating CoT to observe how effects evolve across reasoning steps.
Core Idea: CoT concentrates probability mass on the most likely answer (locking the top-1), but it fails to finely adjust the probability allocation of non-top-1 options—the latter is determined by the model prior.
Method¶
Overall Architecture¶
Reasoning LLMs are evaluated on the ChaosNLI benchmark (collective opinion distribution of 100 annotators). Three complementary metrics are used: Accuracy (top-1 correctness), JSD (distribution alignment), and Spearman ρ (ranking alignment). Two decoupling experiments reveal the mechanism of CoT.
Key Designs¶
1. Cross-CoT Experiment: Separating "CoT Content" and "Model Prior"
When CoT improves distribution alignment, is it the reasoning chain itself or the model's parametric knowledge? These are entangled in standard settings. Cross-CoT separates them by injecting CoT generated by Model A into Model B, letting B produce a distribution based on another's reasoning chain, followed by ANOVA to decompose variance sources. If the variance of a metric is mainly explained by the CoT source, the metric is content-driven (CoT as a general reasoner); if dominated by model identity, it is prior-driven. Results show ~99% of accuracy variance comes from CoT content, while >80% of JSD and Spearman ρ variance comes from the model prior.
2. Step-wise Analysis: Observing the Evolution of CoT Effects
If accuracy and distribution structure were driven by the same mechanism, they should evolve synchronously. Step-wise analysis measures the output distribution at 25%, 50%, 75%, and 100% truncation points of the CoT. Observations support "decoupling": accuracy climbs monotonically with CoT steps—the reasoning chain gradually pushes mass toward the top-1. Conversely, the distribution structure is locked by the prior very early, with subsequent steps rarely reshaping the probability allocation among non-top-1 options.
3. Complementary Multi-metric Evaluation
Using only accuracy misses the impact of CoT on distribution structure (or its inability to affect it). This paper employs three metrics: Accuracy for top-1 correctness, JSD for overall alignment with human collective opinions, and Spearman ρ for ranking alignment invariant to monotonic transformations. Together, they expose a key contrast: CoT can correctly select the top-1 (dominating accuracy) but cannot alter the probabilistic landscape beyond the top-1 (JSD/ρ dominated by prior), providing empirical evidence for the "strong decision-maker, weak calibrator" conclusion.
Loss & Training¶
This is an analytical study and does not involve model training. Seven SOTA reasoning LLMs (Qwen3, DeepSeek-R1, etc.) are evaluated across three subsets of ChaosNLI.
Key Experimental Results¶
Main Results¶
Impact of CoT Reasoning on Distribution Metrics (MNLI)
| Model | ACC (w/o CoT) | ACC (w/ CoT) | JSD (w/o CoT) | JSD (w/ CoT) |
|---|---|---|---|---|
| Qwen3 | 0.688 | 0.644 | 0.093 | 0.080↓ |
| R1-Llama | 0.666 | 0.689 | 0.082 | 0.077↓ |
| R1-Qwen | 0.734 | 0.672 | 0.080 | 0.072↓ |
Ablation Study¶
Cross-CoT ANOVA Variance Contribution Analysis
| Metric | CoT Content Contribution | Model Prior Contribution |
|---|---|---|
| Accuracy | ~99% | ~1% |
| JSD (Distribution) | ~20% | >80% |
| Spearman ρ (Ranking) | ~15% | >80% |
Key Findings¶
- CoT reasoning generally improves distribution alignment (lower JSD), but this improvement is uneven across metrics.
- Accuracy is almost entirely determined by CoT content (99%)—CoT is a powerful top-1 decision-maker.
- Distribution ranking and probability allocation are dominated by the model prior (>80%)—CoT fails to reshape the non-top-1 probability landscape.
- Step-wise analysis shows accuracy grows monotonically with CoT steps, but distribution structure is determined early by the prior.
- CoT tends to progressively concentrate probability mass to lock the most likely answer but cannot finely calibrate alternative options.
Highlights & Insights¶
- The discovery of "strong decision-maker, weak distribution calibrator" deeply reveals the structural limitations of CoT.
- The Cross-CoT experimental design is ingenious—elegantly separating content and prior effects by injecting external CoT.
- Analysis of HLV tasks has broad implications—in ambiguous tasks like medical or legal domains, CoT may oversimplify uncertainty.
Limitations & Future Work¶
- Validated only on NLI tasks (3-way classification); more complex distribution tasks remain to be explored.
- The use of first-token probability to approximate output distribution may not fully represent the model's true uncertainty.
- Does not explore designing "distribution-aware" reasoning mechanisms to improve CoT calibration.
- Injecting external CoT in Cross-CoT experiments may introduce out-of-distribution effects.
Related Work & Insights¶
- vs. Standard CoT Evaluation: Standard evaluation uses only accuracy; this work reveals distribution structure information beyond accuracy.
- vs. Confidence Calibration: Calibration studies focus on whether model confidence is accurate; this work focuses on the impact of CoT on distribution structure.
- vs. ChaosNLI: ChaosNLI provides human collective opinion distributions; this work is the first to use it to evaluate reasoning LLMs.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The Cross-CoT decoupling experiment and "strong decision-maker/weak calibrator" finding are highly insightful.
- Experimental Thoroughness: ⭐⭐⭐⭐ 7 models, 3 datasets, ANOVA analysis, though task types are limited.
- Writing Quality: ⭐⭐⭐⭐⭐ In-depth analysis, clear logic, and precise presentation of findings.
- Value: ⭐⭐⭐⭐⭐ Significant contribution to understanding CoT reasoning mechanisms and LLM uncertainty modeling.