Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective¶
Conference: ACL 2026 arXiv: 2601.03154 Code: None Area: LLM Reasoning Keywords: Chain-of-Thought, Human Label Variation, Distribution Alignment, Reasoning Decoupling, Model Prior
TL;DR¶
Through Cross-CoT experiments and step-wise analysis, this paper reveals a "decoupling mechanism" underlying CoT reasoning: final accuracy is determined by CoT content (~99% variance contribution), whereas distributional ranking is dominated by the model's intrinsic prior (>80%). This demonstrates that long CoT is a strong decision-maker but a weak distribution calibrator.
Background & Motivation¶
State of the Field: Reasoning-augmented LLMs (e.g., DeepSeek-R1, Qwen3) achieve strong performance on single-answer tasks via long CoT. However, many real-world tasks are inherently ambiguous, where reasonable disagreement exists among human annotators (Human Label Variation, HLV), requiring models to predict probability distributions rather than single answers.
Limitations of Prior Work: (1) Does CoT reasoning help better approximate human label distributions? (2) If so, is it the CoT content itself or the model's intrinsic parametric knowledge that drives the effect? (3) CoT may inadvertently suppress valid alternative interpretations and bias outputs toward the top-1 choice.
Root Cause: CoT reasoning is designed to progressively reduce uncertainty through intermediate steps and produce high-confidence conclusions—a goal that is inherently at odds with the requirement of HLV tasks to preserve probabilistic ambiguity.
Paper Goals: To systematically decouple the distinct effects of "content" and "model prior" within CoT reasoning on output distributions.
Starting Point: Cross-CoT experiments—injecting one model's CoT into another to test the transferability of reasoning; step-wise analysis—truncating CoT at various points to observe how effects evolve across reasoning steps.
Core Idea: CoT concentrates probability mass onto the most likely answer (locking in top-1), but cannot finely regulate the probability allocation among non-top-1 options—the latter being governed by the model prior.
Method¶
Overall Architecture¶
Evaluation is conducted on the ChaosNLI benchmark (collective opinion distributions from 100 annotators) using three complementary metrics: accuracy (top-1 correctness), JSD (distributional alignment), and Spearman \(\rho\) (ranking alignment). Two decoupling experiments are employed to reveal the mechanisms underlying CoT.
Key Designs¶
-
Cross-CoT Experiment:
- Function: Disentangle the effect of CoT content from that of the model prior.
- Mechanism: CoT generated by model A is injected into model B; the degree to which model B's output distribution is influenced by the CoT content versus its own prior is then measured. ANOVA is used to quantify variance contributions: if CoT content decisively affects a given metric, the variance of that metric should be predominantly explained by the CoT source.
- Design Motivation: If CoT functions as a universal reasoner, injecting any high-quality CoT should improve outputs; if the model prior dominates, the CoT source should be largely irrelevant.
-
Step-wise Analysis:
- Function: Track how the influence of CoT on different metrics evolves throughout the reasoning process.
- Mechanism: Output distributions are measured at different CoT truncation points (25%, 50%, 75%, 100%), and trends in accuracy and distributional metrics are observed.
- Design Motivation: If CoT exerts different temporal dynamics on accuracy versus distributional metrics, this indicates they are driven by distinct mechanisms.
-
Multi-Metric Complementary Evaluation:
- Function: Comprehensively assess the effect of CoT on output distributions.
- Mechanism: Accuracy evaluates only top-1 correctness; JSD evaluates overall distributional alignment; Spearman \(\rho\) evaluates ranking alignment and is invariant to monotonic transformations. The three metrics jointly reveal the multi-faceted effects of CoT.
- Design Motivation: Accuracy alone cannot capture CoT's influence (or lack thereof) on the underlying distributional structure.
Loss & Training¶
This is an analytical study involving no model training. Seven state-of-the-art reasoning LLMs (Qwen3, DeepSeek-R1, etc.) are evaluated on three subsets of ChaosNLI.
Key Experimental Results¶
Main Results¶
Effect of CoT Reasoning on Distributional Metrics (MNLI)
| Model | ACC (w/o CoT) | ACC (w/ CoT) | JSD (w/o CoT) | JSD (w/ CoT) |
|---|---|---|---|---|
| Qwen3 | 0.688 | 0.644 | 0.093 | 0.080↓ |
| R1-Llama | 0.666 | 0.689 | 0.082 | 0.077↓ |
| R1-Qwen | 0.734 | 0.672 | 0.080 | 0.072↓ |
Ablation Study¶
Cross-CoT ANOVA Variance Contribution Analysis
| Metric | CoT Content Contribution | Model Prior Contribution |
|---|---|---|
| Accuracy | ~99% | ~1% |
| JSD (Distribution Alignment) | ~20% | >80% |
| Spearman \(\rho\) (Ranking Alignment) | ~15% | >80% |
Key Findings¶
- CoT reasoning generally improves distributional alignment (lower JSD), but this improvement is uneven across metrics.
- Accuracy is almost entirely determined by CoT content (~99%)—CoT is a strong top-1 decision-maker.
- Distributional ranking and probability allocation are dominated by the model prior (>80%)—CoT cannot reshape the probability landscape of non-top-1 options.
- Step-wise analysis shows that accuracy increases monotonically with CoT length, whereas distributional structure is already determined by the prior at early steps.
- CoT progressively concentrates probability mass to lock in the most likely answer, but cannot finely calibrate alternative options.
Highlights & Insights¶
- The finding of CoT as a "strong decision-maker but weak distribution calibrator" reveals a fundamental structural limitation of CoT reasoning.
- The Cross-CoT experimental design is elegant—injecting external CoT cleanly separates the effects of content and prior.
- The analysis on HLV tasks has broad implications: in ambiguous domains such as medicine and law, CoT may oversimplify uncertainty.
Limitations & Future Work¶
- Validation is limited to NLI tasks (3-way classification); more complex distributional tasks remain to be explored.
- First-token probabilities are used to approximate output distributions, which may not fully capture the model's true uncertainty.
- The paper does not investigate how to design "distribution-aware" reasoning mechanisms to improve CoT calibration.
- Injecting external CoT in Cross-CoT experiments may introduce out-of-distribution effects.
Related Work & Insights¶
- vs. Standard CoT Evaluation: Standard evaluation relies solely on accuracy; this paper reveals distributional structural information beyond accuracy.
- vs. Confidence Calibration Research: Calibration research concerns whether model confidence is accurate; this paper focuses on CoT's effect on distributional structure.
- vs. ChaosNLI: ChaosNLI provides human collective opinion distributions; this paper is the first to use it to evaluate reasoning LLMs.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The Cross-CoT decoupling experiment and the "strong decision-maker / weak calibrator" finding are highly insightful.
- Experimental Thoroughness: ⭐⭐⭐⭐ Seven models, three datasets, and ANOVA analysis, though task diversity is limited.
- Writing Quality: ⭐⭐⭐⭐⭐ Analysis is thorough, logic is clear, and findings are articulated with precision.
- Value: ⭐⭐⭐⭐⭐ Makes an important contribution to understanding CoT reasoning mechanisms and uncertainty modeling in LLMs.