Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning¶

Conference: ACL 2026 arXiv: 2601.02902 Code: https://github.com/AI4SS/Logical-Phase-Transitions Area: LLM Reasoning Keywords: logical reasoning, phase transition, curriculum learning, neuro-symbolic alignment, reasoning collapse

TL;DR¶

This paper identifies a "logical phase transition" phenomenon in LLM logical reasoning—performance collapses abruptly at specific complexity thresholds rather than degrading smoothly. The authors propose a Logical Complexity Metric (LoCM) to quantify this phenomenon, and design a Neuro-Symbolic Curriculum Tuning (NSCT) framework that achieves average accuracy gains of +1.26 over naive prompting and +3.95 over CoT across five benchmarks via adaptive neuro-symbolic alignment and complexity-aware curriculum optimization.

Background & Motivation¶

State of the Field: Symbolic logical reasoning is a core capability of LLMs, underpinning high-stakes domains such as mathematical proof and legal reasoning. Existing work shows that LLMs perform well on simple logical tasks but degrade significantly as complexity increases.

Limitations of Prior Work: Although performance degradation has been widely observed, a systematic characterization of how logical depth affects reasoning ability is lacking. Existing analyses rely on coarse-grained difficulty proxies (e.g., hop count) that cannot precisely quantify logical complexity itself. Existing reasoning enhancement methods (CoT, ToT, symbolic reasoning, etc.) improve surface-level performance but offer little insight into how reasoning behavior changes with complexity.

Root Cause: Existing logical reasoning datasets lack complete first-order logic (FOL) representations, making it impossible to finely characterize logical dependency structures and compositional depth, and thus preventing the discovery and explanation of reasoning collapse.

Paper Goals: (1) Propose a metric that precisely quantifies logical complexity; (2) discover and formalize the reasoning collapse phenomenon; (3) design training strategies targeting the collapse region.

Starting Point: The authors draw an analogy to phase transitions in physics—water undergoes abrupt state changes at 0°C and 100°C rather than changing continuously. Similarly, LLM logical reasoning performance collapses suddenly at critical complexity thresholds, exhibiting the hallmarks of a phase transition.

Core Idea: Use LoCM to quantify logical complexity and identify phase transition intervals; then apply neuro-symbolic weight interpolation to align natural language and logical symbolic representations; finally, employ complexity-aware curriculum learning to progressively reinforce reasoning at phase transition boundaries.

Method¶

Overall Architecture¶

The framework consists of three stages: (1) Logical Complexity Measurement—constructing the NSA-LR dataset and quantifying the logical difficulty of each instance using LoCM; (2) Logical Phase Transition Discovery—evaluating LLM performance with LoCM, identifying phase transition intervals, and partitioning instances into three experience pools (Easy/Medium/Hard); (3) Neuro-Symbolic Curriculum Tuning—first obtaining a hybrid semantic model \(\theta_{MIX}\) via NL-FOL weight interpolation, then deriving the final model \(\theta^*\) through complexity-incremental curriculum optimization.

Key Designs¶

Logical Complexity Metric (LoCM):
- Function: Assigns a scalar score to each reasoning instance to quantify its logical difficulty.
- Mechanism: Jointly considers logical operator types and their weights \(\omega(o)\), operator frequency \(\text{freq}(o, \phi)\) (accounting for nesting depth \(d\) and the number of premises \(N_\phi\)), and the number of reasoning hops \(h\), normalized via a monotonic transformation \(f\): \(\text{LoCM}(\phi) = f(\sum_{o \in \mathcal{O}} \omega(o) \cdot \text{freq}(o, \phi) + \gamma \cdot h(\phi))\)
- Design Motivation: Existing complexity estimates rely primarily on hop count, neglecting the contribution of operator types (negation, implication, etc. differ in difficulty), nesting depth, and the number of premises. LoCM provides a multi-dimensional, fine-grained quantification.
Adaptive Neuro-Symbolic Alignment:
- Function: Learns a shared representation space for natural language and logical symbols, enabling the model to perform hybrid reasoning.
- Mechanism: A pure NL model \(\theta_{NL}\) and a pure FOL model \(\theta_{FOL}\) are fine-tuned separately; a family of hybrid models is constructed via linear interpolation \(\theta_\lambda = (1-\lambda)\theta_{NL} + \lambda\theta_{FOL}\); the optimal \(\lambda\) is searched on a validation set and the resulting model is further fine-tuned to yield \(\theta_{MIX}\).
- Design Motivation: Work such as LogicAgent demonstrates that NL provides semantic anchoring while FOL provides precise symbolic constraints—the two are complementary. Weight interpolation is a lightweight model fusion approach that avoids the complexity of multimodal joint training.
Complexity-Aware Curriculum Optimization:
- Function: Progressively reinforces reasoning ability at phase transition boundaries, preventing training instability caused by direct exposure to high-complexity instances.
- Mechanism: Starting from \(\theta_{MIX}\), training is organized in the order Easy→Medium→Hard. At each stage, all instances from the current and preceding complexity levels are used; performance gains are continuously monitored, and the next stage begins only when gains stabilize. Standard token-level cross-entropy loss is employed.
- Design Motivation: Phase transitions imply that directly training on high-complexity instances is ineffective (the model has already collapsed in that region); progressive exposure is necessary for the model to smoothly traverse the phase transition interval.

Loss & Training¶

Standard token-level cross-entropy loss: \(\mathcal{L}(\theta) = -\mathbb{E}[\sum_t \log p_\theta(y_t | x, y_{<t})]\). The NSA-LR dataset is constructed using dual translation by GPT-5 and Qwen3-Max, with inconsistencies resolved via CFG validation or human arbitration.

Key Experimental Results¶

Main Results¶

Method	ProntoQA	ProofWriter	FOLIO	ProverQA	NSA-LR	Avg.
Naive (orig.)	55.20	44.16	60.78	54.13	49.55	52.76
Naive + NSCT	56.80	44.66	62.25	55.47	50.91	54.02 (+1.26)
CoT (orig.)	67.60	55.16	66.17	60.70	57.70	61.47
CoT + NSCT	72.00	60.71	65.20	64.20	65.00	65.42 (+3.95)

Ablation Study (NSA-LR dataset stratified by complexity)¶

Method	Low	Medium	High	Overall
CoT (orig.)	75.5	58.4	39.4	57.7
CoT + NSCT	84.0 (+8.5)	64.2 (+5.8)	46.8 (+7.4)	65.0 (+7.3)

Key Findings¶

The logical phase transition phenomenon appears consistently across all tested open-source and closed-source LLMs, indicating it is a universal property of reasoning ability rather than a model-specific artifact.
Phase transitions manifest not as a single threshold but as multiple critical intervals \(\mathcal{I}_k\), within which accuracy drops sharply and then stabilizes—analogous to multi-stage solid–liquid–gas phase transitions.
NSCT achieves the largest gain on High-complexity instances (+7.4), confirming that the method is effective precisely in the phase transition region.
Single-dataset fine-tuning tends to cause degradation on other datasets (e.g., FOLIO-tuned models drop 0.33 on ProverQA); NSCT is the only method that consistently improves performance across all datasets.
The discovered phase transitions closely parallel Landau's theory of phase transitions in physics—once the control variable (LoCM) enters the critical interval, system behavior changes abruptly.

Highlights & Insights¶

The concept of "logical phase transition," borrowed from physics, is highly apt: performance does not degrade smoothly but collapses abruptly at a threshold. This finding offers a fundamentally new perspective for understanding the capability boundaries of LLM reasoning, explaining why simply adding more training data fails to improve high-complexity reasoning.
LoCM unifies logical operator weights, nesting depth, premise count, and hop count into a single scalar metric, representing the first systematic attempt at quantifying logical complexity and offering a potential standard tool for future research.
The weight interpolation approach for fusing NL and FOL models is simple yet effective, exploiting the mode connectivity property of neural networks and being considerably more lightweight than multi-task joint training.

Limitations & Future Work¶

The operator weights \(\omega(o)\) in LoCM require domain knowledge to specify, and different logical systems may demand different weight configurations.
Validation is limited to the SFT setting; the effect of RL-based training (e.g., GRPO) on the phase transition region remains unexplored.
The NSA-LR dataset is synthetic; real-world natural language logical reasoning may exhibit more complex noise patterns.
Automatic detection of phase transition intervals is not detailed; more guidance is needed on how to identify critical intervals in practice.

vs. Apple (Shojaee et al.): Shojaee et al. identify reasoning collapse in procedural tasks (e.g., Tower of Hanoi), focusing on structured puzzles. This paper targets symbolic reasoning over propositional and first-order logic; the complexity definitions, evaluation objectives, and interventions are entirely different.
vs. CoT-Valve: CoT-Valve controls reasoning chain length, whereas this paper reveals that the root issue lies in logical complexity rather than chain length, providing a more fundamental explanation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The logical phase transition concept is original and experimentally substantiated; LoCM fills a gap in logical complexity quantification.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluation spans five benchmarks with comparisons across multiple reasoning methods, though absolute gains are modest.
Writing Quality: ⭐⭐⭐⭐⭐ The physics analogy is precise and well-placed; the framework overview is clear; the narrative is fluent.
Value: ⭐⭐⭐⭐ Provides a new framework for understanding LLM reasoning capability boundaries, though the practical performance gains remain limited (+1.26/+3.95).