How LLMs Learn to Reason: A Complex Network Perspective¶

Conference: ICLR 2026 arXiv: 2509.23629 Code: https://anonymous.4open.science/r/CoNet-83A4 Area: LLM Reasoning / Reinforcement Learning Keywords: RLVR, concept network, sparse graph, catastrophic forgetting, annealing algorithm

TL;DR¶

This paper proposes a "sparse concept network" theory from a complex network perspective to provide a unified explanation of four puzzling phenomena in RLVR training (V-shaped response length, two-stage learning curve, catastrophic forgetting, and policy collapse). It reveals that all four phenomena originate from the topological self-organization of sparse reasoning graphs with average degree approximately 2, and derives the Annealed-RLVR algorithm, which surpasses standard RLVR on mathematical reasoning benchmarks.

Background & Motivation¶

Background: RLVR (Reinforcement Learning with Verifiable Rewards) is used to train the reasoning capabilities of LLMs, with representative work such as DeepSeek-R1. However, RLVR training exhibits four puzzling phenomena: (1) a two-stage learning curve featuring rapid improvement followed by a prolonged plateau; (2) a V-shaped trajectory in which the length of correct answers first decreases then increases; (3) catastrophic forgetting after SFT; and (4) policy diversity collapse.

Limitations of Prior Work: Existing explanations are isolated from one another—the plateau is attributed to entropy exhaustion, the V-shape to the emergence of self-reflection after redundant reasoning paths are pruned, and catastrophic forgetting is treated as an objective mismatch problem. No unified framework connects all four phenomena to a common underlying mechanism.

Key Challenge: Constructing microscopic reasoning graphs directly from the high-dimensional latent space of LLMs is extremely difficult, impeding direct investigation of the structural origins of RLVR dynamics.

Goal: To provide a unified physical framework that traces all four RLVR phenomena back to a common topological self-organization process.

Key Insight: Inspired by the renormalization group, the analysis operates at the semantic level on a coarse-grained "concept network"—a sparse network with average degree approximately 2—rather than analyzing the full reasoning graph at the token level. A simplified Concept Network Model (CoNet) serves as a computational microscope for validation.

Core Idea: The paper proposes and validates the central hypothesis that the concept network formed after RLVR training is a sparse network with average degree \(\langle k \rangle \approx 2\). This predominantly tree-like topology is efficient yet fragile, and unifies the explanation of the V-shaped curve (paths necessarily lengthen when transitioning from local skill-island optimization to global network integration), catastrophic forgetting (cutting critical "backbone" edges renders subtrees unreachable), and policy collapse (sharp leaf-node learning transitions accumulate to freeze global exploration).

Method¶

Overall Architecture¶

(1) Reproduce the four RLVR phenomena on DeepSeek-R1-Distill-Qwen-1.5B; (2) replicate these phenomena with the minimal CoNet model, which abstracts reasoning as a graph traversal problem; (3) leverage CoNet's transparency to analyze microscopic topological mechanisms; (4) design the Annealed-RLVR intervention algorithm based on these findings.

Key Designs¶

Sparse Concept Network Hypothesis (\(\langle k \rangle \approx 2\)):
- Function: Provides a unified explanation of the four macroscopic RLVR phenomena.
- Mechanism: The descending segment of the V-shaped curve corresponds to parallel local optimization of independent "skill islands" (pruning redundant paths), while the ascending segment corresponds to the merging of skill islands into a global concept network, during which the sparse structure forces paths to lengthen (average geodesic distance increases with network growth). Catastrophic forgetting: \(\langle k \rangle = 2\) implies backbone edges are the sole connections; SFT overwriting the weights at these critical branch points severs entire subtrees. Policy collapse: sharp leaf-node learning transitions (from exploration to exploitation) accumulate to cause global diversity loss.
- Design Motivation: In complex systems, emergent behavior is typically governed by large-scale organization rather than microscopic details; sparse graphs provide the most parsimonious explanatory framework.
CoNet Computational Microscope:
- Function: Provides a tractable minimal model for theoretical validation.
- Mechanism: CoNet maps the LLM's "semantic states" to abstract nodes in a fixed random graph, and "logical transitions" to learnable probabilistic edges. The learning process reduces to a graph traversal problem, yet remarkably reproduces macroscopic behaviors of LLMs such as the V-shaped curve and the two-stage learning dynamics.
- Design Motivation: Direct analysis of LLMs is infeasible; CoNet serves as a renormalized proxy that makes microscopic analysis tractable.
Annealed-RLVR Algorithm:
- Function: Overcomes topological bottlenecks in RLVR to improve reasoning performance.
- Mechanism: At the "maximum frustration state" (when skill-island competition is most intense, corresponding to the bottom of the V-shaped curve), a brief SFT "heating" phase is precisely inserted—applying SFT only to problems with very low accuracy (<0.1) that nonetheless have correct solutions—followed by resumption of RLVR "cooling." Analogous to simulated annealing: heating breaks local optima, and cooling guides the system toward a superior global configuration.
- Design Motivation: The maximum frustration state coincides with the peak of exploration diversity (Figure 6b), making it the optimal moment for SFT intervention—skill islands have not yet consolidated into a global network and are therefore most robust to perturbation.

Loss & Training¶

GRPO (Group Relative Policy Optimization) is used as the RLVR algorithm. Annealed-RLVR triggers the SFT heating phase (tens of steps) upon detecting the knee point of the reward curve and the bottom of the V-shaped curve, after which standard RLVR resumes.

Key Experimental Results¶

Main Results¶

Method	Training Set (512 problems)	Minerva (OOD)	AIME 2024/2025 (OOD)
Standard RLVR	Baseline	Baseline	Baseline
Annealed-RLVR	Superior	Superior	Superior

Ablation Study¶

Configuration	Performance	Notes
RLVR without intervention	Baseline; late-stage policy collapse	Inherent limitation of the standard approach
SFT intervention at incorrect timing	Catastrophic forgetting	Timing is critical
SFT intervention at maximum frustration state	Optimal	Best timing as predicted by theory

Key Findings¶

CoNet (minimal model) and the 1.5B LLM exhibit remarkably consistent macroscopic dynamics, supporting the claim that emergent behavior is independent of microscopic details.
The average degree of the concept network consistently stabilizes at approximately 2, directly validating the central hypothesis.
Rapid recovery following catastrophic forgetting confirms the "topologically local damage" interpretation—knowledge is not erased but rendered unreachable.
Annealed-RLVR outperforms standard RLVR on both in-distribution and OOD benchmarks.

Highlights & Insights¶

A Unified Theory from a Physics Perspective: A single concise topological hypothesis (\(\langle k \rangle \approx 2\)) unifies the explanation of four independent phenomena, demonstrating the power of interdisciplinary thinking.
Maximum Frustration State = Optimal Exploration Moment: The paper reveals that the moment of apparently worst performance coincides with the peak of exploration diversity—an insight with direct implications for all practitioners using RLVR.
From Explanation to Prescription: Beyond proposing a theoretical framework, the work directly derives a verifiable optimization algorithm, completing a coherent theory-to-practice loop.

Limitations & Future Work¶

CoNet is a highly simplified model with a substantial gap in scale and mechanism relative to real LLMs.
The central hypothesis (\(\langle k \rangle \approx 2\)) lacks validation through direct extraction of reasoning graphs from within an LLM.
Validation is limited to a 1.5B-parameter model; applicability to larger-scale models remains to be confirmed.
Detection of the annealing trigger (V-shape bottom / reward knee point) may not always be unambiguous in practice.

vs. DeepScaleR/DeepSeek-R1: This work provides a theoretical explanation of RLVR training dynamics rather than a new training method.
vs. Learning Curve Analysis Work: Macroscopic phenomena are traced to topological structure rather than statistical or optimization perspectives.
The sparse network self-organization framework may generalize to understanding other emergent capabilities, such as in-context learning and tool use.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Unifying RLVR phenomena through complex network theory represents an entirely novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ CoNet validation is thorough; LLM validation covers multiple benchmarks but is limited in model scale.
Writing Quality: ⭐⭐⭐⭐⭐ Exemplary narrative; the logical chain from phenomena to theory to algorithm is complete and elegant.
Value: ⭐⭐⭐⭐⭐ Has far-reaching implications for understanding the mechanisms of reasoning acquisition in LLMs; Annealed-RLVR offers practical utility.