Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry¶

Conference: ICML2025 (Spotlight)
arXiv: 2503.18114
Code: GitHub
Area: Feature Learning Theory
Keywords: feature learning, lazy-rich regime, manifold capacity, representational geometry, GLUE

TL;DR¶

This paper proposes using manifold capacity and its associated geometric metrics (GLUE) to characterize the richness of feature learning. This approach goes beyond the traditional lazy vs. rich dichotomy, revealing new insights into different learning phases, learning strategies, computational neuroscience, and OOD generalization.

Background & Motivation¶

Limitations of the Lazy vs. Rich Dichotomy¶

Existing theoretical frameworks classify neural network learning into the lazy regime (where weights remain nearly invariant, behaving like random feature models) and the rich regime (where task-relevant features are actively learned).
This dichotomy is too coarse: substantial heterogeneity exists within the rich regime itself. Different architectures, initializations, and learning rates lead to completely distinct feature learning mechanisms, yet all are generalized as "rich."
Traditional metrics (weight change magnitude, NTK-label alignment, representation-label alignment) have their respective limitations:
- Weight change: Only measures the magnitude of change and fails to quantify the amount of learned task-relevant features.
- NTK/representation-label alignment: Can produce incorrect rankings in certain scenarios.
- These are not purely representation-based, rendering them inapplicable to neuroscience settings where synaptic weight changes cannot be tracked precisely.

Core Problem¶

How can a representation-based metric be used to quantify the richness of feature learning?
Do subtypes exist within the rich regime?
Can this framework provide new insights into open problems in neuroscience and machine learning?

Method¶

1. Task-Relevant Manifolds¶

For classification tasks: The manifold of the $i$-th class is defined as $\mathcal{M}_i = \text{conv}(\{\Phi(x) : x \in \mathcal{X}_i\})$, which is the convex hull of the neural representations in a certain layer for all inputs in that class.
Key Idea: Feature learning = manifold untangling—learning to make task-relevant manifolds more linearly separable in the representation space.

2. Manifold Capacity $\alpha_M$¶

Intuition: Measures how many linearly separable manifolds can be "packed" into a representation space of a given dimension.
Definition of simulation capacity: For random dichotomies $\mathbf{y} \in \{±1\}^P$ and random projections $\Pi_n$, the probability of successful linear separation $p_n$ is estimated, and then: $$\alpha_{\text{sim}} = \frac{P}{\sum_{n \in [N]}(1 - p_n)}$$
In practice, the mean-field version $\alpha_M$ is used, which can be efficiently computed by solving a quadratic program and has an error of $O(1/N)$ compared to the simulated version.
Core property: Higher capacity $\rightarrow$ more untangled manifolds $\rightarrow$ richer feature learning.
Approximation formula: $\alpha_M \approx (1 + R_M^{-2}) / D_M$, where $R_M$ represents the manifold radius and $D_M$ is the manifold dimension.

3. GLUE: A Family of Geometric Metrics¶

GLUE (Geometry Linked to Untangling Efficiency) decomposes capacity into several interpretable geometric metrics:

Metric	Meaning	Impact on Capacity
Manifold Dimension $D_M$	Degrees of freedom of variation within the manifold (analogous to Gaussian width)	Dimension reduction $\rightarrow$ Capacity $\uparrow$
Manifold Radius $R_M$	Noise-to-signal ratio (intra-class variation / norm of class center)	Radius reduction $\rightarrow$ Capacity $\uparrow$
Center Alignment $\rho_M^c$	Correlation between the centers of different manifolds	Decrease $\rightarrow$ Capacity $\uparrow$
Axis Alignment $\rho_M^a$	Correlation between the variation directions of different manifolds	Decrease $\rightarrow$ Capacity $\uparrow$
Center-Axis Alignment $\psi_M$	Correlation between manifold centers and variation directions of other manifolds	More complex relationship

4. Theoretical Guarantee (Theorem 3.1)¶

In a two-layer nonlinear network under the teacher-student setting, it is proven that:

Capacity tracks richness: Under the proportional asymptotic limit, capacity $\alpha(\eta, \psi_1, \psi_2)$ is strictly monotonically increasing with respect to the learning rate $\eta$.
Capacity links to prediction accuracy: There exists a monotonically increasing invertible function $h$ such that $\text{Acc}(\eta) = h(\alpha(\eta))$.

This rigorously demonstrates from a theoretical standpoint that manifold capacity indeed quantifies the extent of feature learning.

Key Experimental Results¶

Experiment 1: Comparison with Traditional Metrics (2-layer NN + Synthetic Data)¶

Interpolation between the lazy (small $\bar{\eta}$) and rich (large $\bar{\eta}$) regimes is conducted using the inverse scaling factor $\bar{\eta}$.
Capacity accurately distinguishes the degree of richness corresponding to different $\bar{\eta}$ values, whereas NTK-label alignment and representation-label alignment yield incorrect orderings under certain settings.
Capacity can also detect the amount of task-relevant features at initialization (the wealthy vs. poor regime), which is impossible for methods like weight change.

Experiment 2: Differences in Learning Strategies (Section 4.1)¶

Tracking training trajectories on radius-dimension contour plots reveals that different levels of richness correspond to different strategies:
- Lazy $\rightarrow$ moderately rich: Compresses both radius and dimension simultaneously.
- Moderately rich $\rightarrow$ extremely rich: Sacrifices radius to further compress dimension.
Different levels of initialization wealth also lead to different strategies: wealthy initialization primarily compresses the radius, while poor initialization requires manipulating both.

Experiment 3: Learning Phases (Section 4.2)¶

VGG-11 is trained on CIFAR-10. Despite training and test accuracy saturating rapidly, the manifold geometry still unveils at least four distinct phases:

Clustering phase: Initial compression of the manifolds.
Structuring phase: Aligned structures increase.
Separating phase: Alignment decreases and manifolds push away from each other.
Stabilizing phase: Center alignment decreases further.

Experiment 4: Structural Inductive Bias in RNNs (Section 5.1)¶

Post-training capacity values of RNNs with different initial weight ranks converge, yet their geometric organizations differ substantially.
Low-rank initialization (poorer-richer) $\rightarrow$ large radius + small dimension.
High-rank initialization (wealthier-lazier) $\rightarrow$ small radius + large dimension.
This demonstrates the existence of a structural inductive bias at the level of manifold geometry.

Experiment 5: OOD Generalization (Section 5.2)¶

VGG-11 / ResNet-18 are pre-trained on CIFAR-10, followed by linear probing on CIFAR-100.
The moderately rich regime performs best; the OOD accuracy drops sharply in the ultra-rich regime.
Geometric explanation: In the ultra-rich regime, manifold radius inflates and center-axis alignment increases, leading to decreased capacity.
In ResNet-18, the capacity decline is caused by an increase in dimension, highlighting architectural differences.

Highlights & Insights¶

Beyond Dichotomy: This work is the first to systematically segment feature learning into multiple subtypes (learning strategies $\times$ learning phases) using the lens of representation geometry, rather than relying on a simplistic lazy-rich dichotomy.
Integration of Theory and Experiment: A rigorous asymptotic theory (Theorem 3.1) is presented on two-layer networks and validated across practical architectures such as VGG, ResNet, and RNNs.
Cross-Domain Applicability: The framework spans computational neuroscience (RNN neural circuit bias) and machine learning (OOD generalization), serving as a prime application of representational geometry.
Actionable Metrics: The GLUE family of metrics provides an interpretable diagnostic tool—identifying that a capacity drop can be specifically attributed to changes in radius, dimension, or alignment.
Spotlight Paper: This designation reflects the reviewers' strong recognition of its originality and impact.

Limitations & Future Work¶

Theory limited to two layers + single-step gradient: Theorem 3.1 only holds after a single gradient step. The asymptotic behavior of multi-step training remains unproven, as Gaussian equivalence might not be preserved.
Experimental Scale: Only VGG-11/ResNet-18 and CIFAR-10/100 were tested. The applicability to larger models (e.g., Transformers) and more complex tasks (e.g., NLP, large-scale computer vision) remains unverified.
Convex Hull Approximation: Modeling manifolds as convex hulls is mathematically equivalent to linear classification analysis but might overlook high-order non-linear structures.
Computational Cost: Computing mean-field capacity requires solving quadratic programs, and its scalability to ultra-large-scale representations needs verification.
Lack of Causality: The capacity-tracking-richness paradigm is a correlational description. A causal intervention framework for "manipulating geometry $\rightarrow$ improving learning" has not yet been established.

Chizat et al. (2019): Interpolates between the lazy and rich regimes using a scaling factor, forming the basis of the experimental setup in this paper.
Ba et al. (2022): Theoretical analysis of single-step gradients in two-layer networks, which this paper generalizes from regression to classification.
Chung et al. (2018), Chou et al. (2025): Original proponents of manifold capacity theory and GLUE.
Jacot et al. (2018): NTK theory, providing the theoretical foundation of the lazy regime.
Neural Collapse (Papyan et al. 2020): Representation structures under the extreme rich regime, which can be viewed as a special case of this framework.

Insights: This framework offers a new perspective on understanding the training dynamics of large models—rather than tracking loss curves or NTK variation, it monitors the geometric evolution of manifolds in the representation space. Future work can explore applying GLUE to the concept manifold analysis of LLMs.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (The first study to systematically go beyond the lazy-rich dichotomy, proposing a taxonomy of feature learning from a geometric perspective)
Experimental Thoroughness: ⭐⭐⭐⭐ (Combines theory, synthetic data, CNNs, RNNs, and OOD settings, offering comprehensive coverage, but the model sizes are relatively small)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear structure, excellent flow between figures and text, and thorough intuitive explanations)
Value: ⭐⭐⭐⭐⭐ (Provides a new analysis paradigm for representation learning theory with strong cross-disciplinary applicability)

Metric	Meaning	Impact on Capacity
Manifold Dimension \(D_M\)	Degrees of freedom of variation within the manifold (analogous to Gaussian width)	Dimension reduction \(\rightarrow\) Capacity \(\uparrow\)
Manifold Radius \(R_M\)	Noise-to-signal ratio (intra-class variation / norm of class center)	Radius reduction \(\rightarrow\) Capacity \(\uparrow\)
Center Alignment \(\rho_M^c\)	Correlation between the centers of different manifolds	Decrease \(\rightarrow\) Capacity \(\uparrow\)
Axis Alignment \(\rho_M^a\)	Correlation between the variation directions of different manifolds	Decrease \(\rightarrow\) Capacity \(\uparrow\)
Center-Axis Alignment \(\psi_M\)	Correlation between manifold centers and variation directions of other manifolds	More complex relationship