Less is More: Local Intrinsic Dimensions of Contextual Language Models¶
Conference: NeurIPS 2025 arXiv: 2506.01034 Code: GitHub Area: Video Understanding Keywords: intrinsic dimension, LLM, fine-tuning, grokking, overfitting detection, embedding geometry
TL;DR¶
This paper proposes using the Local Intrinsic Dimension (LID) of contextual token embeddings as an unsupervised signal for monitoring LLM training dynamics — a decrease in LID indicates improved generalization, while an increase signals overfitting. The utility of this geometric signal is validated on tasks including dialogue state tracking, grokking, and sentiment recognition.
Background & Motivation¶
- Understanding LLM internals remains difficult: Even fundamental questions such as how fine-tuning affects model behavior typically require extensive empirical evaluation.
- Lack of unsupervised training diagnostic tools: Most performance diagnostics rely on labeled validation sets or task-specific probes, which are unavailable in low-resource settings.
- Limitations of existing dimensionality studies: Tulchinskii et al. found that AI-generated text exhibits lower global intrinsic dimension, but their analysis operates on individual text segments; Aghajanyan et al. defined intrinsic dimension in parameter space rather than embedding space; Valeriani et al. studied how global dimensionality changes as data passes through an LLM, but without localized analysis.
- Global dimensionality lacks granularity: The embedding space is not a single manifold of uniform dimensionality, but rather a union of manifolds with varying local dimensions, necessitating local estimation.
Method¶
Latent Space Modeling¶
Given a text corpus \(\mathcal{D} = (s_0, \ldots, s_D)\) and a model \(\mathcal{M}\) of depth \(l\), each sequence \(s_m\) is tokenized by \(\mathcal{T}\) and produces contextual embeddings at layer \(i\):
All token embeddings form a point cloud \(\mathbb{T}_i = \{\mathcal{M}_i(t_j^m)\}_{m, j}\), with distances measured in Euclidean space.
Two-Stage Sampling Strategy¶
In practice, \(\mathbb{T}_i\) can contain millions of vectors, making direct neighborhood computation infeasible. The procedure is: 1. Sample \(M\) sequences from \(\mathcal{D}\) 2. After deduplication, subsample \(N\) token vectors 3. Compute the \(L\)-nearest neighborhood \(\mathcal{N}_L(t_j; \mathbb{T})\) for each token
Local TwoNN Dimension Estimation¶
The TwoNN estimator is applied, leveraging the ratio \(r_2/r_1\) of distances to the nearest and second-nearest neighbors (which follows a Pareto distribution under mild assumptions) to estimate local dimensionality:
This yields a dimensionality vector \(\in \mathbb{R}_{\geq 0}^N\) over all sampled tokens, aggregated into a mean LID as the overall geometric signature.
Cross-Model Comparison¶
A base model \(\mathcal{M}\) and its fine-tuned counterpart \(\mathcal{M}^\Delta\) share the same architecture and tokenizer, establishing a natural point-wise correspondence between their embedding spaces, enabling direct comparison of dimensionality changes.
Experiments¶
Experiment 1: Fine-Tuning Induces Dataset-Specific Dimensional Shifts¶
Setup: RoBERTa-base is fine-tuned via MLM on MultiWOZ dialogue data for 5 epochs; LID is measured on MultiWOZ, Wikipedia, and Reddit.
Results: - MultiWOZ (fine-tuning data): significant LID decrease (standardized mean difference 1.19) - Wikipedia/Reddit (out-of-distribution data): LID nearly unchanged (standardized mean difference 0.08/0.10)
Key Finding: LID reduction is dataset-specific — it occurs only within the fine-tuning data distribution and does not affect unrelated data regions.
Experiment 2: LID Detects Grokking¶
Setup: A 2-layer decoder-only Transformer is trained on modular addition mod \(p=197\), with training data ratios ranging from 10% to 50%.
| Training Data Ratio | Grokking? | LID Trend on Training Set |
|---|---|---|
| 10% | No | Rises then plateaus |
| 15% | No | Rises then plateaus |
| ≥20% | Yes | Rises then significantly decreases |
Key Finding: A pronounced LID decrease on the training set coincides with the onset of rising validation accuracy — grokking can be predicted from training data alone, without requiring validation labels.
Experiment 3: LID Detects Training Capacity Exhaustion¶
Setup: TripPy-R dialogue state tracking model (RoBERTa encoder) is trained on MultiWOZ for 20 epochs.
Results: - Spearman correlation between mean LID on training set and Joint Goal Accuracy (JGA): −0.982 - Validation loss is minimized by step 7,500, yet JGA continues to improve and LID continues to decrease — indicating that validation loss gives a misleading "convergence" signal - LID stabilizes after approximately 25,000 steps, in synchrony with JGA convergence
Key Finding: LID is a more reliable convergence indicator than validation loss.
Experiment 4: LID Detects Overfitting¶
Setup: BERT-base with a linear classifier is trained on EmoWOZ emotion classification for 8 epochs.
Results: - After epoch 1, LID drops sharply from ~9.94 to ~7.25 (model finds an efficient representation) - LID subsequently rises gradually to ~8 (dimensional increase suggests memorization) - Validation loss rises continuously after epoch 1 — a clear overfitting signal - Spearman correlation of LID with training loss: −0.952; with validation loss: +0.952
Key Finding: The pattern of LID decreasing then increasing corresponds to the transition from "finding an efficient representation" to "overfitting," and can serve as an unsupervised early-stopping signal.
Highlights & Insights¶
- A unified framework covering diverse training dynamics: The same LID metric detects fine-tuning effects, grokking, training convergence, and overfitting across four distinct phenomena.
- Label-free diagnostic signal: Entirely grounded in the embedding geometry of training data, requiring no labeled validation set.
- Simple and effective heuristic: LID decreasing → improved generalization; LID increasing → memorization/overfitting — an intuitive and actionable rule.
- Carefully designed experiments: Cover encoders (RoBERTa/BERT) and decoders (GPT-2/tiny Transformer), as well as sequence labeling and classification settings.
Limitations & Future Work¶
- High computational cost: Requires extensive forward passes to build embeddings plus \(O(dN^2)\) nearest-neighbor search, limiting real-time monitoring.
- Strong TwoNN assumptions: The estimator requires approximately constant local density and a Poisson process assumption; its applicability to Transformer embeddings is validated only empirically.
- Absolute values are not cross-architecture comparable: LID absolute values depend on hyperparameters (\(M\), \(N\), \(L\)); only relative changes are meaningful for comparison.
- Causality not established: The relationship between LID decrease and generalization improvement is correlational rather than causal; theoretical explanation remains lacking.
- Validated only on smaller models: Experiments focus on RoBERTa-base and GPT-2-medium; scalability to models with 7B+ parameters is unknown.
Related Work & Insights¶
- LLM intrinsic dimensionality (Aghajanyan et al., 2021): Studies dimensionality in parameter space rather than embedding space, finding that larger models have lower parameter-space dimensionality.
- Global embedding dimensionality (Valeriani et al., 2023; Tulchinskii et al., 2023): Analyzes global dimensionality differences between AI-generated and human-written text, without localization.
- Token-level dimensionality (Viswanathan et al., 2025): Analyzes token dimensionality within individual prompts; this paper subsamples from an entire dataset.
- Topological deep learning (Papamarkou et al., 2024): Geometric and topological methods for observational analysis of ML models; this paper extends the direction to training dynamics diagnosis.
- LoRA rank adaptation (Ed-dib et al., 2024): Adjusts LoRA rank based on the rank of hidden-state information matrices; the LID proposed here is complementary.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Local intrinsic dimension as an unsupervised diagnostic signal for training dynamics is a novel perspective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four independent experiments covering fine-tuning, grokking, convergence, and overfitting across diverse models and tasks.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear correspondence between theory and experiments; each experiment is framed around a well-defined research question.
- Value: ⭐⭐⭐⭐ — Provides a valuable geometric tool for LLM training monitoring, particularly meaningful in low-resource scenarios.