📚 Pretraining¶

🧠 NeurIPS2025 · 50 paper notes

A Practical Guide for Incorporating Symmetry in Diffusion Policy: This paper presents a practical guide for incorporating symmetry into diffusion policies. Through three simple and composable methods — invariant representations (relative trajectory actions + eye-in-hand perception), equivariant visual encoders, and Frame Averaging — the proposed approach achieves performance on par with or exceeding fully equivariant diffusion policies across 12 MimicGen tasks, while substantially reducing implementation complexity.
AI Progress Should Be Measured by Capability-Per-Resource, Not Scale Alone: A Framework for Gradient-Guided Resource Allocation in LLMs: This position paper challenges "scaling fundamentalism" by proposing Capability-Per-Resource (CPR) as a replacement for raw scale as the primary measure of AI progress. The paper presents a gradient-guided resource allocation framework in which foundation model developers publish "gradient blueprint" metadata, enabling downstream adapters to fine-tune only a high-influence parameter subset while substantially reducing resource consumption and maintaining performance close to full-parameter fine-tuning.
Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks: This paper proposes the Alternating Gradient Flow (AGF) theoretical framework to explain the stepwise "saddle-to-saddle" feature learning dynamics in neural networks. Training is modeled as an alternating process between utility maximization for dormant neurons and cost minimization for active neurons, unifying feature selection analysis across diagonal linear networks, attention models, and modular addition. Predictions from AGF exhibit high agreement with actual gradient flow behavior.
An Empirical Investigation of Neural ODEs and Symbolic Regression for Dynamical Systems: This paper presents a systematic empirical study of the extrapolation capabilities of Neural ODEs (NODEs) and the equation recovery ability of Symbolic Regression (SR) for dynamical systems. It finds that NODEs can extrapolate to new boundary conditions under dynamically similar settings, and proposes a NODE→SR pipeline: training a NODE on only 10% of the original data to generate augmented trajectories, from which SR recovers 2/3 of the governing equations exactly and provides good approximations for an additional 1/3.
Beyond Benign Overfitting in Nadaraya-Watson Interpolators: By tuning a single bandwidth parameter \(\beta\) in the Nadaraya-Watson interpolator, this paper precisely characterizes the complete phase transition spectrum from catastrophic overfitting (\(\beta < d\)) → benign overfitting (\(\beta = d\)) → tempered overfitting (\(\beta > d\)), demonstrating that overestimating the intrinsic dimensionality of data is safer than underestimating it.
Born a Transformer – Always a Transformer? On the Effect of Pretraining on Architectural Abilities: Through systematic study of a family of retrieval and copying tasks, this paper reveals that large-scale pretraining introduces a directional bias into Transformers (rightward/forward over leftward/backward), while failing to overcome fundamental architectural limitations on non-unique tasks. Fine-tuning can eliminate the directional bias but cannot surpass the boundaries of architectural expressiveness.
Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models: This paper proposes Multi-brain-tuning, a method that jointly fine-tunes pretrained speech models on fMRI data from multiple participants, reducing the data required for brain alignment by 5×, improving alignment by up to 50%, and generalizing to unseen participants and datasets.
Breaking the Frozen Subspace: Importance Sampling for Low-Rank Optimization in LLM Pretraining: This paper identifies that the dominant subspace in low-rank optimizers such as GaLore "freezes" during pretraining (cosine overlap between consecutive subspaces approaches 1), trapping weight updates within a fixed low-rank subspace. The authors propose SARA (Sampling-based Adaptive Rank Allocation), which constructs subspaces by sampling singular vectors according to singular value weights, provides convergence guarantees, and reduces the performance gap between low-rank optimizers and full-rank Adam by up to 46%.
Broken Tokens: Your Language Model Can Secretly Handle Non-Canonical Tokenization: This paper reveals that LLMs can secretly handle non-canonical tokenizations (e.g., splitting "Hello" into "He"+"llo" instead of the canonical whole-word token)—even when the input token sequence differs from training, models exhibit surprising robustness. This capability stems from the property that sub-word embeddings in the embedding space can linearly combine to approximate whole-word embeddings.
Conformal Risk Training: End-to-End Optimization of Conformal Risk Control: This paper extends Conformal Risk Control (CRC) from expected loss to the generalized Optimized Certainty-Equivalent (OCE) risk measure (encompassing tail risks such as CVaR), and proposes conformal risk training—an end-to-end approach that differentiates through the conformal risk control procedure during training, achieving provable risk guarantees while significantly improving average-case performance.
Deep Compositional Phase Diffusion for Long Motion Sequence Generation: This paper proposes the Compositional Phase Diffusion framework, which employs SPDM and TPDM to handle semantic alignment and transition continuity, respectively, within the frequency-domain phase space established by ACT-PAE. The framework enables long-range compositional motion sequence generation and achieves state-of-the-art performance on BABEL-TEACH.
Differentiable Hierarchical Visual Tokenization: This paper proposes an end-to-end differentiable hierarchical visual tokenizer that adaptively partitions images into tokens at pixel-level granularity. It leverages information criteria for hierarchical model selection, serves as a drop-in replacement for the fixed patch tokenization in ViT, and additionally supports raster-to-vector conversion.
Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction: This paper disaggregates language model performance on subject-verb agreement tasks by experimental condition, revealing multi-phase training dynamics obscured by aggregate metrics: models first learn frequency biases, then local context sensitivity, and finally develop general grammatical rules — a process involving multiple "hidden breakthroughs" rather than simple monotonic improvement.
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?: By defining the IsSameObject predicate and designing quadratic probes, this work demonstrates that large-scale pretrained ViTs — particularly DINO and CLIP — naturally develop object binding capabilities. This signal is encoded in a low-dimensional subspace and actively guides the attention mechanism, challenging the cognitive science community's view that ViTs lack binding ability.
Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs: This paper proposes Arnold, a scheduling system that aligns the communication patterns of LLM training (DP/PP groups) with the physical network topology of data centers. In simulation, Arnold reduces the maximum communication group span by 1.67×, and achieves a 10.6% end-to-end throughput improvement in production-scale training on 9,600+ GPUs.
Enhancing Training Data Attribution with Representational Optimization: This paper proposes AirRep (Attentive Influence Ranking Representation), a representation learning-based training data attribution method that employs a trainable encoder and attention pooling mechanism. AirRep achieves attribution accuracy on par with or superior to state-of-the-art gradient-based methods while being approximately 80× faster at inference.
Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods: This paper explicitly formulates the "Final-Model-Only" (FiMO) setting for training data attribution (TDA), reframes the problem from measuring contribution to measuring sensitivity, proposes further training as the gold standard, and provides a unified derivation showing that various gradient-based methods (Grad-Dot, influence functions, TRAK, DataInf, etc.) are all approximations of further training at different orders.
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking: Using grokking (delayed generalization) as a causal probe, this paper demonstrates that relative flatness is a (potentially) necessary condition for generalization, whereas neural collapse, despite frequently co-occurring with generalization, is not necessary — it is merely one pathway toward flatness.
Gemstones: A Model Suite for Multi-Faceted Scaling Laws: This work releases the Gemstones model suite — an open-source collection of over 4,000 checkpoints spanning 50M–2B parameters and diverse width-depth ratios. Through systematic experimentation, the paper demonstrates that scaling laws are highly sensitive to design choices such as model selection, learning rate scheduling, and cooldown strategies, and proposes a convex-hull-based fitting method to improve scaling law stability under sparse sampling.
Generalization Bounds for Rank-sparse Neural Networks: This paper establishes generalization bounds that exploit the approximate low-rank structure of neural network weight matrices. When the Schatten \(p\) quasi-norm is small, the sample complexity reduces to \(\widetilde{O}(WrL^2)\), where \(W\), \(L\), and \(r\) denote the width, depth, and rank of the weight matrices, respectively.
Global Minimizers of Sigmoid Contrastive Loss: This work provides the first rigorous characterization of the global minimizer geometry of the Sigmoid contrastive loss (SigLIP) with trainable temperature and bias in the practically relevant regime \(N \gg d\). It introduces a novel combinatorial object called the \((m, b_\text{rel})\)-Constellation, and uses it to explain retrieval success, the modality gap phenomenon, and to propose an explicit relative bias parameterization that improves training dynamics.
Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks: This paper proposes Gradient-Weight Alignment (GWA), which quantifies the directional consistency (cosine similarity) between the gradient of each training sample and the model weights. During training, GWA accurately predicts generalization performance, identifies the optimal early stopping point, and localizes influential training samples—all without requiring a validation set.
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models?: By introducing a "domain-restricted pre-training + OOD testing" evaluation framework, this paper reveals that stateful architectures such as Mamba and RWKV suffer from degraded base capabilities. It identifies the key design principle of "arbitrary selection over the full sequence" (full-sequence visibility + real relation calculation + non-uniform distribution), and validates this principle using a minimalist Top-1 Element/Chunk Selection architecture that recovers base capabilities to near-Transformer levels.
Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale: Through systematic analysis of over 1,400 language model checkpoints—spanning Transformer/Mamba/RWKV architectures, 14M–12B parameter scales, and two training datasets—across 110K+ tokens, this work demonstrates that all autoregressive language models exhibit highly consistent behavioral phases during pre-training: predicted probabilities sequentially overfit to n-gram probabilities of increasing order. Three simple heuristics—word frequency, n-gram probability, and semantic similarity—account for up to 98% of behavioral variance.
Learning in Compact Spaces with Approximately Normalized Transformer: This paper proposes anGPT (Approximately Normalized GPT), which exploits the concentration of vector norms in high-dimensional spaces to replace per-sample exact normalization with simple scalar multiplication. The method achieves 40% convergence speedup over GPT+ (with QK-norm) while eliminating weight decay and learning rate warmup, incurring only 3% runtime overhead.
Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models: This paper reveals that LLMs learn spurious correlations between syntactic templates (PoS n-grams) and domains, leading to sharp performance drops in cross-domain settings. Furthermore, this correlation can be exploited to bypass safety refusal mechanisms, reducing the refusal rate from 40% to 2.5% on OLMo-2.
Learning to Flow from Generative Pretext Tasks for Neural Architecture Encoding: This paper proposes FGP (Flow-based Generative Pre-training), which trains an encoder to reconstruct a flow surrogate — a lightweight representation of architectural information flow — enabling encoders of arbitrary structure to capture information flow without specialized asynchronous message-passing designs. FGP achieves up to 106% improvement in Precision@1% on performance prediction.
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models: This paper proposes the Residual Alignment Model (RAM), which formalizes the LLM alignment process as importance sampling and decomposes a large model into a frozen Proposal Module and a trainable lightweight Residual Aligner. Using fewer than 1/8 of the parameters, RAM achieves alignment performance comparable to or exceeding full-parameter SFT/DPO, while also resolving the first-token latency problem.
Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale: By analyzing over 1,400 model checkpoints on 110,000+ tokens, this paper demonstrates that autoregressive language models exhibit highly consistent behavioral phases during training — predicted probabilities successively overfit to n-gram distributions of increasing \(n\) — and that three simple heuristics (word frequency, n-gram probability, and semantic similarity) explain up to 98% of the variance in model behavior. This pattern holds consistently across architectures (Transformer/Mamba/RWKV), datasets, and scales.
Memory Mosaics at Scale: Memory Mosaics v2 scales associative memory networks to 10B parameters trained on 1T tokens, substantially outperforming same-scale—and even 8T-token-trained—Transformers on new-task learning and in-context learning.
Mouse-Guided Gaze: Semi-Supervised Learning of Intention-Aware Representations for Reading Detection: This paper proposes a semi-supervised framework that uses mouse trajectories as weak supervision signals to pretrain gaze representations, followed by fine-tuning on labeled data to distinguish reading from scanning behavior. At inference time, only gaze signals are used, enabling hands-free assistive reading detection.
Nemotron-CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training: NVIDIA proposes the CLIMB framework, which automatically discovers optimal pre-training data mixture ratios through embedding-based clustering and iterative bootstrapped search. On a 1B-scale model, CLIMB outperforms Llama-3.2-1B by 2.0%, and releases the 1.2T-token ClimbLab corpus and the 400B-token ClimbMix high-quality dataset.
Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data: This paper provides the first provable convergence guarantee that gradient flow (GF) on two-layer ReLU networks with small initialization converges to a Neural Collapse (NC) solution on orthogonally separable data, revealing the critical role of GF's implicit bias—early neuron alignment followed by asymptotic maximum-margin bias—in driving the emergence of NC.
One Prompt Fits All: Universal Graph Adaptation for Pretrained Models: This paper theoretically proves that representation-level graph prompts are essentially equivalent to linear probes, and on this basis proposes UniPrompt—an input-level method based on a learnable kNN topological prompt graph. By fusing the prompt graph with the original graph via a bootstrapping strategy, UniPrompt consistently outperforms existing graph prompt learning methods on both in-domain and cross-domain few-shot node classification.
Optimal Online Change Detection via Random Fourier Features: This paper proposes the Online RFF-MMD algorithm, which approximates the MMD statistic via random Fourier features and embeds it into a sequential testing framework over a binary grid. The method achieves online nonparametric change detection without requiring training data or window size parameters, with \(\mathcal{O}(r \log n)\) time and space complexity, and establishes minimax optimality of the detection delay.
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training: This paper proposes a set of power-law scaling relations for weight decay \(\lambda\) and batch size \(B\) in LLM pre-training. By introducing the concept of an AdamW timescale \(\tau\), it unifies hyperparameter scaling relationships, enabling accurate prediction of optimal hyperparameters prior to large-scale training.
Predict Training Data Quality via Its Geometry in Metric Space: This paper proposes a training data diversity metric based on Persistent Homology (PH), demonstrating that geometric and topological structural features of data can effectively predict model performance, outperforming traditional entropy-based metrics such as Vendi Score.
PRESCRIBE: Predicting Single-Cell Responses with Bayesian Estimation: PRESCRIBE is a framework that jointly models epistemic uncertainty (model unfamiliarity with inputs) and aleatoric uncertainty (inherent randomness of biological systems) in single-cell perturbation prediction via multivariate deep evidential regression. It generates a pseudo E-distance as a unified uncertainty proxy; filtering unreliable predictions based on this metric yields accuracy improvements exceeding 3%.
Quantifying Task-Relevant Representational Similarity Using Decision Variable Correlation: This paper proposes Decision Variable Correlation (DVC), a novel metric for quantifying trial-by-trial consistency between two neural representations on classification tasks. The authors find that higher ImageNet accuracy in deep networks is associated with lower DVC relative to monkey V4/IT, and that neither adversarial training nor large-scale dataset pretraining closes this gap.
Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models: This paper proposes RICL (Retrospective In-Context Learning), which leverages the pre-trained knowledge of LLMs to convert sparse environmental feedback into dense advantage function signals via retrospective in-context learning, achieving up to 100× sample efficiency over conventional Monte Carlo methods. Building on this, the paper further introduces RICOL, an online learning framework.
Retrospective In-Context Learning for Temporal Credit Assignment with Large Language Models: This paper proposes RICL (Retrospective In-Context Learning), which estimates the advantage function by comparing the log-probability difference of an LLM policy before and after an in-context update. This approach converts sparse environment feedback into dense training signals, enabling efficient temporal credit assignment, and achieves comparable convergence performance to traditional RL methods on BabyAI tasks with substantially higher sample efficiency.
Scalable Fingerprinting of Large Language Models: This paper proposes Perinucleus sampling to generate scalable LLM fingerprints, enabling the embedding of 24,576 fingerprints in Llama-3.1-8B—two orders of magnitude more than existing methods—without degrading model capability. Theoretical and empirical analyses demonstrate that large-scale fingerprinting is essential for defending against collusion attacks.
Scaling Embedding Layers in Language Models: This paper proposes Scone, a method that learns contextualized embeddings for high-frequency n-grams using a separate Transformer model, and offloads these embeddings to main memory/SSD at inference time. This enables a new scaling paradigm in which additional compute is consumed during training without increasing accelerator resource usage at inference. A 1B-parameter Scone model surpasses a 1.9B baseline.
Superposition Yields Robust Neural Scaling: This paper identifies representational superposition as the core driver of neural scaling laws: in the strong-superposition regime, loss universally scales inversely with model dimension (\(L \propto 1/m\)), independent of the specific form of the data frequency distribution—consistent with empirical scaling behavior in real LLMs.
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation: This paper systematically dissects the internal mechanisms of LLMs in in-context retrieval augmented QA using the AttnLRP attribution method. Three functionally specialized attention head types are identified — Task heads (middle layers, parsing instructions/questions), Retrieval heads (later layers, verbatim copying of contextual answers), and Parametric heads (encoding parametric knowledge) — and their functions are validated via Function Vector injection and source-tracking probes, achieving ROC AUC ≥94% on Llama-3.1/Mistral/Gemma.
The Curse of Depth in Large Language Models: This paper identifies the root cause of deep-layer degradation in Pre-LN Transformers—exponential growth of output variance causing deep layers to collapse into identity mappings—and proposes a parameter-free LayerNorm Scaling (LNS) strategy that multiplies the LayerNorm output by \(1/\sqrt{\ell}\), compressing variance growth from exponential to polynomial. LNS consistently improves perplexity by 5–8% across scales from 130M to 7B parameters.
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training: From the geometric perspective of the river-valley loss landscape, this paper analyzes why the Schedule-Free (SF) optimizer can continuously track the optimal solution during language model pre-training without requiring learning rate decay or weight averaging. It further reveals that SF implicitly performs weight averaging, and proposes an improved SF-AdamW that decouples the momentum and averaging window parameters.
Understanding and Enhancing Mask-Based Pretraining towards Universal Representations: This paper employs high-dimensional linear regression theory to precisely characterize the effect of masking ratio on test risk in mask-based pretraining via a bias-variance decomposition, revealing that the optimal masking ratio depends on both the downstream task and model size. Building on this theory, the paper proposes R2MAE (Random Ratio MAE), which consistently outperforms fixed masking ratios across vision, language, DNA, and single-cell modeling benchmarks.
Vocabulary Customization for Efficient Domain-Specific LLM Deployment: This paper proposes a BPE tokenizer expansion algorithm that guarantees monotonically non-increasing encoding length, appending domain-frequent tokens to the Llama 3.1 vocabulary (+30K tokens). In an e-commerce setting, the approach shortens input sequences by 20% and improves inference throughput by 20–30%. After 10K steps of continual training, model quality is fully preserved, and in approximately 98% of cases the model actively generates the newly added tokens.
ZEUS: Zero-shot Embeddings for Unsupervised Separation of Tabular Data: ZEUS is the first zero-shot clustering method for tabular data. By pretraining a Transformer encoder on synthetic datasets, it learns generalizable representations that enable high-quality clustering of new datasets in a single forward pass, requiring no additional training or hyperparameter tuning.