📚 Pretraining¶

🔬 ICLR2026 · 27 paper notes

A Law of Data Reconstruction for Random Features (and Beyond): This paper establishes a data reconstruction law in random feature models from information-theoretic and algebraic perspectives: when the parameter count \(p \gg dn\) (where \(d\) is the data dimension and \(n\) is the number of samples), training data can be fully reconstructed. A projection-loss-based optimization method is proposed and the universality of this threshold is validated on RF models, two-layer networks, and ResNets.
Block-Sample MAC-Bayes Generalization Bounds: This paper proposes block-sample MAC-Bayes generalization bounds (mean approximately correct) that partition the training data into \(J\) blocks and replace the monolithic KL divergence with a sum of per-block conditional KL divergences. The resulting bounds remain finite and meaningful even in settings where the original PAC-Bayes bounds are vacuous (e.g., deterministic learning algorithms such as mean estimation). The paper further establishes that a high-probability (PAC) version of these bounds is generally unattainable.
CHAMMI-75: Pre-training multi-channel models with heterogeneous microscopy images: This work introduces CHAMMI-75—the largest heterogeneous multi-channel microscopy image pre-training dataset (2.8M images, 75 sources, 25 channel types, 16 species)—and demonstrates that imaging modality diversity is the key factor for improving generalization of multi-channel models. The trained MorphEm model achieves state-of-the-art performance on 6 out of 7 benchmarks.
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training: This work constructs Common Corpus — the largest legally licensed LLM pre-training dataset at approximately 2 trillion tokens — spanning 6 major collections (government, culture, science, code, web, and semantic), covering multiple languages including low-resource ones. All data originates from copyright-free or permissively licensed sources, accompanied by complete data provenance and a multi-stage filtering pipeline. The dataset has been adopted by industry leaders including Anthropic.
Deconstructing Positional Information: From Attention Logits to Training Biases: This paper proposes a unified analytical framework based on Toeplitz matrices, categorizing positional encodings into additive (Absolute/T5/ALiBi) and multiplicative (RoPE) types. Through synthetic tasks, it reveals that RoPE exhibits significant advantages on position-sensitive tasks but suffers from a "single-head deposit pattern" in shallow layers, where nearly all positional reasoning concentrates in a single attention head. The paper further provides a theoretical proof that this pattern is an intrinsic property of RoPE's multiplicative structure.
Emergent Misalignment is Easy, Narrow Misalignment is Hard: Fine-tuning on narrow-domain harmful data induces broad misalignment (emergent misalignment) because "general misalignment" constitutes a simpler and more efficient solution in parameter space than "misalignment confined to a specific domain"—the general solution exhibits smaller parameter norm and greater robustness to perturbations.
Explaining Grokking and Information Bottleneck through Neural Collapse Emergence: This work provides a unified explanation of two prominent late-stage training phenomena—Grokking (delayed generalization) and the Information Bottleneck compression phase—through the lens of Neural Collapse. It proves that the contraction of population within-class variance is the common key factor underlying both phenomena, and reveals that training loss convergence and the onset of Neural Collapse operate on distinct timescales governed by weight decay.
FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition: This work introduces the FictionalQA dataset and generation pipeline, which synthesizes webtext-style documents and QA pairs about fictional events to study both factual memorization and verbatim memorization in LLM training under controlled conditions. Key findings show that greater surface-form diversity facilitates knowledge acquisition, while concise structured lists are least conducive to generalization.
Identifying and Evaluating Inactive Heads in Pretrained LLMs: This paper systematically evaluates 12 scoring functions for identifying inactive attention heads in LLMs, finding that the attention head output norm-based scoring function (AHON LN) more consistently identifies inactive heads across model families than traditional attention weight metrics. On average across 14 models, over 12% of heads can be zeroed out while maintaining MMLU accuracy within 1%.
Imagine How To Change: Explicit Procedure Modeling for Change Captioning: ProCap reframes change captioning from static image-pair comparison to dynamic procedure modeling. In the first stage, a procedure encoder is trained via frame interpolation and masked reconstruction to capture spatiotemporal change dynamics; in the second stage, learnable process queries implicitly infer the change procedure, surpassing state-of-the-art methods on three benchmarks.
Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank: By analyzing the gradient flow dynamics of deep matrix factorization (deep linear networks) in matrix completion, this paper proves that coupled dynamics is the key mechanism underlying the low-rank implicit bias of deep networks, and that networks of depth \(\geq 3\) inevitably exhibit coupling except under diagonal initialization. This provides a theoretical explanation for why deep models are able to avoid loss of plasticity.
Intrinsic Training Dynamics of Deep Neural Networks: This paper investigates when the trajectory in parameter space under gradient flow training of deep neural networks can be "lifted" to a low-dimensional intrinsic space and expressed as an intrinsic Riemannian gradient flow. It proposes an intrinsic recoverability criterion based on conservation laws and extends the results to ReLU networks and linear networks of arbitrary depth.
Lossless Vocabulary Reduction for Auto-Regressive Language Models: This paper proposes a theoretical framework for Lossless Vocabulary Reduction (LVR), which converts any auto-regressive language model into an exactly equivalent model operating over an arbitrary sub-vocabulary via nested tokenization. Building on the Maximal Common Vocabulary (MCV), the framework enables efficient ensembling of language models with heterogeneous tokenization schemes, with effectiveness validated on GSM8K, MATH, translation, and other tasks.
MoMa: A Simple Modular Deep Learning Framework for Material Property Prediction: MoMa is a modular material property prediction framework that trains task-specific modules across multiple tasks and stores them centrally in a MoMa Hub, then applies a training-free Adaptive Module Composition (AMC) algorithm driven by representation similarity to assemble customized models for downstream tasks, achieving an average improvement of 14% over the strongest baseline across 17 datasets.
Polynomial, trigonometric, and tropical activations: This paper systematically explores learnable activation function families based on orthogonal bases (Hermite polynomials, Fourier trigonometric basis) and tropicalization, addressing the gradient explosion/vanishing problem of polynomial activations via variance-preserving initialization, and successfully replacing GELU in GPT-2 and ConvNeXt to enable stable training.
Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning: This paper proposes the Warmup-Stable-Only (WSO) learning rate schedule—completely eliminating the decay phase during pre-training. Despite yielding worse pre-training metrics, WSO consistently outperforms all decay-based schedules after SFT. Loss landscape analysis reveals that WSO's advantage stems from maintaining flatter minima.
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums: This paper proposes the Training Re-evaluation Curve (TREC) as a diagnostic tool that analyzes the loss of a fully trained model evaluated on training data at each timestep, thereby guiding optimal placement of high-quality data. The paper further demonstrates that the shape of TREC can be predicted via the implicit EMA coefficient of AdamW, enabling curriculum design without any actual training runs.
RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization: This paper proposes RECON, a class- and pose-agnostic canonical orientation normalization method that corrects arbitrary canonical representations produced during training via a simple right translation, enabling unsupervised instance-level symmetry discovery, OOD pose detection, and a plug-and-play test-time canonicalization layer.
Reducing Class-Wise Performance Disparity via Margin Regularization: This paper proposes MR2 (Margin Regularization for performance disparity Reduction), which dynamically adjusts class-dependent margins in both the logit and representation spaces. Grounded in theoretically derived generalization bounds, MR2 reduces class-wise performance disparity while simultaneously improving overall accuracy.
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook: This paper proposes SemHiTok — a tokenizer that unifies visual understanding and generation via a Semantic-Guided Hierarchical Codebook (SGHC): pixel sub-codebooks are constructed on top of a pretrained semantic codebook, with structure and training fully decoupled (stage-wise optimization) to avoid the semantic–pixel conflict in joint training. Under the LLaVA setting, SemHiTok achieves state-of-the-art performance in both understanding and reconstruction among discrete tokenizers.
Steering Language Models with Weight Arithmetic: This paper proposes Contrastive Weight Steering, which extracts behavioral direction vectors from the weight difference between models fine-tuned on positive and negative behavioral data, and directly modifies model weights to control behavior. The method demonstrates superior generalization and consistency compared to Activation Steering across experiments on sycophancy, malicious behavior, and refusal.
Stochastic Self-Organization in Multi-Agent Systems: This paper proposes SelfOrg, a framework that dynamically constructs directed acyclic communication graphs (DAGs) based on semantic similarity of agent responses and Shapley value contribution estimates, enabling self-organized collaboration in multi-agent systems. The approach is particularly effective in weak-model settings.
TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling: This paper proposes TASTE (Text-Aligned Speech Tokenization and Embedding), which aligns speech tokens with text transcriptions via a cross-attention mechanism, enabling high-quality speech reconstruction at an extremely low bitrate (~150 bps). This design makes text-speech joint modeling straightforward and efficient; the resulting 1.3B-parameter TASLM outperforms 7B pretrained SLMs.
Token-level Data Selection for Safe LLM Fine-tuning: This paper proposes TOSS (Token-level data Selection for Safe LLM fine-tuning), the first token-level data selection framework that evaluates the safety risk of each token via the loss difference between a safety-degraded reference model and a utility-oriented reference model, achieving a superior safety-utility tradeoff compared to sample-level methods.
Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization: This paper reinterprets the structured second-order moment estimation of Shampoo and SOAP through the lens of KL divergence minimization, reveals their inherent limitations, and proposes two practical methods—KL-Shampoo and KL-SOAP—that match or surpass the original methods without requiring Adam grafting.
Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors: This paper explains, from the perspective of gradient signals, why Transformers trained with next-token prediction (NTP) learn features that appear "useless" for predicting the immediate next token. It proposes a decomposition of gradient pathways into three components — direct learning, pre-caching, and circuit sharing — and validates this framework on toy tasks, OthelloGPT, and language models.
Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors: By decomposing training gradient signals into three components — direct, pre-cached, and circuit sharing — this paper explains why Transformers trained with NTP learn features that appear "useless" for predicting the current next token. The framework is validated on OthelloGPT, small language models, and a pre-trained LLM (Gemma 2).