📚 Pretraining¶
🧪 ICML2026 · 17 paper notes
📌 Same area in other venues: 💬 ACL2026 (7) · 📷 CVPR2026 (8) · 🔬 ICLR2026 (26) · 🤖 AAAI2026 (5) · 🧠 NeurIPS2025 (46) · 📹 ICCV2025 (9)
🔥 Top topics: LLM ×7 · Diffusion Models ×3
- Annotations Mitigate Post-Training Mode Collapse
-
The authors observe that SFT aligns models to a low-entropy semantic prior, leading to a "the larger the instruction model, the more boring" reverse scaling effect. They propose "annotation-anchored training": during pretraining, semantic tags are paired with documents; during SFT, loss on tag tokens is masked. At inference, the model first samples semantics, then generates responses, thereby narrowing the semantic diversity gap by 85% while retaining instruction-following ability.
- Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner
-
This work systematically compares continuous diffusion, discrete masked diffusion, and looped transformer from the perspectives of expressiveness and trainability, proving that "continuous diffusion" is strictly more expressive than discrete diffusion and can simulate looped transformers, but its practical performance is limited by decoding and representation space. Based on this, it proposes CCDD (Coevolutionary Continuous Discrete Diffusion)—diffusing simultaneously in the discrete token space and the contextual embedding space of a pretrained LLM, with a single model jointly denoising. On LM1B/OWT, it reduces perplexity by 25-35% compared to MDLM, and surpasses MDLM's 256-step performance with only 8 sampling steps.
- CoFrGeNet: Continued Fraction Architectures for Language Generation
-
This work introduces the function class of "continued fractions," known for optimal rational approximation, into language generation Transformers. CoFrNet replacement modules (CAttnU/CAttnM/Cffn) are designed for multi-head attention and FFN, respectively. By leveraging the closed-form "continuants," \(d\) divisions are reduced to a single division. On GPT2-xl and Llama-3.2B, downstream performance is matched or even improved with only \(\frac{2}{3}\sim\frac{1}{2}\) of the parameters.
- Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
-
This paper proposes Compute as Teacher (CaT): leveraging the \(G\) rollouts already sampled by GRPO to "synthesize" a pseudo-reference answer via a frozen anchor model, then, in unverifiable domains, using the model itself to derive binary rubrics from the pseudo-reference to score each rollout as RL reward. This approach directly transforms inference compute into supervision signals without any human annotation. On HealthBench, CaT achieves up to 30% improvement over baselines and matches or surpasses inference-time aggregation with 9× lower test-time compute.
- Consistent Diffusion Language Models
-
This paper points out that discrete diffusion lacks a continuous-domain probability-flow ODE counterpart, making direct consistency modeling infeasible. The authors propose using an exact closed-form posterior bridge as a "stochastic PF-ODE surrogate" in the discrete domain, constructing a Multi-Path Discrete Consistency (MPDC) training objective. This requires the denoiser's predictions to be consistent in expectation across multiple stochastic bridge paths, enabling single-stage, teacher-free training of Consistent Diffusion Language Models (CDLM) that can generate high-quality text in 2-3 steps. CDLM achieves SOTA in unconditional/conditional text generation and up to \(32\times\) speedup over AR models.
- Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
-
This work systematically investigates the role of data difficulty in SFT, finding that there is no "universally optimal difficulty." Instead, the optimal difficulty shifts toward harder data as data scale increases. This is explained via a trade-off between the "in-distribution generalization gap" and the "extrapolation gap," with a PAC-Bayes interpretation.
- Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning
-
This paper addresses the issue of cross-task gradient conflict in multi-task instruct-tuning by proposing Badit: first, SVD is used to decompose pretrained weights into a set of naturally orthogonal, high-singular-value LoRA "basic ability" experts; then, during training, spherical K-means is used to dynamically orthogonally group rank-1 components, shifting the traditional paradigm of "parameter isolation by task" to "decoupling by basic ability." On six LLMs, Badit achieves an average improvement of 2.68 Rouge over GainLoRA.
- Edit-Based Refinement for Parallel Masked Diffusion Language Models
-
ME-DLM introduces a lightweight "decode-then-edit" stage to masked diffusion language models (e.g., LLaDA): the first stage performs standard unmasking to generate a draft, and the second stage applies parallel corrections using three types of token-level edits (replace/delete/insert). The supervision signal is derived from the shortest edit script (edit distance). With only 1/8 diffusion steps, it surpasses LLaDA-Instruct by +11.6 on HumanEval and +33.6 on GSM8K.
- Focus and Dilution: The Multi-stage Learning Process of Attention
-
In a simplified setting where a single-layer Transformer learns Markov data, this work analyzes gradient flow by performing stage-wise linearization around a sequence of critical points, rigorously characterizing the recurring "focus–dilution" cycles in attention training. Consistent phenomena are observed on WikiText and TinyStories.
- From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing
-
This paper systematically analyzes why backward spreading works in locate-then-edit (LTE) editing, why it is insufficient, and proposes forward replay: treating the first decisive layer as the optimization variable and obtaining subsequent layer targets via standard forward propagation. This approach consistently improves over MEMIT/RECT/PRUNE/AlphaEdit without extra computational cost.
- InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
-
The authors propose InfoLaw: redefining "pretraining" as a process of "bucket-wise information accumulation," where the information in each bucket equals "quality density \(f_d\) × unique token count \(M_d\) × \(\log K\)" multiplied by an exponentially decaying factor with respect to repetition \(R_d\). The final validation loss is expressed as \(L = \alpha\cdot\text{info}^{-\beta}\), which can be fitted on 252M-1.2B and extrapolated to 7B / 425B tokens with an average error of 0.15% and a maximum of 0.96%. This formulation can be directly used to search for the optimal data recipe.
- Model Merging Scaling Laws in Large Language Models
-
The authors empirically establish, using 10,866 merged models, a dual-axis power law of the form \(L=L_*+BN^{-\beta}+A_0 N^{-\gamma}/(k+b)\): the base model size \(N\) determines the floor, the number of experts \(k\) determines the tail, and four mainstream merging methods (Average, TA, TIES, DARE) all share the same curve. This transforms the questions of "how many experts to merge" and "when to stop merging" into predictable, budgetable engineering problems.
- On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
-
This paper uses a carefully controlled set of Sudoku/Rush Hour tasks—where "reasoning difficulty is fixed and only horizon length varies"—to systematically demonstrate that task horizon itself is an independent root cause of LLM agent RL training collapse. It proposes two horizon-reduction mechanisms, macro action and subgoal decomposition, which not only stabilize training but also enable strong zero-shot generalization to longer horizons (horizon generalization).
- Predicting Large Model Test Losses with a Noisy Quadratic System
-
This paper proposes the Noisy Quadratic System (NQS)—a mechanistic loss model that models LLM test loss as \(L(N, B, K)\) (model size / batch size / update steps), explicitly modeling batch size in scaling law for the first time. On Pythia + OWT2, it improves extrapolation prediction capability from Chinchilla’s ~20× compute to ~4000× compute.
- Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models
-
The authors decompose traditional Softmax attention into two independent components: "non-negativization + L1 normalization," and demonstrate that the truly critical part is L1 normalization rather than the exponential. They replace the exponential with Softplus plus a dynamic length scaling factor to obtain LSSA, and then apply a power function-based "re-weighting" to sharpen the attention. The resulting LSSAR maintains nearly unchanged validation loss at 16× the training length and enables GPT-109M to "rediscover" Newton's law of universal gravitation from trajectory data.
- Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
-
The authors derive closed-form training dynamics on a simplified single-layer linear attention Transformer, proving that regularization methods can only alter convergence speed but cannot shift the convergence point (thus are almost doomed to fail in cFKA scenarios), while data replay can directly shift the convergence point and amplify oscillations to stabilize old knowledge. They further propose STOC, which prunes fragments based on token attention contribution and guides the pretrained model to generate replay corpora. STOC consistently suppresses forgetting better than LAMOL on synthetic + KnowEdit + IndustryCorpus legal corpora.
- Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics
-
The authors formulate Transformer self-attention as a mean-field particle system modeling token interactions, treat LoRA as a low-rank perturbation, and prove that forgetting is governed by two phase transition curves related to the "perturbation norm" and "network depth." They provide a long-term stability condition controlled by the eigenvalue gap of \(V\).