Skip to content

📚 Pretraining

🔬 ICLR2026 · 79 paper notes

📌 Same area in other venues: 📷 CVPR2026 (5) · 💬 ACL2026 (12) · 🧪 ICML2026 (27) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (51) · 📹 ICCV2025 (9)

🔥 Top topics: LLM ×16 · Diffusion Models ×5

A Law of Data Reconstruction for Random Features (and Beyond)

This paper demonstrates an information-theoretical and algebraic "data reconstruction law" in random feature (RF) models: when the number of parameters \(p \gg dn\) (where \(d\) is data dimension and \(n\) is the number of samples), the training data can be fully reconstructed. The universality of this threshold is verified across RF, two-layer networks, and ResNet using a projection loss optimization method.

Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms

This paper proposes the first unified benchmark for PU learning, systematically addressing two key issues: (1) implementing model selection without negative samples using Proxy Accuracy and Proxy AUC; (2) identifying and resolving the Internal Label Shift problem in the one-sample setting through a simple calibration method that merges positive samples into the unlabeled set, enabling fair comparison of two-sample algorithms over one-sample evaluations.

ADEPT: Continual Pretraining via Adaptive Expansion and Dynamic Decoupled Tuning

ADEPT discovers that the contributions of different layers and parameter units in LLMs to "general competence" are highly non-uniform. Consequently, it only replicates the layers least important to the general domain to create new capacity and assigns asymmetric learning rates within these expanded layers based on unit importance. In continual pretraining (CPT) for math and medical domains, this method injects new knowledge with almost no damage to general competence—tuning only 15% of parameters in less than 50% of the training time, yet achieving 5.76% higher performance on general benchmarks and 5.58% higher on domain benchmarks compared to full-parameter CPT.

Autoregressive Models Rival Diffusion Models at Any-Order Generation

This paper proposes A3 (Any-order Any-subset Autoregressive modeling), which reintegrates the "any-order, any-subset" flexibility of diffusion language models into the autoregressive framework. By using group-wise factorization to preserve the multi-layer dependency modeling capabilities of AR, and employing two-stream attention with a progressive curriculum to smoothly transform a pre-trained AR model into an any-order generator, A3 comprehensively outperforms diffusion language models of the same scale while using significantly less training data.

Avey-B: Refactoring Attention-Free Architectures into Bidirectional Encoders

Avey-B transforms the originally autoregressive, attention-free Avey architecture into a BERT-style bidirectional encoder by removing causal masks, decoupling static weights and dynamic similarity into alternating layers, applying row normalization to dynamic layers, and integrating a neural compressor within the ranker. Consequently, it consistently outperforms BERT/RoBERTa/ModernBERT/NeoBERT in token classification and information retrieval, using approximately \(11\times\) fewer pre-training tokens than ModernBERT while achieving \(3.38\times\) faster throughput at a context length of 96K.

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Addressing the overlooked fact that "long text \(\neq\) long-range dependency," this paper proposes LongFilter. It quantifies the "information gain from extended context" by comparing a language model's prediction distributions under long vs. short context for each token. Samples that are long but predictable using only local context are filtered out. Continuing pretraining LLaMA-3-8B (8K \(\rightarrow\) 64K) with filtered data yields an average improvement of over 2 points on HELMET, LongBench, and RULER, achieving equivalent performance with approximately half the data.

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

This paper proposes Future Summary Prediction (FSP): adding an auxiliary head to the standard next-token prediction to predict a compact summary of a long-range future sequence (instead of predicting future tokens one by one). Two summary construction methods are provided: a manual bag-of-words summary (FSP-BoW) and a learned summary distilled from a reverse language model (FSP-RevLM). Large-scale pretraining experiments at the 3B/8B scale demonstrate that FSP consistently outperforms Next-Token Prediction (NTP) and Multi-Token Prediction (MTP) on mathematical, reasoning, and coding tasks, with improvements up to 4–5 percentage points in mathematics.

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

This paper systematically broadens the design space of "metadata conditioning for accelerating LLM pretraining." Beyond the known effectiveness of prepending URLs, the authors discover that fine-grained quality scores and domain information can similarly accelerate training. They propose two new mechanisms—"appending metadata as an auxiliary prediction task" and "learnable meta tokens"—and use layer-wise probing to reveal how these signals reshape latent representations.

Block-Sample MAC-Bayes Generalization Bounds

This paper proposes block-sample MAC-Bayes (mean approximately correct) generalization bounds. By partitioning training data into \(J\) blocks and replacing the global KL divergence with the sum of KL divergences conditioned on each block, the framework provides finite, meaningful generalization error bounds in scenarios where traditional PAC-Bayes bounds are vacuous (e.g., deterministic learning algorithms like mean estimation). It also demonstrates that high-probability versions of this bound are generally infeasible.

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

This paper points out a fatal flaw in the practice widely relied upon by frontier teams—comparing data recipes using small proxy models with fixed hyperparameters. Dataset rankings can be flipped by minor changes in the learning rate. The authors propose training proxy models with an extremely small learning rate (\(10^{-5}\sim10^{-6}\)) as a simple patch, which improves the Spearman correlation of rankings from a proxy (GPT2-125M) to a target model (Pythia-1B) from \(<0.75\) to \(>0.95\) across 23 data recipes.

CHAMMI-75: Pre-training multi-channel models with heterogeneous microscopy images

The authors construct CHAMMI-75—the largest heterogeneous multi-channel microscopy image pre-training dataset (2.8M images, 75 sources, 25 channel types, 16 species)—demonstrating that imaging modality diversity is the key factor for improving multi-channel model generalization. The trained MorphEm model achieves SOTA on 6 out of 7 benchmarks.

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Establishment of Common Corpus—the largest legally authorized LLM pre-training dataset with approximately 2 trillion tokens. It covers 6 major collections (Government, Culture, Science, Code, Web, Semantic) across multiple languages (including low-resource languages). All data originates from public domain or permissively licensed sources, featuring complete data provenance and a multi-stage filtering pipeline. It has already been adopted by industry leaders such as Anthropic.

Conditioned Initialization for Attention

This paper theoretically attributes the optimization stability of attention layers to the condition number of their Jacobian. It proposes "Conditioned Initialization"—initializing the value matrix as a rectangular identity matrix and the query/key matrices as semi-orthogonal matrices (both having a condition number of 1). This tightens the upper bound of the Jacobian condition number at the start of training, consistently accelerating convergence (by 20–30%) and improving generalization across various Transformer tasks including image classification, detection/segmentation, language modeling, and long sequences.

Deconstructing Positional Information: From Attention Logits to Training Biases

Based on a unified Toeplitz matrix framework, the authors categorize positional encoding (PE) into additive (Absolute/T5/ALiBi) and multiplicative (RoPE) types. Through synthetic tasks, they find that RoPE holds significant advantages in position-sensitive tasks but exhibits a "single-head deposit pattern"—where positional reasoning in shallow layers is almost entirely concentrated in a single attention head. This pattern is theoretically proven to be an inherent property of RoPE's multiplicative structure.

Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling

This paper systematically deconstructs the gains and losses of "Distilled Pretraining (DPT)" under the modern LLM paradigm. It finds that distillation significantly enhances test-time scaling (pass@k diversity) but simultaneously impairs in-context learning (weakening induction heads). Using a bigram sandbox, the authors prove that these opposing effects stem from the same mechanism: distillation only benefits high-entropy distributions and is unhelpful or even harmful for low-entropy deterministic mappings. Finally, practical pretraining suggestions such as token routing are provided.

Dual-objective Language Models: Training Efficiency Without Overfitting

Without modifying any model architecture, this work linearly mixes autoregressive (AR) and masked-diffusion (MD) training objectives using a weight \(\alpha\) on the same Transformer. This allows the model to possess both the high training efficiency of AR and the anti-overfitting capabilities of MD. The authors trained 50 models of 470M parameters to systematically sweep for the optimal \(\alpha\) under different data repetition counts, concluding that "hybrid training is superior to single-objective training in all settings."

DUET: Optimizing LLM Training Data Mixtures via Noisy Feedback from Unseen, Downstream Evaluation Tasks

DUET addresses realistic scenarios where "evaluation task data is unseen and only multiple rounds of coarse noisy feedback are available" by iteratively optimizing LLM training data mixtures through "global Bayesian Optimization for domain ratios + local influence functions for high-quality sample selection." It provides convergence proofs and significantly outperforms methods requiring fine-grained data information, such as DoReMi and LESS, across multiple language tasks.

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

This paper proposes H-Net, a hierarchical sequence model that replaces BPE tokenization with a learnable "Dynamic Chunking (DC)" mechanism. The network automatically learns where to split chunks and the granularity of compression on byte-level inputs in an end-to-end differentiable manner. Under compute and data alignment, a single-stage H-Net outperforms BPE-based Transformers, while a two-stage H-Net matches the performance of token-level models twice its size.

Emergent Misalignment is Easy, Narrow Misalignment is Hard

The study finds that fine-tuning on narrow-domain harmful data leads to broad-spectrum "emergent misalignment" (EM) because a "general misalignment" solution is a simpler and more efficient point in the parameter space—possessing a smaller parameter norm and greater stability against noise.

Energy-Based Transformers are Scalable Learners and Thinkers

This paper reformulates "prediction" as "gradient descent optimization on a learned verifier (energy function)" and proposes Energy-Based Transformers (EBTs). This architecture enables cross-modal and cross-task System 2 thinking capabilities (dynamic compute allocation + self-verification) to emerge purely through unsupervised pre-training, outperforming Transformer++ and DiT in both language and vision domains.

Explaining Grokking and Information Bottleneck through Neural Collapse Emergence

This work provides a unified explanation for Grokking (delayed generalization) and Information Bottleneck (compression phase) from the perspective of Neural Collapse. It demonstrates that the contraction of population intra-class variance is the common underlying factor and reveals that a distinct time-scale, controlled by weight decay, separates training loss convergence from the emergence of Neural Collapse.

FictionalQA: A Dataset for Studying Memorization and Knowledge Acquisition

The authors propose the FictionalQA dataset and a generation pipeline. By synthesizing webtext-style documents and QA pairs regarding fictional events, they study the dual processes of factual and verbatim memorization during LLM training in a controlled environment. The study finds that more diverse surface forms facilitate knowledge acquisition, whereas concise structured lists are the least conducive to generalization.

FoNE: Precise Single-Token Number Embeddings via Fourier Features

FoNE maps arbitrary numbers directly into single-token embeddings using a set of sine and cosine functions with different periods (Fourier features). Each digit occupies only 2 dimensions, bypassing tokenization fragmentation and frequency bias. A 38M Transformer trained from scratch outperforms fine-tuned Llama-3.2-1B in addition, subtraction, and multiplication, being the only method to achieve 100% accuracy on 100,000 test samples.

GneissWeb: Preparing High Quality Data for LLMs at Scale

GneissWeb distills approximately 10T high-quality tokens from the 15T FineWeb dataset using "sharded exact substring deduplication + an ensemble of novel complementary quality filters." This allows a 7B model to outperform the FineWeb-trained version by an average of 2.73 percentage points across 11 benchmarks, filling the gap between "small but refined" (<5T) and "large but coarse" (>15T) datasets.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

The authors identify a natural conflict between "ascending quality data curriculum" and "learning rate (LR) decay." High-quality data is intentionally placed at the end of training but coincides with the stage where the LR is decayed to its minimum, resulting in minimal update steps and wasted data. By utilizing "gentle decay + replacing decay with model averaging," the study improves average benchmark scores by 1.64% relative to random shuffling on a 1.5B model / 30B tokens by simply rearranging the data.

How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study

The authors construct QualityPajama, a suite of 23 datasets with different quality interventions, and train 2000+ models to systematically measure how "filtering / deduplication / LLM rewriting" reshapes all five parameters of neural scaling laws. They find that data interventions simultaneously change both scaling coefficients and exponents (unlike architectural changes which mainly affect coefficients), causing the compute-optimal token-to-parameter ratio to fluctuate by orders of magnitude, thereby establishing scaling law analysis as a principled framework for evaluating data strategies.

How to Train Data-Efficient LLMs

This paper systematically compares 22 data selection strategies for LLM pre-training. It proposes Ask-LLM, which uses instruction-tuned LLMs to directly provide quality scores, and Density, which performs coverage sampling based on kernel density estimation. The study finds that quality filtering (Ask-LLM) can outperform full-scale training while converging 70% faster even when keeping only 10% of the data, whereas coverage sampling typically only "matches" the full-scale performance.

Identifying and Evaluating Inactive Heads in Pretrained LLMs

This paper systematically evaluates 12 score functions for identifying inactive attention heads in LLMs. The study finds that score functions based on head output norms (AHON LN) identify inactive heads more consistently across model families than traditional attention weight metrics. On average, across 14 models, more than 12% of heads can be zeroed out while maintaining MMLU accuracy within 1%.

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

Proposing the ProCap framework, which redefines change captioning from static image pair comparison to dynamic procedure modeling. The first stage trains a procedure encoder to learn spatio-temporal change dynamics through frame interpolation and masked reconstruction, while the second stage employs learnable procedure queries to implicitly infer change processes, outperforming SOTA on three datasets.

Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank

By analyzing the gradient flow dynamics of deep matrix factorization (deep linear networks) in matrix completion tasks, this work proves that coupled dynamics are the key mechanism for low-rank implicit bias. It demonstrates that networks with depth \(L \geq 3\) inevitably exhibit coupling (except for diagonal initialization), thereby explaining why deep models avoid the loss of plasticity.

Intrinsic Training Dynamics of Deep Neural Networks

This paper investigates when parameter space trajectories in deep neural network gradient flow training can be "lifted" to a low-dimensional intrinsic space and represented as an intrinsic Riemannian gradient flow. It proposes an intrinsic recoverability criterion based on conservation laws and generalizes the results to ReLU networks of arbitrary depth and linear networks.

Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

The trillion-token scale pre-training data selection problem is reformulated as a "learnable mask" task. By using grouped policy gradients to simultaneously optimize both quality and diversity metrics, the method is 98.9% faster than greedy algorithms. It selects 1.5T of FineWeb-Mask from the 15T FineWeb dataset, achieving average improvements of 3.2% and 1.9% on 1.5B and 7B models, respectively.

Late-to-Early Training: Enabling LLMs to Learn Late-Stage Knowledge Earlier for Faster and Better Training

LET uses the final-layer representations of a significantly smaller (up to 10×) open-source pre-trained model to align with the early-layer representations of the target large model during early training steps. This allows the large model to "prematurely" acquire knowledge that would otherwise only form in later stages, achieving approximately 1.6× acceleration and nearly a 5% improvement in downstream accuracy on 1.4B/7B scales.

Learned Meta-Tokens for Language Modeling

During pre-training, a set of learnable meta-tokens is randomly injected into sequences, paired with a sparse meta-attention that flows exclusively between meta-tokens. This enables these tokens to compress and "cache" previous context as content anchors, allowing small models trained on <100B tokens to achieve length generalization up to 2× the context window, while providing an information-theoretic explanation of "meta-tokens sharpening positional encodings."

Learning Facts at Scale with Active Reading

The model is allowed to generate a set of "learning strategies" (paraphrasing, self-testing, knowledge association, analogy, etc.) for each document, which are then used to synthesize diverse training data to efficiently embed closed-form knowledge into parameters. The 8B WikiExpert outperforms 405B Llama and 236B DeepSeekV2 on SimpleQA.

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures

The successful Joint Embedding Predictive Architecture (JEPA) from computer vision is adapted for LLMs for the first time. By adding a latent space objective—"predicting Code embeddings from Text embeddings"—alongside the standard next-token reconstruction loss, this method significantly outperforms standard fine-tuning and pre-training across four model families and four datasets without sacrificing generative capabilities or suffering from overfitting.

LLM Pretraining with Continuous Concepts

This paper proposes CoCoMix, which goes beyond standard next-token prediction by having the model predict high-level concepts extracted via SAE and filtered by attribution. These concepts are compressed into continuous vectors and interleaved into the Transformer hidden state sequence, achieving higher efficiency in language modeling, downstream reasoning, and controllable generation compared to vanilla NTP and knowledge distillation.

Lossless Vocabulary Reduction for Auto-Regressive Language Models

Proposes the theoretical framework of Lossless Vocabulary Reduction (LVR). By utilizing nested tokenization, any auto-regressive language model is precisely converted into an equivalent model using an arbitrary sub-vocabulary. Based on the Maximal Common Vocabulary (MCV), the method achieves efficient ensemble of language models with different tokenization schemes, validating effectiveness across tasks like GSM8K, MATH, and translation.

MrRoPE: Mixed-radix Rotary Position Embedding

This paper re-examines RoPE from the perspective of "radix conversion" and proposes a unified framework, MrRoPE. It explains extrapolation methods such as PI, NTK, and YaRN as different mixed-radix conversion strategies. Based on this, MrRoPE-Pro (Progressive Radix Conversion) is designed to double the retrieval and dialogue accuracy of YaRN on 128K long contexts without fine-tuning.

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

A domain-agnostic pipeline utilizing "lynx layout rendering + lightweight LLM cleaning" is proposed to reliably extract and standardize math/code content from Common Crawl. This constructs Nemotron-CC-Math (133B tokens), the highest-quality open-source math pre-training corpus to date, which consistently outperforms FineMath, MegaMath, and OpenWebMath across math, code, and general knowledge tasks.

Next-ToBE: Probabilistic Next Token-Bag Exploitation for Activating Anticipatory Capacity in LLMs

Next-ToBE replaces the one-hot target of standard NTP with a "soft token-bag distribution" covering multiple tokens within a future window. Without adding any extra parameters, it activates the latent "anticipatory planning" capability of LLMs, consistently outperforming strong baselines like MTP in math, code, and common-sense reasoning.

Not All Documents Are What You Need for Extracting Instruction Tuning Data

Addressing the issue that "extracting instruction tuning QA data from web corpora is expensive and noisy," this paper proposes EQUAL. It aligns document and QA feature spaces using contrastive learning before clustering, treats each document cluster as an arm of a Multi-Armed Bandit (MAB), and utilizes Optimal Transport (OT) scores to measure how closely a cluster's projected QA distribution matches the target distribution. Through iterative "cluster selection—extraction—update," it reduces extraction costs by 5–10 times while increasing downstream accuracy by approximately 2.5%.

OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

OptimSyn transforms the manual task of "writing rubrics for synthetic data" into a learnable policy. It utilizes gradient-based influence scores to measure the actual contribution of each synthetic QA pair to target model training. These scores serve as rewards to train a rubric generator via GRPO, consistently achieving higher downstream accuracy in knowledge-intensive fields like humanities, social sciences (HSS), and medicine compared to mainstream open-source SFT corpora.

Polynomial, trigonometric, and tropical activations

This paper systematically explores a family of learnable activation functions based on orthogonal bases (Hermite polynomials, Fourier trigonometric bases) and tropicalization. By addressing the gradient explosion/vanishing issues of polynomial activations through variance-preserving initialization, it successfully replaces GELU to achieve effective training on GPT-2 and ConvNeXt.

Pre-training Limited Memory Language Models with Internal and External Knowledge

LMLM (Limited Memory Language Model) inserts entity-level factual query calls into the corpus during the pre-training phase and masks the retrieved factual values from the loss. This forces the model to learn "when to query" rather than memorizing by rote. Consequently, a small 382M model approaches LLaMA2-7B in factual accuracy and allows for one-click unlearning by modifying the database.

Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

This paper proposes the Warmup-Stable-Only (WSO) learning rate scheduling strategy, which completely eliminates the learning rate decay phase during pre-training. Although this results in worse pre-training metrics, it consistently outperforms all decay strategies after SFT. Loss landscape analysis reveals that the superiority of WSO stems from its ability to maintain flatter minima.

Pre-training under Infinite Compute

When compute far exceeds available web data, the authors use "heavy regularization + model ensemble + joint parameter/ensemble scaling + distillation" to compress the pre-training loss of a fixed 200M token budget to an asymptotic value of 3.17. This achieves a 5.17× data efficiency gain over standard recipes, with 83% of the ensemble benefits retained even when distilled into an 8× smaller student model.

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums

The authors propose the Training Re-evaluation Curve (TREC) as a diagnostic tool. By analyzing the loss of training data at each timestamp using the final model, they guide the optimal placement of high-quality data. They demonstrate that the shape of the TREC can be predicted via the implicit EMA coefficient of AdamW, enabling the design of data curriculums without actual training.

Pretraining Scaling Laws for Generative Evaluations of Language Models

This paper proposes and systematically compares three sets of pretraining scaling laws for "generative evaluation" (tasks with verifiable binary rewards like math problem-solving, scored via pass@k). These laws use pretraining compute, parameters + training tokens, and log-likelihood of gold reference solutions as independent variables to fit and extrapolate pass@k. It reveals that the sampling count \(k\) is a new lever for controlling scaling behavior and predictability, discovers that the parameters of the "gold reference likelihood" law are exceptionally stable across nearly five orders of magnitude, and theoretically proves that the compute law is the "compute-optimal envelope" of the parameter + token law.

Pretraining with Hierarchical Memories: Separating Long-Tail and Common Knowledge

This paper proposes attaching a massive "hierarchical parameterized memory bank" to a small "anchor model" during pretraining. Based on input documents, hierarchical clustering routing retrieves only ~10% of memory parameters to augment the anchor model. This allows the anchor model to focus on general knowledge and reasoning while the memory bank absorbs long-tail world knowledge. Experiments on trillions of tokens show that a 160M anchor model with 18M retrieved memory (from a 4.6B bank) can match the performance of standard models more than twice its size.

Programming by Backprop: An Instruction is Worth 100 Examples when Finetuning LLMs

The paper proposes Programming by Backprop (PBB)—a two-stage training curriculum that allows LLMs to "compile" corresponding executable behaviors into weights using only "declarative instructions" (such as a Python source code snippet or a set of grammar rules) in the training data, without providing execution examples. Experiments demonstrate that a single instruction can be worth up to 100 execution samples, and this phenomenon has direct implications for data governance and safety.

RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization

RECON is proposed as a class-pose-independent canonical orientation normalization method. By correcting arbitrary canonical representations generated during training through simple right translation, it achieves unsupervised instance-level symmetry discovery, OOD pose detection, and a plug-and-play test-time normalization layer.

Reformulation for Pretraining Data Augmentation

To address the scarcity of high-quality pre-training corpora and the performance degradation caused by repeating data, this paper proposes MGA (Massive Genre-Audience reformulation). Utilizing a lightweight 3.3B MoE model, MGA adaptively generates multiple "Genre-Audience" pairs for each original document to rewrite it into five distinct gaya versions while maintaining factual consistency. This expands 195B high-quality tokens into 770B synthetic tokens (a 3.9× expansion), achieving superior N/D bidirectional scaling compared to "data repetition / upsampling" across models ranging from 134M to 13B parameters.

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

This paper unifies LLM data curation (selection) and data mixing into an "online reweighting" problem. It proposes ADAPT, which dynamically adjusts the per-sample learning rate based on the semantic similarity between training samples and a validation set during training. Without removing any data and with near-zero additional overhead, ADAPT achieves stronger cross-benchmark generalization than offline selection/mixing methods in both instruction tuning and pre-training.

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

This paper challenges the conventional wisdom that "downstream benchmark accuracy is unpredictable" and proposes a two-parameter power law \(-\log Q = A/C^{\alpha}\) to directly model downstream accuracy from training FLOPs. It extends this to different token-to-parameter ratios and repeated sampling (pass@k). Experiments on a grid up to 17B parameters and 350B tokens demonstrate that this method is more accurate and stable for extrapolation than the classic "two-stage method" (predicting proxy metrics first, then mapping to accuracy).

Rewriting Pre-training Data Boosts LLM Performance in Math and Code

Instead of "filtering and discarding," this paper uses a 70B model to "rewrite clean and retain" open-source code and math corpora, constructing two datasets: SwallowCode (≈16.1B tokens) and SwallowMath (≈2.3B tokens). In continued pre-training with a fixed budget of 50B tokens, Llama-3.1-8B achieves a +17.0 improvement in HumanEval pass@1 and a +12.4 improvement in GSM8K, proving that data quality is the fundamental bottleneck for code and mathematical capabilities.

Scaling Behavior of Discrete Diffusion Language Models

This paper systematically investigates the scaling laws of discrete diffusion language models (DLMs) under various noise types. By employing a unified diffusion framework parameterized by signal-to-noise ratio (SNR) that smoothly interpolates between masked and uniform diffusion, and carefully calibrating batch size and learning rate, the authors find that DLM scaling behavior heavily depends on the noise type. Uniform diffusion is more "data-efficient but parameter-hungry" in data-constrained scenarios. The study scales uniform diffusion models up to 10B parameters / \(10^{22}\) FLOPs, verifying that their scaling laws can compete with autoregressive models (ALMs).

Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining

This paper introduces a dimensionless data quality parameter \(Q \in (0,1]\) into the classic Chinchilla scaling law, obtaining \(L(N,D,Q)=A/N^\alpha + B/(D^\beta Q^\gamma) + E\). Through systematic controlled experiments involving noise injection in machine translation and causal language modeling, the authors demonstrate that loss decreases predictably with improved data quality, and high-quality data can compensate for smaller model sizes and lower computational costs.

Scaling with Collapse: Efficient and Predictable Training of LLM Families

It is demonstrated that training loss curves (TLC) of LLM families "collapse" onto a single universal curve when optimization hyperparameters are matched to the data budget. This phenomenon is leveraged for two practical applications: (1) using deviation from collapse as an early diagnostic signal for training pathologies, and (2) achieving early stopping in large-scale hyperparameter tuning through the predictability of the collapse curve.

Selective Rotary Position Embedding

This paper theoretically demonstrates that "strong recall = rotation + decay" is indispensable. It notes that linear attention lacks the "rotation" implicitly performed by softmax. Consequently, it proposes Selective RoPE—an input-dependent, learnable rotary position embedding capable of rotating at arbitrary angles and seamlessly compounding with decay gates. Efficiently implemented as a layer of complex gated linear attention using the RoPE trick, it improves recall, expressivity, and perplexity on synthetic recall tasks and 370M/1.3B language modeling with minimal cost.

SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook

Proposes SemHiTok—a unified tokenizer for understanding and generation via Semantic-Guided Hierarchical Codebook (SGHC). It establishes pixel sub-codebooks based on a pre-trained semantic codebook. Structural and training decoupling (staged optimization) avoids semantic-pixel conflicts, achieving SOTA in both understanding and reconstruction among discrete tokenizers under LLaVA settings.

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

The authors develop a suite of paired encoder-only and decoder-only models (the ETTIN suite) ranging from 17M to 1B parameters. Using identical data, architectures, and training recipes—varying only in the objective function and attention direction—they achieve SOTA performance on open-data benchmarks for both types. They demonstrate that encoders significantly outperform decoders in classification and retrieval tasks, while the reverse is true for generation, and converting one model type to another via continued training (cross-objective) cannot bridge this performance gap.

Should We Still Pretrain Encoders with Masked Language Modeling?

The authors conducted a strictly controlled comparative experiment with 38 models (210M to 1B parameters) and over 15,000 fine-tuning runs to answer whether MLM is still necessary for pre-training encoders. The study concludes that while MLM remains generally stronger for text representation tasks, CLM is more data-efficient and offers more stable fine-tuning. Consequently, a two-stage strategy of CLM followed by MLM (especially performing MLM on off-the-shelf CLM decoders) yields the optimal encoder under a fixed compute budget.

Soft-Masked Diffusion Language Models

Addressing the issue where binary "keep mask or replace with prediction" decisions in Masked Diffusion Language Models (MDLM) discard valuable predictive information, this paper proposes soft-masking (SM). By representing retained [MASK] positions as a confidence-weighted convex combination of the [MASK] embedding and top-k predictions from the previous step, information is propagated across steps. With only 3 additional trainable parameters, this method consistently improves perplexity, MAUVE, and code generation accuracy across training from scratch, continued pre-training, and Dream-7B fine-tuning, with significant gains in low-compute (fewer decoding steps / high-throughput) scenarios.

SPICE: Submodular Penalized Information–Conflict Selection for Efficient Large Language Model Training

SPICE identifies gradient conflict between samples as the primary culprit for why "greedy data selection based on Fisher information" collapses faster in practice than in theory. By using an \(\varepsilon\)-decomposition to quantify the "degree of deviation from ideal submodularity" into conflict statistics, a conflict-aware greedy selector is proposed: "Information Gain − Conflict Penalty." On LLaMA2-7B / Qwen2-7B, using only 10% of the data and 20 GPU-hours, it matches or exceeds full fine-tuning and 6 baselines across 8 benchmarks.

ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

ssToken performs token-level data filtering during LLM supervised fine-tuning (SFT). It utilizes the model's own historical checkpoints instead of external reference models to calculate "Retrospective Excess Loss" (a self-modulated signal), combined with an attention-based semantic importance metric. By weighting these two orthogonal signals, the method computes loss only on the top-\(\rho\) tokens. Experiments on 3B–14B models demonstrate a performance gain of up to 4.3% over full fine-tuning and 2.8% over existing token selection methods, with negligible training overhead.

Steering Language Models with Weight Arithmetic

Proposes Contrastive Weight Steering, which extracts behavioral direction vectors by calculating the weight difference between models fine-tuned on positive/negative behaviors. By directly modifying model weights to achieve behavioral control, it demonstrates superior generalization and consistency compared to Activation Steering in experiments involving sycophancy, malevolence, and refusal.

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

StochasTok adds a lightweight post-processing step after tokenization—randomly splitting tokens into equivalent smaller token pairs from the vocabulary based on probability. This allows LLMs to "see" the internal structure of tokens during pre-training, significantly outperforming deterministic tokenization and BPE-dropout on fine-grained subword tasks like letter counting, substring search, and multi-digit addition. It is hot-pluggable to any training stage.

Synthetic Bootstrapped Pretraining

SBP (Synthetic Bootstrapped Pretraining) extracts semantically similar document pairs from pretraining corpora, trains a conditional synthesizer to "generate related \(d_2\) given \(d_1\)," and then scales it across the entire corpus to synthesize a large volume of new documents for joint pretraining with real data. In compute-matched 1T token settings for 3B and 6B models, it consistently exceeds strong repetition baselines and recovers up to approximately 60% of the gains achieved by an oracle (which has 20x more unique data).

Task-Aware Data Selection via Proxy-Label Enhanced Distribution Matching for LLM Fine-Tuning

Addressing the task of selecting the most relevant instruction data from a large corpus given a small target set, this paper argues that aligning only input features \(X\) is insufficient. It proposes reconstructing the problem as joint distribution \(P(X,Y)\) alignment by using LLMs to infer proxy labels \(Y\). This is implemented via a four-step pipeline called TADS: "annotation → cluster propagation → LLM scoring/filtering → incremental sampling." Selecting only 10K samples from a 300K pool to fine-tune LLaMA-3.1-8B achieves performance comparable to or exceeding SOTA methods like LESS and TSDS.

The Diffusion Duality, Chapter II: Ψ-Samplers

Addressing the issue where Uniform State Discrete Diffusion (USDM) quality saturates rather than improves at high sampling steps, this paper proposes a family of "superposition posteriors" (Ψ-posterior) and its corresponding Ψ-sampler (Predictor-Corrector sampler). This generalizes correction methods like ReMDM to arbitrary noise priors, allowing USDM text/image generation quality to scale with sampling steps. Additionally, an efficient curriculum using top-k order statistics to approximate softmax is introduced, reducing training memory by 33% and time by 25%.

Time is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Authors find that Diffusion Language Models (dLLMs) often "get it right in the middle but change to wrong at the end" (temporal oscillation) during denoising. They exploit discarded intermediate step predictions as signals using two methods: a training-free Temporal Self-Consistency voting that selects the most stable answer across steps within a single sampling trajectory, and a Temporal Consistency Reinforcement post-training via GRPO using "Negative Temporal Semantic Entropy" as an unlabeled reward. These yield improvements of ~1.5% and up to 25.3% across four math reasoning benchmarks, respectively.

TNT: Improving Chunkwise Training for Test-Time Memorization

Ours proposes the TNT training paradigm, which utilizes "hierarchical memory + periodic state reset" to break the sequential dependencies of non-linear RNNs to achieve large-scale context parallelism. A subsequent lightweight fine-tuning stage adapts local memory to small chunks, accelerating the training of Titans-like deep memory models by up to \(17 \times\) while improving accuracy.

Token-level Data Selection for Safe LLM Fine-tuning

TOSS (Token-level data Selection for Safe LLM fine-tuning) is proposed as the first token-level data selection framework. By evaluating the safety risk of each token through the loss difference between a safety-degraded model and a utility-oriented model, it achieves a superior safety-utility tradeoff compared to sample-level methods.

Train on Validation (ToV): Fast Data Selection with Applications to Fine-Tuning

ToV reverses the process of "estimating the impact of each sample on validation loss" by leveraging train-validation symmetry revealed through first-order Taylor expansion. It fine-tunes on a small validation set for one step and identifies samples in the training pool with the largest loss reduction—using only forward loss evaluations without requiring per-sample gradients or Hessians. This achieves 2–6× speedups over LESS while selecting higher-quality data for instruction tuning and NER.

Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization

This paper reinterprets the structured second-moment estimation of Shampoo and SOAP from the perspective of KL divergence minimization, revealing inherent limitations and proposing two practical solutions—KL-Shampoo and KL-SOAP—that match or exceed the original methods without requiring Adam grafting.

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors

By decomposing training gradient signals into three components—direct, pre-cached, and circuit sharing—this work explains why Transformers trained with NTP learn features "useless" for predicting the current next token. The explanatory power of this framework is validated on OthelloGPT, small language models, and a pre-trained LLM (Gemma 2).

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

This paper proposes the Clustering-On-Difficulty (COD) framework: it first clusters evaluation samples based on "difficulty scaling features," filters out non-extrapolatable clusters, applies a newly derived downstream performance scaling law to perform compute-performance extrapolation for each cluster, and finally uses a smooth mapping function to restore the accuracy of the "predictable subset" to the full evaluation set—reducing the average prediction error to 1.55% across 8 mainstream benchmarks for a 70B model.

What Scales in Cross-Entropy Scaling Law?

This paper precisely decomposes cross-entropy loss into three terms: "Error-Entropy + Self-Alignment + Confidence." Through experiments on 32 models spanning five orders of magnitude, it demonstrates that only the Error-Entropy consistently follows a power-law decay with model size. The other two terms remain largely invariant to scale—explaining why the cross-entropy scaling law is accurate for small models but tends to fail for ultra-large models.