Skip to content

📚 Pretraining

💬 ACL2026 · 12 paper notes

📌 Same area in other venues: 📷 CVPR2026 (5) · 🔬 ICLR2026 (79) · 🧪 ICML2026 (27) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (51) · 📹 ICCV2025 (9)

🔥 Top topics: LLM ×3

Compact Example-Based Explanations for Language Models

This paper proposes Selection Relevance Score, a re-training-free metric to evaluate the quality of training sample subsets as example-based explanations. It demonstrates that the common "select highest influence" strategy is often inferior to random selection and further introduces a new strategy that balances influence and representativeness.

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

This paper proposes Data Mixing Agent, the first model-based end-to-end domain re-weighting framework. By training a small agent using CQL reinforcement learning on extensive data mixing trajectories, it learns generalizable data mixing heuristics. It balances performance between source and target domains in mathematical reasoning continual pre-training and generalizes to unseen source domains, target models, and domain spaces.

Demystifying Data Organization for Enhanced LLM Training

This paper systematically investigates the impact of "sample appearance order" in LLM training. By reusing existing sample-level quality/difficulty scores, it proposes four data organization principles: boundary reinforcement, cyclic review, continuous curriculum, and local diversity. The proposed STR and SAW strategies consistently enhance performance in both pre-training and SFT.

Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective

The authors utilize Probabilistic Hierarchical Context-Free Grammar (HPCFG) to construct a set of "contamination-free, bounded, and precisely samplable" formal languages as controlled testbeds. They propose the "Discriminative AUC Test" as a unified metric to systematically compare FT and ICL across 18 LLMs from 6 families on 6 languages. The study finds that FT consistently outperforms ICL in-distribution, but both perform equally on out-of-distribution data; ICL shares a similar inductive bias with FT but exhibits significantly higher sensitivity to specific tokens.

FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning

The authors realign the "spaced repetition" concept of the Ebbinghaus forgetting curve from "training steps" to "model time" (accumulated parameter update norm \(\Delta_t = \|\Theta_t - \Theta_{t-1}\|_2\)). Specifically, cumulative model time \(\tau_t\) determines when to replay, while the instability ratio \(r_t\) (current update intensity \(\mu_t\) vs. baseline \(\mu_0\)) adaptively controls how to replay (regularization strength). The method consistently outperforms SOTA across 3 CL benchmarks and 4 backbones (0.6B–13B), achieving OP +1.2% and BWT +0.9% over the strongest baseline VBM.

Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

This paper discovers that Classifier-based Quality Filtering (CQF) mistakenly equates "Wikipedia-style writing" with "higher educational value." Simple rewriting allows low-quality web pages to bypass pre-training data filtering thresholds; approximately 7% of samples in FineWeb-Edu flip their filtering decisions as a result.

KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates

Ours proposes Knowledge Coordinate (KoCo) conditioning for pre-training, which maps each document to a three-dimensional semantic coordinate (Source, Content, Stability). These coordinates are injected into pre-training as text prefixes, providing the model with explicit context-awareness. This approach improves performance across 10 downstream tasks, accelerates convergence by approximately 30%, and effectively mitigates hallucinations.

On the Proper Treatment of Units in Surprisal Theory

This paper points out that the choice of the "next unit" in surprisal theory has historically been implicitly determined by pre-trained language model tokenizers. It proposes a finite-state transduction framework that explicitly decouples model tokens, linguistic units, and experimental Regions of Interest (ROI), demonstrating on MECO eye-tracking data that different unit inventories fundamentally alter how surprisal predicts reading time.

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

Ours proposes the SAGE optimizer, which addresses the "embedding layer dilemma" where lightweight optimizers fail on embedding layers. By combining a Lion-style sign update direction with an \(O(d)\) memory overhead adaptive damping scaling factor, SAGE achieves new SOTA perplexity on Llama models (up to 1.3B) with significantly lower optimizer memory.

SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

Ours proposes SCRIPT, a model-agnostic plug-and-play module that injects Hangul subcharacter (Jamo) compositional knowledge into the embedding layers of existing subword-level PLMs using a dual-channel strategy. It achieves consistent improvements across Korean NLU/NLG tasks without re-pretraining and enables the embedding space to better capture grammatical regularities and semantic variations.

Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement

This paper theoretically analyzes how Multi-Token Prediction (MTP) induces representation contractivity through a gradient coupling mechanism, facilitating the emergence of belief states. However, it also reveals the "structural hallucination" problem of MTP (illegal shortcuts in latent space). The proposed LSE-MTP framework anchors predictions to true latent state trajectories through latent consistency and semantic anchoring losses, significantly improving path legality and robustness in synthetic graphs and real-world Manhattan taxi navigation.

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

This paper integrates human working memory constraints (fixed window, exponential decay, logistic decay, primacy-recency effects) into the GPT-2 attention mechanism. By training from scratch on developmentally plausible small-scale corpora (10M/100M words), it finds that these constraints significantly improve syntactic accuracy and the predictability of human reading times under data scarcity, while promoting functional specialization of attention heads.