📚 Pretraining¶
💬 ACL2026 · 12 paper notes
📌 Same area in other venues: 📷 CVPR2026 (5) · 🔬 ICLR2026 (79) · 🧪 ICML2026 (27) · 🤖 AAAI2026 (9) · 🧠 NeurIPS2025 (51) · 📹 ICCV2025 (9)
🔥 Top topics: LLM ×3
- Compact Example-Based Explanations for Language Models
-
This paper proposes Selection Relevance Score, a re-training-free metric to evaluate the quality of training sample subsets as example-based explanations. It demonstrates that the common "select highest influence" strategy is often inferior to random selection and further introduces a new strategy that balances influence and representativeness.
- Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
-
This paper proposes Data Mixing Agent, the first model-based end-to-end domain re-weighting framework. By training a small agent using CQL reinforcement learning on extensive data mixing trajectories, it learns generalizable data mixing heuristics. It balances performance between source and target domains in mathematical reasoning continual pre-training and generalizes to unseen source domains, target models, and domain spaces.
- Demystifying Data Organization for Enhanced LLM Training
-
This paper systematically investigates the impact of "sample appearance order" in LLM training. By reusing existing sample-level quality/difficulty scores, it proposes four data organization principles: boundary reinforcement, cyclic review, continuous curriculum, and local diversity. The proposed STR and SAW strategies consistently enhance performance in both pre-training and SFT.
- Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
-
The authors utilize Probabilistic Hierarchical Context-Free Grammar (HPCFG) to construct a set of "contamination-free, bounded, and precisely samplable" formal languages as controlled testbeds. They propose the "Discriminative AUC Test" as a unified metric to systematically compare FT and ICL across 18 LLMs from 6 families on 6 languages. The study finds that FT consistently outperforms ICL in-distribution, but both perform equally on out-of-distribution data; ICL shares a similar inductive bias with FT but exhibits significantly higher sensitivity to specific tokens.
- FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning
-
The authors realign the "spaced repetition" concept of the Ebbinghaus forgetting curve from "training steps" to "model time" (accumulated parameter update norm \(\Delta_t = \|\Theta_t - \Theta_{t-1}\|_2\)). Specifically, cumulative model time \(\tau_t\) determines when to replay, while the instability ratio \(r_t\) (current update intensity \(\mu_t\) vs. baseline \(\mu_0\)) adaptively controls how to replay (regularization strength). The method consistently outperforms SOTA across 3 CL benchmarks and 4 backbones (0.6B–13B), achieving OP +1.2% and BWT +0.9% over the strongest baseline VBM.
- Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering
-
This paper discovers that Classifier-based Quality Filtering (CQF) mistakenly equates "Wikipedia-style writing" with "higher educational value." Simple rewriting allows low-quality web pages to bypass pre-training data filtering thresholds; approximately 7% of samples in FineWeb-Edu flip their filtering decisions as a result.
- KoCo: Conditioning Language Model Pre-training on Knowledge Coordinates
-
Ours proposes Knowledge Coordinate (KoCo) conditioning for pre-training, which maps each document to a three-dimensional semantic coordinate (Source, Content, Stability). These coordinates are injected into pre-training as text prefixes, providing the model with explicit context-awareness. This approach improves performance across 10 downstream tasks, accelerates convergence by approximately 30%, and effectively mitigates hallucinations.
- On the Proper Treatment of Units in Surprisal Theory
-
This paper points out that the choice of the "next unit" in surprisal theory has historically been implicitly determined by pre-trained language model tokenizers. It proposes a finite-state transduction framework that explicitly decouples model tokens, linguistic units, and experimental Regions of Interest (ROI), demonstrating on MECO eye-tracking data that different unit inventories fundamentally alter how surprisal predicts reading time.
- SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization
-
Ours proposes the SAGE optimizer, which addresses the "embedding layer dilemma" where lightweight optimizers fail on embedding layers. By combining a Lion-style sign update direction with an \(O(d)\) memory overhead adaptive damping scaling factor, SAGE achieves new SOTA perplexity on Llama models (up to 1.3B) with significantly lower optimizer memory.
- SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models
-
Ours proposes SCRIPT, a model-agnostic plug-and-play module that injects Hangul subcharacter (Jamo) compositional knowledge into the embedding layers of existing subword-level PLMs using a dual-channel strategy. It achieves consistent improvements across Korean NLU/NLG tasks without re-pretraining and enables the embedding space to better capture grammatical regularities and semantic variations.
- Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
-
This paper theoretically analyzes how Multi-Token Prediction (MTP) induces representation contractivity through a gradient coupling mechanism, facilitating the emergence of belief states. However, it also reveals the "structural hallucination" problem of MTP (illegal shortcuts in latent space). The proposed LSE-MTP framework anchors predictions to true latent state trajectories through latent consistency and semantic anchoring losses, significantly improving path legality and robustness in synthetic graphs and real-world Manhattan taxi navigation.
- Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity
-
This paper integrates human working memory constraints (fixed window, exponential decay, logistic decay, primacy-recency effects) into the GPT-2 attention mechanism. By training from scratch on developmentally plausible small-scale corpora (10M/100M words), it finds that these constraints significantly improve syntactic accuracy and the predictability of human reading times under data scarcity, while promoting functional specialization of attention heads.