📚 Pretraining¶

💬 ACL2026 · 5 paper notes

Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding: This paper proposes an automated method for augmenting existing commonsense knowledge bases with negation, constructing a large-scale negated commonsense corpus (¬Atomic and ¬Anion) containing over 2 million triples, and demonstrates that pretraining on this corpus improves LLMs' negation understanding capabilities.
Compact Example-Based Explanations for Language Models: This paper proposes the Selection Relevance Score, a retraining-free metric for evaluating the quality of training sample subsets as example-based explanations. It demonstrates that the commonly used "select top-k by influence" strategy frequently underperforms random selection, and introduces a new strategy that balances influence and representativeness.
SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization: This paper proposes the SAGE optimizer, which addresses the "embedding layer dilemma" of lightweight optimizers by combining Lion-style sign update directions with an \(O(d)\)-memory adaptive damping scaling factor \(\mathbf{H}_t\). SAGE achieves new state-of-the-art perplexity on Llama models (up to 1.3B parameters) with significantly reduced optimizer memory overhead.
SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models: This paper proposes SCRIPT, a model-agnostic plug-and-play module that injects subcharacter (Jamo) compositional knowledge from the Korean Hangul writing system into the embedding layer of existing subword-level PLMs via a dual-channel strategy. Without requiring re-pretraining, SCRIPT yields consistent improvements on Korean NLU/NLG tasks and enables the embedding space to better capture morphosyntactic regularities and semantic variations.
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity: This paper integrates human working memory constraints (fixed window, exponential decay, logistic decay, and primacy-recency effects) into the GPT-2 attention mechanism, training from scratch on developmentally plausible small-scale corpora (10M/100M words). The results demonstrate that these constraints significantly improve grammatical accuracy and human reading time predictability under data scarcity, while also promoting functional specialization of attention heads.