Skip to content

📚 Pretraining

💬 ACL2026 · 5 paper notes

Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding

This paper proposes an automated method for augmenting existing commonsense knowledge bases with negation, constructing a large-scale negated commonsense corpus (¬Atomic and ¬Anion) containing over 2 million triples, and demonstrates that pretraining on this corpus improves LLMs' negation understanding capabilities.

Compact Example-Based Explanations for Language Models

This paper proposes the Selection Relevance Score, a retraining-free metric for evaluating the quality of training sample subsets as example-based explanations. It demonstrates that the commonly used "select top-k by influence" strategy frequently underperforms random selection, and introduces a new strategy that balances influence and representativeness.

SAGE: Sign-Adaptive Gradient for Memory-Efficient LLM Optimization

This paper proposes the SAGE optimizer, which addresses the "embedding layer dilemma" of lightweight optimizers by combining Lion-style sign update directions with an \(O(d)\)-memory adaptive damping scaling factor \(\mathbf{H}_t\). SAGE achieves new state-of-the-art perplexity on Llama models (up to 1.3B parameters) with significantly reduced optimizer memory overhead.

SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models

This paper proposes SCRIPT, a model-agnostic plug-and-play module that injects subcharacter (Jamo) compositional knowledge from the Korean Hangul writing system into the embedding layer of existing subword-level PLMs via a dual-channel strategy. Without requiring re-pretraining, SCRIPT yields consistent improvements on Korean NLU/NLG tasks and enables the embedding space to better capture morphosyntactic regularities and semantic variations.

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

This paper integrates human working memory constraints (fixed window, exponential decay, logistic decay, and primacy-recency effects) into the GPT-2 attention mechanism, training from scratch on developmentally plausible small-scale corpora (10M/100M words). The results demonstrate that these constraints significantly improve grammatical accuracy and human reading time predictability under data scarcity, while also promoting functional specialization of attention heads.