💬 LLM / NLP¶

💬 ACL2026 · 38 paper notes

A Study of LLMs' Preferences for Libraries and Programming Languages: This paper presents the first systematic study of library and programming language preferences in code generation across 8 LLMs, revealing that LLMs exhibit strong biases toward popular libraries such as NumPy (with 45% of usages deemed unnecessary) and toward Python (chosen in 58% of high-performance tasks), and that natural language recommendations are inconsistent with actual code generation behavior.
Adam's Law: Textual Frequency Law on Large Language Models: This paper proposes the Textual Frequency Law (TFL), which finds that when semantics are equivalent, prompting or fine-tuning LLMs with higher-frequency textual expressions yields better performance. The authors further introduce frequency distillation and curriculum training strategies to exploit this regularity.
AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment: This paper proposes AlphaContext, an evolutionary tree-based psychometric context generator comprising four modules—HyperTree outline planning, MCTS sentence-level generation, MAP-Elites diversity optimization, and assessment-guided iterative refinement—to automatically generate high-quality long-form contexts for creativity assessment, achieving an average improvement of 8% over competitive baselines across 7 evaluation dimensions.
An Existence Proof for Neural Language Models That Can Explain Garden-Path Effects via Surprisal: By fine-tuning neural language models on garden-path sentences, this paper demonstrates the existence of a neural LM that can simultaneously explain garden-path effects and naturalistic reading times via surprisal, providing an existence proof for surprisal theory.
Are Emotion and Rhetoric Neurons in LLM? Neuron Recognition and Adaptive Masking for Emotion-Rhetoric Prediction Steering: This paper systematically investigates the representational mechanisms of emotion and rhetoric neurons in LLMs and their intrinsic relationships. It proposes a multi-dimensional neuron recognition framework and an adaptive masking validation method, enabling targeted steering of emotion/rhetoric predictions and rhetoric-neuron-assisted emotion recognition.
Automatic Combination of Sample Selection Strategies for Few-Shot Learning: This paper proposes ACSESS, a method that automatically identifies complementary sample selection strategies and combines them via weighted aggregation, using three mechanisms: forward selection, backward selection, and Datamodels. Experiments across 23 strategies, 5 ICL models, 3 gradient-based few-shot learning methods, 6 text datasets, and 8 image datasets demonstrate that combined strategies consistently outperform individual strategies and ICL-specific baselines.
ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis: ChatHLS proposes a multi-agent HLS design framework featuring two core components — HLSTuner (QoR-aware reasoning for pragma selection) and HLSFixer (a hierarchical feedback-enhanced debugging framework) — combined with a self-evolving error case augmentation mechanism (VODA), achieving significant improvements over baselines in HLS-C generation success rate and hardware performance optimization.
CoSToM: Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models: This paper proposes CoSToM, a framework that first applies causal tracing to identify the critical layers encoding Theory-of-Mind (ToM) features within LLMs (finding they concentrate primarily in early layers), then performs lightweight alignment via activation steering at those layers—significantly improving social reasoning quality in negotiation and persuasion dialogues, bridging the gap between "knowing but not applying" and "knowing and applying."
Detoxification for LLM from Dataset Itself: This paper proposes HSPD (Hierarchical Semantic-Preserving Detoxification), a pipeline that leverages SoCD (Soft Contrastive Decoding) to guide an LLM in identifying and rewriting toxic segments in raw corpora while preserving semantics, producing detoxified text that can directly replace original training data for fine-tuning. The approach reduces toxicity probability from 0.42 to 0.18 on GPT2-XL and achieves state-of-the-art detoxification on LLaMA2-7B, OPT-6.7B, and Falcon-7B.
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot NER: DiZiNER simulates the human pilot annotation workflow: multiple heterogeneous LLMs independently annotate the same text, and inter-model disagreements are analyzed to iteratively refine task instructions. The method achieves zero-shot SOTA on 14 out of 18 NER benchmarks, with an average F1 gain of +8.0, surpassing its supervisor model GPT-5 mini.
Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models: This paper proposes PA-Tool, a training-free tool schema optimization method that leverages a "peakedness" signal borrowed from data contamination detection to identify naming patterns familiar to a model from pretraining. By renaming tool components to align with the internalized knowledge of small language models (SLMs), PA-Tool achieves up to 17% improvement on MetaTool and RoTBench, with an 80% reduction in schema misalignment errors.
EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution: EvoSpark proposes a multi-agent framework for long-horizon narrative evolution, addressing social memory stacking and narrative–spatial misalignment through three core designs: hierarchical recursive memory (RSB as social cognitive metabolism), generative scene scheduling (GMS for character–location–plot alignment), and an emergent character grounding protocol (ECGP that converts LLM hallucinations into persistent entities).
Expect the Unexpected? Testing the Surprisal of Salient Entities: This paper investigates the relationship between discourse-level salient entities and surprisal. Using 70K+ manually annotated entity mentions and a novel minimal-pair prompting approach, the study finds that globally salient entities are themselves more surprising (higher surprisal), yet systematically reduce the surprisal of surrounding content. This effect varies by genre and is strongest in topically coherent texts.
FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation: This paper analyzes two bottlenecks in continuous diffusion language models under few-step sampling — self-conditioning signal mismatch and training saturation — and proposes the FastDiSS framework, which introduces Self-Conditioning Perturbation (SCP) and Model-Aware Noise Scaling (MANS) to improve robustness, achieving 4×–400× speedup while preserving generation quality across 6 benchmarks.
Foresight Optimization for Strategic Reasoning in Large Language Models: This paper proposes Foresight Policy Optimization (FoPO), which introduces a foresight correction term based on opponent modeling into the policy optimization process, enabling LLMs to explicitly anticipate opponent behavior and adjust their strategies accordingly. FoPO achieves significant improvements in strategic reasoning on both cooperative (Cooperative RSA) and competitive (Competitive Taboo) game tasks, with consistent gains on the cross-domain γ-Bench benchmark.
From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models: This paper presents the first systematic survey of Streaming Large Language Models (Streaming LLMs), proposing a unified definition grounded in data flow and interaction concurrency. It organizes existing approaches into a three-level progressive taxonomy — Output-streaming, Sequential-streaming, and Concurrent-streaming — and covers methodologies and applications across text, speech, and video streaming scenarios.
GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-Efficient LLM Fine-tuning: GRASS is a framework that employs Mean Gradient Norm (MGN) as a task-aware and training-stage-aware layer importance metric. It adaptively samples and updates a subset of model layers during fine-tuning, coupled with a layer-wise optimizer state offloading mechanism, achieving up to 4.38-point improvement in average accuracy while reducing memory usage by up to 19.97%.
HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction: This paper proposes HCRE, a model that reformulates cross-document relation extraction from direct classification over a large relation set into layer-wise hierarchical classification guided by a constructed relation tree. A predict-then-verify inference strategy is designed to mitigate inter-layer error propagation. HCRE achieves substantial improvements over both SLM and LLM baselines on the CodRED benchmark.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs: This paper identifies a "benign self-reading" pattern in reasoning LLMs (e.g., DeepSeek-R1) during quantitative reasoning: answer tokens' attention over reasoning traces exhibits forward drift (progressively advancing along the reasoning chain) and semantic anchor concentration (repeatedly revisiting key steps), and this pattern strongly correlates with correctness. Building on this finding, the authors propose a training-free activation steering method driven by Self-Reading Quality (SRQ) scores, achieving accuracy improvements of up to 2.6% across multiple benchmarks.
It's High Time: A Survey of Temporal Question Answering: This paper presents a comprehensive survey of Temporal Question Answering (TQA), proposing a unified analytical framework along three dimensions—corpus temporality, question temporality, and model temporal capability—and systematically reviewing the evolution of TQA methods, benchmark datasets, and evaluation strategies from rule-based pipelines to the Transformer/LLM era, while identifying key challenges for future research.
Iterative Formalization and Planning in Partially Observable Environments: This paper proposes PDDLego+, a framework that enables LLMs to iteratively generate and refine PDDL (Planning Domain Definition Language) representations in partially observable environments. Through a two-phase error refinement loop (solver error + simulation error), the framework achieves effective planning without fine-tuning or in-context demonstrations.
Losses that Cook: Topological Optimal Transport for Structured Recipe Generation: This paper proposes a topological loss function based on Sinkhorn divergence, representing ingredient lists as point clouds in embedding space and minimizing the geometric discrepancy between predicted and reference ingredients. The approach significantly improves ingredient recall and quantity precision in structured recipe generation, with generated outputs preferred by human evaluators in 62% of cases.
Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models: This paper systematically investigates the sensitivity of large language models to the ordering of prompt components in multiple-choice question answering (MCQA). Through controlled experiments, the authors rule out training bias and memory decay hypotheses, identifying the causal attention mask as the fundamental mechanism responsible for the substantial performance degradation observed under the QOC (Question–Options–Context) ordering.
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness: By comparing the correctness-prediction performance of self-probes (using a model's own hidden states) against external probes (using hidden states from other models), this paper identifies inter-model agreement as the critical confounding factor that masks privileged knowledge. After controlling for agreement, domain-specific privileged knowledge is revealed: it exists in factual tasks but is absent in mathematical reasoning.
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data: This paper proposes MALMAS, a memory-augmented LLM-based multi-agent system for automated feature generation on tabular data. It employs six specialized agents to explore different dimensions of the feature space in parallel, coordinated by a Router Agent, and leverages a three-tier memory mechanism (procedural/feedback/conceptual) for cross-iteration experience accumulation and strategy refinement. MALMAS outperforms existing baselines on 16 classification and 7 regression datasets.
MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models: This paper proposes MulDimIF, a multi-dimensional constraint framework that systematically evaluates LLM instruction-following capabilities across three dimensions—constraint patterns (3 types), constraint categories (4 classes, 13 subcategories), and constraint difficulty (4 levels)—and significantly improves model performance via GRPO training, finding that gains primarily stem from parameter updates in the attention modules.
Not All Animals Are Equal: Metaphorical Framing through Source Domains and Semantic Frames: This paper proposes ConceptFrameMet, the first computational framework that integrates FrameNet semantic frames with source domains from Conceptual Metaphor Theory (CMT). A RoBERTa-based multi-task model is trained to jointly detect metaphors and predict their semantic frames and source domains. Combined with a log-likelihood ratio (LLR) statistical method for identifying salient metaphorical patterns in discourse, the framework reveals that liberal and conservative outlets employ the same source domains in immigration discourse yet select systematically different semantic frames to convey opposing associations.
One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization: This paper systematically compares 6 commonly used persona prompting strategies (two variants each of name-based, explicit-mention, and conversation-history cues) across 7 LLMs and 4 tasks. While average responses are highly correlated across prompting strategies, the magnitude of inter-persona differences varies substantially depending on the strategy used. Overly explicit prompts induce stronger personalization bias, cautioning against drawing bias conclusions from any single prompting approach.
Please Refuse to Answer Me: Mitigating Over-Refusal in LLMs via Adaptive Contrastive Decoding: This paper proposes AdaCD (Adaptive Contrastive Decoding), which extracts a refusal token distribution by contrasting token distributions under an extreme safety prompt versus no prompt, then dynamically decides to amplify or suppress refusal behavior based on an agreement ratio. AdaCD reduces over-refusal by 10.35% while simultaneously improving the refusal rate on malicious queries by 0.13%.
Prefix Parsing is Just Parsing: This paper proposes prefix grammar transformation, an efficient method that reduces prefix parsing to ordinary parsing. Given a grammar, the approach constructs a new grammar that generates exactly the set of all prefix strings of the original language, thereby enabling direct reuse of any existing ordinary parsing algorithm without the need for specialized prefix parsing algorithms.
Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms: This paper constructs the RedirectQA dataset—leveraging Wikipedia redirect information to associate the same entity with multiple surface forms—and systematically investigates how non-verbatim memorization in LLMs is affected by entity naming variants. The findings show that factual memorization is neither purely surface-form-specific nor entirely surface-form-agnostic, and that entity-level frequency makes an independent contribution beyond surface-level frequency.
Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffixes: This paper proposes R2A (Route to Rome Attack), which constructs a hybrid ensemble surrogate router in a black-box setting and optimizes universal adversarial suffixes to redirect LLM router decisions from cheap weak models toward expensive strong models — achieving an average attack success rate improvement of 49% across 7 open-source routers and 2 commercial routers (GPT-5-Auto, OpenRouter), with inference costs increasing by 2.7–2.9×.
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models: This paper identifies a phenomenon termed "style amnesia," in which spoken language models (SLMs) fail to maintain initially specified speaking styles (emotion, accent, volume, speech rate) across multi-turn conversations. Attention analysis reveals the underlying cause as attention dilution, and an explicit recall process is proposed as a mitigation strategy.
The Model Agreed, But Didn't Learn: Diagnosing Surface Compliance in Large Language Models: This paper proposes the SA-MCQ diagnostic framework to reveal the phenomenon of "surface compliance" in knowledge editing — editors achieve high scores on standard benchmarks without genuinely overwriting internal beliefs, models revert to original parametric memory under discriminative self-assessment, and sequential editing accumulates representational residuals that lead to cognitive instability.
Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities: This paper proposes inserting delimiter tokens at sentence boundaries in LLM inputs to implement a "think-in-sentences" reasoning paradigm via both ICL and SFT. The approach yields consistent improvements across models ranging from 7B to 600B parameters (GSM8k +7.7%, DROP +12.5%) with negligible additional computational overhead.
Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Collaboration: This paper proposes SpreadsheetAgent, a two-stage multi-agent framework that achieves robust real-world spreadsheet understanding through progressive region-based reading and cross-validation across three formats—code execution, vision, and LaTeX—without exceeding LLM context limits.
Why Did Apple Fall: Evaluating Curiosity in Large Language Models: This paper proposes the first psychologically inspired framework for systematically evaluating curiosity-like behaviors in LLMs. Through a combination of self-report questionnaires and behavioral experiments, it finds that LLMs exhibit curiosity-like behavioral patterns that arise from data fitting and safety constraints rather than intrinsic drives. A curiosity-driven questioning pipeline is further designed to demonstrate that simulating curious behavior can improve downstream reasoning performance.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration: This paper presents XtraGPT—the first open-source LLM suite (1.5B–14B) for academic paper revision. By fine-tuning on 7,000 top-venue papers and 140,000 criteria-guided instruction–revision pairs, it enables context-aware, paragraph-level controllable revision. The 7B variant matches GPT-4o-mini, the 14B variant surpasses it, and human evaluation shows an average predicted score improvement of 0.65 points after revision.