💬 LLM / NLP¶

🔬 ICLR2026 · 46 paper notes

AP-OOD: Attention Pooling for Out-of-Distribution Detection: This paper proposes AP-OOD, which replaces the mean pooling in Mahalanobis distance-based OOD detection with learnable attention pooling, addressing the information loss caused by mean aggregation of token-level anomaly signals. On text OOD detection, AP-OOD reduces FPR95 on XSUM summarization from 27.84% to 4.67%, while supporting a smooth transition from unsupervised to semi-supervised settings.
AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer: This paper proposes AssetFormer, an autoregressive Transformer-based framework for modular 3D asset generation. By designing graph-traversal token ordering, token set modeling, and a SlowFast decoding strategy, it generates high-quality architectural assets composed of discrete primitives from text descriptions, and introduces the first large-scale real-world modular 3D dataset (16k real + 4k synthetic samples).
AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer: This paper proposes AssetFormer, an autoregressive Transformer based on the Llama architecture that models modular 3D assets (composed of primitive sequences) as discrete token sequences. Through DFS/BFS graph traversal reordering and joint vocabulary decoding, it enables the generation of modular 3D assets directly usable in game engines from text descriptions.
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning: This paper proposes BOTS—a unified Bayesian inference framework for online task selection in LLM reinforcement finetuning. BOTS integrates explicit evidence (historical pass rates from direct evaluation) and implicit evidence (difficulty estimates for unevaluated tasks inferred via reference model interpolation), combined with Thompson sampling for exploration–exploitation balance. The framework achieves up to 50% training speedup on math, code, and logic tasks with only 0.2% additional computational overhead.
Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning: This paper introduces the Compositional-ARC dataset to evaluate systematic generalization in abstract spatial reasoning—specifically, whether models can generalize from known primitive geometric transformations (e.g., translation, rotation) to unseen combinations thereof. A 5.7M-parameter encoder-decoder model trained with MLC achieves 78.26% exact match on the systematicity task, matching the ARC Prize 2024 champion (8B model + TTT) while vastly outperforming GPT-4o, o3-mini, and similar models (<3%).
d²Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching: This paper proposes d²Cache, a training-free approximate KV cache framework for diffusion-based LLMs (dLLMs), achieving 4.1× inference speedup while simultaneously improving generation quality via a two-stage strategy: deterministic prior-guided masked token selection followed by attention-aware non-masked token selection.
DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas: DreamOn introduces two special states, [expand] and [delete], to overcome the fixed-length generation constraint of diffusion language models (DLMs), enabling variable-length code infilling without any architectural modification. It achieves an average improvement of 26.4% over diffusion baselines on HumanEval-Infilling, reaching performance on par with state-of-the-art autoregressive models.
ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework: This paper proposes ELLMob, a framework grounded in Fuzzy-Trace Theory (FTT) from cognitive psychology. By extracting and iteratively aligning "habit gist" and "event gist," the framework reconciles the competition between users' routine patterns and social event constraints, enabling interpretable event-driven trajectory generation.
ELLMob: Event-Driven Human Mobility Generation with Self-Aligned LLM Framework: This paper proposes ELLMob, a self-aligned LLM framework grounded in Fuzzy-Trace Theory (FTT), which generates human mobility trajectories that balance everyday routines with event-driven responses by extracting and iteratively aligning "habitual pattern gists" with "event constraint gists."
Enhancing Persona Following at Decoding Time via Dynamic Importance-Guided Token Estimation for Role-Playing Agents: This paper proposes Persona Dynamic Decoding (PDD), a framework that dynamically estimates the context-dependent importance of persona attributes via conditional mutual information and integrates importance scores into multi-objective reward-guided decoding, achieving training-free inference-time persona following.
Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents: This paper proposes the Persona Dynamic Decoding (PDD) framework, which dynamically estimates the context-dependent importance of persona attributes via conditional mutual information and guides decoding with a weighted multi-objective reward, enabling training-free, adaptive persona following at inference time.
Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents: This paper proposes PDD (Persona Dynamic Decoding), a framework that dynamically estimates the importance of persona attributes across different contexts via conditional mutual information, and guides decoding at inference time through a weighted multi-objective reward, achieving adaptive persona following without any fine-tuning.
Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator: This paper proposes a context-aware pairwise comparison framework for evaluating text creativity, constructs the CreataSet dataset comprising 100K+ human-annotated and 1M+ synthetic samples, and trains the CrEval evaluator, which surpasses GPT-4o by 18.7% in alignment with human judgments.
Fine-Grained Activation Steering: Steering Less, Achieving More: AUSteer reveals that block-level activation steering is inherently heterogeneous—different dimensions govern different token distributions, and steering the entire block simultaneously amplifies both beneficial and harmful signals. The paper proposes fine-grained steering at the Atomic Unit (AU) level: discriminative dimensions are identified via activation momentum, steering strength is adaptively allocated, and intervening on only ≤100 dimensions substantially outperforms state-of-the-art methods that steer thousands of dimensions.
First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation: Through theoretical analysis and empirical experiments, this paper demonstrates that the widely accepted claim that "the first layer (embedding) is best suited for influence estimation" is unreliable. The work finds that intermediate attention layers are more effective, proposes two novel cross-layer aggregation strategies—Rank and Vote—along with a Noise Detection Rate (NDR) proxy metric, and achieves significant improvements in detecting harmful training samples in LLMs.
From Assumptions to Actions: Turning LLM Reasoning into Uncertainty-Aware Planning: This paper proposes PCE (Planner-Composer-Evaluator), a framework that explicitly extracts and organizes implicit environmental assumptions from LLM reasoning chains into decision trees, enabling uncertainty-aware action selection via a likelihood-gain-cost scoring function, thereby substantially reducing communication overhead in multi-agent collaboration.
FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Model: This paper proposes FS-DFM (Few-Step Discrete Flow-Matching), which reduces the sampling steps of discrete flow-matching language models from 1024 to 8 through step-aware training and a cumulative scalar update rule, achieving a 128× speedup while maintaining comparable perplexity and generation quality.
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition: Using off-by-one addition (e.g., 1+1=3, 2+2=5) as a counterfactual task, this work applies path patching to reveal a function induction mechanism within large language models — an attention head circuit that performs inductive reasoning at the function level, beyond token-level pattern matching — and demonstrates that this mechanism is reused across tasks.
GASP: Guided Asymmetric Self-Play For Coding LLMs: GASP introduces "goalposts" (hard target problems) into asymmetric self-play to guide the teacher in generating targeted training problems. Through a lemma (simplified variant) → lift (harder variant) curriculum structure, the framework progressively approaches difficult targets, surpassing unguided self-play by 2.5% on LiveCodeBench and solving hard problems that all baselines fail to solve.
Generative Value Conflicts Reveal LLM Priorities: This paper proposes ConflictScope, an automated pipeline for generating value-conflict scenarios. Through open-ended evaluation (rather than multiple-choice), it reveals LLMs' value priority rankings under conflict conditions. Key findings show that models shift from protective values (e.g., harmlessness) toward personal values (e.g., user autonomy) in open-ended settings, and that system prompts can improve alignment with target rankings by 14%.
How Catastrophic is Your LLM? Certifying Risk in Conversation: This paper proposes C3LLM (Certification of Catastrophic risks in multi-turn Conversation for LLMs), the first framework to provide statistical certification of catastrophic risks in multi-turn LLM conversations. It models conversation distributions as Markov processes over a semantic similarity graph, defines three conversation sampling strategies augmented with a jailbreak layer, and applies Clopper-Pearson 95% confidence intervals to certify the probability that a model produces harmful outputs—finding that the worst-performing model has a risk lower bound as high as 72%.
How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use: This paper systematically analyzes three core reasoning deficiencies of LLMs in poker (heuristic reasoning, factual misunderstanding, and knowing-doing gap), and proposes ToolPoker — the first tool-integrated LLM reasoning system for incomplete information games. By incorporating an external CFR solver to provide game-theoretically optimal action guidance, a 7B model approaches Nash equilibrium performance in Limit Hold'em.
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure: This paper proposes that the Reversal Curse is a manifestation of the cognitive science "binding problem" in Transformers—stemming from inconsistent and entangled concept representations—and for the first time designs an architecture based on JEPA and memory layers that genuinely overcomes (rather than circumvents) the Reversal Curse.
KVComm: Enabling Efficient LLM Communication through Selective KV Sharing: This paper proposes KVComm, a framework that enables efficient inter-LLM communication via selective KV pair sharing. It identifies an "information concentration bias" in hidden states that renders them unsuitable for cross-model transfer, and designs a layer selection strategy combining attention importance scores with a Gaussian prior. Transmitting only 30% of layers suffices to outperform most baselines.
LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery: This paper proposes LLEMA, a framework that integrates LLM scientific knowledge with chemistry-rule-guided evolutionary search and memory-driven iterative optimization, achieving superior hit rates, stability, and Pareto front quality across 14 multi-objective materials discovery tasks.
LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery: This paper proposes LLEMA, a framework that integrates the scientific prior knowledge of LLMs with chemistry-rule-guided evolutionary search and memory-driven iterative optimization, substantially outperforming generative and pure-LLM baselines across 14 multi-objective materials discovery tasks.
Meta-RL Induces Exploration in Language Agents: This paper proposes LaMer, a framework that introduces Meta-Reinforcement Learning (Meta-RL) into LLM agent training. By optimizing rewards across episodes and enabling context-based policy adaptation via self-reflection, LaMer equips language agents with active exploration capabilities, achieving absolute performance gains of 11%, 14%, and 19% on Sokoban, MineSweeper, and Webshop, respectively.
Near-Optimal Online Deployment and Routing for Streaming LLMs: This work provides the first formal treatment of the joint LLM streaming online deployment and routing problem, where new models continuously arrive and existing models may become obsolete. Under a concurrency deployment cap \(M_{\max}\) and cost budget constraints, the paper proposes StageRoute, a hierarchical algorithm that achieves a provable \(\tilde{\mathcal{O}}(T^{2/3})\) regret bound with a matching lower bound, establishing near-optimality.
Neural Synchrony Between Socially Interacting Language Models: This paper presents the first investigation of neural synchrony between LLMs engaged in social interaction. By training affine transformations to predict a partner model's future representations, it defines the \(SyncR^2\) metric to quantify synchrony strength. The results show that synchrony depends on social engagement and temporal proximity, and correlates strongly with LLMs' social behavioral performance (Pearson \(r\) = 0.88–0.99), echoing neuroscientific findings on inter-brain synchrony (IBS) in humans.
Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards: This paper proposes Optimas, a framework that maintains a locally aligned reward function (LRF) per component in compound AI systems, enabling independent optimization of heterogeneous components (prompts, model parameters, hyperparameters, model selection), achieving an average improvement of 11.92% across five real-world systems.
Predicting LLM Reasoning Performance with Small Proxy Models: This paper proposes rBridge, a method that combines NLL evaluation on frontier-model reasoning traces with token-level task alignment weights, enabling models with ≤1B parameters to effectively predict the reasoning performance of 13B–32B models, reducing data ranking computation cost by over 100×.
PT2-LLM: Post-Training Ternarization for Large Language Models: This paper proposes PT2-LLM, the first post-training ternarization framework for LLMs. Through an asymmetric ternary quantizer (featuring iterative ternary fitting and activation-aware grid alignment) and a structural similarity reordering strategy, it achieves superior performance over 2-bit PTQ methods at 1.58-bit precision.
ConflictScope: Generative Value Conflicts Reveal LLM Priorities: This paper proposes ConflictScope — an automated pipeline for generating and evaluating value conflict scenarios: given an arbitrary set of values, it automatically generates conflict scenarios for each value pair and evaluates LLM value priority rankings through open-ended simulated user interactions (rather than multiple-choice questions). The study finds that models shift significantly from "protective values" (e.g., harmlessness) toward "personal values" (e.g., user autonomy) under open-ended evaluation, and that system prompts can improve alignment target rankings by 14%.
Rethinking Code Similarity for Automated Algorithm Design with LLMs: This paper proposes BehaveSim, an algorithm similarity metric based on Problem-Solving Trajectories (PSTrajs) and Dynamic Time Warping (DTW). BehaveSim measures algorithmic differences at the level of execution behavior rather than syntax or output, and when integrated into LLM-AAD frameworks such as FunSearch and EoH, yields significant performance improvements.
Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure: Starting from the proper scoring rules framework, this paper proves that the negative log-likelihood of the highest-probability output sequence (MSP) is a theoretically grounded uncertainty measure, and proposes G-NLL — a method that approximates this measure with a single greedy decoding pass, matching or surpassing SOTA methods that require multiple samples across several benchmarks.
Statistical Advantage of Softmax Attention: Insights from Single-Location Regression: By proposing the Single-Location Regression (SLR) theoretical framework and employing the order parameter method from statistical physics, this paper rigorously proves in the high-dimensional limit that softmax attention achieves the Bayes risk at the population level while linear attention fundamentally cannot. Under finite-sample regimes, softmax is shown to consistently outperform linear attention. This work provides the first principled explanation for the superiority of softmax attention in retrieval tasks.
Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding: This paper proposes SureLock, which permanently locks token positions in Masked Diffusion LMs once their posterior distributions stabilize after unmasking—skipping Q projection and FFN while caching KV—thereby reducing per-step attention computation from \(O(N^2d)\) to \(O(MNd)\). SureLock achieves 30–50% FLOPs reduction on LLaDA-8B without degrading generation quality.
The Lattice Representation Hypothesis of Large Language Models: This paper proposes the Lattice Representation Hypothesis (LRH) for LLMs: by unifying the Linear Representation Hypothesis with Formal Concept Analysis (FCA), it demonstrates that attribute directions in LLM embedding spaces implicitly encode a concept lattice via half-space intersections, thereby bridging continuous geometry and symbolic abstraction.
The Path of Least Resistance: Guiding LLM Reasoning Trajectories for Efficient Consistency: This paper proposes PoLR (Path of Least Resistance), the first inference-time method that exploits reasoning prefix consistency. By clustering short prefixes and expanding only the dominant cluster, PoLR serves as an efficient alternative to Self-Consistency, reducing token usage by up to 60% and latency by up to 50%.
Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerabilities: This paper identifies a priming vulnerability in masked diffusion language models (MDLMs)—injecting affirmative tokens at intermediate denoising steps can bypass safety guardrails—and proposes Recovery Alignment (RA), a training method that teaches models to recover safe responses from corrupted intermediate states.
Trapped by simplicity: When Transformers fail to learn from noisy features: This paper demonstrates that Transformers fail to learn Boolean functions from feature-noisy data. Their simplicity bias—a tendency to learn low-sensitivity functions—causes models to become trapped at optimal noisy predictors that are simpler than the target function, preventing recovery of the true noiseless target.
Unsupervised Evaluation of Multi-Turn Objective-Driven Interactions: Three unsupervised metrics are proposed—LLM-guided clustering (goal identification), interaction completeness detection via fine-tuned completion models, and response trees (LLM uncertainty quantification)—for evaluating multi-turn objective-driven dialogues without labeled data or LLM-as-a-judge, achieving performance that matches or exceeds a 70B judge using only an 8B model.
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality: This work introduces WebDevJudge, a meta-evaluation benchmark that systematically assesses the ability of LLMs/MLLMs and agentic workflows to serve as judges for web development quality. Results reveal an approximately 15% agreement gap between the strongest current models and human experts, and identify two fundamental bottlenecks: failure to recognize functional equivalence and inadequate feasibility verification.
Weight Decay may matter more than μP for Learning Rate Transfer in Practice: Through large-scale empirical analysis, this paper demonstrates that the core alignment assumption of μP holds only briefly at the start of training. In practice, it is independent weight decay rather than μP that correctly stabilizes feature learning dynamics across widths, and the practical benefits of μP can be reinterpreted as a form of implicit learning rate warmup.
When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making: Through a controlled behavioral evaluation framework, this paper identifies four hidden failure modes of LLMs in data-constrained scientific decision-making tasks: high stability ≠ correctness, prompt-wording sensitivity, over-selection under relaxed thresholds, and hallucination of invalid identifiers.
When Stability Fails: Hidden Failure Modes of LLMs in Data-Constrained Scientific Decision-Making: This paper reveals hidden failure modes of LLMs in data-constrained scientific decision-making tasks: models can exhibit near-perfect run-to-run stability while systematically diverging from statistical ground truth, manifesting as over-selection, prompt sensitivity, and hallucinated gene identifiers.