💬 LLM / NLP¶

🧠 NeurIPS2025 · 52 paper notes

AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play: This paper proposes AceSearcher—a collaborative self-play framework in which a single LLM simultaneously plays two roles: a decomposer (breaking complex queries into sub-questions to guide retrieval) and a solver (integrating retrieved context to generate answers). Through a two-stage training pipeline of SFT followed by iterative DPO, using only final-answer rewards, AceSearcher achieves an average EM improvement of 7.6% across 10 datasets, and the 32B model matches DeepSeek-V3 with fewer than 5% of its parameters.
Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLMs: This paper proposes CAKE (Context-Aware Kernel Evolution), which leverages LLMs as crossover and mutation operators within a genetic algorithm framework to adaptively generate and evolve GP kernel expressions during Bayesian optimization. Combined with the BAKER ranking mechanism that balances model fit (BIC) and expected improvement (EI), CAKE consistently outperforms both fixed-kernel and adaptive-kernel baselines on tasks including hyperparameter optimization, controller tuning, and photonic chip design.
Are Language Models Efficient Reasoners? A Perspective from Logic Programming: This paper proposes a framework for evaluating LLM reasoning efficiency (rather than correctness alone) from a logic programming perspective. By mapping natural language proofs to logic program proofs via verbalized logic programs, the authors find that current LLMs not only suffer accuracy degradation on math problems containing irrelevant axioms, but also exhibit severely inefficient reasoning—more than half of all reasoning steps are unnecessary.
AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise: AutoDiscovery proposes Bayesian Surprise as an objective reward signal for open-ended scientific discovery — estimating the KL divergence between prior and posterior belief distributions via LLM sampling, combined with MCTS and progressive widening to explore the hypothesis space. On 21 real-world datasets, the method produces 5–29% more surprising discoveries than greedy/beam search baselines. Human evaluation confirms that Bayesian Surprise aligns with expert "surprise" ratings (0.67), substantially outperforming LLM self-evaluated "novelty" and "usefulness."
C²Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning: To address class-level knowledge inconsistency during prompt communication in federated continual learning, C²Prompt is proposed, which explicitly enhances class-level knowledge coherence across clients via two mechanisms: Local Class Distribution Compensation (LCDC) and Class-aware Prompt Aggregation (CPA). The method achieves an Avg accuracy of 87.20% on ImageNet-R, surpassing the previous SOTA Powder by 2.51%.
CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers: CAT replaces the \(N \times N\) attention matrix in standard self-attention with a circulant matrix generated from an \(N\)-dimensional vector, leveraging FFT to achieve \(O(N \log N)\) attention computation. While strictly preserving the softmax row-normalization structure, CAT matches or surpasses standard attention on ImageNet-1k (avg pool, CLIP-L accuracy 0.694 vs. 0.646) and WikiText-103 masked LM (PPL 8.32 vs. 9.82).
Characterizing the Expressivity of Fixed-Precision Transformer Language Models: This work precisely characterizes the expressive power of fixed-precision, strictly causal, soft-attention, NoPE Transformers — showing it is exactly equivalent to linear temporal logic restricted to past operators, LTL[P] — and unifies this characterization with partially ordered deterministic finite automata (PODFA) and \(\mathcal{R}\)-trivial monoids.
Composing Linear Layers from Irreducibles: By leveraging Clifford algebra, this work represents linear layers as compositions of bivectors—specifically as rotor sandwich products—requiring only \(O(\log^2 d)\) parameters to replace a \(d \times d\) dense matrix. When applied to Q/K/V projections in LLM attention layers, performance closely matches the original model and strong baselines.
Cultural Alien Sampler: Open-ended Art Generation Balancing Originality and Coherence: This paper proposes the Cultural Alien Sampler (CAS), which employs two GPT-2 models to separately model "concept coherence" and "cultural typicality," selecting concept combinations with high coherence but low cultural typicality to generate original yet harmonious artistic ideas. In human evaluations, CAS approaches the level of art students and substantially outperforms GPT-4o.
Detecting High-Stakes Interactions with Activation Probes: Linear activation probes (lightweight classifiers trained on LLM internal representations) are used to detect "high-stakes interactions" from users. Trained on synthetic data, these probes achieve AUROC of 0.88–0.92 across 6 real-world datasets, matching fine-tuned 8–12B LLMs at a computational cost six orders of magnitude lower. A cascaded architecture (probe pre-filtering + LLM refinement) further surpasses either component used alone.
Don't Be Lazy: CompleteP Enables Compute-Efficient Deep Transformers: CompleteP parameterization (α=1) is the only scheme that simultaneously achieves hyperparameter transfer along the depth dimension and complete feature learning, saving 12–34% FLOPs over μP on deep models.
EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths: This paper proposes the Probabilistic Angelic Nondeterminism (PAN) programming model and the EnCompass Python framework, which decouple an agent's core workflow logic from its inference-time search strategy. Programmers only need to insert branchpoint() markers at LLM call sites and switch among best-of-N, beam search, tree search, and other strategies via a few configuration parameters, reducing the amount of code modification by 3–6×.
EvoRefuse: Evaluating and Mitigating LLM Over-Refusal via Evolutionary Prompt Optimization: This paper proposes EvoRefuse, a framework that employs evolutionary search to maximize the ELBO for automatically generating diverse pseudo-malicious instructions, yielding a more challenging over-refusal evaluation benchmark (EvoRefuse-Test) and an effective alignment mitigation dataset (EvoRefuse-Align).
GeoCAD: Local Geometry-Controllable CAD Generation with Large Language Models: GeoCAD is proposed as the first method for locally geometry-controllable CAD generation. It introduces a complementary captioning strategy to generate geometric instructions for local parts and fine-tunes an LLM to enable precise modification of local CAD components according to user-defined text instructions.
Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales: This paper investigates the hyperparameter scaling rules for matrix-preconditioned optimizers (Shampoo/SOAP/Muon) with respect to model width and depth under the μP framework, and demonstrates that correct hyperparameter scaling is the key to achieving consistent speedups. Using μP with \(1/\text{width}\) weight decay, all three optimizers consistently achieve approximately \(1.4\times\) speedup on Llama models ranging from 190M to 1.4B parameters.
In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation: This paper analyzes the ICL approximation capability of linear Transformers on noisy linear dynamical systems: \(O(\log T)\) depth suffices to achieve \(O(\log T / T)\) test error (approaching the least-squares estimator), while single-layer linear Transformers admit an irreducible lower bound — revealing a depth-separation phenomenon under non-IID data.
Large Language Models Miss the Multi-Agent Mark: This position paper systematically surveys 1,400+ papers to argue that current LLM-based multi-agent systems (MAS LLMs) deviate from foundational MAS theory along four dimensions: LLMs lack native social behavior, environment design is LLM-centric, asynchronous coordination and standard communication protocols are absent, and emergent behaviors lack quantification. The paper warns that the field risks reinventing the wheel while ignoring 40 years of MAS research.
Linear Transformers Implicitly Discover Unified Numerical Algorithms: After training linear Transformers on a masked block matrix completion task, algebraic analysis of the learned weights reveals that the models implicitly converge to the same two-line iterative update rule—EAGLE—under three fundamentally different computational constraints (centralized, distributed, and compute-limited). This rule achieves second-order convergence with only logarithmic dependence on the condition number.
MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention: This paper proposes MonarchAttention, which leverages the structured properties of Monarch matrices and employs alternating optimization over a variational form of softmax to approximate attention at \(\Theta(N\sqrt{N}d)\) complexity. The method enables zero-shot replacement of attention layers in pretrained Transformers without any additional training, while achieving 1.4×–8.2× speedups over FlashAttention-2 on GPU.
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery: This work formalizes fine-grained scientific hypothesis generation as a combinatorial optimization problem and proposes Hierarchical Heuristic Search (HHS)—using LLM pairwise comparisons as gradient signals to navigate the hypothesis space, with hierarchical abstraction smoothing the reward landscape to reduce local optima entrapment. On an expert-annotated benchmark of 51 post-2024 chemistry papers, Soft Recall improves from 19.99% to 40.35%.
msf-CNN: Patch-based Multi-Stage Fusion with Convolutional Neural Networks for TinyML: This paper proposes msf-CNN, a multi-stage patch-based fusion optimization technique based on a directed acyclic graph (DAG) shortest-path algorithm. By efficiently searching for the optimal fusion configuration of CNNs, msf-CNN achieves 50%–87% reduction in peak RAM usage compared to existing methods (MCUNetV2, StreamNet) across various microcontrollers (ARM Cortex-M, RISC-V, ESP32), while maintaining controllable computational overhead.
Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models: Nemotron-Flash constructs a latency-optimal family of small language models through systematic optimization of depth-width ratios, evolutionary search over hybrid operator combinations (DeltaNet + Mamba2 + Attention), and weight-normalization-based training. Compared to Qwen3-1.7B/0.6B, it achieves 1.3×/1.9× latency reduction alongside a +5.5% average accuracy improvement.
On the Role of Hidden States of Modern Hopfield Network in Transformer: This paper moves beyond the adiabatic approximation underlying the established correspondence between Modern Hopfield Networks (MHN) and Transformers. By retaining the hidden-state dynamics of MHN, it derives a novel attention mechanism—Modern Hopfield Attention (MHA)—that introduces a cross-layer propagation mechanism for attention scores within self-attention layers. MHA improves the performance of ViT and GPT-2 systematically without adding any trainable parameters, and both theoretically and empirically demonstrates that it effectively alleviates the rank collapse problem in deep Transformers.
Opinion Maximization in Social Networks by Modifying Internal Opinions: This paper studies the optimization problem of maximizing the overall opinion in a social network by modifying the internal opinions of \(k\) key nodes. Two sampling-based approximation algorithms (random walk and forest sampling) and one exact asynchronous algorithm MIS are proposed. MIS provides theoretical convergence guarantees to the optimal solution and demonstrates superior efficiency and accuracy on real-world networks with tens of millions of nodes.
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL: This paper proposes PNLC, a method that trains a lightweight goal-conditioned value function as a "natural language critic" to guide LLM agents in multi-turn planning and self-refinement at the thought-step level. Without direct fine-tuning or inference-time search, PNLC significantly outperforms existing methods on complex interactive tasks such as web navigation, social reasoning, and persuasion, while achieving 8–10× faster inference.
PluralisticBehaviorSuite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies: This paper introduces PBSuite, an evaluation suite comprising 300 industry-specific behavioral policies and a dynamic multi-turn adversarial evaluation framework. It reveals that mainstream LLMs exhibit high compliance under single-turn settings (violation rate <4%), but compliance degrades sharply under multi-turn adversarial interactions (violation rate up to 84%).
Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity: This paper reveals a "polarity shift" phenomenon in LLM inference sparsity — MLP layer sparsity vanishes as batch size increases, while attention head sparsity remains stable and batch-invariant. Based on this insight, the authors design Selective Head Attention and corresponding GPU kernels, achieving up to 2.2× end-to-end speedup in large-batch inference.
Post Hoc Regression Refinement via Pairwise Rankings: This paper proposes RankRefine, a model-agnostic post-processing regression refinement method that fuses predictions from a base regressor with estimates derived from pairwise rankings via inverse-variance weighting. Without any retraining, the method achieves up to 10% relative MAE reduction in molecular property prediction using only 20 pairwise comparisons and a general-purpose LLM.
PRESTO: Preimage-Informed Instruction Optimization for Prompting Black-Box LLMs: This paper proposes PRESTO, a framework that exploits the many-to-one mapping (preimage structure) from soft prompts to instructions in white-box LLMs. Through three components — score sharing, preimage-based initialization, and score consistency regularization — PRESTO equivalently obtains 14× labeled data under the same query budget, significantly improving instruction optimization efficiency for black-box LLMs.
Q♯: Provably Optimal Distributional RL for LLM Post-Training: This paper proposes Q♯, a distributional RL-based value function method for KL-regularized LLM post-training. By learning the cumulative reward distribution under the reference policy to compute the optimal soft Q-function for guided generation, Q♯ achieves higher accuracy and lower KL divergence on mathematical reasoning tasks, and provides a variance-dependent PAC convergence bound.
Reparameterized LLM Training via Orthogonal Equivalence Transformation: This paper proposes POET, a training framework that reparameterizes weight matrices as the product of two learnable orthogonal matrices and a fixed random weight matrix, thereby preserving spectral properties throughout training to achieve more stable optimization and improved generalization with fewer trainable parameters than AdamW.
Scaling Up Active Testing to Large Language Models: By introducing three key simplifications—constructing a fixed surrogate model via in-context learning, using a small surrogate model to evaluate a large target model, and eliminating the need for target model predictions during data acquisition—this work scales active testing to LLMs, reducing risk estimation error by 25%–80% relative to random sampling.
SolverLLM: Solving Optimization Problems via Test-Time Scaling with LLM-Guided Search: This paper proposes SolverLLM, a training-free framework that treats the mathematical modeling of optimization problems as a search problem. It employs an enhanced MCTS to explore optimal formulations within a six-element representation space, incorporating dynamic expansion, prompt backpropagation, and uncertainty backpropagation. SolverLLM surpasses both prompting-based and fine-tuning-based methods on 6 benchmarks without any training.
Solving Inequality Proofs with Large Language Models: This paper proposes IneqMath, the first large-scale olympiad-level inequality benchmark, formulates inequality proving as two automatically verifiable subtasks (bound estimation and relation prediction), develops a five-module LLM-as-Judge framework, and finds that even o1 achieves an overall accuracy below 10% under step-by-step reasoning scrutiny.
SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models: This paper proposes SPACE (Self-PlAy via Noise Contrastive Estimation), which incorporates noise contrastive estimation into self-play fine-tuning. By independently optimizing the absolute reward values of real and synthetic samples—rather than their relative margin—SPACE fundamentally resolves the unstable convergence issues of methods such as SPIN, and provides provable convergence guarantees.
Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning: This paper proposes Sparse MeZO (S-MeZO), motivated by the observation that zeroth-order gradient noise disproportionately affects parameters with large magnitudes. S-MeZO selectively applies zeroth-order perturbation and updates only to small-magnitude parameters, achieving significant performance gains (+9% on RTE) and convergence acceleration (3.5×) without any additional memory overhead.
Spectral Conditioning of Attention Improves Transformer Performance: This paper theoretically establishes that the condition number of the attention layer Jacobian in Transformers is governed by the condition numbers of the Query/Key/Value matrices, and proposes Spectral Conditioned Attention — a plug-and-play module that reduces the condition number by adding fixed correction terms to the Q/K/V matrices, consistently improving performance across image classification, object detection, and NLP tasks.
SubSpec: Speculate Deep and Accurate — Lossless and Training-Free Acceleration for Offloaded LLMs: This paper proposes SubSpec, a plug-and-play lossless and training-free acceleration method for offloaded LLMs. The core idea is to construct a highly aligned quantized substitute draft model directly from the offloaded target model itself, and to maximize alignment by sharing GPU-resident layers and KV-Cache. SubSpec achieves a 9.1× speedup for Qwen2.5 7B under an 8GB VRAM budget and a 12.5× speedup for Qwen2.5 32B under 24GB VRAM.
Strassen Attention, Split VC Dimension and Compositionality in Transformers: This paper introduces the Splitting VC dimension as a theoretical tool to prove fundamental limitations of single-layer softmax Transformers (even with infinite precision) on compositional reasoning tasks, and proposes the Strassen attention mechanism with sub-cubic time complexity to overcome these limitations.
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Model: StreamBridge proposes a simple and generalizable framework that enables multi-turn streaming interaction via a memory buffer with round-decayed compression, and achieves proactive response through a decoupled lightweight activation model. Combined with the purpose-built Stream-IT dataset, it successfully converts offline Video-LLMs (e.g., Qwen2-VL, LLaVA-OV) into streaming assistants, surpassing GPT-4o and Gemini 1.5 Pro on OVO-Bench and Streaming-Bench.
SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assemblies: This paper proposes SYMPHONY, an MCTS-based multi-agent planning framework that leverages diversity-driven search over a heterogeneous LLM pool, UCB-based adaptive scheduling, entropy-modulated confidence scoring, and pool-level memory sharing to substantially improve planning diversity and efficiency.
Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning: This paper proposes a partition-based multi-stage fine-tuning framework that strategically partitions multiple domains into subsets (stages) to maximize inter-domain synergy while minimizing negative transfer, and derives a novel generalization bound to theoretically support the partitioning strategy.
System Prompt Optimization with Meta-Learning: This paper formulates system prompt optimization as a bilevel problem and proposes MetaSPO, a meta-learning framework that optimizes system prompts for cross-task generalization in the outer loop while optimizing task-specific user prompts in the inner loop. The resulting system prompts significantly outperform baselines across 14 unseen tasks.
Systematizing LLM Persona Design: A Four-Quadrant Technical Taxonomy for AI Companions: This paper proposes a four-quadrant technical taxonomy for LLM persona design, organized along two axes—"virtual vs. embodied" and "emotional companionship vs. functional augmentation"—to systematically analyze the technology stacks, core challenges, and ethical risks across diverse scenarios ranging from virtual companions and game NPCs to caregiving robots.
The Rise of Parameter Specialization for Knowledge Storage in Large Language Models: This paper systematically analyzes 20 open-source LLMs and finds that stronger models exhibit higher degrees of parameter specialization in MLP value vectors — i.e., semantically similar knowledge tends to be concentrated in a small subset of parameter vectors. Causal experiments further confirm a causal relationship between this specialization degree and model performance on knowledge tasks.
Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs: This paper proposes T-SPIN (Triplet Self-Play Fine-Tuning), which extends SPIN by introducing a "historical advantage" (proto-synthetic responses as anchor points) and an entropy constraint to enable reference-free policy training. T-SPIN addresses two core issues in SPIN: optimization instability and train-generation misalignment, achieving performance comparable to full-data SFT using only 25% of labeled data.
Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning: This paper proposes a unified framework based on hidden state geometry (separability + alignment) that bridges the two major explanatory lines of ICL — attention heads (PTH/IH) and task vectors — revealing a two-phase mechanism in classification tasks: early layers establish separability via PTH, while later layers improve alignment with label unembedding directions via IH.
Valid Inference with Imperfect Synthetic Data: A hyperparameter-free framework based on Generalized Method of Moments (GMM) is proposed to integrate imperfect LLM-generated synthetic data with real data for statistically valid inference. When the residuals of synthetic data are correlated with those of real data, the framework can substantially reduce estimation variance, while guaranteeing no harm to estimation quality in the worst case (i.e., when synthetic data is entirely uninformative).
Weak-to-Strong Generalization under Distribution Shifts: This paper demonstrates that naive weak-to-strong generalization fails under distribution shifts—where the strong model performs even worse than the weak supervisor—and proposes RAVEN, a framework that dynamically learns optimal combination weights over multiple weak models to achieve robust weak-to-strong generalization, surpassing baselines by over 30% on OOD tasks.
What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains: This paper theoretically proves that a two-layer single-head Transformer suffices to represent the conditional \(k\)-gram model (i.e., \(k\)-th order induction head) for any \(k\)-th order Markov process, establishing the tightest known characterization of the relationship between Transformer depth and Markov order. The key insight is leveraging ReLU and LayerNorm nonlinearities in the MLP to compensate for the reduced number of layers.
Wider or Deeper: Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search: AB-MCTS proposes an adaptive-branching Monte Carlo Tree Search framework that dynamically decides at each node whether to go "wider" (generate new candidate answers) or "deeper" (refine existing answers using feedback), balancing exploration and exploitation via Bayesian posterior updates, and outperforms repeated sampling and standard MCTS on programming and engineering tasks.
Writing in Symbiosis: Mapping Human Creative Agency in the AI Era: Through longitudinal corpus analysis of 50,000+ documents, this paper proposes the "Dual-Track Evolution" hypothesis — that in the LLM era, human writing exhibits thematic convergence alongside structural stylistic differentiation — and identifies three authorial adaptation archetypes: Adopters, Resistors, and Pragmatists.