💬 LLM (Other)¶

🧪 ICML2026 · 39 paper notes

📌 Same area in other venues: 📷 CVPR2026 (3) · 🔬 ICLR2026 (56) · 💬 ACL2026 (62) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (54) · 📹 ICCV2025 (6)

🔥 Top topics: LLM ×11 · Diffusion Models ×3 · Adversarial Robustness ×2

A Geometric Relation of the Error Introduced by Sampling a Language Model's Output Distribution to its Internal State: This paper characterizes the information loss introduced by sampling from high-entropy distributions in GPT-style LLMs from a differential geometry perspective. By constructing \(\mathfrak{so}(n)\)-valued 1-forms and parallel transport operators, it demonstrates through chess probe experiments that these geometric rotations align highly with the model’s learned world vectors.
ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models: ANCHOR constructs a dense factor space using "bottom-up abduction + hierarchical clustering." For downstream conditions, it performs coarse-to-fine retrieval to obtain a sparse set of relevant factors. It then aggregates posteriors by combining Naïve Bayes with a dynamically constructed Causal Bayesian Network (CBN) featuring latent variables. In high-risk LLM decision-making scenarios, it significantly reduces "unknown" predictions and improves probability calibration.
Automated Formal Proofs of Combinatorial Identities via Wilf–Zeilberger Guidance and LLMs: WZ-LLM compiles the classic Wilf–Zeilberger symbolic proof pipeline into an executable proof skeleton (recurrence + boundary conditions + side conditions) in Lean 4. These components are discharged by WZ-Prover, a specialized model trained via SFT + expert-iteration + DAPO. On 100 classic combinatorial identities, it improves the pass@32 from Goedel-Prover-V2's 9% to 34%.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision: This paper proposes Compute as Teacher (CaT): it "synthesizes" a pseudo-reference answer from \(G\) rollouts already sampled by GRPO using a frozen anchor model. In non-verifiable domains, the model uses binary rubrics self-derived from this pseudo-reference to score each rollout as an RL reward. This directly converts inference compute into supervision signals without any human annotation, achieving up to a 30% improvement over baselines on HealthBench and matching or exceeding inference-time aggregation with 9× lower test-time compute.
Creative Collision: Directorial Persona Steering and Competition in Large Language Models: Two semantically opposing "directorial persona" steering vectors (Spielberg's optimistic redemption vs. Scorsese's dark moral ambiguity) are simultaneously injected into the residual stream of an LLM. This study systematically characterizes the moral tone, coherence, and geometric changes during the competition between these two directions, discovering three counter-intuitive phenomena: "directional dominance," "coherence trough," and the "Layer 28 moral hub."
Deep Networks Learn to Parse Uniform-Depth Context-Free Languages from Local Statistics: The authors propose a "Varying-tree RHM" Probabilistic Context-Free Grammar (PCFG) with controllable ambiguity. They prove that using only low-order moments (root-to-pair and root-to-triple) combined with layer-wise clustering is sufficient to recover grammar rules and perform CYK-style parsing. The sample complexity is derived as \(P^\star \asymp v\, m_3\, m_2^{L-1} (p_2^2/2)^{1-L}\), and experiments on CNNs and Transformers strictly follow this power law.
Differential Syntactic and Semantic Encoding in LLMs: By averaging hidden representations of sentences sharing the same syntactic structure or the same meaning to obtain "syntactic centroids" and "semantic centroids," the authors demonstrate that a significant portion of syntactic/semantic information in LLMs like DeepSeek-V3 is encoded via linear superposition. Moreover, these two types of information exhibit clear separability in layer-wise distribution and orthogonal ablation—supporting the linguistic hypothesis of "syntactic autonomy."
Emergence of Hierarchical Emotion Organization in Large Language Models: The paper utilizes a tree-building algorithm that relies solely on LLM output logits without any annotations to "excavate" a hierarchical emotion tree from the model's next-token distribution of emotion words. It finds that as the model scale increases, these trees increasingly resemble the human psychological "emotion wheel." Furthermore, it demonstrates that LLMs under different demographic personas reproduce systematic emotion recognition biases consistent with those of human subjects.
Express Your Doubts: Probabilistic World Modeling Should Not Be Based on Token logprobs: This is a position paper arguing that treating the token softmax probabilities (logprobs) of an LLM as "world event probabilities" is theoretically flawed. This is because distribution estimation, response prediction, and target distribution estimation are three distinct tasks, each corresponding to a different ideal output distribution. The correct approach to obtaining world probabilities is second-order prediction—tasking the LLM to explicitly output its probability for an event (using numerical values or verbal qualifiers) rather than calculating "the probability of it generating X."
How Many Different Outputs Can a Transformer Generate?: Starting from two fundamental architectural facts—finite precision and bounded embedding support—this paper proves that any Transformer can only generate a finite number of "accessible sequences." It provides a tight upper bound where the length of accessible sequences grows linearly with prompt length, after which the proportion of accessible sequences decays exponentially at a rate of \(1/|V|^n\). Experiments on Pythia, Qwen, Llama, and Gemma verify that the theoretical slope differs from the measured value by only 5–10x.
"I've Seen How This Goes": Characterizing LLM vs. Human Writing Diversity using Progressive Conditional Surprisal: This paper proposes \(D_{Ca_n}=C\cdot a_n\), an embedding-free, reference-free, and label-free diversity metric. It uses a base model \(\theta\) to process all responses in a single forward pass to measure "how much per-byte conditional surprisal remains in the last response after seeing \(n-1\) priors," multiplied by the "overall coherence of the responses." It approaches SentBERT performance on the McDiv human evaluation benchmark and captures the monotonic decrease in diversity (mode collapse) across the OLMo-2-7B post-training pipeline (base → SFT → DPO → RLVR).
Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models: This paper systematically reveals two overlooked defects of Masked Diffusion Language Models (MDLM): like autoregressive models, they exhibit a strong locality bias; furthermore, the mask tokens appended for parallel generation act as distractors that degrade context comprehension. The authors propose a mask-agnostic fine-tuning loss that enforces prediction invariance to the number of mask tokens, significantly restoring robustness.
Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation: The authors attribute the "culprit" of convergence collapse caused by delayed gradients in asynchronous pipeline parallelism (APP) training of LLMs to Adam's basis mismatch (where the Hessian eigenbasis is not aligned with coordinate axes). They propose performing basis rotation to the Hessian eigenbasis before executing Adam updates. On a 3B model, this reaches the same loss with 81.7% fewer iterations compared to the strongest asynchronous baseline.
YAQA: End-to-End KL Minimizing Adaptive Weight Quantization for LLMs: YAQA shifts the proxy objective of LLM weight quantization from "layer-wise activation error" to "end-to-end model output KL divergence." Using a Hessian sketch via Kronecker decomposition, it provides the first end-to-end error bound. It reduces KL divergence by approximately 30% compared to GPTQ/LDLQ, even outperforming Quantization-Aware Training (QAT) in accuracy, while maintaining the same inference speed.
Multi-Agent Teams Hold Experts Back: Why Self-Organized LLM Teams Fail to Retain "Experts": This paper systematically evaluates self-organized heterogeneous LLM teams using the organizational psychology standard of "strong synergy" (team \(\ge\) strongest individual). It finds that even when explicitly informed of expert identities, teams underperform experts by 6.3%–41.1% on frontier ML benchmarks. The root cause is not the inability to recognize experts, but a reluctance to let them lead—LLMs favor "middle-ground integration" over "epistemic deference." This consensus mechanism dilutes expertise as team size grows but, conversely, makes teams exceptionally robust against adversarial members.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation: Through large-scale experiments on toxicity detection (9 models × 5 datasets), this paper finds that LLM annotation performance is primarily determined by definition alignment rather than text memorization. Model-internalized priors render the vast majority of zero-shot errors "resilient" to prompt-based correction—even when explicit definitions and examples are provided, two-thirds of errors remain uncorrectable (rescue rate of only 34.8%), and confidence scores cannot be used to detect definition errors.
Optimizing Diversity and Quality through Base-Aligned Model Collaboration: The authors propose BACO, an inference-time token-level routing framework. It allows an "unaligned base model" and an "aligned instruct model" to switch token-by-token during a single decoding pass. Decisions are based on logit uncertainty and content signals, achieving base-level diversity and aligned-level quality without re-training or multiple sampling. The best router achieves a 21.3% joint improvement in diversity and quality over the strongest baseline.
Position: Adversarial ML for LLMs Is Not Making Any Progress: This position paper argues that adversarial machine learning (ML) research in the LLM era focuses on problems that are "harder to define, harder to solve, and harder to evaluate" compared to traditional classifier scenarios. Having made slow progress on "toy problems" like \(\ell_p\) robustness over the past decade, the full shift to LLMs risks another decade of research without producing measurable or reproducible security guarantees.
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI: This position paper leverages neuroscientific evidence to argue that the underlying learning mechanism of LLMs corresponds to "implicit memory" in the human brain (habitual/procedural learning in the basal ganglia). However, higher-order cognition essential for AGI—such as long-range planning, metacognition, and symbolic reasoning—depends on hippocampal "explicit memory" and cannot emerge from purely statistical implicit learning. Consequently, supplementing LLMs with an explicit memory system is the cornerstone of the transition to AGI. The authors further propose eight computational requirements for an artificial explicit memory system.
Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem: This is a position paper arguing that the machine learning community must urgently build an "AI-augmented" peer-review ecosystem—treating LLMs as collaborative assistants for authors, reviewers, and Area Chairs (ACs) rather than replacements. The paper identifies that the primary near-term bottleneck is not the lack of stronger models, but the absence of structured process data that records "why scores changed" or "which specific rebuttal addressed which concern."
Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management: The authors point out that the popular claim "Transformers are Turing-complete" in most existing proofs actually substitutes "a family of different Transformers together can simulate a Turing machine." They formalize a fixed system \((T, D, C)\) reflective of real-world deployment, proving that the computational power of the same fixed Transformer can shift from merely recognizing regular languages to reaching Turing-completeness under different context management strategies, thereby shifting the research focus from the model itself to the context manager.
Preregistration for Experiments with AI Agents: This is a position paper advocating for the extension of preregistration practices—used in social sciences to combat the "reproducibility crisis"—to behavioral experiments where LLMs/AI agents serve as experimental subjects. It systematically catalogs the unique "researcher degrees of freedom" in AI agent experiments and provides a tailored preregistration template for such studies.
Rare Event Analysis of Large Language Models: This paper introduces mature Rare Event Analysis (REA) methods from statistical physics to LLMs, utilizing a "Exponential Tilting + Transition Path Sampling + MBAR" trio. On TinyStories, it estimates rare completion probabilities several orders of magnitude smaller than direct sampling with affordable compute, and identifies cheap runtime proxies (consecutive token repeats) via EDA to pre-screen high ARI anomalous outputs.
Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models: This paper proposes BMC (Bidirectional Manifold Consistency), an unsupervised, training-free metric based on the geometric perspective that "valid reasoning trajectories are stable attractors on the learned distribution." By performing a "forward re-masking + backward few-step reconstruction" on the outputs of a Diffusion Language Model (dLLM), reconstruction stability is used for scoring. BMC supports error diagnosis, inference-time rejection sampling, and dense RL rewards, systematically outperforming baselines such as confidence, Self-Consistency, and Self-Evaluation across four reasoning benchmarks.
Resting Neurons, Active Insights: Robustify Activation Sparsity for Large Language Models: This paper attributes the performance degradation of LLMs caused by activation sparsity to "representation drift." By mimicking biological spontaneous firing, it injects an input-independent small vector (SPON) into each layer. This vector can be absorbed into the bias after training, significantly narrowing the gap between sparse and dense models with near-zero inference overhead.
Rethinking LLM Ensembling from the Perspective of Mixture Models: This paper proves that token-level ensembling of \(n\) LLMs does not require running all models at every step. By randomly selecting one model per step based on weights to sample the next token, the output distribution is strictly equivalent to the "average then sample" approach. This reduces the \(n\)-fold forward passes back to a \(1\times\) forward pass, achieving actual speedups of 1.78×–2.68× when combined with "Lazy Synchronous KV Cache."
SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling: SAC-Opt "back-translates" LLM-generated optimization solver code into structured semantic anchors (constraints and objectives), compares them item-by-item with the original problem description's anchors, and iteratively rewrites only the inconsistent parts. It achieves an average performance gain of 7.7% across 7 public datasets and 21.9% on ComplexLP.
Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions: This paper replaces the point estimation of "predicting a single output length" in LLM inference scheduling with log-t distribution fitting. It substitutes the output length in SJF with Tail Inflated Expectation (TIE), which incorporates a CVaR tail penalty. On LMSYS-Chat-1M, it reduces online per-token latency by \(2.31\times\) compared to the strongest baseline LTR and improves offline SDG throughput by \(1.42\times\).
SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel: SLAY linearizes the Yat-kernel, inspired by the physical "inverse-square interaction," through a four-step sequence: (1) spherical normalization, (2) Laplace integral representation via Bernstein's Theorem, (3) Gauss-Laguerre quadrature, and (4) tensor product positive random features of polynomial and exponential kernels. This achieves \(O(L)\) time complexity with performance nearly indistinguishable from softmax attention.
SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models: SPA-Cache transforms the determination of "which tokens need updating" in Diffusion Language Models (DLMs) from the original \(d=4096\) dimensional Value space via cosine similarity to a compressed \(r=128\) singular subspace. By dynamically allocating update budgets across layers, it achieves a \(6.4\times\) throughput increase for LLaDA-8B on GSM8K and \(8\times\) on MBPP without accuracy loss. Combined with parallel decoding, the total acceleration reaches \(28\times\).
SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion: SphericalDreamer generates the first outdoor 3D world that simultaneously possesses \(360^\circ \times 180^\circ\) omnidirectional immersion and long-distance navigability. It achieves this by lifting multiple text-generated Layered Depth Panoramas (LDP) into 3D "spherical building blocks" and employing harmonic blending to synthesize and stitch the missing transition regions between adjacent spheres.
Stop Automating Peer Review Without Rigorous Evaluation: This is a position paper: through empirical measurements of real ICLR 2026 reviews and 60 simulated reviews, the authors identify two major failures in current LLM reviewing: the hivemind effect (high convergence) and paper laundering (zero-shot paraphrasing alone can increase scores by 0.45). Consequently, they argue that "LLMs should not directly generate review comments without rigorous evaluation" and call for the establishment of a "science of review automation."
T\(^2\)PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning: T\(^2\)PO attributes the training collapse of multi-turn agentic RL to "hesitation"—characterized by over-thinking at the token level and repetitive invalidity at the turn level. It utilizes a self-calibrated uncertainty signal \(M_t\), which fuses entropy and confidence, to simultaneously drive Token-level Thinking Intervention (dynamically truncating think blocks) and Turn-level Dynamical Sampling (resampling ineffective turns). This approach consistently outperforms PPO, GRPO, and GiGPO across WebShop, ALFWorld, and Search QA.
The Cylindrical Representation Hypothesis for Language Model Steering: This paper proposes the Cylindrical Representation Hypothesis (CRH). By maintaining "concept linearity" while abandoning the orthogonality assumption of LRH, it demonstrates that the superposition of concept vectors naturally induces a cylindrical geometry consisting of an "axis + normal plane + sensitive sector." This provides the first geometric explanation for why activation steering is unpredictable at the sample level but observable at the group level.
Token-Efficient Change Detection in LLM APIs: The authors demonstrate that under low-temperature sampling, specific inputs where "two token logits are nearly tied" (Border Inputs) are extremely sensitive to parameter perturbations—theoretically, the SNR diverges as \(T \to 0\). This allows for LLM API change detection using minimal requests by observing only output tokens (strict black-box). The proposed B3IT matches gray-box logprob methods at 1/30th the cost on the TinyChange benchmark and detected 8 real-world model replacements during 23 days of continuous monitoring across 93 commercial endpoints.
Structured Generalized Linear Token Mixing: Shifting Gears Between Complexity and Expressivity with SND + Kronecker: The paper proposes a unified "direct input mixing \(\mathbf{A}\) + output recursive mixing \(\mathbf{B}\)" framework \(Y = (I - B)^{-1} A X\) that encompasses attention, SSMs, linear recurrence, and high-order recurrence. It proves that the sparsity pattern of \(A\) and \(B\) directly controls the complexity gradient from \(\mathcal{O}(n \log n)\) to \(\mathcal{O}(n^2)\). Two translation-invariant modes, \(f(k) = 2^k\) and \(f(k) = k^2+1\), are introduced as new choices for \(\mathcal{O}(n \log n)\) and \(\mathcal{O}(n \sqrt{n})\) complexity, with cache sizes reducible to \(\mathcal{O}(\log n)\) or \(\mathcal{O}(\sqrt{n})\).
In-Context Routing (ICR): Train-Once, Use-Everywhere Attention-Level Implicit ICL: Instead of injecting shift vectors into the residual stream, ICR extracts Principal ICL Directions (PIDs) from multi-domain ICL via PCA to serve as low-rank correction directions for attention logits, adaptively modulated by a query-conditioned router. After a single training phase, it enables zero-shot inference across 12 in/out-of-domain tasks without task-specific retrieval or retraining, avoiding the degradation on OOD tasks typical of vector-based methods.
Universal Reasoner: Composable Plug-and-Play Reasoners for Frozen LLMs: The authors propose Universal Reasoner (UniR), which trains independent lightweight reasoning modules to capture reward-oriented behaviors. At inference, these modules are combined with frozen LLMs via logit superposition, enabling reasoning enhancement without fine-tuning the backbone, cross-model scale transfer, and multi-task composition.
Why Are Linear RNNs More Parallelizable?: This paper uses circuit complexity to strictly explain why Linear RNNs are more easily parallelized like Transformers compared to traditional non-linear RNNs: LRNNs fall within arithmetic circuit classes of approximate log-depth, whereas non-linear RNNs can express harder-to-parallelize \(\mathsf{logspace}\) / \(\mathsf{polynomial}\)-time complete problems, forming a fundamental trade-off between expressivity and parallelizability.