💬 LLM / NLP¶
🧪 ICML2026 · 4 paper notes
📌 Same area in other venues: 💬 ACL2026 (32) · 📷 CVPR2026 (9) · 🔬 ICLR2026 (35) · 🤖 AAAI2026 (32) · 🧠 NeurIPS2025 (48) · 📹 ICCV2025 (6)
🔥 Top topics: LLM ×2
- A Geometric Relation of the Error Introduced by Sampling a Language Model's Output Distribution to its Internal State
-
This paper characterizes the information loss introduced by sampling from high-entropy distributions in GPT-style LLMs from a differential geometry perspective. It constructs \(\mathfrak{so}(n)\)-valued 1-forms and parallel transport operators, and demonstrates in chess probing experiments that such geometric rotations are highly aligned with the world vectors learned by the model.
- Escaping Mode Collapse in LLM Generation via Geometric Regulation
-
This work reinterprets "mode collapse" (repetition, looping, monotony) in LLM long-form generation from a dynamical systems perspective as "geometric collapse" of hidden state trajectories in representation space. It proposes RMR—a lightweight low-rank damping on the Transformer value cache to suppress the most persistent self-reinforcing directions, thereby maintaining stable, high-quality generation even in extremely low-entropy decoding regimes (0.8 nats/step).
- Top-W: Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for LLMs
-
Top-W formulates next-token truncation as a minimization problem with three terms—Wasserstein (geometry-aware), entropy, and mass—explicitly considering token embedding geometry. Theoretically, the optimal solution is either a singleton token or a prefix sorted by \(f(i)+\lambda\log p_i\). The engineering implementation is just an \(O(n\log n)\) scan. On GSM8K, GPQA, AlpacaEval, and MT-Bench, Top-W outperforms in the majority of 15 (T, model) combinations, and at high temperatures, improves GSM8K by up to 33.7% over Top-H.
- Rethinking LLM Ensembling from the Perspective of Mixture Models
-
This paper proves that when performing token-level ensembling over \(n\) LLMs, it is unnecessary to run all models at each step—randomly sample one model according to the weights to generate the next token, and the output distribution is strictly equivalent to "average then sample." This reduces the \(n\)-fold forward pass back to a single forward pass, and, combined with "lazy KV cache synchronization," achieves a practical speedup of 1.78×–2.68×.