💡 LLM Reasoning¶
🧪 ICML2026 · 20 paper notes
📌 Same area in other venues: 💬 ACL2026 (44) · 📷 CVPR2026 (12) · 🔬 ICLR2026 (63) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (66) · 📹 ICCV2025 (3)
🔥 Top topics: Reasoning ×10 · LLM ×4 · Diffusion Models ×2 · Reinforcement Learning ×2
- A Formal Comparison Between Chain of Thought and Latent Thought
-
Starting from computational complexity theory, this paper formally compares the expressive power of CoT (Chain of Thought) and Latent Thought (Looped Transformer / Coconut). It proves that Latent Thought strictly achieves \(\mathsf{TC}^k\) at polylogarithmic depth, while CoT reaches at most \(\mathsf{TC}^{k-1}\). Additionally, under probabilistic settings, it is shown for the first time that CoT can support FPRAS counting via random decoding, thereby surpassing deterministic Latent Thought.
- ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models
-
ANCHOR employs "bottom-up abduction + hierarchical clustering" to construct a dense factor space, retrieves a sparse set of relevant factors for downstream conditions via coarse-to-fine search, and aggregates posteriors using both Naïve Bayes and a latent-variable causal Bayesian network constructed on-the-fly by an LLM. This approach significantly reduces "unknown" predictions and improves probability calibration in high-risk LLM decision scenarios.
- Automated Formal Proofs of Combinatorial Identities via Wilf–Zeilberger Guidance and LLMs
-
WZ-LLM compiles the classical Wilf–Zeilberger symbolic proof process into executable proof skeletons in Lean 4 (recurrence + boundary conditions + side conditions), which are then discharged item by item by a WZ-Prover trained via SFT + expert-iteration + DAPO. On 100 classical combinatorial identities, pass@32 improves from Goedel-Prover-V2's 9% to 34%.
- Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
-
To address the issue where "fixed block size" in diffusion language models (dLLM) during semi-autoregressive generation disrupts the logical chain of reasoning, this paper proposes b1: using RL to learn a block-ending indicator token for generating dynamic-length blocks, and introducing a "block-level monotonic entropy descent (MED) reward" to drive coherent reasoning. This reward can be plugged into existing dLLM RL frameworks (Diffu-GRPO/GDPO/d1/wd1) as a plug-and-play component, boosting wd1 on Countdown from 39.45 to 58.98.
- Conformal Thinking: Risk Control for Reasoning on a Compute Budget
-
This work reframes the problem of "when to stop reasoning in LLMs" from an opaque threshold-tuning task into a user-specifiable risk tolerance conformal risk control problem: using two thresholds—an upper threshold to stop when the model is confident (controlling false positives), and a newly proposed parameterized lower threshold to force stop when the model is "stuck" on unsolvable problems (controlling false negatives). The UCB algorithm is used to automatically determine thresholds from a calibration set that satisfy risk constraints, achieving "almost no drop in accuracy, significant token savings" on AIME / GPQA / MathVision.
- Efficient Reasoning with Hidden Thinking
-
Heima distills each stage (summary / caption / reasoning) of a multimodal LLM’s lengthy CoT into a special thinking token, enabling the model to "think" in latent space. The number of tokens drops from 100-200 to 13-16, while zero-shot accuracy is more stable than LLaVA-CoT. An auxiliary LLM "interpreter" is trained to reconstruct the textual reasoning chain from the thinking token’s hidden state, empirically verifying the information-theoretic upper bound of compression loss.
- Entropy-informed Decoding: Adaptive Information-Driven Branching
-
EDEN (Entropy-informed DEcodiNg) sets the beam width \(B_t\) at each step to be monotonically proportional to the normalized entropy \(\bar H_t\)—high entropy forks more branches, low entropy steps approach greedy decoding—thus approximating a wider beam search with fewer total expansions. Theoretically, it is proven that entropy-monotonic branching factors are strictly superior to any fixed beam width in terms of expected cumulative regret, with an explicit regret rate \(\mathbb{E}[R_T] \leq G P_\max \sum_t \exp(-c m_t \Delta_\min^2)\).
- ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
-
ETS samples directly from the closed-form optimal solution of the KL-regularized RLHF objective, expressing it as a "reference policy × conditional expectation of exponential reward (energy term)", and then uses Monte Carlo + self-normalized importance sampling at test time to approximate this energy term. This achieves, or even surpasses, the performance of RL-trained policies without any training, and leverages lightweight proposals + Fast-dLLM to keep latency within practical bounds.
- Express Your Doubts: Probabilistic World Modeling Should Not Be Based on Token logprobs
-
This position paper argues that using the token softmax probabilities (logprob) of LLMs as "world event probabilities" is theoretically incorrect—because distribution estimation, response prediction, and target distribution estimation are three distinct tasks, each with a different ideal output distribution. The correct approach to obtaining world probabilities is second-order prediction—having the LLM explicitly write out its probability estimate for an event (numerically or with linguistic hedges) in its output, rather than computing "the probability it says X".
- Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory
-
This paper models LLM active questioning scenarios (20 Questions / medical diagnosis / troubleshooting) as a two-player zero-sum extensive-form game (EFG), and proposes Game of Thought (GoT): using depth-limited subgame construction + CFR to compute Nash equilibria, thereby generating "randomized questioning strategies" that significantly reduce worst-case interaction rounds across all datasets, with a 15–40% improvement over UoT under the weighted variant.
- GRPO is Secretly a Process Reward Model
-
This paper theoretically proves that GRPO + ORM, under the mild condition of "intra-group trajectory shared prefixes," is equivalent to a process reward RL objective with Monte-Carlo PRM, thereby revealing a hidden bug in vanilla GRPO—uneven prefix lengths cause most tokens in high-reward trajectories to receive negative advantage. The authors propose \(\lambda\)-GRPO, which performs PRM-aware normalization, consistently outperforming GRPO on reasoning benchmarks and achieving about 2× faster training.
- Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal
-
A simple logistic regression probe on LLM hidden states during chain-of-thought (CoT) generation can predict whether the entire reasoning will be incorrect with 0.95 AUROC (0.79 from the first step), while a classifier trained on surface text achieves only 0.59; unfortunately, all four intervention methods (activation steering, probe-guided best-of-N, self-correction, activation patching) fail—this error signal is "diagnostic" rather than "causal."
- Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks
-
NSI "lifts" LLM agent interaction traces into neuro-symbolic workflow graphs with explicit conditional branches and dynamic variable binding, evolving skills from stateless scripts into state-aware logical programs. Achieves 98.0 / 76.5 / 95.2 success rates on ALFWorld / WebShop / TextCraft, comprehensively outperforming programmatic skill baselines like ASI and AWM.
- Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
-
This paper systematically reveals that the "rules of thumb" for many-shot ICL in non-reasoning tasks completely fail in CoT reasoning tasks—similarity retrieval is actually harmful, and order sensitivity increases with the number of shots. The paper reinterprets successful many-shot CoT as "in-context test-time learning," and proposes the CDS method, which sorts demonstrations by embedding trajectory curvature, achieving a 5.42 pp improvement on 64-shot geometry problems.
- Multimodal Fact-Level Attribution for Verifiable Reasoning
-
MURGAT is the first benchmark to evaluate MLLMs’ ability to provide "fact-level, modality+timestamp precise citations" in multimodal reasoning outputs. It introduces a three-step evaluation protocol (verifiable claim identification → atomic fact decomposition → attribution quality) and a highly human-aligned automatic evaluator, MURGAT-SCORE (Pearson 0.84). The study reveals that even strong models often cite incorrectly despite correct answers, and that strong reasoning often comes at the expense of verifiable citation.
- Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
-
The authors decompose the problem of "efficient test-time scaling for discrete diffusion language models (dLLM)" into three components: allocating computation along a hierarchical timeline of "exploration → progressive pruning → refinement" (HTS), using partial remask for local branching to preserve high-confidence "logic skeletons," and treating the dLLM itself as a Yes/No verifier (SVF). Ultimately, on four math/code benchmarks and three dLLMs, they achieve comparable or better accuracy than best-of-\(N\) with far fewer NFE.
- Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training
-
This work provides the first rigorous sample complexity proof for "easy-to-hard" curriculum RL post-training: on the state-conditional autoregressive reasoning tree of a transformer, if the curriculum ensures that the difficulty ratio between adjacent stages is at most the \(L/p\)-th root of the target difficulty, then the total sample count can be reduced from the exponential \((C^\star)^L\) of direct training to the polynomial \(L\cdot (C^\star)^{p_\max}\) of curriculum-based training.
- ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
-
ResRL theoretically decomposes the "negative sample gradient polluting positive sample" phenomenon (Lazy Likelihood Displacement) in RLVR into two components: "logit × representation." It then applies a projection residual at the representation layer using the SVD low-rank subspace of positive samples, assigning each negative token a gradient weight in \([\xi,1]\) based on its "orthogonal component energy"—the more similar the representation to positive samples (smaller residual), the lighter the penalty; only purely erroneous components are heavily penalized. This preserves Pass@1 while maintaining Pass@k diversity. On Qwen3-4B math tasks, Avg@16 improves by 9.4% and Pass@128 by 7.0% over NSR.
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
-
The authors translate the human-annotated solution steps of the MATH dataset into "reusable Python tools with descriptions and type signatures," constructing the ToolMATH benchmark with 8K problems and 12K tools. It covers long-horizon multi-tool composition (hop 1-8+), controllable distractor tool similarity (5 levels × 4 densities), and scenarios where all gold tools are removed. Validation shows that the dominant failure factor is not tool selection but reasoning itself—thought errors account for over 90%, and distractor tools amplify early minor deviations into irreversible execution drift.
- Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards
-
The authors use "whether a ground street view and a satellite image can be localized to the same coordinate" as a verifiable indirect reward, and apply two-stage post-training (CoT scaffolding + RL self-exploring) with GRPO to Qwen2.5-VL-7B. This enables the model to learn general reasoning abilities that can zero-shot transfer to 25+ geospatial tasks using only GPS metadata.