ICML2026 LLM Reasoning AI paper notes paper summaries Reasoning LLM Multimodal/VLM Agents Alignment/RLHF Adversarial Robustness

💡 LLM Reasoning¶

🧪 ICML2026 · 78 paper notes

📌 Same area in other venues: 📷 CVPR2026 (16) · 🔬 ICLR2026 (241) · 💬 ACL2026 (82) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (82) · 📹 ICCV2025 (3)

🔥 Top topics: Reasoning ×55 · LLM ×18 · Multimodal/VLM ×2 · Agents ×2 · Alignment/RLHF ×2

A Formal Comparison Between Chain of Thought and Latent Thought: Based on computational complexity theory, this paper formally compares the expressive power of CoT (Chain of Thought) and Latent Thought (Looped Transformer / Coconut). It proves that Latent Thought strictly reaches \(\mathsf{TC}^k\) under polylogarithmic depth, while CoT reaches at most \(\mathsf{TC}^{k-1}\). Simultaneously, in a probabilistic setting, it reveals for the first time that CoT can support FPRAS counting through stochastic decoding, thereby surpassing deterministic Latent Thought.
Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs: Addressing the practical constraint of "fixed token budgets per query" during deployment, this paper proposes Budget-Guided MCTS (BG-MCTS). It utilizes a "budget sufficiency ratio \(\rho\)" as a unified scheduling signal to transition tree search from broad exploration in early stages to deep refinement and answer completion as the budget depletes, consistently outperforming budget-agnostic tree search baselines on mathematical and physical reasoning benchmarks.
An Information-Theoretic Criterion for Efficient Data Synthesis: This paper employs the Data Processing Inequality (DPI) to explain why synthetic data can be effective or cause model collapse: a synthetic data pipeline is only information-open if the training closed-loop continuously introduces stable external signals. Furthermore, high meta-level verification signals are more efficient and generalizable than instance-level imitation.
Are Large Reasoning Models Interruptible?: This paper shifts the evaluation of large reasoning models from static problem-solving to dynamic environments where models may be interrupted or receive mid-generation updates. The authors construct evaluation protocols for mathematics and programming and identify three consistent failure modes: reasoning leakage, panic answering, and self-doubt.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning: AutoTool utilizes reinforcement learning to enable Multimodal Large Language Models (MLLMs) to first determine whether a "zoom-in tool" is truly necessary for a given task. By adaptively switching between tool-assisted reasoning and pure text reasoning, the model achieves simultaneous improvements in accuracy and efficiency across high-resolution perception, grounding, hallucination detection, and reasoning tasks.
Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization: The authors use attention dynamics to "develop" the reasoning process—discovering a "preplan-and-anchor" two-beat rhythm during generation. They convert two internal metrics (WAAD/FAI) characterizing this rhythm into token-level advantage amplification coefficients for RL. This allows GRPO to concentrate credit on critical tokens that dictate the direction of downstream reasoning, achieving consistent performance gains across Countdown, QA, and multiple mathematical reasoning benchmarks.
Beyond Test-Time Memory: State-Space Optimal Control for LLM Reasoning: Ours models LLM reasoning as an optimal control problem in latent space (Linear Quadratic Regulator, LQR) and proposes the Test-Time Control (TTC) layer to perform finite-horizon planning during the forward pass. The optimal control action is decoded as the next-token representation. Combined with a Symplectic Iteration CUDA-efficient solver, this adapter-style layer achieves up to +27.8% gain on MATH-500 and a 2-3× increase in Pass@8 on AMC/AIME when inserted into pretrained LLMs.
Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning: Ours proposes the BRIDGE framework, which models the integration of SFT and RL as a bilevel optimization problem. In this framework, an SFT-based upper-level teacher learns to selectively transfer beneficial supervisory signals to an RL-based student via a lightweight LoRA module, achieving an average absolute improvement of over 3 percentage points across five mathematical reasoning benchmarks.
Biases in the Blind Spot: Detecting What LLMs Fail to Mention: This paper proposes a fully automated black-box pipeline to detect "unverbalized biases"—implicit factors that systematically influence model decisions but are never mentioned in Chain-of-Thought (CoT) reasoning. By utilizing LLMs to automatically generate conceptual hypotheses, counterfactual input variants, and sequential statistical tests, the method discovered known biases such as gender and race across three decision-making tasks, as well as novel biases like Spanish fluency, English proficiency, and writing formality.
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling: This paper proposes Prefix-RFT, which constructs mixed trajectories by sampling prefixes from expert demonstrations and concatenating model continuations. This approach injects knowledge guidance from SFT while maintaining the objective-oriented optimization of RFT, significantly outperforming independent SFT, RFT, and existing hybrid methods on mathematical reasoning tasks.
Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models: This paper proposes the Structured Ignorance Certificate (SIC)—an output format that mandates models, when encountering cross-domain problems exceeding their knowledge boundaries, to explicitly state via JSON "which two domains' intersection is missing, which concepts are required, and what should be retrieved" instead of hallucinating answers. Through a dataset of 7,347 automatically synthesized "Unknown Unknown" (UU) cross-domain problems and Group Relative Policy Optimization (GRPO) reinforcement fine-tuning, a 14B model learns to stably produce these certificates (99.46% JSON validity, 0.967 concept specificity).
Chain-of-Thought Reasoning in the Wild Is Not Always Faithful: This paper reveals two types of unfaithful behavior in frontier LLM Chain-of-Thought (CoT) under non-adversarial, naturally phrased prompts (without human-injected bias): Implicit Post-hoc Rationalization (generating contradictory but seemingly plausible arguments for the same comparative question pairs) and Unfaithful Irlogical Shortcuts (skipping critical reasoning steps in difficult math problems while still reaching the correct answer). The unfaithfulness rate in production models reaches up to 13%, and even reasoning models (DeepSeek R1: 0.37%, Claude 3.7 Sonnet thinking: 0.04%) are not perfectly faithful.
Clustering as Reasoning: A \(k\)-Means Interpretation of Chain-of-Thought Graph Learning: This paper reveals the mathematical equivalence between Transformer self-attention and \(k\)-means clustering. Based on this, it designs the KCoT framework, which explicitly decomposes CoT reasoning into "assignment-update" semantic filtering prompts. It employs Condition-Net to dynamically fuse topological priors with evolving thought representations, consistently surpassing SOTA in node classification and link prediction.
CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers: This paper proposes CoCoReviewBench, which transforms human reviews of 3,900 ICLR/NeurIPS papers into a more credible evaluation reference for AI reviewers through a two-step process: constructing category-based sub-benchmarks and filtering erroneous opinions using meta-review arbitration of reviewer/author conflicts. The study reveals that current AI reviewing still lags behind humans in correctness and thoroughness, while reasoning models demonstrate significantly higher potential.
Conformal Thinking: Risk Control for Reasoning on a Compute Budget: This paper reframes the problem of "when a reasoning LLM should stop thinking" from an uninterpretable threshold tuning task into a user-specified risk tolerance conformal risk control problem. By employing dual thresholds—an upper threshold to stop when the model is confident (controlling false positives) and a newly proposed parameterized lower threshold to force a stop when the model is "stuck" on unsolvable problems (controlling false negatives)—and automatically deriving thresholds via the UCB algorithm on a calibration set, the method achieves significant token savings on AIME / GPQA / MathVision while maintaining accuracy.
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback: The authors first identify three critical flaws in "pure numerical reward RL" (performance plateaus, ineffective spontaneous reflection, and stubborn failures), then integrate natural language critique into online RL. The model learns both the initial response and "self-refinement based on critique." A shaping function is used to bias towards "correct but unfamiliar" refinements while suppressing incorrect ones, achieving an average Pass@1 improvement of approximately +15.0~21.6% across eight reasoning benchmarks (Qwen series).
DecepChain: Inducing Deceptive Reasoning in Large Language Models: DecepChain proposes the first backdoor training paradigm capable of inducing LLMs to generate Chain-of-Thought (CoT) that "reads exactly like normal reasoning but inevitably yields incorrect answers" when specific trigger words are present. By first performing SFT using the model's own "natural error" trajectories and then amplifying the deception via GRPO curriculum reinforcement learning with inverse and format rewards, the framework thoroughly erases the boundary between "seemingly credible reasoning" and "truly credible reasoning."
Deliberate Evolution: Agentic Reasoning for Sample-Efficient Symbolic Regression with LLMs: The proposal-score loop dominated by LLMs in symbolic regression is decomposed into two layers: "Proposal vs. Navigation." By explicitly guiding the LLM with three types of signals—adaptive operators (direction), diagnostic tools (residuals/dimensions), and reflective memory (trajectory experience)—this method reduces the average NMSE by 37–55% on LLM-SRBench using only 40% of the evaluation budget.
DenseSteer: Steering Small Language Models towards Dense Math Reasoning: It is observed that stronger models use fewer CoT steps but have higher information density per step (Dense Reasoning). DenseSteer uses GPT-5.1 to rewrite the sparse solutions generated by the small language model (SLM) itself into "information-dense" in-distribution positive samples. These form contrastive pairs with the original solutions. A steering vector, obtained via mean difference, is injected into the residual stream of an intermediate layer (\(\approx\) L17). This zero-training method consistently improves performance on math benchmarks like GSM8K / MATH500 / AMC / AIME without increasing token-level NLL.
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution: This work formalizes the identification of erroneous steps in Chain-of-Thought (CoT) reasoning as a step-wise confidence attribution problem in black-box scenarios. By utilizing the Information Bottleneck (IB) principle, "correct reasoning trajectories obtained via multiple sampling of the same problem" are compressed into a consensus structure. Two implementations are provided: the training-free NIBS (Semantic Consensus Alignment) and the learnable GIBS (Graph Consensus Subgraph Selection). Both consistently outperform white-box baselines on GSM8K, Math, and MoreHopQA, improving self-correction success rates by up to 13.5% through step-wise feedback.
Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models: This paper systematically investigates the effectiveness of test-time compute (TTC) strategies on Vision-Language Models (VLMs). It theoretically demonstrates that the gains of majority voting are limited by prediction diversity and proposes ETTC, which selects the most confident model based on prediction entropy. This approach allows smaller models to enhance larger ones, achieving an average improvement of +2.8% over 7 VLMs across 6 benchmarks, outperforming the strongest single models.
Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents: In "tool-optional" visual agent tasks such as 3D spatial reasoning, authors found that vanilla RFT causes tool calling rates to collapse to near zero, while explicitly encouraging tool use yields only marginal gains. The true driver of performance is the exploration diversity of rollouts. By employing adaptive entropy regularization, 3DSRBench accuracy is improved from 59.2% to 62.9%, repositioning tools as "training-time scaffolding" rather than inference-time necessities.
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling: DyCon discovers that "problem difficulty" evolves dynamically during the reasoning process and is linearly encoded in the step-level hidden representations of Large Reasoning Models (LRMs). It employs a lightweight linear regressor to estimate step-wise difficulty online and adjusts the logits of "reflection-related tokens" in real-time. This allows simple problems to converge early while difficult ones continue exploring, significantly compressing redundant reasoning tokens without sacrificing accuracy—all via a training-free process that does not modify model parameters.
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure: The authors treat latent Chain-of-Thought (CoT) as an intervenable Structural Causal Model (SCM), performing step-wise do-interventions + early-exit decoding + teacher-forced readout for each continuous "thinking step." By systematically quantifying the step-level necessity, propagation structure, and trajectory superposition of Coconut/CODI in mathematical and commonsense reasoning, they find that latent steps are not a homogenized "deepening" but a structured interface characterized by high heterogeneity, non-local routing, and "output commitment" preceding "representation commitment."
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment: ETS samples directly from the closed-form optimal solution of the KL-regularized RLHF objective, formulating it as "reference policy \(\times\) conditional expectation of exponential reward (energy term)." By using Monte Carlo + Self-Normalized Importance Sampling to approximate this energy term at test time, it achieves or exceeds the performance of policies post-trained with RL without any training. It maintains practical latency through a lightweight proposal + Fast-dLLM.
Evaluating Relational Reasoning in LLMs with REL: The authors adopt "Relational Complexity" (RC) from cognitive science — the number of independent variables that must be simultaneously bound in a single reasoning step — as a unified axis for measuring task difficulty. They construct REL, a generative benchmark spanning algebra, biology, and chemistry. Findings indicate that the accuracy of frontier LLMs (Claude Opus 4.5 / Gemini 3 Pro / GPT-5.2) monotonically decreases as RC increases, and this bottleneck cannot be resolved by test-time compute, ICL, or external tools.
FloorplanQA: A Benchmark for Spatial Reasoning in LLMs Using Structured Representations: FloorplanQA systematically diagnoses the "pure symbolic spatial reasoning" capabilities of 15 cutting-edge LLMs using 2,000 JSON/XML-formatted 2D indoor layouts and 16,000 geometric problems (distance, visibility, pathing, placement, etc.). The study reveals that while models can calculate simple distances, they consistently fail at set unions, planning, and constraint satisfaction. Furthermore, Python tool augmentation fixes arithmetic errors but cannot salvage failures at the algorithmic level.
ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution: ForesightKV trains a lightweight scoring model to dynamically evict KV pairs based on "future attention contribution." It utilizes a "Golden Eviction" algorithm to distill optimal eviction sequences from complete traces as supervision signals, followed by GRPO reinforcement learning fine-tuning with a reward based on the "sum of squared loss increments of low-entropy tokens." On AIME2024/2025, it outperforms SnapKV/H2O/R-KV with half the KV budget; a 4K budget preserves 99% of the original model performance.
From LLM-Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates: NSPI allows LLMs to propose approximate Sum-of-Squares (SOS) structural conjectures, which are refined through Gauss–Newton iteration and rational recovery into rigorous SOS decompositions with rational coefficients. These are then automatically verified by machines using Lean's linear_combination + positivity tactics, extending inequality proving to up to 10 variables.
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning: The authors treat each Transformer attention matrix as a token-weighted graph and extract four parameter-free spectral diagnostics: Fiedler value, HFER, spectral entropy, and smoothness. They discover that "valid mathematical reasoning" leaves measurable fingerprints on the attention spectrum (Cohen’s \(d\) up to 3.30), allowing the model to distinguish between true reasoning and pattern matching with 85–96% accuracy without any training.
GRPO is Secretly a Process Reward Model: This paper theoretically proves that GRPO + ORM is equivalent to a process reward RL objective with a Monte-Carlo PRM under the mild condition of "shared prefixes within a group." It reveals a hidden bug in vanilla GRPO—uneven prefix lengths cause majority tokens in high-reward trajectories to receive negative advantages—and proposes \(\lambda\)-GRPO, which utilizes a PRM-aware normalization to consistently outperform GRPO on reasoning benchmarks with approximately 2x faster training.
Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal: Using a simple logistic regression probe on the hidden states of an LLM during Chain-of-Thought (CoT) generation can predict whether the entire reasoning process will fail with an AUROC of 0.95 (0.79 starting from the very first step). In contrast, a classifier trained on the surface text only achieves 0.59. However, four types of interventions (activation steering, probe-guided best-of-N, self-correction, and activation patching) all failed—indicating that this error signal is "diagnostic" rather than "causal."
How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning: This paper systematically measures the predictive power of LLM latent states regarding "future reasoning" across 12 cross-domain tasks using a low-rank adapter probe called Tele-Lens. The study reveals that internal LLM planning is myopic—it precisely locks onto answers only at the end of the CoT. Based on this, the "Wooden Barrel Principle" is proposed, using uncertainty at sparse pivot positions to represent the entire CoT, significantly improving uncertainty calibration and enabling a 16% CoT bypass.
Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models: This paper targets the vulnerability of Large Reasoning Models (LRMs) to "logically incomplete inputs" that trigger overthinking. It proposes a Hierarchical Genetic Algorithm (HGA) that treats structured problem decompositions as genes under pure black-box conditions. Through sentence-level/question-level crossover and addition/deletion mutations, it searches for adversarial samples with logical fractures. This method amplifies response lengths by up to 26.1x on the MATH benchmark, enabling low-cost DoS attacks.
Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models: ITCR transforms conformal prediction from a post-hoc "generate-then-prune" approach into an inference-time mechanism. It learns a graph-level factuality uncertainty function on the LLM reasoning graph and constructs a non-conformity score that increases monotonically with subgraph expansion. By stopping expansion immediately once a calibrated threshold is crossed, it provides valid \(1-\alpha\) coverage guarantees for "no-false steps" or "no-missed correct steps," improving downstream reasoning accuracy by an average of 18.77%.
Inference Time Optimization with Confidence Dynamics: The authors observe that during LLM multi-sample inference, the confidence of correct trajectories systematically increases along the reasoning chain, while incorrect trajectories decay or decrease. Based on this, they propose CDG (Confidence Dynamic Gain) voting—using "tail confidence − head confidence" as an additional discriminative signal embedded in Best-of-N weighted voting. Across four open-source reasoning models and four mathematical olympiad benchmarks, CDG achieves an average improvement of 5.4% over majority voting and 1.7–4.8% over DeepConf.
Internalizing Safety Understanding in Large Reasoning Models via Verification: This paper demonstrates that "generating safe answers" \(\neq\) "understanding safety" and proposes the SInternal framework: training large reasoning models (LRMs) solely to verify the safety of their own generated answers. The resulting emergent internal safety understanding significantly suppresses jailbreak attacks (StrongREJECT ASR drops from 41% to 0.6%) and provides a superior starting point for subsequent RL.
Is Code Better Than Language for Algorithmic Reasoning?: The authors utilize a "three-route" framework to disentangle two confounded factors in tool-augmented LLMs: reasoning representation (code vs. natural language) and execution mechanism (LLM simulation vs. real interpreter). Across 40 verifiable algorithmic tasks, results show that code representation itself provides negligible gains (+0.15pp), while "reliable external execution" significantly elevates accuracy from 17% to 49% (+31.47pp). A linear decision-theoretic model further proves that "code representation is non-inferior to natural language."
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning: LatentChem replaces "explicit CoT text chains" with "continuous latent thinking vectors + dynamic molecular-aware updates" in chemical LLMs. Under GRPO outcome-only rewards, the model was observed to spontaneously abandon text CoT in favor of latent reasoning. It achieved a non-tie win rate of 59.88% against explicit CoT baselines on ChemCoTBench, with an average 10.84x reduction in reasoning steps and a 5.96x wall-clock speedup.
MOSAIC: Learning When to Act or Refuse — Guarding Agentic Reasoning Models for Safe Multi-step Tool Use: MOSAIC transforms "safety decision-making" from an implicit byproduct of reasoning into an explicit first-class action within a plan-check-act/refuse loop (featuring <safety_thoughts> and refusal_tool). It utilizes pairwise trajectory preferences analyzed by an LLM judge combined with GRPO training. On Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, it achieves a 50% reduction in harmful behaviors in zero-shot OOD scenarios, a 20% increase in prompt injection rejection rates, and decreased privacy leakage—all while maintaining utility on benign tasks.
Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models: The paper reveals an overlooked failure mode of Test-Time Scaling (TTS): by suppressing the diversity of candidate responses, TTS becomes more susceptible to outputting unsafe content than direct adversarial prompting. It proposes RefDiv, a genetic algorithm driven by dual signals of Shannon entropy and reference guidance, which efficiently jailbreaks MCTS and Best-of-N across models, closed-source APIs, and guardrails.
Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models: LiDAR rewrites the Expected Future Reward (EFR) using a few pre-generated lookahead samples and forward perturbation kernels, transforming reward guidance into closed-form softmax weights without neural backpropagation. It matches DATE's performance on SDXL/GenEval while being 9.5× faster.
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn: This paper systematically reveals that the "rules of thumb" for many-shot ICL in non-reasoning tasks fail entirely in CoT reasoning—similarity retrieval is actually harmful, and order sensitivity increases with the number of shots. The study reinterprets successful many-shot CoT as "in-context test-time learning" and proposes the CDS method, which orders demonstrations by embedding trajectory curvature, achieving a 5.42 pp improvement on 64-shot geometry problems.
Mean-Shift PCA by Knockoff Mean: This paper utilizes Random Matrix Theory to prove that "mean-shift contamination" is asymptotically independent of true covariance spikes in the spectrum of the sample covariance matrix. Based on this, the authors propose MS-PCA, a two-stage algorithm: by intentionally injecting a "knockoff mean" (decoy mean-shift) and performing a second PCA, it identifies "decoy-driven" eigenvalues as contamination and removes them, thereby recovering true principal components using only standard PCA operations in high dimensions.
Measuring Weak-to-Strong Legibility of Reasoning Models: This paper proposes Transfer Utility (TU)—a metric that measures the "weak-to-strong legibility" of reasoning traces by feeding percentile prefixes of traces from a strong Reasoning Language Model (RLM) to a weak student model and assessing the student's ability to complete the correct answer. Across 12 open-source RLMs, 3 datasets, and 85k traces, the study finds that the traces of the most accurate and concise RLMs (e.g., GPT-OSS-120B) actually rank lowest in TU, suggesting that RLVR training transforms reasoning traces into artifacts useful only for strong models.
Modeling Hierarchical Thinking in Large Reasoning Models: The authors abstract the long CoT of Large Reasoning Models (LRMs) into a 6-state Finite State Machine (FSM). By constructing a Transition Advantage Matrix based on the probability difference between "success vs. failure" states and using Q-Value iteration to derive a long-horizon planning strategy, they perform sparse orthogonal activation steering only at sentence boundaries. This approach improves accuracy on difficult problems like AIME25 by up to +13% while using approximately 25× fewer interventions.
On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs: This paper systematically exposes the vulnerability of open-source VLMs in visual grounding and Chain-of-Thought (CoT) faithfulness after RL finetuning by injecting two types of controlled textual perturbations—"misleading captions" and "wrong CoT prefixes"—into visual reasoning benchmarks. It reveals an explicit trade-off between "accuracy \(\uparrow\) vs. CoT faithfulness \(\downarrow\)" under RL optimization and demonstrates that neither data augmentation nor faithfulness rewards can simultaneously resolve both issues.
On the Generalization Gap in Self-Evolving Language Model Reasoning: Under the strict closed-loop setting of "unlabeled prompts + base model" only, this paper systematically compares four self-evolution (SE) strategies (single-round verification, multi-round revision, iterative training, curriculum learning) against oracle supervision. It finds that on Knights & Knaves logical reasoning, SE improves Gemma 3 4B from 31.0% to 44.8%, yet a persistent gap of 8–13% remains relative to the oracle's 53.3%. Only RevisionSE with a 12B model approaches oracle performance (52.8% vs. 53.6%).
PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching: This paper reformulates unsupervised LLM fine-tuning as a problem of "matching the \(\alpha\)-power distribution of a base model," employing the Trajectory-Balance objective of GFlowNet as an amortized sampler. By introducing a length-aware LA-TB reparameterization, it eliminates structural length bias inherent in autoregressive generation. A single knob \(\alpha\) controls the direction—\(\alpha>1\) sharpens the distribution to stimulate reasoning (matching or exceeding supervised GRPO), while \(\alpha<1\) flattens the distribution to release the suppressed creativity of aligned models, simultaneously improving both quality and diversity on the Pareto frontier.
Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models: Aiming at the characteristic of Looped Language Models (LoopLM) iteratively regenerating latent representations \(T_{\max}\) times before each token output, this paper proposes RLTT. By modifying the "final-loop-only" strategy gradient in GRPO to "weight each loop's next-token distribution \(P^{(t)}\) with \(\omega_t\)," the method improves the average accuracy of Ouro-2.6B on MATH/AIME/BeyondAIME by +10.9% without external verifiers or additional inference overhead. It also yields spontaneous decreases in training time (-10%) and response lengths.
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models: The authors decompose the problem of "efficient test-time scaling for discrete diffusion language models (dLLMs)" into three components: Hierarchical Trajectory Search (HTS) to allocate computation via an "exploration → progressive pruning → refinement" schedule, local branching via partial remasking to preserve high-confidence "logic skeletons," and using the dLLM itself as a Yes/No validator (SVF). Ultimately, Prism achieves comparable or superior accuracy to best-of-\(N\) with significantly fewer Number of Function Evaluations (NFE) across four math and code benchmarks on three dLLMs.
Prompt Injection as Role Confusion: This paper attributes the root cause of "prompt injection" to a role confusion phenomenon where LLMs identify "who is speaking" in the latent space using style rather than labels. The authors propose "Role Probes" to quantify this confusion and design a CoT Forgery attack. This attack increases success rates from near 0% to over 60% across six frontier models. Furthermore, it demonstrates that the "role confusion degree" measured by probes can predict attack success before the model generates its first token.
R2-Router: A New Paradigm for LLM Routing with Reasoning: This paper proposes R2-Router, which transforms "output token budget" from a passive estimate into a controllable variable. By enabling the router to search in the joint (LLM, budget) space and using a lightweight multi-head quality predictor to extend each LLM from a static point into a quality-cost curve, it achieves comparable quality to existing routers at 4–5× lower cost.
Reasoning Can Be Restored by Correcting a Few Decision Tokens: The authors quantify the gap between base LLMs and reasoning LRMs using token-level distribution divergence. They find that the gap is highly concentrated in a small number of early, planning-related tokens where the base model is inherently uncertain (accounting for ~8% of tokens). Based on this, they propose "Divergence-Gated Single-Token Takeover"—letting the LRM generate only one token at divergence spikes before immediately handing control back to the base model. With an intervention budget of ~4-13%, this method recovers or even exceeds the performance of same-sized thinking models.
Reasoning Structure of Large Language Models: This paper converts the free-text Chain-of-Thought (CoT) of Large Reasoning Models (LRMs) into a verifiable DAG of "atomic claims + deductive dependencies." By defining a reasoning flow efficiency metric \(\eta\) based on the structural entropy of absorbing Markov chains, it demonstrates that even in regions where accuracy and token counts are saturated or overlapping, \(\eta\) can still distinguish between "focused reasoning" and "divergent exploration," serving as a fine-grained tool for diagnosing LRM failure modes.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning: ResRL theoretically decomposes the "negative sample gradient contaminating positive samples" phenomenon (Lazy Likelihood Displacement, LLD) in RLVR into "logit × representation" components. It then utilizes the SVD low-rank subspace of positive samples at the representation layer to compute projection residuals. Based on the "orthogonal component energy" of each negative token, a gradient weight in the \([\xi, 1]\) interval is assigned—lighter penalties for representations similar to positive samples (smaller residuals) and heavy penalties for purely erroneous components. This preserves Pass@1 while maintaining Pass@k diversity. On Qwen3-4B mathematical tasks, it achieves a 9.4% improvement in Avg@16 and a 7.0% improvement in Pass@128 compared to NSR.
Reward Modeling from Natural Language Human Feedback: This paper identifies a severe outcome-process inconsistency in generative reward models (GRMs) trained on binary preference rewards—where models "predict the preference correctly but provide incorrect critiques" (20–30%, up to 44%). It proposes RM-NLHF: using the similarity of core arguments between model critiques and human critiques as an additional process reward, and using MetaRM to automatically predict process rewards with online on-policy updates. This approach consistently outperforms SOTA GRMs trained via outcome-only GRPO across multiple benchmarks.
Scaling-Aware Adapter for Structure-Grounded LLM Reasoning: Cuttlefish replaces the "fixed-length query tokens" typical of Q-Former with "instruction-conditioned patch tokens" that grow adaptively based on structural complexity. It utilizes cross-attention to inject geometric features extracted by an EGNN as modality tokens into the LLM, effectively reducing hallucinations and handling scaling across four all-atom modalities: molecules, proteins, DNA, and RNA, outperforming several modality-specific baselines.
Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics: This paper introduces the first systematic study of "logicality" in LLM scientific reasoning. It proposes a three-dimensional evaluation metric—"Logical Fidelity / Causal Connection / Inferential Progress"—and constructs two SFT data sampling methods based on these metrics: Style Transfer (RST) and Logic Distillation (Logic-Distill). These methods significantly improve both logicality and answer accuracy for 7B models on the self-built PhysLogic benchmark and three public physics benchmarks.
Select to Think: Unlocking SLM Potential with Local Sufficiency: This paper discovers that Small Language Models (SLMs) often already include the token preferred by Large Language Models (LLMs) in their top-K candidate sets at reasoning "divergence points" (the top-8 of a 1.5B model hits the 32B teacher's choice 95% of the time), but these are missed by greedy decoding. The authors reframe the LLM's role from "open-ended generation" to "selecting from SLM candidates" and distill this selection logic into the SLM itself. This allows a 1.5B model to improve its Math score by 24.1% relative to the baseline using single-track decoding, matching the performance of 8-way self-consistency while using only 1/8 of the compute.
Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain: The authors argue that the current collapse of "LLM self-play" within a few rounds is fundamentally due to self-synthetic data failing to provide learnable information gain; they formalize "learnable information" using bounded MDL/epiplexity and propose three system-level designs—Asymmetric Co-evolution, Capacity Growth, and Proactive Information Seeking—to collectively ensure the monotonic increase of learnable information in the Proposer-Solver-Verifier self-evolution loop.
SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning: This paper proposes SmartThinker, an efficient reasoning post-training method based on GRPO. By Gaussian modeling of the "total trajectory length distribution" and "correct trajectory length distribution" for each prompt, the authors analytically derive the "optimal length \(l^{\text{opt}}\) that maximizes accuracy." Combined with a dynamic length reward coefficient \(\Lambda\) that ensures non-negative normalized advantage for correct trajectories, the method achieves up to 52.6% token compression while improving AIME25 accuracy by up to 16.6% relatively.
Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models: This paper diagnoses the root cause of the "gain-then-collapse" phenomenon in Looped Language Models (LoopLM) when scaling depth at test-time from a dynamical systems perspective—a "Stability-Effectiveness" dilemma caused by normalization placement. It proposes STARS: using Jacobian Spectral Radius Regularization (JSRR) + Stochastic Recurrent Sampling to pull latent trajectories toward "asymptotically stable effective fixed points." On GSM8K, it compresses the performance drop at 8 iterations from 20.47% to 8.26%, while increasing peak performance by 4.01%.
Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models: ASAG is a training-free, plug-and-play early exit framework for reasoning. It monitors both model confidence and attention entropy at the switching points of each "reasoning action" in Large Reasoning Models (LRMs) to determine if reasoning has truly converged. It adaptively selects from four strategies—"early exit," "logits injection for enhancement," "trap escape," or "continue"—improving average accuracy by 3.2% on Qwen3-8B while reducing generated tokens by nearly 40%.
SuCo: Sufficiency-guided Continuous Adaptive Reasoning: SuCo proposes "Minimum Sufficient CoT (MSC)"—the shortest CoT prefix capable of producing the correct answer. Based on this, it designs a two-stage training process (MSC-aligned Fine-Tuning, MFT + Sufficiency-Aware Policy Optimization, SAPO), enabling large reasoning models to autonomously adjust reasoning length on a continuous spectrum. It achieves higher accuracy with fewer tokens across math, code, and science benchmarks (7B average accuracy +2.7, reasoning length reduced from 5239 to 1267).
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary: This paper identifies a "Deterministic Horizon" (approx. 19-31 steps) in decoder-only Transformers for tasks requiring deterministic state tracking, where extending reasoning beyond this threshold leads to performance collapse due to attention capacity limits. Through information theory and large-scale empirical analysis (720,000 evaluations), the authors prove this is an architectural capability failure ("Decoherence") rather than a "Simplicity Bias," and quantitatively demonstrate the necessity of tool delegation (e.g., symbolic solvers)—which can restore accuracy from 24-42% to 86-94%.
The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning: This paper decomposes the training dynamics of GRPO and discovers that treating easy, hard, and learnable problems uniformly leads to compute mismatch. It proposes CoDaPO, which calculates a bounded value based on "confidence × difficulty" for each question. this value is used both to weight gradient updates and to resample high-value questions, concentrating updates on the "learnable zone" within a fixed compute budget. CoDaPO consistently outperforms methods like GRPO across 12 reasoning benchmarks.
The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought: This paper provides the first proof that a standard Transformer decoder using softmax attention and bfloat16-level precision (where both activations and attention weights are rounded) can simulate any Turing machine using CoT, provided its depth and width grow logarithmically with the context. It further proves that Summarized CoT (SCoT) reduces the required scale from a time bound \(\hat{t}\) to a space bound \(\hat{s}\). Empirical results on Sudoku tasks reveal that "increasing depth rather than increasing precision" is the true remedy for CoT failures in long contexts.
The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning: This paper reveals the "Quality-Utility Paradox" in small language model (SLM) mathematical reasoning distillation: training data refined by strong Oracles, which receives higher reward model scores, actually results in inferior downstream fine-tuning compared to lower-scoring data sampled by the SLMs themselves. This is because Oracle refinement, while fixing logic, pushes reasoning traces away from the SLM's native distribution, raising the adaptation cost. The authors introduce "Style-Aligned Refinement" to decouple logic repair from stylistic drift, successfully reclaiming downstream gains.
The Role of Feedback Alignment in Self-Distillation: This paper systematically investigates the design of context in "self-distillation." By comparing three feedback forms within a solver–critic framework, it is found that corrective feedback step-aligned with the solver's own reasoning trajectory (StepAlignFB) significantly outperforms binary rewards (GRPO, +16.11 points) and reference solutions (RefSol, +5.27 points Avg@12). This is because it concentrates distillation signals on the solver's actual erroneous tokens while bypassing correct steps, thereby implicitly achieving process-level supervision (PRM-style signals) without training a reward model.
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning: The authors construct the ToolMATH benchmark, containing 8K problems and 12K tools, by translating manual solution steps in the MATH dataset into "reusable Python tools with descriptions and type signatures." It covers long-horizon multi-tool compositions (hops 1-8+), controllable distractor similarity (5 levels × 4 densities), and tool-missing scenarios where "gold tools are entirely removed." Evaluations reveal that the dominant factor for model failure is reasoning itself rather than tool selection—thought errors account for over 90%, while distractors amplify early minor deviations into irreversible execution drift.
TRACE: Evaluating LLM CoT Reasoning Process Quality with the Toulmin Argumentation Model: TRACE is a reference-free CoT quality evaluation metric that synthesizes the Toulmin Argumentation Model (Claim/Data/Warrant/Backing/Qualifier/Rebuttal) and Flavell Metacognition (Monitoring/Evaluation) into 8 core elements. It utilizes DeBERTa for multi-label recognition of these elements in each reasoning sentence, calculating a weighted sum of "State Validity + Transition Coherence." Across 26.3K QA pairs from 7 models, it achieves a correlation of \(r=0.741\) with benchmark accuracy and improves GSM8K performance by +9.9% when used as an RL reward.
UCPO: Uncertainty-Aware Policy Optimization: UCPO addresses the advantage bias caused by fixed uncertainty rewards in existing RL paradigms through two mechanisms: Tri-Advantage Decoupling (TAD) and Dynamic Uncertainty Reward Adjustment (DURA). This allows LLMs to reliably express uncertainty at knowledge boundaries, achieving a PAQ of 79.63% in mathematical reasoning on Qwen3-8B.
UniScale: Adaptive Unified Inference Scaling through Online Joint Optimization of Model Routing and Test-Time Scaling: The authors propose the UniScale framework, which unifies model routing and test-time scaling (TTS) into a single decision space. It leverages LinUCB contextual multi-armed bandits for online learning of adaptive inference strategies, addressing the fine-grained quality-cost tradeoff in LLM deployment.
Verifying Meta-Awareness via Predictive Rewards in Reasoning Models: Optimizing model metacognition by requiring reasoning models to self-predict solution length, pass rates, and necessary concepts—aligning predictions with ground-truth statistics—significantly enhances mathematical reasoning performance and accelerates training.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code: Through controlled experiments involving 10T-token corpus and MoE pre-training from scratch, this paper indicates that what truly improves complex mathematical reasoning is not pure executable code itself, but cross-domain structured reasoning signals, particularly the "cognitive scaffolds" in mathematical corpora that explicitly expose intermediate steps.
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models: This paper demonstrates that safety failures in multi-turn reasoning models are largely invisible to "final-turn score evaluations." A model may lock into an unsafe stance early on while maintaining a final refusal rate identical to well-aligned baselines. The authors propose a trajectory-level diagnostic framework, the CoT–Output 2×2 Safety Matrix, which labels each turn along two independent axes: "Internal Reasoning (CoT)" and "Visible Output." This framework identifies four failure categories and characterizes the context-injection failure (Safe CoT but Harmful Output) for the first time.
When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning: This paper introduces manager-worker style persistent subgoals into Hierarchical Reasoning Models (HRM). It find that the key in latent reasoning is not simply injecting subgoals, but ensuring that subgoals persist for \(P=3\) to \(6\) low-level update steps. Rapid re-planning disrupts compositional structure, while excessive alignment interferes with task learning.