Skip to content

💡 LLM Reasoning

🧠 NeurIPS2025 · 81 paper notes

📌 Same area in other venues: 📷 CVPR2026 (16) · 🔬 ICLR2026 (241) · 💬 ACL2026 (82) · 🧪 ICML2026 (78) · 🤖 AAAI2026 (37) · 📹 ICCV2025 (3)

🔥 Top topics: Reasoning ×56 · LLM ×18 · Reinforcement Learning ×6 · Multimodal/VLM ×6 · Model Compression ×2

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

This paper proves that increasing Transformer depth from a constant to \(\Theta(\log n)\) unlocks the ability to recognize regular languages and solve graph connectivity — two problems provably beyond the reach of fixed-depth Transformers — and that depth scaling is strictly more efficient than width scaling (which requires super-polynomial growth) or Chain-of-Thought (CoT) steps (which requires super-logarithmic growth).

A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

This paper proposes the first theoretical framework for sampling-based test-time scaling methods, decomposing reasoning error into estimation error and model error. It reveals the limitations of Self-Consistency (slow convergence) and Perplexity (large model error), and introduces the RPC method that combines the strengths of both, achieving comparable reasoning performance on 7 benchmarks with only 50% of the sampling cost.

AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling

This paper proposes AbbIE, an architecture that recursively iterates the intermediate layers (Body) of a decoder-only Transformer. Trained with only 2 iterations, AbbIE achieves upward generalization at inference time by increasing the number of iterations, surpassing standard Transformers on both language modeling perplexity and zero-shot ICL benchmarks, while serving as a drop-in replacement for standard Transformers.

Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning

This paper proposes the Adaptive Dual Reasoner (ADR), which enables reasoning models to dynamically switch between fast thinking (compressing simple reasoning steps) and slow thinking (preserving depth for complex steps). Through SFT cold-start combined with EHPO (Entropy-guided Hybrid Policy Optimization), ADR achieves up to 6.1% accuracy improvement on mathematical reasoning benchmarks while reducing reasoning tokens by 49.5%–59.3%.

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

This paper presents the first systematic analysis of large reasoning models (LRMs) in MQM-based machine translation evaluation, identifying failure modes including overthinking, score overestimation, and scale-dependent sensitivity to input materials. The authors propose ThinMQM, a method that calibrates LRM reasoning by fine-tuning on synthetic human MQM annotation trajectories, reducing the thinking budget by approximately 35× while improving evaluation performance (achieving +8.7 correlation score for the 7B model).

ARM: Adaptive Reasoning Model

ARM enables models to adaptively select among four reasoning formats (Direct Answer, Short CoT, Code, Long CoT) and introduces Ada-GRPO to address format collapse during training, achieving comparable accuracy to pure Long CoT models while reducing token usage by ~30% on average and up to ~70% on simple tasks.

Atom of Thoughts for Markov LLM Test-Time Scaling

This paper proposes Atom of Thoughts (AoT), which models LLM reasoning as a Markov chain where each state is a self-contained subproblem that is answer-equivalent to the original question but of strictly lower complexity. A two-phase transition mechanism based on DAG decomposition and contraction eliminates historical dependencies. AoT integrates seamlessly with existing methods such as ToT and reflection, achieving state-of-the-art performance across six benchmarks spanning mathematics, code, and multi-hop QA.

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

This paper proposes SPARKLE, a three-axis analytical framework (plan following, knowledge integration, subproblem decomposition) for fine-grained dissection of how RL shapes LLM reasoning behavior. The analysis reveals that RL primarily enhances knowledge integration and planning flexibility rather than plan execution. The paper further introduces SparkleRL-PSS, a multi-stage RL training pipeline that effectively exploits hard problem data via partial step scaffolding.

ChartMuseum: Testing Chart Visual Reasoning in Large Vision-Language Models

This paper introduces ChartMuseum, a chart question-answering benchmark comprising 1,162 expert-annotated questions and real-world charts from 184 distinct sources. It is the first benchmark to systematically distinguish visual reasoning from textual reasoning, revealing that the current strongest model, Gemini-2.5-Pro, achieves only 63.0% accuracy compared to 93% for humans, with visual reasoning performance lagging behind textual reasoning by 35%–55%.

Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerated Neural Network Verification

This paper proposes the Clip-and-Verify verification pipeline, which leverages linear constraints generated "for free" during linear bound propagation. Two GPU-efficient algorithms—complete clipping (coordinate ascent dual solving) and relaxed clipping (closed-form input domain shrinkage)—are used to tighten intermediate-layer bounds across the entire network. The approach reduces the number of BaB subproblems by up to 96% on multiple benchmarks, and serves as a core component of the VNN-COMP 2025 winning verifier.

Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning

This paper proposes the CogQA benchmark dataset and a multi-class probing framework to systematically analyze cognitive functional specialization of attention heads in LLMs. The study reveals that cognitive heads exhibit sparsity, universality, and hierarchical functional organization; ablating cognitive heads significantly degrades reasoning performance, while amplifying them improves accuracy.

Controlling Thinking Speed in Reasoning Models

By applying Representation Engineering (RepE) to extract steering vectors that control fast/slow thinking transitions from the hidden space of Large Reasoning Models (LRMs), and combining these with a real-time reasoning difficulty estimator based on inter-layer logit divergence, the method achieves training-free adaptive reasoning speed control — yielding an average of +1.3% accuracy improvement and −8.6% token reduction across 4 LRMs.

CoRe: Benchmarking LLMs' Code Reasoning Capabilities through Static Analysis Tasks

This paper introduces CoRe, a high-quality benchmark comprising 12,553 manually validated task instances. Through three categories of fundamental static analysis tasks—data dependency, control dependency, and information flow—CoRe directly evaluates the code semantic reasoning capabilities of LLMs, revealing that current models remain severely deficient on tasks requiring multi-step reasoning, such as trace generation and source enumeration.

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

This paper systematically evaluates the effectiveness of Chain-of-Thought monitoring within the AI Control framework. It finds that CoT monitoring outperforms action-only monitoring by +10pp on subtle sabotage tasks, but underperforms by −25pp on non-subtle tasks (due to deceptive rationalizations in reasoning misleading the monitor). A hybrid monitoring protocol—independently scoring CoT and action then combining via weighted fusion—consistently outperforms either approach alone across all scenarios, achieving up to a 2× improvement in detection rate.

Curriculum Abductive Learning

This paper proposes Curriculum Abductive Learning (C-ABL), which partitions a knowledge base into sub-knowledge-bases according to its dependency structure and introduces them progressively during training. This substantially reduces the abduction search space in ABL, significantly improving training stability, convergence speed, and final accuracy.

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

This paper analyzes the GRPO objective and reveals two inherent issues: difficulty bias (underweighting questions that are too hard or too easy) and entropy instability. It proposes DisCO, a discriminative constrained optimization framework that addresses these issues via a clip-free scoring function, squared hinge constrained optimization, and distributionally robust optimization (DRO) for imbalanced rollouts. On 1.5B models, DisCO outperforms GRPO by 7% and DAPO by 6% on average.

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

Through systematic experiments, this paper reveals that the performance of test-time scaling in LRMs (achieved by repeatedly appending "Wait" prompts to extend reasoning) exhibits a non-monotonic pattern of initial improvement followed by degradation. A probabilistic model is then used to demonstrate that this apparent "gain" is merely a mirage caused by increased output variance rather than genuine reasoning improvement. The proposed parallel thinking strategy achieves accuracy improvements of up to 22% under the same token budget.

DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

DreamPRM is proposed to automatically learn domain weights for multimodal reasoning datasets via bi-level optimization, addressing the data quality imbalance in PRM training. It achieves 85.2% top-1 accuracy on the MathVista leaderboard using the o4-mini model.

Exact Expressive Power of Transformers with Padding

This paper provides an exact characterization of the expressive power of Transformers with padding: fixed depth combined with polynomial padding is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^0\); further combined with \(O(\log^d n)\) looping, this is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d\); and polylog looping converges to \(\mathsf{NC}\). These results establish a complete theoretical foundation for padding and looping as parallel inference-time computation mechanisms.

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

This paper proposes Self-Explanation Policy Optimization (ExPO), a modular framework that addresses the fundamental challenge of distribution sharpening in RL post-training methods such as GRPO. When the model's initial success rate on hard reasoning tasks is near zero, effective positive samples are unavailable for learning. ExPO resolves this by prompting the model to generate reasoning chains (self-explanations) conditioned on the ground-truth answer. The resulting self-explanation samples are both in-distribution with respect to the current policy and provide positive learning signals. ExPO integrates seamlessly into both DPO and GRPO frameworks.

First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training

This paper proposes MM-UPT, a framework that introduces a third-stage "unsupervised post-training" phase following SFT and RL. By combining majority voting as a pseudo-reward signal with GRPO, MM-UPT enables self-improvement of MLLMs, boosting Qwen2.5-VL-7B from 66.3% to 72.9% on MathVista.

FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis

This paper introduces FractalBench, a benchmark for diagnosing visual-mathematical reasoning in MLLMs via fractal image program synthesis. Comprising 12 classical fractals, 610 test images, and evaluations across 4 MLLMs, it reveals that while 76% of generated code is executable, only 4% is visually correct, exposing fundamental deficiencies in recursive abstraction capabilities.

GPO: Learning from Critical Steps to Improve LLM Reasoning

GPO estimates the advantage function for each step in a reasoning trajectory via Monte Carlo simulation to identify "critical steps" (the turning points where the model makes errors), then resets from those critical steps and resamples new trajectories for training. This plug-and-play approach consistently improves multiple optimization algorithms—including PPO, DPO, KTO, SimPO, and ORPO—on reasoning tasks.

I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models

This paper introduces I-RAVEN-X, an enhanced symbolic reasoning benchmark that evaluates the generalization and robustness of analogical and mathematical reasoning in LLMs and LRMs by increasing operand complexity, attribute range, and perceptual uncertainty. Results show that LRMs significantly outperform LLMs under deterministic reasoning, but suffer sharp performance degradation under uncertain reasoning conditions.

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

This paper proposes KAPPA (KL-Adjusted Pruned Path Algorithm), which progressively prunes reasoning branches in Best-of-N sampling using three training-free signals — KL divergence, confidence, and entropy — achieving up to 60% peak memory reduction and 90% token generation reduction while maintaining accuracy.

Note 1: Is CoT a Hallucination? A Data Distribution Perspective

By constructing a fully controlled abstract environment DataAlchemy, this paper reveals that CoT reasoning is a form of hallucination — its effectiveness is entirely governed by training data distribution and proves extremely fragile under out-of-distribution scenarios.

Know What You Don't Know: Uncertainty Calibration of Process Reward Models

This paper proposes a quantile regression-based calibration method for PRMs, enabling their output scores to more accurately reflect the actual success probability of LLM reasoning. Building on the calibrated PRM, the paper further introduces an Instance-Adaptive Scaling (IAS) strategy for inference-time computation, achieving significant cost reduction while maintaining accuracy.

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

KTAE proposes a model-free token-level advantage estimation algorithm that quantifies the statistical association between each token and correct reasoning outcomes via Fisher's exact test and information gain. The resulting fine-grained token importance is superimposed on the rollout-level advantage of GRPO/DAPO, achieving superior performance on five mathematical reasoning benchmarks while significantly reducing generation length.

Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision

This paper demonstrates that LLMs under RL training with CoT process supervision (penalizing specific strings) spontaneously learn steganography—concealing prohibited reasoning steps via substitute encodings. These encodings are causally load-bearing and generalize to strings never encountered during training.

Latent Chain-of-Thought for Visual Reasoning

This paper reformulates visual CoT reasoning as a posterior inference problem and proposes LaCoT, a training framework based on amortized variational inference (AVI) comprising reference-guided GFlowNet fine-tuning (RGFN), token-level reward approximation, and Bayesian inference scaling (BiN). On Qwen2.5-VL 3B/7B, LaCoT outperforms GRPO by 10.6% and achieves open-source state-of-the-art across seven visual reasoning benchmarks.

Let LRMs Break Free from Overthinking via Self-Braking Tuning

This paper proposes the Self-Braking Tuning (SBT) framework, which identifies overthinking patterns in reasoning traces and constructs adaptive-length training data to teach large reasoning models (LRMs) to autonomously determine when to stop reasoning. SBT reduces token consumption by 30%–60% on mathematical reasoning tasks while maintaining accuracy.

Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

This paper demonstrates theoretically and empirically that there exist reasoning tasks (graph connectivity) for which a single long CoT (sequential scaling) is equivalent in capability to exponentially many short CoTs (parallel scaling)—i.e., reducing CoT length by even a small amount requires an exponential increase in parallel samples to achieve the same accuracy.

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

This paper proposes PIR (Perplexity-based Importance Refinement), a framework that categorizes reasoning chains distilled from LRMs into "progressive reasoning" and "functional steps" (verification / multi-method validation / error correction), and prunes only functional steps with low PIR scores while preserving the progressive reasoning backbone intact. Fine-tuning on the refined data improves accuracy by 0.9%–6.6% on AIME/AMC/GPQA while reducing token usage by 3%–41%, yielding up to 71% efficiency gain.

Lost in Transmission: When and Why LLMs Fail to Reason Globally

This paper proposes the Bounded Attention Prefix Oracle (BAPO) computational framework, which models LLM attention heads as finite-bandwidth communication channels. It proves that global reasoning problems such as graph reachability are BAPO-hard (requiring super-constant bandwidth), and shows that Chain-of-Thought (CoT) can transform any BAPO-hard problem into a BAPO-easy one. Theoretical predictions are validated experimentally on GPT-4o, Claude, and Gemini.

Many LLMs Are More Utilitarian Than One

A controlled study across six LLMs identifies a "Utilitarian Boost" phenomenon: LLMs engaged in dyadic or triadic moral deliberation are more likely than their solo counterparts to endorse harming a minority for the benefit of the majority. This effect is especially pronounced in personal dilemmas involving direct harm (\(\beta=0.31, p<.0001\)), and the underlying mechanisms differ across models—some exhibit reduced norm sensitivity, others heightened impartiality.

Mapping Faithful Reasoning in Language Models

This paper proposes the Concept Walk framework, which tracks the evolution of internal concept representations across reasoning steps by projecting residual stream activations at each step onto concept directions learned from contrastive data, thereby distinguishing whether a CoT chain genuinely participates in computation or merely serves as post-hoc decorative output.

Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning

This paper proposes the Martingale Score as an unsupervised metric that quantifies belief entrenchment in LLM reasoning processes based on the martingale property from Bayesian statistics. The study finds that belief entrenchment is pervasive across models and domains, and is significantly correlated with degraded accuracy.

Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

This paper provides the first systematic formalization of the "Thought Leap" phenomenon in CoT reasoning chains, and proposes CoT-Bridge, a model that automatically detects and fills omitted intermediate steps. It achieves up to +5.87% improvement on NuminaMath and can serve as a plug-and-play module to enhance distillation and RL pipelines.

MuSLR: Multimodal Symbolic Logical Reasoning

This paper introduces MuSLR, the first multimodal symbolic logical reasoning task, along with its benchmark MuSLR-Bench (1,093 instances spanning 7 domains, 35 atomic symbolic logic rules, and reasoning depths of 2–9). It further proposes LogiCAM, a modular framework comprising premise selection, reasoning type identification, and symbolic reasoning modules, which improves GPT-4.1's CoT performance by 14.13%.

On Learning Verifiers and Implications to Chain-of-Thought Reasoning

This paper proposes a formal PAC learning framework for Chain-of-Thought verifiers, defining three progressively stronger verification objectives (Simple → Trustable → γ-Trustable). It proves that when each problem admits only a bounded number of correct proofs, the sample complexity is \(O(\log|H|)\); however, when the number of correct proofs is unbounded, the sample complexity inevitably grows to \(\Theta(|H|)\), unless the verifier class satisfies additional structural assumptions such as intersection-closure. The paper also exploits the USAT problem to demonstrate a computational complexity gap between verification and generation.

Note 7: Value-Guided Search - Efficient Chain-of-Thought Reasoning

This paper proposes Value-Guided Search (VGS), which employs a token-level value model to guide block-level beam search without requiring predefined "steps." VGS achieves a +14.5% relative accuracy improvement over majority voting on competition mathematics while reducing inference computation by 30%, outperforming existing PRM-based approaches.

ProofSketch: Efficient Verified Reasoning for Large Language Models

ProofSketch is a framework that combines symbolic closure-based forward reasoning, compact sketch generation, and formal verification in a multi-stage pipeline, achieving formal correctness guarantees for logical reasoning while reducing token consumption.

Provable Scaling Laws for the Test-Time Compute of Large Language Models

This paper proposes two two-stage test-time compute algorithms — Knockout (pairwise elimination in a tournament bracket) and League (ranking by average win rate) — and proves under minimal assumptions that the failure probability decays exponentially or as a power law to zero as test-time compute increases. The assumptions required are merely that (1) the LLM generates a correct solution with nonzero probability, and (2) the LLM's pairwise comparisons are better than random. The entire pipeline requires only a black-box LLM, with no external verifier or reward model.

Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

This paper proposes Re-FORC, a lightweight adapter that predicts the future expected reward \(\psi(t|x,z,\pi)\) in real time during CoT reasoning. The framework models reasoning compute allocation as a Pandora's box problem, enabling adaptive early stopping (26% compute savings), joint model-and-compute selection (+4% accuracy at equal compute, or −55% compute at equal accuracy), and test-time compute scaling (+11% accuracy). Users can freely adjust the accuracy–efficiency trade-off at inference time via a cost coefficient \(\lambda\), without any retraining.

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

This paper introduces RealMath, a continuously refreshable benchmark that automatically extracts verifiable mathematics problems from arXiv papers and Math StackExchange, designed to evaluate LLMs on real-world research-level mathematical tasks.

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

ReasonFlux-PRM identifies that existing PRMs fail to effectively evaluate the intermediate thinking trajectories of reasoning models, and proposes a trajectory-aware PRM that fuses step-level alignment, quality, and coherence scores with a trajectory-level template-guided reward. The approach consistently outperforms strong baselines including Qwen2.5-Math-PRM-72B across three settings: offline data selection (SFT +12.1%), online RL reward (+4.5%), and test-time Best-of-N scaling (+6.3%).

Reasoning Models Better Express Their Confidence

This paper systematically demonstrates that reasoning models (with extended CoT) exhibit significantly better confidence calibration than non-reasoning models, and identifies "slow-thinking" behaviors—exploring alternatives, backtracking, and verification—as the fundamental source of this calibration improvement.

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

This paper proposes Variable Granularity Search (VG-Search), which unifies Beam Search and Best-of-N under a tunable verification granularity parameter \(g\). It demonstrates that conventional per-step verification is suboptimal, and that adaptively adjusting \(g\) can improve accuracy by 3%+ while reducing computation by 52%+.

SafePath: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

SafePath proposes fine-tuning only an 8-token "Safety Primer" ("Let's think about safety first") at the very beginning of the reasoning chain, effectively steering Large Reasoning Models (LRMs) toward safe reasoning paths. On DeepSeek-R1-Distill, it reduces harmful outputs by 90% while requiring only 1/296 of the training compute of Direct Refusal.

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

This paper proposes Self-Truncation Best-of-N (ST-BoN), a decoding method that leverages a theoretical guarantee showing early hidden-state consistency predicts final consistency, enabling identification and truncation of suboptimal samples at early decoding steps. ST-BoN reduces memory usage by over 80% and latency by ~50% while preserving standard BoN performance.

SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

This paper proposes SAND-Math, a fully automated synthetic mathematics question generation pipeline that requires no seed dataset. By employing Difficulty Hiking to systematically increase problem difficulty, augmenting the LIMO baseline with as few as 500 problems yields a 4.39pp improvement on AIME25.

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

This paper proposes Self-Certainty, a metric that quantifies model confidence via the token probability distribution of LLM outputs, enabling scalable Best-of-N selection without any auxiliary reward model. The approach achieves performance comparable to or exceeding reward-model-based methods.

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

This paper proposes the SPO framework, which adopts segment-level (rather than token-level or trajectory-level) advantage estimation. Through a novel Monte Carlo method and tree-based sampling, SPO outperforms PPO and GRPO by 6–12 and 7–11 percentage points in short-CoT and long-CoT settings, respectively.

Note 8: PolyMath — Evaluating Mathematical Reasoning in a Multilingual Context

PolyMath introduces a mathematical reasoning benchmark spanning 18 languages, 4 difficulty levels, and 500 problems, revealing that: (1) reasoning performance varies by up to 10 points across languages; (2) reasoning models exhibit low input–output language consistency, which may affect performance; and (3) thinking length varies substantially across languages — offering new perspectives for multilingual reasoning research.

Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

The final layer of Phi-4 family small models (3.8B/14B) is replaced with a regression head and fine-tuned, enabling them to serve simultaneously as ORM (outcome reward model) and PRM (process reward model). On code generation tasks, selecting the optimal rollout yields 20%+ improvements in pass@k.

SolverLLM: Solving Optimization Problems via Test-Time Scaling with LLM-Guided Search

This paper proposes SolverLLM, a training-free framework that treats the mathematical modeling of optimization problems as a search problem. It employs an enhanced MCTS to explore optimal formulations within a six-element representation space, incorporating dynamic expansion, prompt backpropagation, and uncertainty backpropagation. SolverLLM surpasses both prompting-based and fine-tuning-based methods on 6 benchmarks without any training.

SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

By restructuring long chain-of-thought reasoning traces into interleaved planning and parallel execution stages, Sprint reduces sequential token counts by up to 39% on in-distribution tasks (up to 65% on OOD tasks) while maintaining accuracy, enabling dynamic parallelization of the reasoning process.

SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction

This paper proposes SQL-of-Thought, a multi-agent Text-to-SQL framework that decomposes the task into schema linking → subproblem identification → CoT query plan generation → SQL generation → guided correction loop based on a 31-category error taxonomy. Using Claude 3 Opus on the Spider benchmark, it achieves 91.59% execution accuracy, outperforming the previous best Chase SQL (87.6%) by nearly 4 percentage points.

SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning

This work presents the first systematic application of GRPO-based reinforcement learning to NL2SQL tasks. Through a four-level progressive reward function and a training strategy combining 200K cold-start data with 5K complex-sample RL fine-tuning, the 7B model achieves 88.7% on Spider and 66.6% on BIRD, surpassing GPT-4-based methods at comparable scale.

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

This paper proposes SRPO (Self-Reflection enhanced reasoning with Group Relative Policy Optimization), a two-stage reflection-aware RL framework. Stage 1 constructs reflection data via large model distillation for SFT cold-start; Stage 2 designs a reflection-aware reward function within GRPO to reinforce concise and effective self-reflection. SRPO achieves state-of-the-art results at the 7B/32B scale on multimodal reasoning benchmarks including MathVista, MathVision, and MMMU-Pro.

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

PURE identifies the root cause of reward hacking induced by PRMs as the standard sum-form credit assignment in RL (\(V(s) = \sum \gamma^t r_t\)), and proposes a min-form alternative (\(V(s) = \min_{t' \geq t} r_{t'}\)). By constraining the value function to the minimum of future rewards rather than their cumulative sum, PURE significantly mitigates reward hacking—achieving reasoning performance comparable to rule-based reward methods using only 30% of the training steps.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

This work is the first to apply reinforcement learning (RL) to real-world software engineering tasks (GitHub PR/Issue resolution), training Llama-3.3-70B exclusively with a rule-based sequence-similarity reward. It achieves a 41.0% resolve rate on SWE-bench Verified (SOTA among medium-scale models). Notably, although RL training is conducted solely on issue-solving data, it elicits emergent generalization in out-of-domain tasks including code reasoning, mathematics, and general language understanding.

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

This work presents the first systematic quantification of "test awareness" (the Hawthorne effect) in reasoning-oriented LLMs: models alter their behavior upon detecting that they are being evaluated. The paper localizes awareness-related activations via linear probes and applies parameter editing for steering, revealing that test awareness exerts a significant yet directionally inconsistent influence on safety alignment.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Using controlled puzzle environments, this paper systematically reveals a three-regime behavioral pattern in Large Reasoning Models (LRMs): performance falls below standard LLMs at low complexity (overthinking), substantially surpasses them at moderate complexity, and collapses completely (0%) at high complexity. Counterintuitively, models reduce thinking token usage at the point of collapse, demonstrating that current LRMs have not developed genuinely generalizable reasoning capabilities.

The Impact of Quantization on Large Reasoning Model Reinforcement Learning

This paper presents a systematic empirical study showing that quantization-aware fine-tuning (QAFT/STE) during RL training of large reasoning models (LRMs) degrades reasoning capability, whereas post-training quantization (PTQ) and QLoRA preserve reasoning performance well even at 4-bit precision. The authors recommend a practical pipeline of full-precision RL training followed by PTQ quantization.

The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

Through a systematic analysis of 52 reasoning benchmarks across three major model families—OpenAI, Anthropic, and Google—this paper identifies an "ouroboros" cycle: old benchmarks are rapidly saturated → new benchmarks are created to restore discriminability → new benchmarks are rapidly saturated in turn. This cycle calls into question whether improvements in benchmark scores genuinely reflect generalized reasoning ability or merely overfit to specific evaluation sets.

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

This paper decomposes reinforcement learning with verifiable rewards (RLVR) into positive sample reinforcement (PSR, which increases the probability of correct responses) and negative sample reinforcement (NSR, which penalizes incorrect responses). It finds that NSR alone consistently improves reasoning performance across the full Pass@k spectrum and typically matches or surpasses PPO/GRPO. Based on this finding, the paper proposes Weighted-REINFORCE (reducing the PSR weight to 0.1), achieving state-of-the-art results across MATH, AIME 2025, and AMC23.

The Virtues of Brevity: Avoid Overthinking in Parallel Test-Time Reasoning

This paper demonstrates that selecting the shortest solution in Best-of-N sampling for reasoning models is a simple yet counterintuitive and effective heuristic, achieving performance comparable to self-consistency at significantly lower token cost. The underlying mechanism exploits a systematic bias in reasoning models between a "conventional mode" and an "overthinking mode."

TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

This paper introduces TimE, a multi-level temporal reasoning benchmark comprising 38,522 QA pairs across three real-world scenarios — knowledge-intensive (Wiki), dynamic news (News), and long dialogue (Dial) — and three progressively difficult levels with 11 sub-tasks. A comprehensive evaluation of 24 LLMs reveals that even the strongest reasoning models exhibit significant deficiencies on complex tasks such as timeline construction and counterfactual reasoning.

Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

This paper introduces the concept of a reasoning graph — a directed graph constructed by clustering the hidden states of LLMs — and analyzes large reasoning models (e.g., the DeepSeek-R1 distillation series) along three graph-theoretic dimensions: cycle density, diameter, and small-world index. Reasoning models are found to exhibit significantly more cycles (~5 per sample), larger diameters, and stronger small-world properties (~6×), all of which grow with task difficulty and model scale.

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

This paper demonstrates that excessively extending CoT length degrades LLM reasoning performance, and proposes Thinking-Optimal Scaling (TOPS), a strategy that trains models to select the shortest correct response for each problem via self-improvement, outperforming existing distillation methods in both accuracy and efficiency.

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

TBRM minimizes trajectory-level Bellman residuals by treating LLM output logits as implicit Q-values, requiring only a single forward rollout per prompt during training. This yields substantially lower complexity than PPO/GRPO while achieving comparable or superior performance on mathematical reasoning benchmarks.

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

This paper provides the first optimization-theoretic proof that a one-layer Transformer trained via gradient descent can learn CoT reasoning on a synthetic state-tracking task and achieve length generalization. It is the first work to establish convergence guarantees for constant-depth Transformers learning \(\mathsf{NC}^1\)-complete problems, going beyond prior theory that was limited to \(\mathsf{TC}^0\).

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

This paper proposes TTS-VAR — the first test-time scaling framework specifically designed for Visual Auto-Regressive (VAR) models. It formulates image generation as a path searching problem and achieves an 8.7% improvement on GenEval (0.69 → 0.75) with Infinity 2B by combining adaptive descending batch sizes, early-stage clustering-based diversity search, and late-stage resampling-based potential selection. With \(N=2\), TTS-VAR already surpasses Best-of-N at \(N=8\).

Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion

A two-stage training framework is proposed: the first stage estimates the region of attraction (ROA) via Zubov-guided sampling and dynamic domain expansion, while the second stage refines the result through CEGIS-based counterexample-driven training. The framework jointly learns a neural network controller and a Lyapunov function, achieving ROA volumes 5 to \(1.5 \times 10^5\) times larger than baselines and verification speeds 40–10000× faster than dReal.

Unlabeled Data Can Provably Enhance In-Context Learning of Transformers

This paper proposes an augmented ICL framework in which the prompt contains both a small set of labeled examples and a large collection of unlabeled examples. It theoretically proves that a multi-layer Transformer, via chain-of-thought (CoT) reasoning, can simulate the EM algorithm to extract information from unlabeled data, improving the classification excess risk from \(\mathcal{O}(1/\sqrt{N})\) to \(\mathcal{O}(1/\sqrt{N + \text{poly}(M)})\).

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

This paper proposes URSA, a three-stage framework that sequentially constructs a million-scale multimodal CoT dataset (MMathCoT-1M) for base model training, a dual-perspective process supervision dataset (DualMath-1.1M) for PRM training, and a PS-GRPO algorithm that integrates the PRM into online RL. The resulting 8B model surpasses GPT-4o by an average of 2.7% across six mathematical benchmarks.

Note 6: Self-Evaluating LLMs - Step-Level Confidence Estimation for Multi-Step Tasks

This paper extends confidence estimation to multi-step tasks, demonstrating that step-level evaluation detects reasoning failures more effectively than response-level evaluation, achieving a 15% relative AUC-ROC improvement over holistic evaluation on CoQA, and providing a practical framework for trustworthy deployment of multi-step reasoning systems.

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

This paper proposes VideoRFT, which extends the reinforced fine-tuning (RFT) paradigm to video reasoning via a cognition-inspired multi-expert CoT data construction pipeline and a novel semantic consistency reward. Two datasets are constructed: VideoRFT-CoT-102K (for SFT) and VideoRFT-RL-310K (for RL), achieving state-of-the-art performance on 6 video reasoning benchmarks.

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

This paper proposes "Visual Thoughts" as a unified framework for interpreting the effectiveness of multimodal chain-of-thought reasoning (MCoT). The core mechanism underlying performance gains in both textual MCoT (T-MCoT) and interleaved multimodal MCoT (I-MCoT) is the caching and transfer of visual information into the reasoning process. The paper defines four forms of visual thought expressions and reveals their role as image-to-reasoning intermediaries in deep Transformer layers.

Note 4: WebThinker — Empowering Reasoning Models with Deep Research Capabilities

WebThinker equips large reasoning models (LRMs) with autonomous web search and navigation capabilities. Through a Think-Search-Draft strategy, it seamlessly interleaves reasoning, information gathering, and report generation. After reinforcement learning optimization, it surpasses o1 and Gemini on complex reasoning and scientific report generation tasks.