Skip to content

💡 LLM Reasoning

🧠 NeurIPS2025 · 67 paper notes

AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling

This paper proposes AbbIE, an architecture that recursively iterates the intermediate layers (Body) of a decoder-only Transformer. Trained with only 2 iterations, AbbIE achieves upward generalization at inference time by increasing the number of iterations, surpassing standard Transformers on both language modeling perplexity and zero-shot ICL benchmarks, while serving as a drop-in replacement for standard Transformers.

Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning

This paper proposes the Adaptive Dual Reasoner (ADR), which enables reasoning models to dynamically switch between fast thinking (compressing simple reasoning steps) and slow thinking (preserving depth for complex steps). Through SFT cold-start combined with EHPO (Entropy-guided Hybrid Policy Optimization), ADR achieves up to 6.1% accuracy improvement on mathematical reasoning benchmarks while reducing reasoning tokens by 49.5%–59.3%.

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

This paper presents the first systematic analysis of large reasoning models (LRMs) in MQM-based machine translation evaluation, identifying failure modes including overthinking, score overestimation, and scale-dependent sensitivity to input materials. The authors propose ThinMQM, a method that calibrates LRM reasoning by fine-tuning on synthetic human MQM annotation trajectories, reducing the thinking budget by approximately 35× while improving evaluation performance (achieving +8.7 correlation score for the 7B model).

ARM: Adaptive Reasoning Model

ARM enables models to adaptively select among four reasoning formats (Direct Answer, Short CoT, Code, Long CoT) and introduces Ada-GRPO to address format collapse during training, achieving comparable accuracy to pure Long CoT models while reducing token usage by ~30% on average and up to ~70% on simple tasks.

Atom of Thoughts for Markov LLM Test-Time Scaling

This paper proposes Atom of Thoughts (AoT), which models LLM reasoning as a Markov chain where each state is a self-contained subproblem that is answer-equivalent to the original question but of strictly lower complexity. A two-phase transition mechanism based on DAG decomposition and contraction eliminates historical dependencies. AoT integrates seamlessly with existing methods such as ToT and reflection, achieving state-of-the-art performance across six benchmarks spanning mathematics, code, and multi-hop QA.

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

This paper introduces ChemCoTBench, the first CoT-based benchmark for evaluating chemical reasoning in LLMs. It decomposes complex chemical problems into modular chemical operations (adding/deleting/substituting functional groups), and is accompanied by ChemCoTDataset — a large-scale dataset of 22,000 expert-annotated CoT samples — enabling systematic evaluation of both reasoning and non-reasoning LLMs across molecular understanding, editing, optimization, and reaction prediction.

Clip-and-Verify: Linear Constraint-Driven Domain Clipping for Accelerated Neural Network Verification

This paper proposes the Clip-and-Verify verification pipeline, which leverages linear constraints generated "for free" during linear bound propagation. Two GPU-efficient algorithms—complete clipping (coordinate ascent dual solving) and relaxed clipping (closed-form input domain shrinkage)—are used to tighten intermediate-layer bounds across the entire network. The approach reduces the number of BaB subproblems by up to 96% on multiple benchmarks, and serves as a core component of the VNN-COMP 2025 winning verifier.

Controlling Thinking Speed in Reasoning Models

By applying Representation Engineering (RepE) to extract steering vectors that control fast/slow thinking transitions from the hidden space of Large Reasoning Models (LRMs), and combining these with a real-time reasoning difficulty estimator based on inter-layer logit divergence, the method achieves training-free adaptive reasoning speed control — yielding an average of +1.3% accuracy improvement and −8.6% token reduction across 4 LRMs.

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

This paper systematically evaluates the effectiveness of Chain-of-Thought monitoring within the AI Control framework. It finds that CoT monitoring outperforms action-only monitoring by +10pp on subtle sabotage tasks, but underperforms by −25pp on non-subtle tasks (due to deceptive rationalizations in reasoning misleading the monitor). A hybrid monitoring protocol—independently scoring CoT and action then combining via weighted fusion—consistently outperforms either approach alone across all scenarios, achieving up to a 2× improvement in detection rate.

Curriculum Abductive Learning

This paper proposes Curriculum Abductive Learning (C-ABL), which partitions a knowledge base into sub-knowledge-bases according to its dependency structure and introduces them progressively during training. This substantially reduces the abduction search space in ABL, significantly improving training stability, convergence speed, and final accuracy.

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

This paper analyzes the GRPO objective and reveals two inherent issues: difficulty bias (underweighting questions that are too hard or too easy) and entropy instability. It proposes DisCO, a discriminative constrained optimization framework that addresses these issues via a clip-free scoring function, squared hinge constrained optimization, and distributionally robust optimization (DRO) for imbalanced rollouts. On 1.5B models, DisCO outperforms GRPO by 7% and DAPO by 6% on average.

Does Thinking More Always Help? Mirage of Test-Time Scaling in Reasoning Models

Through systematic experiments, this paper reveals that the performance of test-time scaling in LRMs (achieved by repeatedly appending "Wait" prompts to extend reasoning) exhibits a non-monotonic pattern of initial improvement followed by degradation. A probabilistic model is then used to demonstrate that this apparent "gain" is merely a mirage caused by increased output variance rather than genuine reasoning improvement. The proposed parallel thinking strategy achieves accuracy improvements of up to 22% under the same token budget.

DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning

DreamPRM is proposed to automatically learn domain weights for multimodal reasoning datasets via bi-level optimization, addressing the data quality imbalance in PRM training. It achieves 85.2% top-1 accuracy on the MathVista leaderboard using the o4-mini model.

Exact Expressive Power of Transformers with Padding

This paper provides an exact characterization of the expressive power of Transformers with padding: fixed depth combined with polynomial padding is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^0\); further combined with \(O(\log^d n)\) looping, this is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d\); and polylog looping converges to \(\mathsf{NC}\). These results establish a complete theoretical foundation for padding and looping as parallel inference-time computation mechanisms.

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

This paper proposes Self-Explanation Policy Optimization (ExPO), a modular framework that addresses the fundamental challenge of distribution sharpening in RL post-training methods such as GRPO. When the model's initial success rate on hard reasoning tasks is near zero, effective positive samples are unavailable for learning. ExPO resolves this by prompting the model to generate reasoning chains (self-explanations) conditioned on the ground-truth answer. The resulting self-explanation samples are both in-distribution with respect to the current policy and provide positive learning signals. ExPO integrates seamlessly into both DPO and GRPO frameworks.

GPO: Learning from Critical Steps to Improve LLM Reasoning

GPO estimates the advantage function for each step in a reasoning trajectory via Monte Carlo simulation to identify "critical steps" (the turning points where the model makes errors), then resets from those critical steps and resamples new trajectories for training. This plug-and-play approach consistently improves multiple optimization algorithms—including PPO, DPO, KTO, SimPO, and ORPO—on reasoning tasks.

I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models

This paper introduces I-RAVEN-X, an enhanced symbolic reasoning benchmark that evaluates the generalization and robustness of analogical and mathematical reasoning in LLMs and LRMs by increasing operand complexity, attribute range, and perceptual uncertainty. Results show that LRMs significantly outperform LLMs under deterministic reasoning, but suffer sharp performance degradation under uncertain reasoning conditions.

Inference-Time Chain-of-Thought Pruning with Latent Informativeness Signals

This paper proposes KAPPA (KL-Adjusted Pruned Path Algorithm), which progressively prunes reasoning branches in Best-of-N sampling using three training-free signals — KL divergence, confidence, and entropy — achieving up to 60% peak memory reduction and 90% token generation reduction while maintaining accuracy.

Note 1: Is CoT a Hallucination? A Data Distribution Perspective

By constructing a fully controlled abstract environment DataAlchemy, this paper reveals that CoT reasoning is a form of hallucination — its effectiveness is entirely governed by training data distribution and proves extremely fragile under out-of-distribution scenarios.

Know What You Don't Know: Uncertainty Calibration of Process Reward Models

This paper proposes a quantile regression-based calibration method for PRMs, enabling their output scores to more accurately reflect the actual success probability of LLM reasoning. Building on the calibrated PRM, the paper further introduces an Instance-Adaptive Scaling (IAS) strategy for inference-time computation, achieving significant cost reduction while maintaining accuracy.

Large Language Models Can Learn and Generalize Steganographic Chain-of-Thought under Process Supervision

This paper demonstrates that LLMs under RL training with CoT process supervision (penalizing specific strings) spontaneously learn steganography—concealing prohibited reasoning steps via substitute encodings. These encodings are causally load-bearing and generalize to strings never encountered during training.

Latent Chain-of-Thought for Visual Reasoning

This paper reformulates visual CoT reasoning as a posterior inference problem and proposes LaCoT, a training framework based on amortized variational inference (AVI) comprising reference-guided GFlowNet fine-tuning (RGFN), token-level reward approximation, and Bayesian inference scaling (BiN). On Qwen2.5-VL 3B/7B, LaCoT outperforms GRPO by 10.6% and achieves open-source state-of-the-art across seven visual reasoning benchmarks.

Let LRMs Break Free from Overthinking via Self-Braking Tuning

This paper proposes the Self-Braking Tuning (SBT) framework, which identifies overthinking patterns in reasoning traces and constructs adaptive-length training data to teach large reasoning models (LRMs) to autonomously determine when to stop reasoning. SBT reduces token consumption by 30%–60% on mathematical reasoning tasks while maintaining accuracy.

Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

This paper demonstrates theoretically and empirically that there exist reasoning tasks (graph connectivity) for which a single long CoT (sequential scaling) is equivalent in capability to exponentially many short CoTs (parallel scaling)—i.e., reducing CoT length by even a small amount requires an exponential increase in parallel samples to achieve the same accuracy.

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

This paper proposes PIR (Perplexity-based Importance Refinement), a framework that categorizes reasoning chains distilled from LRMs into "progressive reasoning" and "functional steps" (verification / multi-method validation / error correction), and prunes only functional steps with low PIR scores while preserving the progressive reasoning backbone intact. Fine-tuning on the refined data improves accuracy by 0.9%–6.6% on AIME/AMC/GPQA while reducing token usage by 3%–41%, yielding up to 71% efficiency gain.

Lost in Transmission: When and Why LLMs Fail to Reason Globally

This paper proposes the Bounded Attention Prefix Oracle (BAPO) computational framework, which models LLM attention heads as finite-bandwidth communication channels. It proves that global reasoning problems such as graph reachability are BAPO-hard (requiring super-constant bandwidth), and shows that Chain-of-Thought (CoT) can transform any BAPO-hard problem into a BAPO-easy one. Theoretical predictions are validated experimentally on GPT-4o, Claude, and Gemini.

Many LLMs Are More Utilitarian Than One

A controlled study across six LLMs identifies a "Utilitarian Boost" phenomenon: LLMs engaged in dyadic or triadic moral deliberation are more likely than their solo counterparts to endorse harming a minority for the benefit of the majority. This effect is especially pronounced in personal dilemmas involving direct harm (\(\beta=0.31, p<.0001\)), and the underlying mechanisms differ across models—some exhibit reduced norm sensitivity, others heightened impartiality.

Mapping Faithful Reasoning in Language Models

This paper proposes the Concept Walk framework, which tracks the evolution of internal concept representations across reasoning steps by projecting residual stream activations at each step onto concept directions learned from contrastive data, thereby distinguishing whether a CoT chain genuinely participates in computation or merely serves as post-hoc decorative output.

Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

This paper provides the first systematic formalization of the "Thought Leap" phenomenon in CoT reasoning chains, and proposes CoT-Bridge, a model that automatically detects and fills omitted intermediate steps. It achieves up to +5.87% improvement on NuminaMath and can serve as a plug-and-play module to enhance distillation and RL pipelines.

On Learning Verifiers and Implications to Chain-of-Thought Reasoning

This paper proposes a formal PAC learning framework for Chain-of-Thought verifiers, defining three progressively stronger verification objectives (Simple → Trustable → γ-Trustable). It proves that when each problem admits only a bounded number of correct proofs, the sample complexity is \(O(\log|H|)\); however, when the number of correct proofs is unbounded, the sample complexity inevitably grows to \(\Theta(|H|)\), unless the verifier class satisfies additional structural assumptions such as intersection-closure. The paper also exploits the USAT problem to demonstrate a computational complexity gap between verification and generation.

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model

This paper proposes the Deadlock Attack, which optimizes a single adversarial token embedding and implants it into a Large Reasoning Model (LRM) via a backdoor mechanism, causing the model to enter a permanent reasoning loop during inference (endlessly generating transition words such as "Wait" and "But"). The attack achieves a 100% attack success rate across 4 LRMs and 3 mathematical reasoning benchmarks, with negligible performance degradation on clean inputs.

ProofSketch: Efficient Verified Reasoning for Large Language Models

ProofSketch is a framework that combines symbolic closure-based forward reasoning, compact sketch generation, and formal verification in a multi-stage pipeline, achieving formal correctness guarantees for logical reasoning while reducing token consumption.

Provable Scaling Laws for the Test-Time Compute of Large Language Models

This paper proposes two two-stage test-time compute algorithms — Knockout (pairwise elimination in a tournament bracket) and League (ranking by average win rate) — and proves under minimal assumptions that the failure probability decays exponentially or as a power law to zero as test-time compute increases. The assumptions required are merely that (1) the LLM generates a correct solution with nonzero probability, and (2) the LLM's pairwise comparisons are better than random. The entire pipeline requires only a black-box LLM, with no external verifier or reward model.

Re-FORC: Adaptive Reward Prediction for Efficient Chain-of-Thought Reasoning

This paper proposes Re-FORC, a lightweight adapter that predicts the future expected reward \(\psi(t|x,z,\pi)\) in real time during CoT reasoning. The framework models reasoning compute allocation as a Pandora's box problem, enabling adaptive early stopping (26% compute savings), joint model-and-compute selection (+4% accuracy at equal compute, or −55% compute at equal accuracy), and test-time compute scaling (+11% accuracy). Users can freely adjust the accuracy–efficiency trade-off at inference time via a cost coefficient \(\lambda\), without any retraining.

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

This paper introduces RealMath, a continuously refreshable benchmark that automatically extracts verifiable mathematics problems from arXiv papers and Math StackExchange, designed to evaluate LLMs on real-world research-level mathematical tasks.

ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs

ReasonFlux-PRM identifies that existing PRMs fail to effectively evaluate the intermediate thinking trajectories of reasoning models, and proposes a trajectory-aware PRM that fuses step-level alignment, quality, and coherence scores with a trajectory-level template-guided reward. The approach consistently outperforms strong baselines including Qwen2.5-Math-PRM-72B across three settings: offline data selection (SFT +12.1%), online RL reward (+4.5%), and test-time Best-of-N scaling (+6.3%).

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

This paper theoretically demonstrates the expressive advantage of continuous chain-of-thought (Coconut) on directed graph reachability: a two-layer Transformer using \(D\) continuous thought steps suffices to solve graph reachability with diameter \(D\), whereas discrete CoT requires \(O(n^2)\) steps. The core mechanism is that continuous thought vectors encode multiple search frontiers simultaneously in a "superposition state," enabling implicit parallel BFS.

Reasoning Models Better Express Their Confidence

This paper systematically demonstrates that reasoning models (with extended CoT) exhibit significantly better confidence calibration than non-reasoning models, and identifies "slow-thinking" behaviors—exploring alternatives, backtracking, and verification—as the fundamental source of this calibration improvement.

Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models

This paper reveals that RL-trained reasoning models (e.g., DeepSeek-R1) hallucinate significantly more than non-reasoning models, theoretically identifies three root causes (high-variance gradients, entropy constraints, and spurious local optima), and proposes the FSPO algorithm, which adjusts token-level advantages via step-level factuality verification to reduce hallucination while maintaining or even improving reasoning capability.

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

This paper proposes Variable Granularity Search (VG-Search), which unifies Beam Search and Best-of-N under a tunable verification granularity parameter \(g\). It demonstrates that conventional per-step verification is suboptimal, and that adaptively adjusting \(g\) can improve accuracy by 3%+ while reducing computation by 52%+.

SafePath: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

SafePath proposes fine-tuning only an 8-token "Safety Primer" ("Let's think about safety first") at the very beginning of the reasoning chain, effectively steering Large Reasoning Models (LRMs) toward safe reasoning paths. On DeepSeek-R1-Distill, it reduces harmful outputs by 90% while requiring only 1/296 of the training compute of Direct Refusal.

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

This paper proposes Self-Truncation Best-of-N (ST-BoN), a decoding method that leverages a theoretical guarantee showing early hidden-state consistency predicts final consistency, enabling identification and truncation of suboptimal samples at early decoding steps. ST-BoN reduces memory usage by over 80% and latency by ~50% while preserving standard BoN performance.

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

This paper proposes Self-Certainty, a metric that quantifies model confidence via the token probability distribution of LLM outputs, enabling scalable Best-of-N selection without any auxiliary reward model. The approach achieves performance comparable to or exceeding reward-model-based methods.

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

This paper proposes the SPO framework, which adopts segment-level (rather than token-level or trajectory-level) advantage estimation. Through a novel Monte Carlo method and tree-based sampling, SPO outperforms PPO and GRPO by 6–12 and 7–11 percentage points in short-CoT and long-CoT settings, respectively.

Note 8: PolyMath — Evaluating Mathematical Reasoning in a Multilingual Context

PolyMath introduces a mathematical reasoning benchmark spanning 18 languages, 4 difficulty levels, and 500 problems, revealing that: (1) reasoning performance varies by up to 10 points across languages; (2) reasoning models exhibit low input–output language consistency, which may affect performance; and (3) thinking length varies substantially across languages — offering new perspectives for multilingual reasoning research.

Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards

The final layer of Phi-4 family small models (3.8B/14B) is replaced with a regression head and fine-tuned, enabling them to serve simultaneously as ORM (outcome reward model) and PRM (process reward model). On code generation tasks, selecting the optimal rollout yields 20%+ improvements in pass@k.

SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

By restructuring long chain-of-thought reasoning traces into interleaved planning and parallel execution stages, Sprint reduces sequential token counts by up to 39% on in-distribution tasks (up to 65% on OOD tasks) while maintaining accuracy, enabling dynamic parallelization of the reasoning process.

SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction

This paper proposes SQL-of-Thought, a multi-agent Text-to-SQL framework that decomposes the task into schema linking → subproblem identification → CoT query plan generation → SQL generation → guided correction loop based on a 31-category error taxonomy. Using Claude 3 Opus on the Spider benchmark, it achieves 91.59% execution accuracy, outperforming the previous best Chase SQL (87.6%) by nearly 4 percentage points.

SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning

This work presents the first systematic application of GRPO-based reinforcement learning to NL2SQL tasks. Through a four-level progressive reward function and a training strategy combining 200K cold-start data with 5K complex-sample RL fine-tuning, the 7B model achieves 88.7% on Spider and 66.6% on BIRD, surpassing GPT-4-based methods at comparable scale.

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

PURE identifies the root cause of reward hacking induced by PRMs as the standard sum-form credit assignment in RL (\(V(s) = \sum \gamma^t r_t\)), and proposes a min-form alternative (\(V(s) = \min_{t' \geq t} r_{t'}\)). By constraining the value function to the minimum of future rewards rather than their cumulative sum, PURE significantly mitigates reward hacking—achieving reasoning performance comparable to rule-based reward methods using only 30% of the training steps.

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

This work presents the first systematic quantification of "test awareness" (the Hawthorne effect) in reasoning-oriented LLMs: models alter their behavior upon detecting that they are being evaluated. The paper localizes awareness-related activations via linear probes and applies parameter editing for steering, revealing that test awareness exerts a significant yet directionally inconsistent influence on safety alignment.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Using controlled puzzle environments, this paper systematically reveals a three-regime behavioral pattern in Large Reasoning Models (LRMs): performance falls below standard LLMs at low complexity (overthinking), substantially surpasses them at moderate complexity, and collapses completely (0%) at high complexity. Counterintuitively, models reduce thinking token usage at the point of collapse, demonstrating that current LRMs have not developed genuinely generalizable reasoning capabilities.

The Impact of Quantization on Large Reasoning Model Reinforcement Learning

This paper presents a systematic empirical study showing that quantization-aware fine-tuning (QAFT/STE) during RL training of large reasoning models (LRMs) degrades reasoning capability, whereas post-training quantization (PTQ) and QLoRA preserve reasoning performance well even at 4-bit precision. The authors recommend a practical pipeline of full-precision RL training followed by PTQ quantization.

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

This paper decomposes Reinforcement Learning from Verifiable Rewards (RLVR) into Positive Sample Reinforcement (PSR, which increases the probability of correct responses) and Negative Sample Reinforcement (NSR, which penalizes incorrect responses). The authors find that NSR alone consistently improves reasoning performance across the entire Pass@k spectrum and typically matches or surpasses PPO/GRPO. Based on this finding, the paper proposes Weighted-REINFORCE (reducing the PSR weight to 0.1), which achieves state-of-the-art results across MATH, AIME 2025, and AMC23.

The Virtues of Brevity: Avoid Overthinking in Parallel Test-Time Reasoning

This paper demonstrates that selecting the shortest solution in Best-of-N sampling for reasoning models is a simple yet counterintuitive and effective heuristic, achieving performance comparable to self-consistency at significantly lower token cost. The underlying mechanism exploits a systematic bias in reasoning models between a "conventional mode" and an "overthinking mode."

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

ThinkSound is a three-stage interactive video-to-audio framework that leverages an MLLM to generate structured CoT reasoning as guidance for a unified audio foundation model. It achieves state-of-the-art performance on VGGSound and MovieGen Audio benchmarks while supporting object-level refinement and natural language instruction-based editing.

TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

This paper introduces TimE, a multi-level temporal reasoning benchmark comprising 38,522 QA pairs across three real-world scenarios — knowledge-intensive (Wiki), dynamic news (News), and long dialogue (Dial) — and three progressively difficult levels with 11 sub-tasks. A comprehensive evaluation of 24 LLMs reveals that even the strongest reasoning models exhibit significant deficiencies on complex tasks such as timeline construction and counterfactual reasoning.

TimE: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

This paper introduces TimE, a multi-level temporal reasoning benchmark comprising 38,522 QA pairs across three real-world scenarios — knowledge-intensive (Wiki), dynamic events (News), and multi-turn dialogue (Dial) — with 11 fine-grained subtasks for systematic evaluation of LLMs' temporal reasoning capabilities. A manually annotated subset, TimE-Lite, is also released.

Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

This paper introduces the concept of a reasoning graph — a directed graph constructed by clustering the hidden states of LLMs — and analyzes large reasoning models (e.g., the DeepSeek-R1 distillation series) along three graph-theoretic dimensions: cycle density, diameter, and small-world index. Reasoning models are found to exhibit significantly more cycles (~5 per sample), larger diameters, and stronger small-world properties (~6×), all of which grow with task difficulty and model scale.

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

This paper demonstrates that excessively extending CoT length degrades LLM reasoning performance, and proposes Thinking-Optimal Scaling (TOPS), a strategy that trains models to select the shortest correct response for each problem via self-improvement, outperforming existing distillation methods in both accuracy and efficiency.

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

This paper provides the first optimization-theoretic proof that a one-layer Transformer trained via gradient descent can learn CoT reasoning on a synthetic state-tracking task and achieve length generalization. It is the first work to establish convergence guarantees for constant-depth Transformers learning \(\mathsf{NC}^1\)-complete problems, going beyond prior theory that was limited to \(\mathsf{TC}^0\).

TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

This paper proposes TTS-VAR — the first test-time scaling framework specifically designed for Visual Auto-Regressive (VAR) models. It formulates image generation as a path searching problem and achieves an 8.7% improvement on GenEval (0.69 → 0.75) with Infinity 2B by combining adaptive descending batch sizes, early-stage clustering-based diversity search, and late-stage resampling-based potential selection. With \(N=2\), TTS-VAR already surpasses Best-of-N at \(N=8\).

Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion

A two-stage training framework is proposed: the first stage estimates the region of attraction (ROA) via Zubov-guided sampling and dynamic domain expansion, while the second stage refines the result through CEGIS-based counterexample-driven training. The framework jointly learns a neural network controller and a Lyapunov function, achieving ROA volumes 5 to \(1.5 \times 10^5\) times larger than baselines and verification speeds 40–10000× faster than dReal.

Unlabeled Data Can Provably Enhance In-Context Learning of Transformers

This paper proposes an augmented ICL framework in which the prompt contains both a small set of labeled examples and a large collection of unlabeled examples. It theoretically proves that a multi-layer Transformer, via chain-of-thought (CoT) reasoning, can simulate the EM algorithm to extract information from unlabeled data, improving the classification excess risk from \(\mathcal{O}(1/\sqrt{N})\) to \(\mathcal{O}(1/\sqrt{N + \text{poly}(M)})\).

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

This paper proposes URSA, a three-stage framework that sequentially constructs a million-scale multimodal CoT dataset (MMathCoT-1M) for base model training, a dual-perspective process supervision dataset (DualMath-1.1M) for PRM training, and a PS-GRPO algorithm that integrates the PRM into online RL. The resulting 8B model surpasses GPT-4o by an average of 2.7% across six mathematical benchmarks.

Note 6: Self-Evaluating LLMs - Step-Level Confidence Estimation for Multi-Step Tasks

This paper extends confidence estimation to multi-step tasks, demonstrating that step-level evaluation detects reasoning failures more effectively than response-level evaluation, achieving a 15% relative AUC-ROC improvement over holistic evaluation on CoQA, and providing a practical framework for trustworthy deployment of multi-step reasoning systems.

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought

This paper proposes "Visual Thoughts" as a unified framework for interpreting the effectiveness of multimodal chain-of-thought reasoning (MCoT). The core mechanism underlying performance gains in both textual MCoT (T-MCoT) and interleaved multimodal MCoT (I-MCoT) is the caching and transfer of visual information into the reasoning process. The paper defines four forms of visual thought expressions and reveals their role as image-to-reasoning intermediaries in deep Transformer layers.