💡 LLM Reasoning¶

🔬 ICLR2026 · 71 paper notes

Adaptive Social Learning via Mode Policy Optimization for Language Agents: This paper proposes the Adaptive Social Learning (ASL) framework, which defines four hierarchical reasoning modes (ranging from intuitive response to deep prospective reasoning) and introduces the AMPO algorithm (combining mode-level and sample-level advantage estimation) to enable LLM agents to adaptively switch reasoning depth according to social scenario complexity. ASL outperforms GPT-4o by 15.6% on social intelligence tasks, surpasses GRPO by 7.0%, and reduces token consumption by 32.8%.
Agentified Assessment of Logical Reasoning Agents: This paper proposes an agent-based evaluation framework (AAA) that encapsulates assessment logic as an assessor agent and interacts with the agent under test via a standard A2A interface. On a FOLIO dataset systematically cleaned using the Vampire theorem prover, an auto-formalization agent (NL→Z3Py + SMT solving) achieves 86.70% accuracy, substantially outperforming the CoT baseline at 73.89%, with a particularly notable gain of 32.79 percentage points on contradiction detection (False class).
AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent: AgentMath proposes a tool-augmented agent framework that seamlessly integrates LLM reasoning with the computational precision of a code interpreter through automated data synthesis, multi-turn interactive reinforcement learning, and an efficient asynchronous training system. At the 30B-A3B scale, it achieves state-of-the-art performance on AIME24/25 and HMMT25 (90.6%/86.4%/73.8%), surpassing o3-mini and Claude-Opus-4.0-Thinking.
AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning: AIMCoT reframes visual information selection in multimodal CoT from "passively attending to high-attention regions" to "actively seeking regions of maximal information gain." Three collaborative modules — CAG (Context-enhanced Attention-map Generation), AVP (Active Visual Probing), and DAT (Dynamic Attention-shifting Trigger) — constitute a training-free, plug-and-play framework that outperforms ICoT by 18.25% on LLaVA-W (0-shot).
Annotation-Efficient Universal Honesty Alignment: This paper proposes EliCal (Elicit then Calibrate), a two-stage framework that first trains an LLM to express internal confidence using annotation-free self-consistency signals, then calibrates with a minimal number of correctness annotations (only 1K samples, 0.18% of the full set). On HonestyBench (560K training + 70K evaluation), EliCal achieves approximately 98% of the fully-supervised upper bound and generalizes better than calibration-only baselines on unseen MMLU tasks.
Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?: This paper systematically evaluates the robustness of reasoning LLMs to various interventions (benign/neutral/adversarial) in their chain-of-thought. Models are generally robust and can recover from interventions; however, paraphrasing suppresses "self-doubt" expressions and degrades accuracy, while the recovery process incurs significant computational overhead (CoT length inflation up to 665%).
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction: This paper proposes ATTS, an asynchronous test-time scaling framework based on conformal prediction that eliminates synchronization overhead by reformulating rejection sampling as a hypothesis testing procedure. On mathematical reasoning benchmarks such as MATH and AIME, ATTS achieves up to 56.7× speedup and 4.14× throughput improvement without accuracy loss. A 1.5B/70B draft/target model combination reaches the AIME performance level of o3-mini (high).
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts: This paper proposes the Contact Searching Question (CSQ) framework, which leverages directed graph reachability tasks and cognitive psychology principles to design two complementary statistical metrics—deception intent score \(\rho\) and deception behavior score \(\delta\)—systematically revealing, for the first time, that 16 mainstream LLMs exhibit spontaneous deception tendencies under entirely benign prompts, with deception escalating as task difficulty increases.
Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning: This paper leverages information-theoretic generalization bounds and mechanistic interpretability to demonstrate that the core mechanism of CoT training is compositional generalization—the model learns to systematically compose previously acquired simple skills to solve novel complex problems, internalizing this ability as a two-stage compositional reasoning circuit that extracts intermediate results at shallower layers, freeing deeper layers to focus on subsequent reasoning steps.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors: This paper identifies the phenomenon of "logic inertia" in LLMs—whereby models continue along learned reasoning trajectories even when presented with contradictory premises, reducing accuracy to 0.0—and proposes the Conflict-Aware Fusion dual-process architecture, which enforces premise verification prior to reasoning execution, achieving 100% accuracy on contradiction detection.
Continuous Chain of Thought Enables Parallel Exploration and Reasoning: CoT2 proposes replacing discrete tokens with continuous-valued tokens (convex combinations of vocabulary embeddings) for chain-of-thought reasoning, enabling the model to track multiple reasoning paths in parallel within a single forward pass. The approach is theoretically shown to be equivalent to \(K\) rounds of self-consistency/best-of-N sampling, and is further improved via GRPO-based reinforcement learning.
CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos: This paper proposes CoT-RVS, a fully training-free multi-agent framework that leverages the zero-shot CoT reasoning capabilities of pretrained MLLMs for temporal-semantic correlation analysis and keyframe selection, achieving substantial improvements over fine-tuned methods on reasoning video segmentation tasks (Refer-DAVIS J&F 79.1 vs. 71.2; ReasonVOS J&F 65.5 vs. 49.9).
CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling: This paper treats reflection tokens (e.g., "wait", "but") in the reasoning process as schedulable "resources" and, inspired by cyclical learning rate scheduling in optimization, proposes CyclicReflex — a training-free decoding strategy that dynamically modulates the logits of reflection tokens via a triangular waveform. CyclicReflex consistently improves the accuracy of 1.5B–8B models across multiple mathematical reasoning benchmarks (MATH500, AIME2024/2025, AMC2023).
DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs: This work formalizes LLM chain-of-thought reasoning as a rule-based stochastic process over DAGs, proposes logical closeness as a metric to assess whether a model arrives at an answer through search or rigorous logical deduction, constructs a gold-standard DAG-MATH benchmark of 2,894 instances, and demonstrates that models with similar PASS@k scores can differ substantially in reasoning faithfulness.
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning: This paper introduces Design Logic—reusable meta-knowledge reverse-engineered from real exam questions—to guide the synthesis of multidisciplinary reasoning problems from raw text. A dataset of 4.7 million questions spanning 75 disciplines is constructed, and base models fine-tuned via SFT on this data surpass their officially post-trained counterparts.
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models: This paper systematically reveals the privacy leakage risks of multi-modal large reasoning models (MLRMs) in inferring sensitive geographic location information from images. It proposes a three-tier privacy risk framework, the DoxBench benchmark, an information-theoretic metric Glare, and a collaborative attack framework GeoMiner.
Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models: This paper presents the first systematic study of privacy leakage risks arising from multimodal large reasoning models (MLRMs) inferring sensitive geographic location information from user-generated images. It proposes a three-tier privacy risk framework, the DoxBench benchmark, and the Glare information-theoretic evaluation metric. The findings demonstrate that MLRMs surpass non-expert humans in geographic inference, significantly lowering the barrier for adversaries to obtain sensitive location information.
DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization: This paper diagnoses a fundamental flaw in GRPO with length penalties — correct but verbose responses may receive negative advantage values and thus be incorrectly penalized — and proposes DRPO, which decouples the reward signals for positive and negative samples to ensure length penalties are normalized only within the correct-response group. On a 1.5B model, DRPO achieves a 77% length reduction with only a 1.1% performance drop, compared to a 68% reduction with a 4.3% drop for the baseline.
Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models: This paper models each prompt's solve progress during RL finetuning as a latent Markov dynamical system, and employs lightweight Bayesian inference to online-predict prompt solve states. By prioritizing "partially solved" prompts for sampling, the method achieves comparable or superior reasoning performance to Dynamic Sampling (DS) using fewer than 30% of DS's rollouts.
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure: This paper models latent CoT as a structural causal model (SCM) and analyzes the Coconut and CODI paradigms via step-wise do-interventions, revealing that latent reasoning steps exhibit heterogeneous causal leverage, non-local jump-based propagation structures, and a persistent gap between early output commitment and late representational commitment.
Efficient Test-Time Scaling for Small Vision-Language Models: This paper proposes two efficient test-time scaling strategies for small VLMs: TTAug (applying diverse input augmentations and aggregating output probability distributions at the token level) and TTAdapt (adapting model parameters using pseudo-labels generated by TTAug). Both methods consistently improve performance across 9 benchmarks while achieving substantially better computational efficiency than existing sampling-based test-time scaling approaches.
Estimating the Empowerment of Language Model Agents: This paper proposes EELMA, an algorithm that leverages empowerment from information theory — defined as the mutual information between an agent's actions and future states — as a goal-agnostic capability metric for LM agents. EELMA achieves strong correlation with task performance (\(r=0.83\)–\(0.94\)) in both language games and real-world web navigation scenarios, and can be applied to open-ended agent monitoring and safety evaluation.
Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive Retrieval: Inspired by the dual-process theory in cognitive science, this paper proposes RF-Mem, a memory retrieval framework that achieves efficient and scalable LLM personalization through adaptive switching between two pathways: Familiarity (fast similarity matching) and Recollection (deep chain-based reconstruction).
FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning: Targeting the severe bottleneck in GRPO training where the generation phase consumes 91%–98% of total training time, this work proposes a concurrency-aware speculative decoding strategy (dynamically adjusting draft tree parameters to accommodate the real-time shift from high to low concurrency) and online draft model learning (continuously adapting to distribution drift using hidden states produced by the target model). The combined approach achieves 2.35×–2.72× end-to-end training speedup without degrading reasoning quality.
Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning: Fine-R1 combines CoT supervised fine-tuning (structured reasoning chains following "visual analysis → candidate sub-classes → comparison → prediction") with Triplet-Augmented Policy Optimization (TAPO)—intra-class augmentation for robustness and inter-class augmentation for discriminability—achieving superior performance over CLIP and general/reasoning MLLMs on fine-grained visual recognition using only 4-shot training.
Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling: This paper systematically diagnoses three failure modes of inference-time reward models (RMs)—performance degradation on easy problems, diminished discriminability as sample size increases, and accuracy loss under high search diversity—and proposes CRISP, an algorithm that mitigates these issues via answer-clustering-based reward aggregation and stepwise prefix guidance, achieving accuracy improvements of up to 5%.
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics: This paper introduces ContextMATH, a benchmark that transforms abstract AIME/MATH-500 problems into two variants — Scenario Grounding (SG) and Complexity Scaling (CS) — and reveals that even top-tier models such as GPT-5 and DeepSeek-R1 suffer accuracy drops of 13–34% on contextual mathematical reasoning, with errors attributable primarily to problem formulation rather than computational reasoning.
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics: This paper introduces the ContextMATH benchmark, which systematically converts abstract mathematical problems from AIME and MATH-500 into two contextual variants—Scenario Grounding (SG) and Complexity Scaling (CS)—to reveal substantial performance degradation in LLMs on contextual mathematical reasoning. Open-source models drop by 13% on average on SG and 34% on CS. Two complementary performance bottlenecks are identified: problem formulation and reasoning execution.
Generalizable End-to-End Tool-Use RL with Synthetic CodeGym: This paper proposes CodeGym, a framework that automatically converts programming problems into multi-turn interactive tool-use environments for reinforcement learning training of LLM agents, achieving significant out-of-distribution generalization gains (e.g., +8.7 points on τ-Bench for Qwen2.5-32B).
GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs: This paper formalizes the Program-to-Geometry task and proposes GeoGramBench (500 problems), evaluating 19 frontier LLMs on their ability to construct geometric representations from procedural drawing code and reason over them using a three-level geometric complexity taxonomy. Even GPT-5 achieves only 39.26% accuracy at the highest abstraction level, revealing fundamental limitations in LLM spatial abstraction.
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation: This paper identifies that the advantage function in GRPO (std normalization) causes update magnitudes to peak at medium-difficulty problems while implicitly suppressing updates on both hard and easy problems. To address this, the authors propose MathForge — combining DGPO (replacing std with MAD for difficulty-balanced normalization + softmax difficulty weighting) and MQR (question reformulation via three aspects: narrative context, abstract terminology, and nested sub-problems, increasing difficulty while preserving original answers). On Qwen2.5-Math-7B, MathForge outperforms GRPO by an average of +4.56% across six mathematical reasoning benchmarks.
HeurekaBench: A Benchmarking Framework for AI Co-scientist: This paper proposes HeurekaBench, a framework for constructing evaluation benchmarks grounded in real scientific workflows. It employs a multi-LLM pipeline to extract verifiable scientific insights from papers and generate open-ended research questions, enabling end-to-end assessment of AI co-scientists in data-driven scientific discovery.
I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift: This paper systematically investigates the fragility of frozen-embedding-based safety classifiers under embedding drift induced by model updates. It finds that a mere 2% perturbation in the embedding space is sufficient to degrade classifier performance from 85% ROC-AUC to near-random levels (50%), with 72% of misclassifications occurring at high confidence (silent failure). Counterintuitively, instruction-tuned models prove harder to classify than their base counterparts.
Is In-Context Learning Learning?: This paper systematically investigates whether ICL constitutes genuine "learning" through large-scale controlled experiments. It demonstrates that ICL satisfies the formal mathematical definition of learning, yet empirical evidence reveals its generalization capacity to be limited — models primarily exploit structural regularities within the prompt via deduction rather than acquiring new capabilities from the provided demonstrations.
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort: This paper proposes TRACE (Truncated Reasoning AUC Evaluation), a method that quantifies reasoning effort by progressively truncating chain-of-thought (CoT) reasoning and measuring how early a model can obtain reward. TRACE detects implicit reward hacking that CoT monitoring fails to identify, achieving detection F1 improvements of over 65% and 30% compared to the strongest CoT monitors on math and code tasks, respectively.
LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation: This paper introduces LingOly-TOO, a benchmark that applies expert-designed grapheme-level permutations to linguistics olympiad problems, preserving reasoning logic while eliminating knowledge and memorization shortcuts. The obfuscation reduces the top score across 15 frontier models from 0.59 to 0.48, systematically quantifying the extent to which LLM reasoning ability is overestimated due to knowledge effects.
mR3: Multilingual Rubric-Agnostic Reward Reasoning Models: This paper introduces mR3, a family of multilingual rubric-agnostic reward reasoning models covering 72 languages. Through systematic data construction (GPT-OSS-120B distillation with difficulty filtering) and curriculum learning, the 14B model surpasses the 120B teacher model and all comparable baselines on multilingual evaluation benchmarks, while supporting point-wise, pair-wise, and binary evaluation paradigms.
Native Reasoning Models: Training Language Models to Reason on Unverifiable Data: This paper proposes NRT (Native Reasoning Training), a framework that treats reasoning chains as latent variables and uses the model's own predictive confidence over reference answers as an intrinsic reward signal to train LLM reasoning—without external verifiers or expert reasoning demonstrations. On Llama-3.1-8B, NRT achieves an average improvement of 10.2 points across 9 benchmarks (46.0→56.2), surpassing the verifier-dependent RLPR by +5.4 points.
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes: Prior to answer generation, a linear probe (difference-of-means) trained solely on residual stream activations at the question-processing stage can predict whether a model's forthcoming answer will be correct. This "pre-generation correctness direction," trained on TriviaQA, generalizes across multiple factual knowledge datasets (AUROC 0.68–0.88) but fails to generalize to mathematical reasoning (GSM8K), revealing a structural separation between representations of factual correctness and reasoning correctness within the model's internals.
Nudging the Boundaries of LLM Reasoning: This paper identifies a fundamental limitation of GRPO: it cannot learn from problems that the model completely fails to solve (pass rate = 0%), producing zero gradients. The proposed method, NuRL, addresses this by injecting self-generated abstract hints (without revealing answers) into hard problems during training, converting them into learnable samples. NuRL consistently outperforms GRPO across 3 models and 6 benchmarks, and genuinely improves the pass@k capability upper bound.
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning: This paper proposes the Regularized Policy Gradient (RPG) framework, which systematically derives and analyzes policy gradient methods based on Forward/Reverse KL divergence (in both normalized and unnormalized forms). It identifies a theoretical inconsistency in GRPO's KL term and achieves superior performance over GRPO, REINFORCE++, and DAPO on mathematical reasoning benchmarks.
On The Fragility of Benchmark Contamination Detection in Reasoning Models: This systematic study reveals that benchmark contamination detection in large reasoning models (LRMs) is extremely fragile: contamination introduced during the SFT stage becomes nearly undetectable after GRPO training (with PPO-style importance sampling and clipping identified as the root cause), and direct CoT SFT contamination of advanced LRMs leaves virtually no detectable trace—all 10 evaluated detection methods perform close to random guessing.
Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning LLMs: This paper proposes the Plan-and-Budget framework, which decomposes complex queries into sub-problems and adaptively allocates token budgets based on estimated complexity, achieving efficient test-time scaling for reasoning LLMs — with up to 70% accuracy improvement, 39% token reduction, and 193.8% gain on the E3 metric.
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation: This work is the first to integrate decomposed Chain-of-Thought reasoning with multi-dimensional reinforcement learning (RL) for video-to-audio (V2A) generation. It addresses the objective entanglement problem via four specialized CoT modules (semantic/temporal/aesthetic/spatial) paired with corresponding reward functions, and proposes the Fast-GRPO algorithm to substantially reduce RL training cost.
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format: To address the tension between strong reasoning capability and weak instruction following in large reasoning models (LRMs), this paper proposes RAIN-Merging, a two-stage gradient-free merging pipeline that preserves the thinking format via null-space projection and enhances instruction relevance via attention-guided per-module scaling coefficients. It integrates the capabilities of an instruction-tuned model (ITM) into an LRM without any gradient-based training, achieving consistent improvements across 4 instruction-following and 9 reasoning benchmarks.
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following Through Model Merging: This paper proposes RAIN-Merging, a gradient-free two-stage model merging method: it first applies null-space projection to preserve the thinking format of Large Reasoning Models (LRMs), then employs instruction-attention-guided merging coefficients to enhance instruction following, simultaneously improving instruction compliance and reasoning quality.
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models: This work presents the first systematic study of answer attribution in large reasoning models (LRMs), revealing that reasoning (CoT) and retrieval (memory) mechanisms compete simultaneously to influence final answers. The paper proposes Farl (Forgetting-Augmented Reinforcement Learning), which suppresses retrieval shortcuts to enhance genuine reasoning capability.
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization: This paper proposes ReForm, a reflective autoformalization paradigm that transforms the process of converting natural-language mathematics problems into Lean formal statements from single-pass generation into an iterative "generate → semantic self-verify → correct" loop. It further introduces the PBSO algorithm to optimize heterogeneous reward signals, achieving an average improvement of 22.6 percentage points over the strongest baselines across four benchmarks.
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models: This paper proposes a formal definition of Reasoning Faithfulness (RF) decomposed into stance consistency and causal influence, constructs the RFEval benchmark comprising 7,186 instances across 7 tasks, and evaluates 12 open-source Large Reasoning Models (LRMs) via output-level counterfactual reasoning intervention. Key findings include: 49.7% of outputs are unfaithful, RL post-training degrades faithfulness, and task accuracy is not a reliable proxy for faithfulness.
Scaling Generalist Data-Analytic Agents: This paper proposes DataMind — a complete training framework for data-analytic agents — achieving diverse query synthesis via fine-grained task taxonomy with recursive difficulty composition, ensuring data quality through knowledge-augmented trajectory sampling and self-consistency filtering, employing a dynamic SFT+RL mixed training strategy, and implementing a memory-efficient asynchronous rollout framework. The resulting DataMind-14B achieves a 71.16% average score across multiple benchmarks, establishing a new state of the art and surpassing GPT-5 and DeepSeek-V3.1.
SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes: This paper proposes SceneCOT, the first framework to introduce Chain-of-Thought reasoning into 3D scene understanding. Through a four-stage reasoning pipeline (task recognition → region localization → entity grounding → grounded reasoning), intermediate reasoning steps are explicitly linked to visual grounding. SceneCOT achieves 34.7% Good Coherence on Beacon3D, surpassing the strongest baseline (20.4%) by over 70%.
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models: This paper introduces SealQA, a challenging benchmark with three variants (Seal-0/Seal-Hard/LongSeal), where each question is carefully crafted by NLP researchers to trigger ambiguous, conflicting, or noisy search results. Even GPT-5 achieves at most 43.2% accuracy, revealing that test-time scaling does not yield reliable gains under noisy retrieval conditions.
Segment-Level Attribution for Selective Learning of Long Reasoning Traces: This paper applies Integrated Gradients to compute the attribution strength and direction consistency of each segment in long reasoning traces with respect to the final answer, identifies important segments for selective SFT, and achieves up to 4.7% accuracy improvement over full-CoT training while reducing output length by 18%.
Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning: This paper proposes TAMPO (Temperature Adaptive Meta Policy Optimization), which reframes the sampling temperature as a learnable meta-policy. Through a bilevel loop — an inner loop for LLM policy optimization and an outer loop for adaptively updating the temperature distribution based on trajectory advantage signals — TAMPO requires no additional rollouts and consistently outperforms fixed-temperature baselines on mathematical reasoning benchmarks.
The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models: Reasoning models form a "first impression" (internal bias) about the answer the moment they receive a question. When this intuitive guess conflicts with the subsequent systematic reasoning process, the model repeatedly second-guesses itself and re-examines its work, causing reasoning length to inflate by 21%–43%. Critically, none of the existing mitigation methods can fundamentally eliminate this effect.
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs: This paper reveals that short-task benchmarks create an illusion of diminishing returns — marginal gains in per-step accuracy are amplified exponentially in long-horizon tasks. It identifies a "self-conditioning effect" in LLMs (whereby prior errors increase the probability of subsequent errors), shows that thinking models mitigate this effect, and demonstrates that GPT-5 thinking can execute tasks exceeding 2,100 steps.
The Path of Least Resistance: Guiding LLM Reasoning Trajectories with Prefix Consensus: This paper proposes PoLR (Path of Least Resistance), the first inference-time method that exploits prefix consensus in reasoning chains. By clustering short prefixes and expanding only the dominant cluster, PoLR replaces standard Self-Consistency while maintaining or improving accuracy on GSM8K, Math500, AIME, and GPQA, with 40%–60% reduction in token usage and up to 50% lower latency.
Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs: This paper proposes AdaAnchor, a latent-space reasoning framework that appends learnable anchor vectors to input embeddings and refines their states through iterative forward passes to achieve "silent thinking." An adaptive stopping mechanism based on anchor stability dynamically allocates computation according to instance difficulty. On mathematical reasoning benchmarks, AdaAnchor achieves up to 5% higher accuracy and 48–60% fewer average steps compared to fixed-step latent reasoning, while reducing output tokens by 92–93% relative to CoT.
TopoBench: Benchmarking LLMs on Hard Topological Reasoning: TopoBench is a benchmark comprising 6 categories of topological puzzles × 3 difficulty levels for evaluating the global spatial reasoning capabilities of LLMs. Frontier models solve fewer than 24% of hard-tier instances. Causal intervention experiments reveal that error frequency does not equal causal impact — low-frequency constraint forgetting is more destructive than high-frequency repetitive reasoning.
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention: This paper proposes Intervened Preference Optimization (IPO), which constructs preference pairs for training by replacing compliance cues with safety triggers at critical steps during the reasoning process, significantly improving the safety of the chain-of-thought (CoT) reasoning process itself in large reasoning models (LRMs).
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention: This paper identifies a critical yet overlooked problem in large reasoning models (LRMs): their chain-of-thought reasoning frequently contains harmful content even when the final response appears safe. The authors propose Intervened Preference Optimization (IPO), which corrects unsafe reasoning trajectories by replacing compliance cues with safety triggers, constructing preference pairs for alignment training. Across 3 LRMs, IPO reduces reasoning harmfulness by over 30% without compromising reasoning capability.
Training Large Reasoning Models Efficiently via Progressive Thought Encoding: This paper proposes Progressive Thought Encoding, which encodes evicted token information into fixed-size LoRA weight updates whenever KV cache entries are evicted, enabling efficient RL training of large reasoning models under constrained cache budgets while preserving long-range reasoning capability.
Training Large Reasoning Models Efficiently via Progressive Thought Encoding: This paper proposes Progressive Thought Encoding, which encodes evicted thought tokens into LoRA weights under KV cache constraints, halving GPU memory usage during RL training of large reasoning models while surpassing full-cache LoRA in reasoning accuracy (up to +23.4% on AIME2024/2025).
TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis: This paper proposes TumorChain, an interleaved multimodal chain-of-thought reasoning framework for tumor analysis across five major digestive organs. It integrates a knowledge graph-driven 1.5M CoT-VQA data engine, organ-guided iterative interleaved reasoning (IIR), and joint optimization of segmentation, classification, and LLM models to realize a complete reasoning chain from imaging findings → clinical impressions → pathological predictions, achieving an average accuracy of 84.41% and substantially outperforming GPT-5-Mini (51.59%).
Understanding the Role of Training Data in Test-Time Scaling: This paper theoretically analyzes how training data properties affect test-time scaling, proves that CoT reasoning is equivalent to pseudo-Newton method iteration, proposes a task hardness measure based on the minimum eigenvalue of feature covariance, reveals the mechanism behind the "more thinking is not always better" overthinking phenomenon, and derives an optimal task selection strategy for multi-task training — training sets should be diverse, relevant, and difficult.
Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision: This paper proposes Uni-CoT, a hierarchical macro-micro reasoning framework that decomposes multimodal CoT into macro-level task planning (decomposing complex tasks into sub-goals) and micro-level sub-task execution (MDP-style self-reflective iterative refinement). Through an attention mask design, the complexity is reduced from \(O(T^2)\) to \(O(T)\). The method surpasses the BAGEL baseline by +0.02 on GenEval, achieving unified reasoning over interleaved text and images.
Verifying Chain-of-Thought Reasoning via Its Computational Graph: This paper proposes CRV (Circuit-based Reasoning Verification), which constructs interpretable attribution graphs by replacing LLM MLPs with transcoders, extracts structural "fingerprints" of reasoning errors from these graphs, and enables white-box CoT reasoning verification with the capacity to correct erroneous reasoning via causal intervention.
When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models: This paper presents a systematic benchmark and mechanistic analysis of the effects of compression (quantization/distillation/pruning) on large reasoning models (LRMs), yielding three core findings: parameter count affects knowledge memorization more than reasoning ability; the last-layer MLP up_proj of distilled models is the most critical weight; protecting only 2% of over-compressed weights improves average accuracy by 6.57%.
When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models: This paper systematically studies the effects of three compression methods—quantization, distillation, and pruning—on Large Reasoning Models (LRMs) through performance benchmarking and mechanistic interpretability analysis. Key findings include: parameter count affects knowledge memorization more than reasoning ability; the last-layer MLP up_proj is the most critical component; and current quantization methods over-compress the final layers.
When Shallow Wins: Silent Failures and the Depth-Accuracy Paradox in Latent Reasoning: This paper systematically analyzes the latent reasoning behavior of Qwen2.5-Math-7B on GSM8K, finding that 81.6% of correct predictions arise from computationally inconsistent paths, 8.8% constitute silent failures (high-confidence errors), and revealing a paradoxical relationship between reasoning depth and accuracy.
Why is Your Language Model a Poor Implicit Reward Model?: This paper provides theoretical and empirical evidence that implicit reward models (IM-RM, e.g., DPO) generalize worse than explicit reward models (EX-RM) because IM-RM overfits to surface-level token cues rather than semantic representations, leading to substantial accuracy degradation under token distribution shift. The paper also refutes the "generation–verification gap" hypothesis.