💡 LLM Reasoning¶
🔬 ICLR2026 · 241 paper notes
📌 Same area in other venues: 📷 CVPR2026 (16) · 💬 ACL2026 (82) · 🧪 ICML2026 (78) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (81) · 📹 ICCV2025 (3)
🔥 Top topics: Reasoning ×159 · LLM ×61 · Reinforcement Learning ×18 · Diffusion Models ×6 · Agents ×5
- A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic
-
ARGOS facilitates bidirectional information exchange between LLMs and SAT solvers: the solver outputs "confirmed true literals" (the backbone), which the LLM uses to hypothesize missing commonsense clauses. These candidates are then filtered by scorers and fed back into the solver. This iterative completion of logic problems lacking explicitly stated commonsense premises outperforms pure neural or symbolic methods by up to 13% across multiple datasets.
- A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
-
MeRF writes verifiable reward functions into prompts as a "rulebook" in natural language. By explicitly informing the model of optimization goals during RL training, it moves away from blind trial-and-error, significantly outperforming RLVR baselines in logic and mathematical reasoning tasks.
- A State-Transition Framework for Efficient LLM Reasoning
-
This paper proposes an efficient reasoning framework that models the LLM reasoning process as a state-transition process. By using Linear Attention to compress information from historical reasoning steps into a state matrix, the framework reduces attention complexity from \(O(C^2)\) to \(O(C)\) and KV cache from \(O(C)\) to \(O(1)\), while maintaining reasoning capabilities without shortening the CoT sequence. An additional momentum strategy is introduced to mitigate the "overthinking" problem caused by noisy reasoning steps.
- A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
-
PASR uses Reinforcement Learning (GRPO) to train LLMs to proactively decide "whether, when, and how" to refine their reasoning trajectories during the generation process (rather than post-hoc rework). By designing a "contrastive refinement reward" to encourage valuable corrections, it reduces average token consumption by 41.6% while improving accuracy by 8.2% on Qwen3-8B compared to standard generation.
- AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
-
NVIDIA systematically decomposes the synergistic relationship between "Supervised Fine-Tuning (SFT) + Large-scale Reinforcement Learning (RL)" in building reasoning models. By expanding SFT data, tuning RL sampling temperature to target "entropy \(\approx 0.3\)", and staging response lengths, a 7B model (AceReason-Nemotron-1.1) achieves new SOTA results for math/code reasoning among 7B-scale models (AIME25 64.8, LiveCodeBench v6 52.1).
- ActivationReasoning: Logical Reasoning in Latent Activation Spaces
-
The ActivationReasoning (AR) framework is proposed to embed explicit logical reasoning into the latent activation space of LLMs (via features extracted by SAEs). Through a three-stage pipeline (discovering concept representations → detecting activation propositions → reasoning with logical rules), it achieves multi-hop reasoning, concept composition, and safety control. On PrOntoQA, an 8B model achieves 95%+ accuracy, surpassing GPT-4o.
- Adaptive Social Learning via Mode Policy Optimization for Language Agents
-
This paper proposes the Adaptive Social Learning (ASL) framework, featuring four hierarchical reasoning modes (ranging from intuitive response to deep deduction). Through the AMPO algorithm—which integrates mode-level and sample-level advantage estimation—LLM agents adaptively switch reasoning depth based on the complexity of social scenarios. On social intelligence tasks, it outperforms GPT-4o by 15.6% and GRPO by 7.0%, while reducing token usage by 32.8%.
- Adaptive Thinking: Large Language Models Know When to Think in Latent Space
-
This paper proposes Sonata: using a lightweight MLP adapter to directly predict "self-consistency" from the last-layer hidden states of a query during the prefilling stage. This allows the model to decide whether and how much to think before decoding, reducing thinking tokens by 20%–60% while maintaining accuracy.
- Agentic Reinforcement Learning with Implicit Step Rewards
-
This paper proposes iStar, a universal credit assignment strategy for multi-turn reinforcement learning of LLM agents. By alternately optimizing an implicit process reward model (PRM) and a policy model, iStar learns dense rewards for each action step through a multi-turn DPO objective. Step-level advantages are combined with episode-level advantages to update the policy. iStar achieves SOTA results on WebShop, VisualSokoban, and the open-ended social environment SOTOPIA, demonstrating superior sample efficiency and training stability.
- AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
-
AgentMath proposes a tool-augmented agent framework that seamlessly integrates LLM reasoning with the computational precision of a code interpreter through automated data synthesis, multi-turn interactive reinforcement learning (RL), and an efficient asynchronous training system. It achieves SOTA performance on AIME24/25 and HMMT25 at the 30B-A3B scale (90.6%/86.4%/73.8%), surpassing o3-mini and Claude-Opus-4.0-Thinking.
- Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis
-
The framework reformulates complex analysis as an estimation of "soft truth values" for propositions, using bias-variance decomposition as a design principle. By combining a divide-and-conquer tree to reduce bias and linear synthesis rules to reduce variance, it achieves Analytica—a verifiable, scalable, and noise-resistant LLM-driven prediction agent architecture.
- Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?
-
This paper systematically evaluates the robustness of reasoning LLMs to various interventions (benign, neutral, adversarial) in their Chain-of-Thought (CoT). It finds that while models are generally robust and can recover from interventions, paraphrasing the CoT suppresses "self-doubt" expressions, leading to decreased accuracy. Furthermore, the recovery process incurs significant computational overhead, with CoT expansion reaching up to 665%.
- Asymmetric Proximal Policy Optimization: Mini-Critics Boost LLM Reasoning
-
AsyPPO replaces the bulky critic (same size as the actor) with two lightweight mini-critics trained on non-overlapping data shards at the prompt level. This restores the utility of the PPO value function while maintaining GRPO-level overhead. Furthermore, it leverages the "disagreement" signal between the two critics for advantage masking and entropy filtering, stably outperforming GRPO and classic PPO on Qwen3-4B/8B/14B.
- Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
-
AttnRL uses the model's own attention scores as a "compass" to perform tree branching on critical reasoning steps (rather than using fixed lengths or entropy). Combined with difficulty-adaptive sampling and a one-step off-policy training pipeline, it enables Process-Supervised RL (PSRL) to improve mathematical reasoning while saving computation—achieving a 7.5% average gain on 1.5B models with shorter wall-clock time than TreeRL.
- ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
-
This paper proposes ATTS, an asynchronous test-time scaling framework based on conformal prediction. By reframing rejection sampling as a hypothesis testing process to eliminate synchronization overhead, it achieves up to 56.7x speedup and 4.14x throughput improvement on mathematical reasoning tasks such as MATH and AIME without accuracy loss. A 1.5B/70B draft/target model combination reaches the AIME performance level of o3-mini (high).
- Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in LLMs
-
Using Qwen2.5-3B-Base for controlled comparisons, the authors systematically demonstrate for the first time that RL (GRPO) possesses significantly stronger cross-lingual generalization for multilingual reasoning than SFT. Counter-intuitively, RL using non-English (German/Chinese) data outperforms English data. The study provides mechanistic explanations from three perspectives: "reasoning-time language inconsistency, sampling exploration, and semantic space drift."
- Beyond Magnitude: Leveraging Direction of RLVR Updates for LLM Reasoning
-
This paper points out that previous analyses of RLVR focused only on update "magnitude" (entropy, KL), whereas the true key is the update "direction". Using the signed token-wise log-probability difference \(\Delta\log p\), the authors precisely locate sparse but critical tokens for reasoning. Based on this, they propose two plug-and-play enhancement methods: test-time selective extrapolation and training-time low-probability token re-weighting.
- Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
-
This paper reinterprets the "self-reflection" behavior of LLMs through the lens of Bayesian Reinforcement Learning—viewing reflection as information gathering under MDP uncertainty. It proposes the BARL algorithm, which maintains a posterior of MDP hypotheses over candidate answers and switches policies when beliefs conflict with reward feedback, simultaneously improving accuracy and token efficiency in mathematical reasoning.
- Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
-
Proposes the Contact Searching Question (CSQ) framework, which designs two complementary statistical metrics—deceptive intent score \(\rho\) and deceptive behavior score \(\delta\)—based on directed graph reachability tasks and cognitive psychology principles. It systematically reveals for the first time that 16 major LLMs exhibit spontaneous deception tendencies that escalate with task difficulty under entirely benign prompts.
- Beyond Speedup - Utilizing KV Cache for Sampling and Reasoning
-
This work reuses the KV cache—which already exists during inference but is traditionally used only to accelerate decoding—as "free lightweight representations." Without needing to store additional hidden states, it enables self-evaluation of reasoning paths (KV-CoE) and difficulty-adaptive fast/slow thinking switching (KVClassifier), reducing reasoning token volume by up to 1/5.7 with almost zero overhead.
- Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
-
The authors attach a small external Cache Processor to a frozen backbone LLM. At the end of each reasoning step (triggered by a newline), it rewrites the KV cache in-place—"consolidating" recently written entries while "reconsolidating" a few historical entries recalled via attention. Explained through Information Bottleneck theory, this mechanism improves generalization, yielding up to a +6.6pp improvement across seven mathematical reasoning benchmarks.
- C-Voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
-
For "recurrent" reasoning models that repeatedly apply the same layer, this paper proposes C-voting—a test-time voting strategy that requires no explicit energy function. By sampling multiple trajectories from random initial hidden states and selecting the one with the "highest average top-1 probability" (i.e., the most confident), it outperforms energy-based voting (E-voting) on AKOrN by 4.9% in Sudoku-hard. Furthermore, combined with a lightweight model ItrSA++ (3M parameters), it improves the HRM benchmark from 55.0% to 95.2% on Sudoku-extreme.
- CaTS: Calibrated Test-Time Scaling for Efficient LLM Reasoning
-
By distilling confidence derived from self-consistency back into the model itself (Self-Calibration), LLMs can provide reliable confidence in a single forward pass. This enables calibrated test-time scaling (CaTS) for repeated sampling methods like Best-of-N and Self-consistency, dynamically allocating compute based on task difficulty. This approach significantly improves accuracy under the same sampling budget and saves substantial compute at the same accuracy level.
- ChainGPT: Dual-Reasoning Model with Recurrent Depth and Multi-Rank State Updates
-
ChainGPT shifts reasoning from "generating more tokens" into the latent space. By combining intra-layer multi-substep state updates (RWKV-Product) + State-Guided Sparse Attention (SGSA) for deep local computation with cross-layer recurrent depth for iterative refinement, it enables small models to achieve reasoning capabilities exceeding fixed-depth Transformers at near-linear complexity.
- Characterizing and Mitigating Reasoning Drift in Large Language Models
-
This paper diagnoses a failure mode in Large Language Models (LLMs) termed "Reasoning Drift" using thousands of mathematical reasoning trajectories. It finds that models once entering a pathological functional state during the early high-plasticity phase, become locked into incorrect paths. To address this, Reasoning-Aware Activation Steering (RAAS) is proposed, which uses a pre-computed library of contrastive steering vectors to nudge activations back to healthy paths in real-time during inference, consistently improving accuracy on GSM8K, AIME, and GPQA with out-of-distribution transferability.
- Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
-
Co-rewarding proposes a self-supervised RL framework that addresses training collapse in self-rewarding RL through two complementary supervision mechanisms: data-side (cross-view consistency of paraphrased problems) and model-side (pseudo-labels from an EMA teacher model). Without human labels, it achieves or exceeds the performance of RLVR (with labels) on multiple mathematical reasoning benchmarks.
- Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning
-
This paper demonstrates through information-theoretic generalization bounds and interpretability analysis that the core mechanism of CoT training is compositional generalization: models learn to systematically combine simple learned skills to solve novel complex problems. This is internalized as a two-stage compositional reasoning circuit that extracts intermediate results at shallower layers, freeing deeper layers to focus on subsequent reasoning steps.
- Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models
-
CANON eschews predefined directional priors such as "higher entropy is better" or "shorter length is better." Instead, it sorts sampled responses for the same query by a target metric (entropy or length) and splits them into two groups. By utilizing inter-group comparisons to automatically discover which metric trend favors accuracy and intra-group comparisons to select superior responses within the same trend, CANON amplifies the effective influence of target metrics without the need for manual penalty term tuning.
- ContextPRM: Leveraging Contextual Coherence for multi-domain Test-Time Scaling
-
ContextPRM shifts the learning objective of Process Reward Models (PRMs) from "verifying whether a step is factually correct" to "evaluating whether the logical transition between adjacent reasoning steps is coherent." By proposing a coherence annotation standard and a context-aware training method, it enables a PRM trained only on mathematical data to generalize to non-mathematical domains such as law, history, and philosophy. It achieves a 6.5% average accuracy gain over the Majority Voting baseline on non-mathematical domains of MMLU-Pro, significantly surpassing the 2.2% gain of the previous SOTA, VersaPRM.
- Continuous Chain of Thought Enables Parallel Exploration and Reasoning
-
CoT2 proposes using continuous-valued tokens (convex combinations of vocabulary embeddings) instead of discrete tokens for chain-of-thought reasoning. This enables the model to track multiple reasoning paths in parallel within a single inference pass, which is theoretically equivalent to \(K\)-wise self-consistency or best-of-N sampling. Performance is further enhanced through GRPO reinforcement learning.
- CORE: Concept-Oriented Reinforcement for Bridging the Definition–Application Gap in Mathematical Reasoning
-
Addressing the issue where LLMs "can recite definitions but fail to apply concepts," CORE utilizes a clean linear algebra textbook to construct concept-aligned problems. During RL (GRPO) training, when a set of sampled trajectories are entirely incorrect, concept text is injected for correction. This is achieved either by directly replacing failed trajectories (CORE-CR) or by using forward KL to distill the "concept-guided" reasoning distribution into the "concept-free" policy (CORE-KL). Performance improves consistently during testing even without providing concepts.
- CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning
-
CoT-Evo reformulates "multi-teacher Chain-of-Thought (CoT) distillation" into a genetic algorithm. It first generates a pool of reasoning trajectories using multiple LLM thinkers and retrieved knowledge, then scores them via a fitness function based on correctness, length appropriateness, and knowledge utilization. Selecting parents through novelty-driven search ensures diversity and quality, followed by reflective recombination and mutation to fuse them into a high-quality chain. Fine-tuning 7-8B models with the evolved dataset achieves SOTA performance on biology and chemistry reasoning benchmarks.
- CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos
-
Ours proposes CoT-RVS, a completely training-free multi-agent framework that leverages the zero-shot CoT reasoning capabilities of pre-trained MLLMs for temporal-semantic correlation analysis and key-frame selection. It significantly outperforms fine-tuning methods on reasoning video segmentation tasks (Refer-DAVIS J&F 79.1 vs 71.2, ReasonVOS J&F 65.5 vs 49.9).
- Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards
-
Addressing the issue that value-free RL like GRPO/DAPO suffers from "insufficient exploration and premature convergence to repetitive patterns" in LLM reasoning, MERCI leverages the property that transitions in LLM generation are "known and deterministic" to simplify the Uncertainty Bellman Equation (UBE) into estimating only local reward variance. Using a lightweight "Coin Flip Network" (CFN) to estimate state novelty and convert it into intrinsic rewards, MERCI enables the policy to explore more diverse and coherent reasoning paths, consistently outperforming strong baselines on math and SQL benchmarks.
- Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
-
The paper proposes E2H Reasoner, which decomposes training data into four difficulty levels—"trivial, easy, medium, and hard"—and utilizes a probability scheduler (Cosine or Gaussian) to smoothly shift the sampling focus from easy to hard. This approach enables small models to master complex reasoning tasks that are unsolvable zero-shot, while providing theoretical guarantees for CRL convergence and sample complexity.
- CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling
-
Reflection tokens in reasoning processes (e.g., "wait", "but") are treated as schedulable "resources." Drawing from the concept of cyclical learning rates in optimization, CyclicReflex is proposed as a training-free decoding strategy. By dynamically regulating the logits of reflection tokens using a triangular waveform, it consistently improves the accuracy of 1.5B-8B models across multiple mathematical reasoning benchmarks (MATH500, AIME2024/2025, AMC2023).
- DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs
-
This paper formalizes CoT reasoning in LLMs as a rule-based stochastic process on Directed Acyclic Graphs (DAGs) and proposes the "logical closeness" metric to assess whether a model reaches an answer through search or rigorous logical derivation. By constructing the DAG-MATH benchmark with 2,894 gold-standard DAGs, the authors find that even models with similar PASS@k exhibit significant differences in reasoning faithfulness.
- Deep Think with Confidence
-
DeepConf leverages local confidence signals inherent in LLM generation to dynamically filter low-quality reasoning chains atop parallel thinking (multi-sampling + majority voting). It uses confidence-weighted voting with Top-η% filtering in offline mode and employs the least grouped confidence as a trigger for early stopping and adaptive sampling in online mode. Without training or hyperparameter tuning, it improves GPT-OSS-120B accuracy to 99.9% on AIME 2025 while reducing generation tokens by up to 84.7%.
- DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
-
DeepCompress modifies the RL training of large reasoning models with a dual-length reward strategy of "compressing simple problems and exploring difficult problems," improving accuracy in mathematical and scientific reasoning while significantly reducing the average number of reasoning tokens.
- DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
-
DeepMath-103K is a large-scale mathematical reasoning training set specifically designed for Reinforcement Learning from Verifiable Rewards (RLVR). Starting from 2.869 million raw math forum problems, it undergoes rigorous decontamination, difficulty filtering (primarily levels 5–9), and answer verifiability checks. The resulting 103,000 high-difficulty problems have almost no overlap with mainstream evaluation benchmarks, each featuring machine-verifiable answers and three R1-generated solutions. Models trained with RL on this dataset lead in benchmarks like AIME and MATH500, generalizing to non-mathematical reasoning tasks such as biology, physics, and chemistry.
- DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
-
This paper proposes Design Logic—reusable meta-knowledge reverse-engineered from authentic exam questions—to guide the synthesis of multidisciplinary reasoning problems from raw text. The authors constructed 4.7 million reasoning questions across 75 disciplines; base models fine-tuned on this data (SFT) even surpass official models that underwent full post-training.
- Diagnosing and Remedying Knowledge Deficiencies in LLMs via Label-free Curricular Meaningful Learning
-
LaMer utilizes the "relative entropy of model output distributions before and after incorporating external knowledge" as a label-free probe to locate and quantify knowledge deficiencies in LLMs. It then adaptively synthesizes data based on deficiency severity and repairs them through easy-to-hard curriculum fine-tuning, matching or exceeding label-dependent methods with only 40% of the training data.
- Diversity-Enhanced Reasoning for Subjective Questions
-
This paper proposes MultiRole-R1, which integrates multiple stakeholder perspectives into a single long Chain-of-Thought (CoT) through "Role Perspective Diversity + Token-level Diversity." This is achieved via unsupervised SFT of synthesized reasoning chains followed by GRPO reinforcement learning with diversity reward shaping. The method improves both accuracy (by 10.6% on average) and diversity on subjective questions without unique correct answers, while also generalizing to objective math problems like AIME 2024.
- Divide and Abstract: Autoformalization via Decomposition and Abstraction Learning
-
DNA is a training-free autoformalization framework that first extracts common mathematical concepts from the entire corpus and formalizes them into reusable abstractions to extend the target formal language. It then hierarchically decomposes each new proposition into "quantifier + premise + conclusion" clauses for step-by-step translation and recombination. It achieves a success rate improvement of up to 8.60× over baselines on LeanEuclidPlus and ProofNet-Hard.
- DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems
-
DRIFT decomposes the task of "translating natural language mathematical propositions into Lean formal statements" into four steps: Decompose → Retrieve → Illustrate → Formalize. It first directs the LLM to split information-dense informal propositions into atomic sub-queries focused on single concepts to retrieve precise formal definitions from Mathlib. Then, it uses a greedy algorithm to select example theorems demonstrating how these definitions are used. Finally, these are fed to a formalizer to generate formal statements. DRIFT nearly doubles dependency retrieval F1 on ProofNet and achieves a 55-point surge in BEq+@10 on the OOD ConNF dataset, even surpassing the oracle.
- DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization
-
The study diagnoses a fundamental flaw in GRPO when integrated with length penalties—correct but lengthy responses may obtain negative advantage values and thus be erroneously penalized. DRPO is proposed to decouple reward signals for positive and negative samples, ensuring that length penalties are normalized only within the group of correct responses. On a 1.5B model, DRPO achieves a 77% reduction in length with only 1.1% performance loss (compared to a 68% reduction and 4.3% loss for the baseline).
- Dynamic Early Exit in Reasoning Models
-
DEER enables Large Reasoning Models (LRMs) to trial-answer at "reasoning switch points" within the Chain-of-Thought (CoT). It uses the confidence of these trial answers to judge if the reasoning is sufficient, allowing for training-free dynamic early exit. Across 11 models and 10 benchmarks, it reduces CoT length by an average of 19.1%~80.1% while improving accuracy by 0.3%~5.0%.
- Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models
-
The solving progress of each prompt in RL finetuning is modeled as a Hidden Markov Model (HMM) dynamic system. Through lightweight online Bayesian inference, the solving state of prompts is predicted, prioritizing "partially solved" prompts. This achieves equivalent or superior reasoning performance with less than 30% of the rollout volume compared to DS.
- e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
-
This paper points out that most open-source reasoning models fail to "extrapolate" test-time compute beyond their training budgets. It proposes the e3 recipe—linking the asymmetric capabilities of base models + RL negative gradients + coupled curricula—to enable in-context exploration. This allows a 1.7B model to continuously improve when extrapolating to 2.5× its training budget on AIME/HMMT'25, surpassing all models \(\le\) 2B.
- Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning
-
This paper reinterprets the spontaneous phenomenon of "repeating the prompt" at the beginning of the Chain of Thought (Echo of Prompt, EOP) in large reasoning models from a training byproduct to an intrinsic attention refocusing mechanism. By defining the "echo likelihood difference \(\Delta L\)" via a rejection sampling framework to quantify probabilistic costs, the paper proposes two methods: the training-based ED-SFT and the training-free Echoic Prompting, achieving consistent improvements across multiple mathematical reasoning benchmarks.
- Efficient Test-Time Scaling for Small Vision-Language Models
-
Two efficient test-time scaling strategies for small VLMs are proposed: TTAug (aggregates output probabilities at the token level after various input augmentations) and TTAdapt (adaptively adjusts model parameters using pseudo-labels generated by TTAug). These methods consistently improve performance across 9 benchmarks while maintaining significantly higher computational efficiency than existing sampling-based test-time methods.
- Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
-
Through the analysis of RL training dynamics, this work discovers that the improvement of LLM reasoning capabilities is driven by a two-stage hierarchical mechanism: "low-level program consolidation \(\rightarrow\) high-level strategy exploration." Based on this, the HICRA algorithm is proposed to concentrate optimization signals on high-impact planning tokens, significantly exceeding GRPO baselines across multiple mathematical reasoning benchmarks.
- Enhancing Language Model Reasoning with Structured Multi-Level Modeling
-
This work reconstructs single-policy long Chain-of-Thought (CoT) generation into a two-level stochastic process (MLR), where a high-level planner outputs step descriptors and a low-level executor writes detailed content. By using Twisted SMC to construct process-level preferences for iterative Step-DPO, the method enables small models to perform stable long-range reasoning under limited data budgets.
- Enhancing LLMs for Knowledge Base Question Answering by Chain-of-Decomposition
-
This paper proposes Chain-of-Decomposition (CoD), which factorizes the answer generation distribution of Knowledge Base Question Answering (KBQA) into three subtasks—"Retrieval → Reformulation → Reasoning"—using a causal graph. Retrieval is handled by a small model and reformulation by rules (both independent of the LLM), leaving only a lightweight binary classification task of "whether the reasoning path is valid" for a fine-tuned LLM. Consequently, Llama-2 7B achieves SOTA performance on WebQSP/CWQ, surpassing GPT-4 with retrieved knowledge.
- Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive Retrieval
-
Inspired by the dual-process theory of cognitive science, this paper proposes the RF-Mem framework. It achieves efficient and scalable LLM personalization through a memory retrieval mechanism that adaptively switches between two paths: Familiarity (fast similarity matching) and Recollection (deep chain reconstruction).
- EvolProver: Advancing Automated Theorem Proving by Evolving Formalized Problems via Symmetry and Difficulty
-
EvolProver proposes a dual-perspective "Symmetry + Difficulty" formal statement data augmentation pipeline (EvolDomain cross-domain translation + EvolDifficulty difficulty evolution + EvolAST AST-based deterministic syntactic rewriting). Using this augmented data, a 7B non-CoT theorem prover was trained, achieving a new SOTA for its size with 53.8% pass@32 on FormalMATH-Lite, even surpassing reasoning models.
- Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code
-
This paper restores "counterfactual reasoning" to its three-step process of "abduction \(\to\) intervention \(\to\) prediction." By constructing executable Python functions (and equivalent GSM math problems) with latent variables that necessitate abduction for correct answers, the authors find that SOTA models experience a 25–40% performance drop from intervention to counterfactual reasoning. While SFT merely memorizes shallow patterns and fails to generalize, RLVR applied solely to code enables the generalization of these three cognitive skills to entirely new control flows and natural language math problems.
- Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
-
This paper formalizes the "reasoning potential" of foundation models as "the reciprocal of the expected number of independent attempts required to solve a problem." It proposes the CoTP framework, which abstracts atomic reasoning patterns from CoT sequences and utilizes a dual-granularity weighted DTW distance (Reasoning Pattern Chain + Token Entropy Chain) to select long CoT data aligned with a high-value core set from massive data pools. Using only 10B tokens, it improves an 85A6B MoE model by 9.58% on AIME and lifts the downstream RL performance ceiling by 7.81%.
- Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation
-
TSD-KD enables small student models to reason "in their own words" by distilling only on high-entropy key tokens at the beginning of responses. It combines indirect preference distillation (where the teacher ranks student candidates) with direct distillation (targeting tokens where the student is uncertain but the teacher is certain) and entropy regularization. TSD-KD achieves SOTA results on 10 reasoning benchmarks for a 1.5B student, even outperforming the 14B teacher on specific tasks.
- Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems
-
This paper proposes GRALGOBENCH—a benchmark evaluating Large Reasoning Models (LRMs) using graph algorithm problems (8–160 nodes, three reasoning paradigms, nine tasks). Leveraging programmatic verification, controllable difficulty, and intrinsic long-context features, it systematically exposes two major weaknesses of LRMs: a sharp drop in accuracy as context length increases and "over-thinking" driven by excessive but inefficient self-verification.
- FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning
-
This paper proposes FaithCoT-Bench—the first unified benchmark for instance-level CoT unfaithfulness detection. It formalizes the question of "whether a specific reasoning chain accurately reflects the model's internal decision-making" as a binary classification problem, supported by the FINE-CoT dataset containing 1,000+ expert-annotated trajectories, and systematically evaluates 11 detection methods.
- FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning
-
Addressing the critical bottleneck where the generation phase accounts for 91%-98% of GRPO training time, this work proposes a concurrency-aware speculative decoding strategy (dynamically adjusting draft tree parameters to adapt to real-time concurrency changes) and online draft model learning (utilizing hidden states from the target model to adapt to distribution shifts). The approach achieves 2.35x-2.72x end-to-end training acceleration without compromising inference quality.
- FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels
-
FATE is a series of Lean formalization benchmarks for research-level abstract and commutative algebra. By utilizing three difficulty levels—FATE-M/H/X (ranging from undergraduate exercises to beyond Ph.D. qualifying exams)—it pushes current top models to their limits: the best models achieve only 3% on FATE-H and 0% on FATE-X. Through a two-stage decomposition of "Natural Language Reasoning + Formalization," the study identifies that the primary bottleneck is not mathematical capability but the translation of a correct natural language proof into precise Lean code.
- Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
-
Fine-R1 surpasses CLIP and general/reasoning MLLMs in fine-grained visual recognition (FGVR) using only 4-shot training, achieved through CoT Supervised Fine-Tuning (structured reasoning chain: "Visual Analysis → Candidate Subclasses → Comparison → Prediction") and Triplet Augmentation Policy Optimization (TAPO), which utilizes intra-class augmentation for robustness and inter-class augmentation for discriminative power.
- Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling
-
This paper systematically diagnoses three failure modes of Reward Models (RMs) at inference time—performance degradation on easy problems, decreased discriminative power as the number of samples increases, and excessive search diversity harming accuracy. It proposes the CRISP algorithm to mitigate these issues through cluster-based reward integration and stepwise prefixing, achieving accuracy improvements of up to 5%.
- FlowRL: Matching Reward Distributions for LLM Reasoning
-
FlowRL transforms LLM reasoning RL from "maximizing scalar rewards" to "matching complete reward distributions"—using a learnable partition function to normalize scalar rewards into a target distribution, and leveraging the Trajectory Balance loss of GFlowNets to minimize the reverse KL between the policy and the target distribution. This preserves multiple valid reasoning modes and alleviates mode collapse, achieving an average improvement of 10.0%/5.1% over GRPO/PPO in mathematics.
- Following the Navigation: Enhancing Small Language Models Contextual Reasoning with LLM Guidance
-
Proposes Navigation—a training-free framework that distills the "reasoning strategies" of large models for complex contexts into reusable navigation templates stored in a database. Using a three-phase "Generation-Utilization-Update" cycle, it guides 3B small models to locate key information, achieving an average accuracy improvement of 10.7% and outperforming GPT-3.5-Turbo.
- From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics
-
The authors propose the ContextMATH benchmark, which reveals that even top-tier models like GPT-5 and DeepSeek-R1 experience a 13-34% accuracy drop in contextual mathematical reasoning. By transforming AIME/MATH-500 abstract problems into Scenario Grounding (SG) and Complexity Scaling (CS) variants, the study identifies that errors primarily stem from problem formulation rather than computational reasoning.
- From Assumptions to Actions: Turning LLM Reasoning into Uncertainty-Aware Planning
-
The authors propose PCE (Planner-Composer-Evaluator), a framework that explicitly extracts implicit environmental assumptions from LLM reasoning chains and organizes them into decision trees. By employing a likelihood-gain-cost scoring mechanism for uncertainty-aware action selection, it significantly reduces communication overhead in multi-agent collaboration.
- Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data
-
Under a fixed reasoning token budget, this study systematically decomposes whether "reasoning data should be placed in pre-training or post-training." It finds that front-loading reasoning data into pre-training builds a persistent advantage that SFT cannot compensate for and proposes an asymmetric data allocation principle: "diversity for pre-training, quality for SFT."
- FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
-
Redundant sentences in reasoning chains with "low attention and low contribution" are defined as reasoning outliers. By replacing vanilla Softmax with Softmax₁ and performing lightweight SFT, large reasoning models can reduce reasoning tokens by approximately 70% while maintaining or even improving performance.
- Generalizable End-to-End Tool-Use RL with Synthetic CodeGym
-
CodeGym is proposed to automatically transform programming problems into interactive multi-turn tool-use environments for reinforcing LLM agents. It achieves significant generalization improvements on out-of-distribution (OOD) benchmarks (e.g., +8.7 points for Qwen2.5-32B on \(\tau\)-Bench).
- Generalization in LLM Problem Solving: The Case of the Shortest Path
-
This paper uses a controllable synthetic environment of shortest paths to decompose the sources of generalization in LLM problem solving. It finds that models can transfer learned local rules to unseen maps, yet fail on longer paths due to the instability of recursive composition. Data coverage determines the upper bound of performance, while RL primarily stabilizes training rather than extending the limit, and test-time sampling merely raises the curve without solving the length extrapolation issue.
- Generalized Parallel Scaling with Interdependent Generations
-
This paper proposes Bridge: treating \(N\) parallel sampling trajectories of a single prompt as a unified 3-D tensor rather than independent slices. By performing "cross-sample attention" along the batch axis at each time step, \(N\) generations exchange information. Adding only 2.8%–5.1% parameters improves the relative gain of RLVR by up to 39%, with a single training session generalizing to any generation width.
- Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
-
GAR places an LLM discriminator and an LLM reasoner into a GAN-like online adversarial reinforcement learning framework for joint training. By using "slice-level" dense process rewards to supplement sparse final answer rewards, it achieves stable improvements across the DeepSeek-R1-Distill series on multiple mathematical reasoning benchmarks.
- GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs
-
This work formalizes the Program-to-Geometry task and introduces GeoGramBench (500 problems). Using a three-level geometric complexity taxonomy, it evaluates the ability of 19 state-of-the-art LLMs to construct geometric representations and reason from procedural plotting code. The study reveals that even GPT-5 achieves only 39.26% accuracy at the highest abstraction level, highlighting a fundamental weakness in LLM spatial abstraction.
- GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning
-
GPG (Group Policy Gradient) returns to the most basic policy gradient, directly optimizing the original RL objective—eliminating the critic, reference model, KL constraints, and surrogate loss, while retaining only group-mean normalization and a gradient debiasing correction, consistently outperforming GRPO in math and multimodal reasoning tasks.
- HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games
-
HardcoreLogic constructs a benchmark of 5,000+ atypical logic puzzles across 10 types by applying three long-tail transformations: "Increasing Complexity," "Uncommon Elements," and "Unsolvable Puzzles." It reveals that even state-of-the-art models like GPT-5 rely heavily on memorized patterns of classic problems, suffering significant performance drops when encountering these variants.
- Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
-
Reveals that the advantage function of GRPO (standard deviation normalization) results in the largest update magnitudes for medium-difficulty problems while implicitly suppressing hard and easy ones. Proposes the MathForge framework: DGPO (replaces std with MAD for difficulty equalization + softmax difficulty weighting) and MQR (reformulates questions via story backgrounds, abstract terms, and nested sub-problems to increase difficulty while preserving original answers). Achieves an average +4.56% improvement over GRPO across 6 mathematical reasoning benchmarks using Qwen2.5-Math-7B.
- HATSolver: Learning Gröbner Bases with Hierarchical Attention Transformers
-
The authors replace the expensive flat self-attention in a Transformer encoder with hierarchical attention consisting of "bottom-up local attention + top-down cross-layer attention." By leveraging the natural tree structure of polynomial systems, the \(O(L^2)\) sequence cost is reduced to approximately \(O(L^{1+1/n})\). This scales neural Gröbner base prediction from 5 variables to 13 variables with 100% density, outperforming classical symbolic tools like STD-FGLM and Msolve on hard instances.
- Hilbert: Recursively Building Formal Proofs with Informal Reasoning
-
HILBERT constructs an agent using a quartet of "General Reasoning LLM + Specialized Proving LLM + Verifier + Theorem Retriever." By recursively decomposing difficult problems into subgoals, proving them layer-by-layer, and reassembling them, it increases the success rate of formal proof from the teens to 70% on PutnamBench and 99.2% on miniF2F, allowing open-source models to approach informal reasoning levels for the first time.
- HiPO: Self-Hint Policy Optimization for RLVR
-
HiPO extracts "prefixes" from accidentally successful trajectories within a training batch to serve as on-policy self-hints for resampling. This transforms sparse 0/1 rewards into dense contrastive learning signals, specifically addressing the "near-miss" problem and exploration stagnation in RLVR.
- Hybrid Reinforcement: When Reward Is Sparse, Better to Be Dense
-
HERO uses rule verifiers as "gates" to hierarchically normalize continuous Reward Model (RM) scores (scaling the correct and incorrect groups separately) and applies variance-adaptive weighting to amplify difficult prompts. By fusing sparse binary verification rewards with dense RM rewards into a stable and fine-grained hybrid reward, it out-performs both "verifier-only" and "RM-only" baselines in mathematical reasoning.
- Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
-
This paper proposes GDPO (Group Diffusion Policy Optimization), utilizing a low-variance, low-cost "Semi-deterministic Monte Carlo" scheme to efficiently estimate the sequence-level ELBO of diffusion language models. This allows GRPO-style RL post-training to be effectively applied to diffusion language models, consistently outperforming the previous diffu-GRPO across mathematical, planning, and coding reasoning tasks.
- Incentivizing LLM Reasoning via Reinforcement Learning with Functional Monte Carlo Tree Search
-
RFTT embeds a set of learnable "functional tokens" such as
<analyze>,<verify>, and<refine>directly into the model's vocabulary. It first generates annotated SFT data using functional prompt-guided MCTS for warmup, and subsequently enables the model to directly sample functional tokens for tree search exploration during the RL stage. This allows 7B/8B small models to acquire human-like multi-step reasoning without any prompting. - InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
-
Ours proposes InftyThink, a new paradigm that transforms monolithic long reasoning into iterative short reasoning with intermediate summaries. It achieves theoretically unbounded reasoning depth and significantly reduces computational costs without modifying model architectures, showing an 11% improvement for Qwen2.5-Math-7B on AIME24.
- Inpainting-Guided Policy Optimization for Diffusion Large Language Models
-
By leveraging the unique "inpainting" capability of diffusion language models (dLLMs), partial ground-truth reasoning segments are injected to guide exploration when GRPO training encounters "all-wrong groups with zero advantage." This restores gradient signals and improves sample efficiency, achieving new SoTA results for full-attention masked dLLMs on four mathematical reasoning benchmarks.
- InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning
-
Addressing the credit assignment challenge in outcome-reward RL where "entire trajectories are rewarded or punished together, failing to distinguish correct from incorrect steps," this paper has the model self-verify against a reference answer to propose a single-step corrective intervention for the first error in a failed trajectory. By "patching" these interventions into the base model via SFT followed by RL, the 4B model's accuracy on IMO-AnswerBench increased by nearly 14%, surpassing gpt-oss-20b.
- Is In-Context Learning Learning?
-
Through large-scale controlled experiments, this paper systematically analyzes whether ICL constitutes "learning". It finds that while ICL satisfies the mathematical definition of learning, empirical evidence shows limited generalization—models primarily rely on structural patterns in prompts for deduction rather than truly acquiring new capabilities from examples.
- Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
-
Ours proposes TRACE (Truncated Reasoning AUC Evaluation), a method that quantifies reasoning effort by progressively truncating reasoning chains and measuring "how early" a model obtains rewards. It detects implicit reward hacking behaviors that CoT monitoring fails to identify, improving detection F1 by over 65% in math and 30% in code tasks compared to the strongest CoT monitors.
- KaVa: Latent Reasoning via Compressed KV-Cache Distillation
-
KaVa compresses the KV-cache generated by a teacher model through explicit Chain-of-Thought (CoT) using redundancy-importance eviction, then distills it directly into the student's continuous implicit reasoning trajectory. By introducing "step-by-step KV alignment" as a new supervision signal, it provides internal step-wise supervision that has long been missing in implicit reasoning, achieving CoT-level accuracy with implicit reasoning efficiency on natural language reasoning traces.
- LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
-
LaDiR utilizes a VAE to compress each reasoning step into a "block" of continuous thought tokens, then applies block-level latent diffusion (flow matching) to iteratively denoise and refine these tokens. This allows LLMs to perform iterative correction and parallel diverse exploration at a semantic level, consistently outperforming autoregressive, discrete diffusion, and latent reasoning baselines across math, code, and planning tasks.
- Latent-Guided Reasoning: Empowering Small LLMs with Large-Model Thinking
-
The proposed method allows a large model to handle "cognitive planning" by compressing problem-solving strategies into a few latent guidance vectors. These vectors are then passed to a small model responsible for "linguistic realization" to generate reasoning chains. This approach outfits small models with the thinking capabilities of large models, pushing the reasoning performance-cost trade-off to a new equilibrium.
- Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
-
The study models "stepwise correctness in CoT" as a set of latent veracity variables. It uses the joint likelihood of "veracity + final answer" from a language model as a proxy reward to perform posterior inference via discrete MCMC search (Veracity Search) for error localization. The search results are then distilled into a zero-shot verifier (AVI) that operates without the ground truth answer, requiring no stepwise human annotation throughout the process.
- Lean4PHYS: Comprehensive Reasoning Framework for College-level Physics in Lean4
-
This paper introduces Lean4PHYS—the first Lean4 formal reasoning framework for college physics. It consists of PhysLib, a community-driven physics theorem library with a unit system, and LeanPhysBench, an evaluation set containing 200 expert-formalized problems. The experiments reveal an overfitting phenomenon where "mathematical expert provers do not outperform general LLMs in the physics domain," while demonstrating that providing PhysLib in the context yields an average performance improvement of 11.90%.
- Learning Global Hypothesis Space for Enhancing Synergistic Reasoning Chain
-
This paper proposes GHS-TDA: fusing multiple reasoning paths sampled from LLMs into a "Global Hypothesis Graph," then applying Topological Data Analysis (Persistent Homology) to extract stable "logical backbones" and "self-consistent loops." By selecting reasoning chains based on structural stability rather than local confidence, the method suppresses error propagation and enhances accuracy and interpretability.
- Learning to Reason over Continuous Tokens with Reinforcement Learning (HyRea)
-
HyRea enables LLMs to autonomously and dynamically switch between "explicit token reasoning" and "implicit embedding reasoning" during inference. By replacing low-entropy CoT steps with continuous embeddings through entropy-guided cold-start SFT, and training the model with GRPO reinforcement learning to learn optimal switching timing, it reduces output tokens by approximately 50% in mathematical reasoning while maintaining near-identical accuracy.
- Learning to Reason via Mixture-of-Thought for Logical Reasoning
-
This paper proposes the Mixture-of-Thought (MoT) framework, allowing a single LLM to learn logical reasoning using three complementary paradigms: natural language, code, and the newly introduced "truth tables." Capabilities across modalities are jointly enhanced through self-evolution training, and fused via majority voting during inference, achieving a performance gain of up to +11.7pp over single Chain-of-Thought baselines on FOLIO/ProofWriter.
- Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
-
By analyzing the training dynamics of RL and SFT across different difficulty levels, this work discovers that RL excels at "getting known problems right" but fails to learn "out-of-syllabus" questions. The authors propose ReLIFT, which dynamically identifies the hardest questions where the model fails all attempts during RL training, collects high-quality CoT solutions online, and interleaves sparse SFT steps. Using significantly less demonstration data and training time, ReLIFT outperforms pure RL/SFT and various hybrid methods by an average of +6.7 points across six reasoning benchmarks.
- Let's Explore Step by Step: Generating Provable Formal Statements with Deductive Exploration
-
This paper proposes DExploration, which transforms mathematical problem synthesis from "one-shot generation" into "incremental deductive exploration in Lean 4." By using three atomic actions (introducing variables/hypotheses, deducing new facts, and submitting conclusions) with step-by-step verification, it generates naturally provable, broad-coverage, and high-difficulty formal statements. Furthermore, an Exploratory Transformation is used to distill exploration trajectories from existing proof data to train the agent, ultimately increasing the success rate from 40.70% to 54.52% and reducing token costs by 83%.
- LEXam: Benchmarking Legal Reasoning on 340 Law Exams
-
LEXam organizes 340 real law school exams from the University of Zurich into 7,537 English-German bilingual questions (open-ended + multiple-choice). It evaluates not just the final answer but also the multi-step legal reasoning process using an expert-calibrated ensemble LLM judge, revealing that current SOTA models still fail significantly in structured legal reasoning.
- LingOly-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
-
The LingOly-TOO benchmark is proposed to disentangle reasoning from knowledge by applying expert-designed grapheme-level permutations to Linguistics Olympiad problems. This obfuscation preserves reasoning logic while eliminating knowledge/memory shortcuts, reducing the top score of 15 frontier models from 0.59 to 0.48 and systematically quantifying the extent to which LLM reasoning capabilities are overestimated due to knowledge effects.
- Linking Process to Outcome: Conditional Reward Modeling for LLM Reasoning
-
CRM models multi-step reasoning as a sequential process of "gradually approaching the correct answer." By leveraging the conditional probability chain rule, it explicitly anchors the process reward of each step to the final outcome. This addresses the issues of missing step-wise dependencies and fuzzy credit assignment, making it more stable and resistant to reward hacking across Best-of-N, beam search, and RL downstream tasks.
- LoC-Decomp: LLM Autoformalization via Logical Concept Decomposition and Iterative Feedback Correction
-
LoC-Decomp utilizes a CoT-like "Logical Concept Decomposition" template to decompose natural language mathematical propositions into modular Lean 4 components. It then employs a "divide-and-conquer back-translation" approach for fine-grained semantic consistency self-checking. By integrating semantic errors and compiler syntax errors into an alternating iterative rectification loop, it elevates the formalization success rate on PutnamBench from the previous SOTA of 75% to 93.09%.
- Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation
-
LAG stores reasoning trajectories from past tasks as logs that "retain only a few tokens, but whose KV values encode the full context." When a new task arrives, these KV values are retrieved and concatenated for direct computation reuse. This allows LLMs to learn from historical experience like humans, simultaneously improving accuracy and efficiency in multi-hop QA and reasoning tasks.
- LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision
-
Ours proposes the LogicReward reward function, which utilizes the Isabelle theorem prover for step-wise logical correctness verification. By combining Autoformalization with Soft Unification to reduce natural language ambiguity, the trained 8B model outperforms GPT-4o by 11.6% and o4-mini by 2% on NLI and logical reasoning tasks.
- Long Chain-of-Thought Reasoning Across Languages
-
This paper systematically decomposes the cross-lingual transfer of long Chain-of-Thought (long CoT) reasoning capabilities into four development stages: scaling, pre-training, post-training, and inference. It finds that scaling only bridges the gap in "understanding" but not in "reasoning in the target language," and offers a counter-intuitive practical conclusion: translating English reasoning trajectories into the target language for fine-tuning is more effective than direct distillation of target language trajectories.
- MAGO: Beyond Fixed Hyperparameters with Multi-Objective Pareto Optimization for Hybrid LLM Reasoning
-
MAGO reformulates the hybrid reasoning problem—deciding "whether to enable long-chain reasoning"—as a multi-objective optimization problem. By maintaining a Pareto frontier and using correlation-aware dynamic weights, it automatically balances accuracy, efficiency, and decision calibration during training. This eliminates manual hyperparameter tuning and achieves 2.2×–3× token savings during inference with zero extra overhead.
- Making, Not Taking, the Best of N
-
The authors shift the paradigm of LLM output aggregation from "selecting the best one from N candidates" (Best-of-N selection) to "using a fusor model to synthesize the merits of N candidates into a superior answer" (Fusion-of-N synthesis). This approach consistently outperforms BON in both test-time scaling and synthetic data generation, even surpassing the oracle upper bound.
- Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy
-
This paper proposes using "step entropy" to quantify the information contribution of each reasoning step in CoT. It discovers that pruning 80% of the lowest-entropy steps results in almost no accuracy loss. A two-stage SFT+GRPO training pipeline is designed to enable models to autonomously insert [SKIP] tokens during inference, reducing token counts by 16–57% while maintaining or even improving accuracy.
- Mathesis: Towards Formal Theorem Proving from Natural Languages
-
Mathesis systematically bridges the gap from "natural language math problems → formal statements → machine-verifiable proofs" for the first time. The core is an autoformalizer trained using online reinforcement learning (GRPO + Hierarchical Preference Optimization), complemented by the LeanScorer evaluation framework for continuous semantic scoring and the challenging Gaokao-Formal benchmark.
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
-
Drawing on the Fill-in-the-Middle (FIM) paradigm from code completion, this work trains a specialized step expansion model, MathFimer-7B, to insert finer-grained intermediate reasoning steps into existing mathematical solution chains, thereby systematically enhancing the mathematical reasoning capabilities of downstream models.
- MetaMuse: Algorithm Generation via Creative Ideation
-
To address the issue of LLMs being trapped by "availability bias" in classical heuristics (e.g., LRU/LFU) when generating system algorithms, MetaMuse proposes three self-reflection principles: measuring diversity in the performance feedback space, guiding via external stimuli rather than internal randomness, and implementation through waypoint reasoning instead of free-form CoT. This enables LLMs to perform "creative leaps" in discontinuous solution spaces, reducing cache misses by up to 35.76% and bin usage by up to 30.93% on real cloud provider workloads.
- Mode-conditioning unlocks superior test-time compute scaling
-
Addressing the "diversity collapse" problem in parallel sampling—where models collapse into a single reasoning strategy and repeatedly commit the same errors—this paper proposes the Mode-conditioning (ModC) framework. By using expert models or mode prefixes to explicitly distribute test-time compute across different reasoning modes, the framework lifts the Pass@k scaling curves in mathematical reasoning and graph search tasks, achieving approximately a 4× improvement in inference efficiency.
- MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning
-
Decouples the "single-chain" latent space recurrent module of Depth-Recurrent Transformers (Huginn) into multiple recurrent branches that share a backbone with individual LoRAs. It employs a hard-gate router without auxiliary loss to dynamically switch branches during each token's generation, significantly improving math and commonsense reasoning accuracy by training \(<0.2\%\) of parameters.
- MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs
-
MolecularIQ is the first fully symbolic verifiable molecular structure reasoning benchmark. All answers are precisely calculated from molecular graphs using RDKit, completely decoupling "true structural understanding" from "memorized molecule-property pairs." It fine-grainedly identifies where 38 LLMs fail across task types, molecular complexity, and representation forms.
- mR3: Multilingual Rubric-Agnostic Reward Reasoning Models
-
The authors propose mR3, a series of multilingual rubric-agnostic reward reasoning models covering 72 languages. Through systematic data construction (GPT-OSS-120B distillation + difficulty filtering) and curriculum learning, the 14B model outperforms the 120B teacher and all comparable baselines on multilingual benchmarks, supporting point-wise, pair-wise, and binary evaluation paradigms.
- \(\nabla\)-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space
-
\(\nabla\)-Reasoner is proposed to upgrade inference-time search from zeroth-order (sampling + evaluation) to first-order (gradient descent). By using Differentiable Text Optimization (DTO) on token logits, it iteratively improves decoding strategies by combining reward gradients with LLM likelihood. It achieves 10-40% accuracy improvements on mathematical reasoning tasks while reducing model calls by 10-40%.
- Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
-
The NRT (Native Reasoning Training) framework is proposed, treating reasoning chains as latent variables. It trains LLM reasoning capabilities using the model's own prediction confidence for reference answers as an intrinsic reward signal, without requiring external verifiers or expert reasoning demonstrations. On Llama-3.1-8B, it achieves an average improvement of 10.2 points across 9 benchmarks (46.0 \(\rightarrow\) 56.2), outperforming RLPR, which requires a verifier, by +5.4 points.
- Neural Theorem Proving for Verification Conditions: A Real-World Benchmark
-
This paper introduces NTP4VC—the first real-world, multi-language (Isabelle/Lean/Rocq) neural theorem proving benchmark targeting the core bottleneck of program verification: "Verification Condition (VC) proving." Using industrial pipelines (Why3/Frama-C), the authors extract 600 VCs from real projects like Linux and Contiki-OS. The results reveal a significant gap: even state-of-the-art LLMs/provers achieve a pass@8 of less than 12%, failing to outperform classic "hammers."
- NFT: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
-
NFT (Negative-aware Fine-Tuning) demonstrates that supervised learning can achieve "verification-driven" self-improvement. By constructing a negative policy implicitly parameterized by the target positive policy for negative samples, it unifies all self-generated answers (correct and incorrect) into maximum likelihood training. Its performance matches or exceeds GRPO/DAPO, and it is theoretically equivalent to the GRPO gradient under strict on-policy conditions.
- Nudging the Boundaries of LLM Reasoning
-
This paper identifies a fundamental limitation where GRPO fails to learn from hard problems that the model cannot solve at all (pass rate = 0%). It proposes NuRL, which injects self-generated abstract hints (without leaking answers) into hard problems during training to make them learnable, consistently outperforming GRPO across three models and six benchmarks and effectively raising the \(pass@k\) capability ceiling.
- Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectories?
-
This paper introduces the problem of "off-trajectory reasoning"—whether multiple reasoning models can collaborate in a relay fashion on the same chain-of-thought. By designing a "twin test" system evaluating Recoverability and Guidability across 15 open-source reasoning LLMs, the study reveals that models with stronger benchmarks are often more susceptible to interference, and almost all models fail to leverage correct guidance from stronger models to surpass their own capability ceilings.
- On Code-Induced Reasoning in LLMs
-
This paper employs a data-centric controlled experimental framework (parallel instruction data for 10 programming languages + over ten types of structural/semantic perturbations + 3,331 experiments across 5 model families and 8 scales) to systematically dissect which specific parts of code data assist LLM reasoning. It concludes that the structural skeleton of code—rather than verbose surface details—is crucial; abstractions like pseudocode or flowcharts can equivalently substitute for code, and even corrupted code remains effective as long as surface regularity is preserved.
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
-
Ours proposes the Regularized Policy Gradient (RPG) framework, which systematically derives and analyzes policy gradient methods based on Forward/Reverse KL divergence (in both normalized and unnormalized forms). The study identifies a theoretical inconsistency in the KL term of GRPO and achieves superior results compared to GRPO, REINFORCE++, and DAPO on mathematical reasoning tasks.
- On The Fragility of Benchmark Contamination Detection in Reasoning Models
-
Systematic research reveals that benchmark contamination detection in LRMs is extremely fragile: contamination introduced during the SFT stage nearly disappears after GRPO training (PPO-style importance sampling/clipping is the root cause), while directly applying CoT SFT contamination to advanced LRMs leaves almost no detectable traces. All 10 existing detection methods perform close to random guessing in these scenarios.
- On the Reasoning Abilities of Masked Diffusion Language Models
-
This paper provides the first formal characterization of the reasoning capabilities of Masked Diffusion Language Models (MDM). It proves that MDM in a finite-precision logarithmic-width setting is strictly equivalent to "Padded Looping Transformers (PLT)," capable of simulating all problems solvable by Chain-of-Thought (CoT), while being strictly more efficient than CoT on parallelizable problems (e.g., regular languages)—revealing the "Sequentiality Bottleneck" of CoT.
- On the Thinking-Language Modeling Gap in Large Language Models
-
This paper uses a Structural Causal Model (SCM) to characterize the process of "LLMs learning to think from human language," pointing out that language is merely a vehicle for knowledge rather than thought itself. Consequently, expression habits in training data inject biases into models—LLMs ignore critical information when it appears as "implicit expressions." A prompt-level intervention called LoT (observe / expand / echo) is proposed to mitigate this bias across 11 tasks and 4 representative LLMs.
- Once-More: Continuous Self-Correction for Large Language Models via Perplexity-Guided Intervention
-
Once-More is a training-free, model-agnostic inference-time self-correction framework. It calculates real-time perplexity by "units" (sentences/formulas/code blocks) during generation, triggering Verifier checks for high-uncertainty units. Rejected units are regenerated using "feedback + perplexity-guided logit redistribution," correcting the generation trajectory before errors propagate. It outperforms representative self-correction methods like Self-Refine and CRITIC on multiple reasoning benchmarks including AIME, GPQA, and LiveBench.
- OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
-
OpenEstimate is a benchmark that tasks frontier LLMs with "estimating probability distributions from internal knowledge" using real-world data. By randomly slicing public observational datasets to generate 178 derived conditional statistics as ground truths, the benchmark requires models to express their beliefs as Bayesian priors. Results show that priors from six frontier models are worth approximately "5 samples from the true distribution," and their confidence is largely uncorrelated with accuracy.
- OpenThoughts: Data Recipes for Reasoning Models
-
The authors decompose the creation of "Reasoning Model SFT Data" into six pipeline stages and conduct over 1,000 controlled ablation experiments. They derive a simple yet counter-intuitive data recipe (high-quality sources + LLM difficulty/length filtering + 16x answer sampling per question + skipping answer verification + using the weaker QwQ-32B as the teacher). Using this, they produced the OpenThoughts3-1.2M dataset and trained OpenThinker3-7B, which outperforms R1-Distill-7B by 15.3/17.2/20.5 percentage points on AIME25, LiveCodeBench, and GPQA respectively, achieving state-of-the-art among open-source models of the same scale.
- Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
-
This paper demonstrates through MAP estimation that the optimal combination of LLM majority consensus and PRM scoring is equivalent to a weighted majority vote. It reveals that optimal weights are highly dependent on the specific LLM-PRM combination and should assign negative weights to low-scoring responses. Based on this, several low-cost offline calibration methods are proposed to approximate this weight function, outperforming vanilla weighted voting while using only approximately 21.3% of the compute.
- OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
-
This paper proposes OptimalThinkingBench, a unified benchmark that simultaneously measures "overthinking" in LLMs on simple tasks (generating hundreds of thinking tokens without improving accuracy) and "underthinking" on difficult tasks. By combining a thinking-adjusted accuracy metric with F1 scores, the benchmark provides a single representative value. Evaluations of 33 models reveal that no current model excels at both ends, and existing efficiency-improving methods often resolve one issue only to exacerbate the other.
- OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research
-
For Operations Research (OR) modeling tasks, the authors observed that over 30% of labels in existing OR datasets are severely incorrect, rendering directly trained PRMs nearly ineffective. They first cleaned seed data using a three-stage verification, then constructed the first OR-ProcessQA dataset with step-level correctness labels using MCTS and GPT-4o. This enabled the training of the first generative process reward model for OR, OR-PRM, which improves base model performance by approximately 12.5% on average in a Best-of-N setting.
- Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
-
Theoretically reveals two fundamental flaws of existing length penalty methods—incorrectly punishing high-entropy exploration tokens and incorrectly rewarding redundant tokens. Proposes the DeCS framework, which, through decoupled token-level rewards and curriculum batch scheduling, reduces inference tokens by more than 50% across 7 benchmarks while maintaining or even improving model performance.
- PEAR: Phase Entropy Aware Reward for Efficient Reasoning
-
This paper discovers that token entropy in Large Reasoning Models (LRMs) positively correlates with response length, and entropy during the "thinking phase" is significantly higher than in the "final answer phase." Based on this, PEAR is proposed—a reward mechanism that incorporates phase-aware entropy into Group Relative Policy Optimization (GRPO). By penalizing excessive entropy in the thinking phase while maintaining adequate exploration in the answer phase, PEAR reduces response length by 32%–57% across six benchmarks with negligible accuracy loss (<1%) and strong robustness to out-of-distribution (OOD) tasks.
- PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning
-
PERK reformulates long-context reasoning as "test-time learning": instead of cramming ultra-long text into the context window during inference, it uses gradient descent to "write" the context into a LoRA adapter, allowing the model to recall and reason from this parameterized memory. Combined with a bi-level meta-learning framework and truncated gradient unrolling, a 0.5B Qwen model achieves a ~20% average improvement in long-context reasoning over same-scale in-context fine-tuning baselines, outperforming specialized 7B+ long-context models.
- Plan-Answer-Refine-on-Graph: Structured Planning and Self-Refinement for Large Language Model Reasoning on Knowledge Graphs
-
PARoG trains a small planner using SPARQL queries as supervision signals to decompose complex questions into composable structured sub-goals. It utilizes a "Plan-Answer-Refine" loop where the LLM first attempts to answer using parametric knowledge and subsequently corrects errors using knowledge graph evidence. This approach significantly outperforms SOTAs like PoG on WebQSP, CWQ, and GrailQA, particularly showing substantial improvements in complex logical queries involving conjunction, comparison, and superlatives.
- Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning LLMs
-
The Plan-and-Budget framework is proposed to achieve efficient test-time scaling for reasoning LLMs by decomposing complex queries into sub-problems and adaptively allocating token budgets based on estimated complexity—achieving up to 70% higher accuracy, 39% fewer tokens, and a 193.8% improvement in the E3 metric.
- Predicting LLM Reasoning Performance with Small Proxy Model
-
The paper proposes rBridge, which utilizes reasoning traces from frontier models as gold labels and performs token-level task-aligned weighted NLL. This enables small models (\(\le\)1B) to effectively predict the reasoning performance of large models (13B-32B), achieving over 100\(\times\) computational savings in dataset ranking tasks.
- Premise Selection for a Lean Hammer
-
This paper proposes the neural premise selector LeanPremise (a sentence encoder trained via contrastive learning) and integrates it with Aesop / Lean-auto / Duper to create LeanHammer, the first end-to-end general-purpose hammer tool for Lean. It proves 21% more theorems than existing premise selectors and generalizes to libraries unseen during training.
- Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion
-
This paper identifies that distilled small models "amplify" generalization defects (memorizing patterns while failing upon directional shifts). It proposes "Explanatory Inversion" to generate probes that compel students to clarify underlying logic, followed by reinforcement refinement using ExGRPO with a "Dialogic Structure Utility Reward" to organize these probes into multi-turn dialogues. Across 12 datasets, Gemma-7B achieves an average gain of 20.39% over zero-shot and 6.02% over the strongest distillation baseline.
- Process-Verified Reinforcement Learning for Theorem Proving via Lean
-
This paper treats the Lean proof assistant as a "symbolic process oracle," extracting both outcome-level and tactic-level (process) verifiable rewards from its elaboration feedback. By integrating first-error propagation and first-token credit assignment into GRPO, it makes RL for formal theorem proving on MiniF2F / ProofNet more stable and effective compared to baselines using only binary outcome rewards (MiniF2F pass@64 +2.5%p).
- ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings
-
ProofBridge unifies the "NL theorem+proof \(\rightarrow\) Lean 4 theorem+proof" formalization task. It first trains a joint embedding model that aligns NL and Lean proofs (encoded via DAG structures) into a shared semantic space. This model performs cross-modal retrieval of similar Lean proofs as demonstrations for retrieval-augmented fine-tuning and inference. An iterative repair loop, driven by Lean type-checking and semantic equivalence judging, further refines the output. On the self-constructed MINIF2F-TEST-PF dataset, it achieves a +31.14% semantic accuracy improvement over the Kimina-Prover-RL-1.7B baseline.
- ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization
-
ProofFlow decomposes a natural language proof into a Directed Acyclic Graph (DAG) characterizing step dependencies, and then formalizes each step into a high-level Lean 4 lemma with explicit dependencies. This preserves the logical structure of the original argument beyond mere "syntactic correctness." The authors also propose a comprehensive evaluation metric, PROOFSCORE, and a university-level benchmark, PROOFFLOWBENCH (184 problems), pushing autoformalization quality from a baseline of 0.279 to 0.545.
- ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
-
ProofOptimizer is the first 7B language model capable of shortening Lean proofs without any human simplification demonstrations. Utilizing a toolkit consisting of a "symbolic linter + 7B model + iterative simplification," it relies on the Lean compiler for automatic verification and uses expert iteration and RL for bootstrapping. It compresses lengthy proofs generated by SOTA neural provers by an average of 87% on miniF2F, 57% on PutnamBench, and 50% on IMO 2025 proofs from Seed-Prover. Furthermore, simplified proofs compile faster and can improve prover performance when reused as training data.
- Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
-
This paper proposes LCPO (Length Controlled Preference Optimization), which uses only 0.8k preference samples and 50 training steps. By performing offline alignment using pure length preferences—selecting easy problems the model can already solve, treating the shortest response as "chosen" and the longest as "rejected"—it reduces the average output length of DeepSeek-R1-Distill reasoning models by over 50% with almost no loss in accuracy.
- Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
-
To address the lack of long-reasoning models for mid-resource languages (Korean), this paper proposes Language-Mixed CoT—using English as a "logic anchor" for reasoning while retaining key Korean terminology. Combined with 5.79M self-collected native Korean prompts and high-yield subset distillation, the authors trained KO-REAson-35B using only SFT, achieving a top average score of 64.0 across nine Korean benchmarks, with an average improvement of +18.6 points for smaller models.
- Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
-
This paper replaces the "intra-group mean" baseline used in value-free RL (GRPO/DAPO) with a "group-wise K-quantile" baseline (QAE). By using a hyperparameter \(K\) to reward rare correct answers on hard problems and punish residual errors on easy ones, it is proven that this approach simultaneously prevents entropy collapse and entropy explosion, consistently improving pass@1 on AIME/AMC mathematical reasoning tasks.
- R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?
-
This paper introduces R-HORIZON: by chaining independent problems through "answer dependency" into a strictly sequential long-range chain, the authors create a benchmark that stresses current state-of-the-art reasoning models. Furthermore, feeding these composed data into RLVR training significantly improves multi-problem solving capabilities and even boosts single-problem performance (AIME2024 +7.5).
- Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
-
The authors observe that RLVR for mathematical reasoning corresponds to a simplified MDP with "deterministic transitions + tree structure + binary terminal rewards." In this structure, evaluating the Q-values of a fixed uniform random policy followed by softmax sampling bypasses the "evaluation-improvement" cycles and heuristic tricks of PPO/GRPO, achieving a reasoning policy that is both high-quality (pass@1 +8.2, pass@256 +16.8) and highly diverse (+20.5%).
- Reasoning Scaffolding: Distilling the Flow of Thought from LLMs
-
This paper proposes Reasoning Scaffolding, which moves beyond verbatim cloning of teacher rationales. It abstracts the teacher's long chain-of-thought into a sequence of discrete, interpretable "semantic signals" (e.g., contrast, supplement, conclusion) acting as a scaffold. Small models are trained with a dual-task objective: "predict the next signal + generate the next step guided by the signal," transferring the algorithmic structure of reasoning rather than surface text. This method significantly outperforms existing distillation approaches in accuracy and logical consistency on benchmarks like GSM8K and StrategyQA.
- Reasoning with Sampling: Your Base Model is Smarter Than You Think
-
This paper proposes a training-free, dataset-free, and verifier-free test-time sampling algorithm: using MCMC (Metropolis-Hastings) to approximately sample from the "power distribution" \(p^\alpha\) of the base model's own likelihood. On single-sample reasoning tasks such as MATH500, HumanEval, GPQA, and AlpacaEval, the performance of the base model is brought to a level comparable to or even better than GRPO (RL post-training), without losing multi-sample (pass@k) diversity.
- Rectifying LLM Thought from Lens of Optimization
-
This paper analogizes the reasoning process of long Chain-of-Thought (CoT) to a "gradient descent" process and proposes REPRO. Using the model's log-likelihood of the correct answer as a proxy objective function, it synthesizes process-level rewards from two scores (Magnitude and Stability) along the reasoning trajectory. These rewards are integrated into RLVR training, consistently improving reasoning accuracy across math, science, and code benchmarks while significantly compressing "overthinking" redundancy.
- Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
-
For instructional molecular optimization tasks where "each data point provides only one optimized reference molecule without intermediate reasoning trajectories," this paper proposes RePO. Based on GRPO-style reinforcement learning with verifiable rewards, it prepends a "reference-guided term" that acts only on answer tokens. This anchors the output to the reference molecule while allowing the model to freely explore the chemical editing space, thereby alleviating early reward sparsity and significantly improving the "Success Rate × Similarity" metric.
- ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
-
ReForm is proposed as a reflective autoformalization paradigm that transforms the process of translating natural language mathematics problems into Lean formal statements from a single-pass generation into a "generation → semantic self-verification → correction" iterative cycle. It utilizes the PBSO algorithm to optimize heterogeneous reward signals, achieving an average improvement of 22.6 percentage points over the strongest baselines across four benchmarks.
- Reinforcing General Reasoning without Verifiers
-
This paper proposes VeriFree—a DeepSeek-R1-Zero-style reinforcement learning method that requires no verifier. Instead of judging the correctness of an answer, it directly maximizes the probability of the reference answer being generated conditioned on the model's self-generated reasoning chain. Strictly derived from the RL objective, this approach extends R1-Zero training from mathematical and code domains to general reasoning fields where rule-based scoring is difficult (e.g., chemistry, medicine, law). VeriFree achieves performance comparable to or exceeding verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA.
- RESTRAIN: From Spurious Votes to Signals — Self-Training RL with Self-Penalization
-
RESTRAIN transforms the problem of "missing golden labels" into training signals. By adding triple self-penalization mechanisms—pseudo-label weighting, negative rollout penalization, and prompt-level weighting—onto GRPO, the model avoids blindly trusting majority votes. This pushes the average Pass@1 of Qwen3-4B on label-free data to 51.0%, nearly matching the upper bound of GRPO trained with golden labels (51.4%).
- Rethinking LLM Reasoning: From Explicit Trajectories to Latent Representations
-
Addressing the "overthinking" problem in slow-thinking reasoning models that generate thousands of tokens, this paper empirically finds that reasoning trajectories are highly redundant (randomly deleting 50% of tokens results in only a 2-point accuracy drop). It proposes Latent Reasoning Tuning (LRT), which uses a lightweight reasoning network \(G_\phi\) to map inputs into fixed-length implicit latent reasoning tokens via a single forward pass, replacing autoregressive explicit chains. LRT consistently outperforms existing efficient reasoning methods on mathematical and cross-domain benchmarks and surpasses the non-thinking mode of Qwen3.
- Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts
-
This work decomposes past reasoning processes into reusable "thought steps" stored in a thought graph. During inference, relevant steps are retrieved and dynamically assembled into problem-specific templates guided by rewards. These templates are injected into the
<think>tag to guide generation, reducing output tokens by up to 40%, latency by 82%, and costs by 59% with almost no loss in accuracy. - Reverse-Engineered Reasoning for Open-Ended Generation
-
To address the challenge that "deep reasoning is difficult to implement in open-ended creative tasks," this paper proposes REER (Reverse-Engineered Reasoning). Instead of forward-constructing reasoning processes through Reinforcement Learning (RL) trial-and-error or distillation, it "back-deduces" implicit chains of thought (CoT) from existing high-quality answers. Using perplexity as a quality proxy and gradient-free local search, it synthesizes 20,000 deep reasoning trajectories (DeepWriting-20K). The resulting 8B model, DeepWriter, matches or exceeds GPT-4o and Claude 3.5 on writing benchmarks.
- RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
-
The authors propose a formal definition of Reasoning Faithfulness (Stance Consistency + Causal Influence) and construct the RFEval benchmark comprising 7,186 instances across 7 tasks. By evaluating 12 open-source Large Reasoning Models (LRMs) via output-layer counterfactual reasoning intervention, the study finds that 49.7% of outputs are unfaithful, RL post-training reduces faithfulness, and accuracy is not a reliable proxy for faithfulness.
- RL of Thoughts: Navigating LLM Reasoning with Inference-Time Reinforcement Learning
-
RLoT models the multi-step reasoning of LLMs as a Markov Decision Process (MDP), training a "navigator" with fewer than \(3\text{K}\) parameters using reinforcement learning. This navigator dynamically selects and concatenates five cognitively-inspired "basic logic blocks" based on the current state during inference to generate a task-specific logical structure on the fly. It achieves a maximum Gain of \(13.4\%\) on benchmarks such as AIME, MATH, and GPQA, enabling sub-10B models to approach the performance of models \(10\times\) their size.
- RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
-
This paper introduces "reasoning abstractions"—reusable segments of procedural or factual knowledge written in natural language—and designs RLAD, a two-player RL paradigm. By jointly training an "abstraction generator" and an "abstraction-conditional solution generator," the model learns to propose abstractions before solving problems. This approach achieves a 44% average improvement over pure long-chain-of-thought RL (DAPO) on AIME 2025.
- ROC-n-Reroll: How Verifier Imperfection Affects Test-Time Scaling
-
This paper utilizes the classical ROC curve to provide a precise theoretical characterization of how Best-of-N and Rejection Sampling scale under imperfect verifiers. It proves two counter-intuitive conclusions: Rejection Sampling outperforms Best-of-N at fixed compute budgets, and high-compute performance cannot be extrapolated from low-compute observations.
- Sample Lottery: Unsupervised Discovery of Critical Instances for LLM Reasoning
-
This paper proposes the "Lottery Sample Hypothesis"—asserting that a minimal subset exists within the RLVR training set which, when used for training alone, can approximate the performance of the full dataset. The authors design an unsupervised selection framework, CONST, which characterizes the potential value of each problem using "Process Volatility + Outcome Volatility." By using the size of the conformal prediction set as a filtering criterion, the approach achieves reasoning performance close to that of the full dataset while labeling and training on < 0.5% of samples, outperforming various baselines by an average of 10.97%.
- Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
-
This paper argues that the intuition "low confidence steps = worth more exploration" is incorrect for reasoning tasks. It proposes that decoding truncation should be calibrated based on token "correctness" rather than "probability": specifically, by reverting to greedy (Greedy-Threshold) when confidence is extremely low and using a training-free calibration grid to map probabilities to correctness for dynamic truncation (Calibrated-TopK / Calibrated-\(\varepsilon\)). This approach yields stable gains across several reasoning benchmarks, with AIME improving by up to approximately 6%.
- Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
-
The Scaf-GRPO framework is proposed to overcome the "learning cliff" (zero-reward) problem in GRPO training through hierarchical in-prompt hint injection (Knowledge \(\to\) Planning \(\to\) Solution). It achieves a 44.3% relative improvement in pass@1 on AIME24 using Qwen2.5-Math-7B while maintaining on-policy consistency.
- Elastic Reasoning: Scalable Chain-of-Thought via Elastic Reasoning
-
This paper proposes Elastic Reasoning: explicitly splitting reasoning outputs into a "thought segment" and a "solution segment" with separate token budgets, combined with a budget-constrained rollout (integrated into GRPO) that trains the model to "answer correctly even when thinking is truncated." This allows large reasoning models to provide complete solutions stably under strict token budgets—with training costs at a fraction of L1, while making reasoning shorter and more efficient even without budget limits.
- Scaling Generalist Data-Analytic Agents
-
Ours proposes DataMind—a comprehensive training scheme for data analysis Agents. Through fine-grained task classification combined with recursive difficulty synthesis for diverse query generation, knowledge-enhanced trajectory sampling with self-consistency filtering for quality assurance, an SFT+RL dynamic hybrid training strategy, and a memory-friendly asynchronous rollout framework, the resulting DataMind-14B achieves SOTA with a 71.16% average score across multiple benchmarks, surpassing GPT-5 and DeepSeek-V3.1.
- SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
-
Ours proposes SceneCOT, the first framework to introduce Chain-of-Thought reasoning into 3D scene understanding. By employing a four-stage reasoning pipeline (task recognition → region localization → entity grounding → grounded reasoning), it explicitly associates intermediate reasoning steps with visual grounding, achieving 34.7% Good Coherence on Beacon3D (over 70% higher than the strongest baseline's 20.4%).
- SCI-Verifier: Scientific Verifier with Thinking
-
Addressing the pain point where scientific reasoning answers possess diverse forms making equivalence judgment difficult, this work tackles the problem from both data and modeling sides: constructing SCI-VerifyBench, an interdisciplinary verification benchmark with equivalent transformations covering five subjects (Math, Physics, Chemistry, Biology, and General QA), and post-training a verifier SCI-Verifier with "concise thinking" via SFT+RL. The 8B version matches the performance of the closed-source SOTA model GPT-5 on scientific verification tasks.
- SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
-
Ours proposes the SealQA challenge benchmark (comprising Seal-0/Seal-Hard/LongSeal variants), where each question is carefully designed by NLP researchers to trigger ambiguity, conflict, or noisy search results. GPT-5 achieves a maximum accuracy of only 43.2%, revealing that test-time scaling fails to produce reliable gains under noisy retrieval.
- Segment-Level Attribution for Selective Learning of Long Reasoning Traces
-
Integrated Gradients are used to calculate the attribution strength and directional consistency of each segment within long reasoning chains to identify important segments for selective SFT. This approach improves accuracy by up to 4.7% while shortening output by 18% compared to full CoT training.
- Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens
-
This paper interprets reasoning tasks such as Sudoku, Maze, and ARC as latent variable constraint satisfaction problems under a causal selection mechanism. It proposes SR2, which iteratively corrects latent representations via reflective representation learning, dependency self-refinement, and periodic intermediate alignment, significantly improving structured reasoning accuracy with fewer parameters.
- Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks
-
For open-ended tasks like translation and summarization, where "answers do not match literally and lack verifiable rewards," this paper proposes Semantic Voting: using a lightweight sentence vector model to calculate pairwise semantic similarity among several self-sampled candidates. Each candidate is assigned an "alignment score with consensus," and the highest/lowest scoring pairs are selected for DPO training. This process bypasses LLM self-evaluation entirely, achieving stable self-improvement with a computational cost that is two to three orders of magnitude lower than self-evaluation methods.
- ShinkaEvolve: Towards Open-Ended and Sample-Efficient Program Evolution
-
ShinkaEvolve utilizes a "Parent Weighted Sampling + Code Novelty Rejection Sampling + Bandit-style LLM Ensemble Selection" triad to compress LLM-driven program evolution from thousands of evaluations to just 150. It achieves state-of-the-art results across four domains: circle packing, AIME agent scaffolding, ALE-Bench competitive programming, and MoE load-balancing loss.
- SIM-CoT: Supervised Implicit Chain-of-Thought
-
SIM-CoT identifies that implicit Chain-of-Thought (CoT) suffers from latent representation collapse when increasing reasoning tokens due to a lack of fine-grained supervision. It introduces a "disposable" auxiliary decoder during training to align each implicit latent with its corresponding explicit reasoning step. This stabilizes training and enriches semantics, improving Coconut by +8.2% on GPT-2 and allowing implicit CoT to outperform explicit CoT for the first time, all while adding zero overhead during inference.
- SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
-
SimpleTIR discovers that the root cause of RL training collapse in multi-turn Tool-Integrated Reasoning (TIR) is the accumulation of low-probability tokens introduced by tool feedback. It proposes a plug-and-play trajectory filtering strategy—discarding entire trajectories containing a "void turn"—to stabilize gradients, raising the AIME24 score of the Qwen2.5-7B base model from a text-only baseline of 22.1 to 50.5.
- SkillFactory: Self-Distillation for Learning Cognitive Behaviors
-
SkillFactory utilizes correct and incorrect solutions sampled from the base model itself, combined with self-reflection, to rearrange them into "silver" trajectories with labels such as
<sample>,<reflect>, and<verdict>for SFT. This pre-installs cognitive skills like "verification-retry" into the model before applying GRPO reinforcement. Without relying on a stronger teacher model, the post-RL model demonstrates enhanced performance on difficult task variants and cross-domain tasks, while exhibiting greater resistance to catastrophic forgetting. - SLM-MUX: Orchestrating Small Language Models for Reasoning
-
This paper finds that "discussion-based" orchestration methods are ineffective and even detrimental for Small Language Models (SLMs). Instead, it proposes SLM-MUX, a training-free orchestration framework without text interaction. Each SLM samples independently, and final answers are selected based on self-consistency confidence. Combined with model selection search and test-time scaling strategies, SLM-MUX outperforms Qwen2.5-72B on GPQA/GSM8K using only two SLMs.
- Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
-
Ours proposes SFPO (Slow-Fast Policy Optimization), which decomposes each training step into a three-phase structure: "Fast Trajectory—Reposition—Slow Correction." Without modifying the objective function or rollout process, it enhances the stability and sample efficiency of GRPO in a plug-and-play manner, achieving an average improvement of up to 2.80 points on mathematical reasoning benchmarks and reducing rollouts by up to 4.93×.
- Smarter Not Harder: Generative Process Evaluation with Intrinsic-Signal Driving and Ability-Adaptive Reward Shaping
-
To address three major pitfalls of Generative Process Reward Models (GenPRM) in Reinforcement Learning (RL)—scoring dependence on reasoning ability, dense step rewards triggering reward hacking, and static rewards suppressing exploration—this paper proposes "using intrinsic semantic signals (reflection/matching) in reasoning trajectories to determine correctness" + "merging consecutive steps with the same correctness into 'thoughts' before awarding" + "adaptively scaling rewards based on current difficulty." Integrated into process-supervised GRPO as TP-GRPO, it outperforms outcome-only GRPO on 1.5B/7B models using 5× fewer samples.
- Soft Tokens, Hard Truths
-
This paper proposes a soft/fuzzy token method that uses Reinforcement Learning (RL) training on continuous CoT embeddings with added noise, without requiring discrete CoT annotations. It maintains nearly identical pass@1 performance to discrete CoT in mathematical reasoning while significantly improving pass@32 diversity and out-of-distribution (OOD) capability preservation.
- Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning
-
This paper proposes CAPO (Curvature-Aware Policy Optimization), which stabilizes training under aggressive hyperparameters (5× learning rate, 1/12 batch size) by modeling second-order geometry in the final LM head layer to predict and filter token updates that lead to policy collapse. It achieves a 30× improvement in sample efficiency on MATH compared to standard GRPO.
- STAT: Skill-Targeted Adaptive Training
-
Utilizing a stronger LLM as a "teacher" to diagnose exactly which skills a student model lacks in mathematics, followed by reweighting or synthesizing training data for SFT. This allows small models already "saturated" on MATH to continue improving (+7.5% max on MATH, +4.6% avg OOD) and shows additive benefits when combined with subsequent GRPO reinforcement learning.
- StepORLM: A Self-Evolving Framework with Generative Process Supervision for Operations Research Language Models
-
StepORLM enables an 8B policy model and a Generative Process Reward Model (GenPRM) to refine each other in a self-evolving loop: each modeling trajectory sampled by the policy receives dual feedback from "solver result verification" and "GenPRM global process critique." The policy is aligned via weighted DPO (W-DPO) and the GenPRM is refined via SFT, achieving SOTA results across six OR benchmarks using a small model. The co-evolved GenPRM also serves as a general-purpose inference-time verifier.
- Strategic Scaling of Test-Time Compute: A Bandit Learning Approach
-
The authors model "allocating test-time compute budget across a batch of queries" as a fully adaptive pure exploration bandit problem. They propose an elimination algorithm that "estimates difficulty while sampling and eliminates once correct," prioritizing compute for hard problems under a fixed budget. They theoretically prove this is more efficient than uniform allocation, with empirical accuracy gains up to 11% on MATH-500, AIME25, and LiveCodeBench.
- StreamingThinker: Large Language Models Can Think While Reading
-
StreamingThinker enables LLMs to "think while reading" similarly to humans—synchronously generating sequentially aligned reasoning segments as input arrives sentence-by-sentence, followed by deepening thoughts after reading as needed. Through a combination of a streaming CoT data construction pipeline, streaming attention mask/positional encoding training, and parallel KV cache inference, it maintains accuracy comparable to traditional "think-after-reading" on mathematics, logic, and context QA, while reducing the waiting tokens before reasoning by approximately 80% and lowering first-answer latency by over 60%.
- String Seed of Thought: Prompting LLMs for Distribution-Faithful and Diverse Generation
-
This paper proposes String Seed of Thought (SSoT), a concise prompting method where LLMs first generate a random string and then extract randomness to select an answer. It significantly improves distribution faithfulness in Probabilistic Instruction Following (PIF) and response diversity in Diverse Answer Generation (DAG). Theoretically, the TV distance decays exponentially with string length, and experiments show that reasoning-heavy LLMs perform close to pseudo-random number generators.
- Structured Reasoning for LLMs: A Unified Framework for Efficiency and Explainability
-
This paper explicitly decomposes the LLM reasoning process into labeled "steps" and models them as a directed graph. By extending GRPO with two structure-aware algorithms—"MaxFlow Reward" and "Longest Common Subsequence (LCS) Reward"—it enables DeepSeek-R1-Distill-Qwen-1.5B/7B to achieve more concise, stable, and explainable reasoning within shorter contexts, surpassing tuned baselines such as GRPO.
- T1: Tool-Integrated Verification for Test-Time Compute Scaling in Small Language Models
-
When small models serve as verifiers in test-time scaling, they often misjudge due to an inability to memorize arithmetic or facts. T1 introduces a two-stage verification process: first, an external tool like a code interpreter filters out candidates with calculation errors; then, a reward model scores the remaining ones. By outsourcing memory-intensive tasks to tools, Llama-3.2-1B outperforms Llama-3.1-8B on the MATH dataset.
- Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking
-
This work reinterprets "guiding language models with process verifiers" as a "random walk on a generation tree" and introduces probabilistic backtracking (occasionally erasing generated tokens). This approach provably avoids the amplification of estimation errors over the generation length, even when the verifier (value function) is imperfect, consistently outperforming action-level sampling without backtracking across various distributional fidelity metrics.
- TATTOO: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
-
Aiming at the blind spots of general PRMs in tabular reasoning—specifically the inability to distinguish sub-table retrieval accuracy and the failure to capture long-distance schema dependencies—this paper proposes TATTOO. It is a generative PRM that decomposes rewards into "table operation rewards + intrinsic reasoning rewards" and invokes actual code/table-lookup tools during verification. Using 60k tool-augmented annotations for SFT cold-start followed by RL reward shaping, it improves downstream policy models by an average of 30.9% across 5 tabular reasoning benchmarks with only 8B parameters, surpassing the 72B Qwen2.5-Math-PRM.
- Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning
-
The authors propose TAMPO (Temperature Adaptive Meta Policy Optimization), which redefines sampling temperature as a learnable meta-policy. Through a bi-level loop, the method performs LLM policy optimization in the inner loop and adaptively updates the temperature distribution in the outer loop based on trajectory advantage signals. This approach requires zero extra rollouts and consistently outperforms fixed-temperature baselines on mathematical reasoning benchmarks.
- Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
-
This paper discovers that Diffusion Language Models (dLLMs) implicitly learn a set of "semi-autoregressive experts" during training, where different block decoding orders activate different experts. Based on this, it proposes HEX, a training-free inference method that runs multiple generation paths via various block schedules and applies majority voting. It improves GSM8K accuracy from 24.72% to 88.10%, even outperforming models fine-tuned with GRPO reinforcement learning.
- Test-Time Scaling with Reflective Generative Model
-
This paper proposes the Reflective Generative Model (RGM), which enables a single network to serve as both a policy model for generating reasoning trajectories and a process reward model for scoring them. By adding only a 50M parameter SPRM head and utilizing a self-supervised SPR Loss to bypass process-level annotations, a 32B model outperforms OpenAI o3-mini on AIME24 (84.2 vs. 79.6), with scoring performance exceeding 72B-class reward models.
- \(\textbf{Re}^{2}\): Unlocking LLM Reasoning via Reinforcement Learning with Re-solving
-
This paper proposes the Re² method, which utilizes pure reinforcement learning to train LLMs to actively abandon invalid reasoning chains and restart the solving process. This approach increases the occurrence of rare "redo" behaviors from 0.5% to over 30%, significantly outperforming standard RLVR methods under the same training computation budget.
- The CoT Encyclopedia: Analyzing, Predicting, and Controlling the Thinking Process of Reasoning Models
-
This paper proposes CoT Encyclopedia, a bottom-up, data-driven framework: it automatically mines reasoning strategy dimensions from model-generated long Chains-of-Thought (CoT), clusters them into interpretable contrastive rubrics, and uses them to predict and proactively guide the model toward optimal strategies. This approach improves accuracy and safety rates by 12.2–16.1% across 5 benchmarks and reveals the key insight that "training data format shapes reasoning style more significantly than the domain."
- The First Impression Problem: Internal Bias Triggers Overthinking in Reasoning Models
-
Reasoning models form a "first impression" (internal bias) regarding the answer the moment they see a question. When this intuitive guess conflicts with subsequent systematic reasoning, the model enters a cycle of self-doubt and re-checking, causing reasoning length to expand by 21%–43%. Existing mitigation methods fail to fundamentally eliminate this effect.
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
-
The paper reveals that short-task benchmarks provide an illusion of "diminishing returns"—marginal gains in single-step accuracy are amplified exponentially in long-horizon tasks. It identifies the "self-conditioning effect" (where a model's own errors increase the probability of subsequent errors), which thinking models can mitigate. Notably, GPT-5 thinking can execute tasks exceeding 2100 steps.
- The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
-
This paper proposes TAIL (Turing mAchine Imitation Learning), which uses Python programs to automatically synthesize Chain-of-Thought (CoT) data that mimics the execution process of a Turing Machine. By decomposing reasoning into three structures—linear expansion, atomic states, and explicit data fetching—and fine-tuning Qwen2.5-7B solely on synthetic data, the model achieves stable generalization to sequences longer than those seen during training across 18 algorithmic tasks, even surpassing DeepSeek-R1 (671B).
- The Limits of Inference Scaling Through Resampling
-
This paper demonstrates both theoretically and empirically that when verifiers are imperfect (e.g., incomplete unit test coverage, non-zero false positive rates), scaling inference compute through "repeated sampling until passing a verifier" hits an insurmountable accuracy ceiling. Regardless of the compute budget allocated to a weak model, it cannot match the single-call accuracy of a sufficiently strong model, and the optimal number of samples is often as low as single digits.
- The Path of Least Resistance: Guiding LLM Reasoning Trajectories for Efficient Consistency
-
Proposes PoLR (Path of Least Resistance), the first inference-time method leveraging reasoning prefix consistency. By clustering short prefixes and only extending the dominant cluster, it serves as an efficient alternative to Self-Consistency, reducing token usage by up to 60% and latency by 50%.
- The Path of Least Resistance: Guiding LLM Reasoning Trajectories with Prefix Consensus
-
The authors propose PoLR (Path of Least Resistance), the first test-time method utilizing reasoning prefix consistency. By clustering short prefixes and expanding only the dominant cluster to replace standard Self-Consistency, it reduces token usage by 40%–60% and latency by up to 50% while maintaining or even improving accuracy across benchmarks such as GSM8K, Math500, AIME, and GPQA.
- The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
-
This paper proposes DC-CoT—the first data-centric benchmark for systematically evaluating Chain-of-Thought (CoT) distillation. It places three types of data operations—augmentation, filtering, and mixing—into a unified framework. Through large-scale empirical studies across multiple teacher-student model pairs and reasoning tasks, it concludes that "Data Augmentation (especially Reverse Thinking) yields the highest gains, Filtering ensures quality, and Mixing has limited impact."
- Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning
-
This paper utilizes Erotetic Reasoning Theory (ETR) from cognitive science and its open-source implementation, PyETR, to programmatically generate 383 formal reasoning problems. Evaluating 38 models, the study reveals a counterintuitive phenomenon: as model capability (Chatbot Arena Elo) increases, the proportion of logical errors that "exactly match ETR-predicted human-like fallacies" rises, while overall logical accuracy remains uncorrelated with capability.
- Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning
-
THINKMERGE is proposed: it allows an LLM to run \(K\) reasoning chains in parallel. After each chain finishes its "thinking" phase, their next-token logits are arithmetically averaged and sampled during the "answer" stage. This extends "Majority Voting" from closed-ended questions to open-ended tasks like code generation and deep research agents where a "majority" cannot be defined. It is training-free and plug-and-play.
- Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners
-
Inserting a cheap initialization stage called TFPI between SFT-distilled long CoT models and standard RLVR—by simply appending
</think>during rollouts to skip explicit thinking and using multi-stage RL with short contexts—enables models to be both more accurate and token-efficient in slow-thinking mode. This also improves the convergence speed and performance ceiling of subsequent standard RLVR (a 4B model achieves 89.0% on AIME24 using less than 4K H20 hours). - THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
-
The THOR framework is proposed, which systematically addresses three major challenges in tool-integrated mathematical reasoning—data construction, fine-grained optimization, and reasoning enhancement—through three core components: the TIRGen data construction pipeline, hierarchical reinforcement learning (joint episode-level and step-level optimization), and a self-correction reasoning mechanism. It achieves SOTA performance among models of the same scale on benchmarks such as MATH500 and AIME.
- Tina: Tiny Reasoning Models via LoRA
-
On a tiny 1.5B model, LoRA was used for RL (GRPO) post-training. By spending only $9, the mathematical reasoning capability was trained to be comparable to or even better than the full-parameter SOTA of the same base model. The "Rapid Reasoning Format Adaptation" hypothesis is proposed to explain why this low-cost approach is effective.
- Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning
-
The Tool-Light framework is proposed to analyze the root causes of inefficiency in Tool-Integrated Reasoning (TIR) from an information entropy perspective. By utilizing "entropy-guided sampling + two-stage self-evolved DPO," the model learns "when to call tools and when not to," simultaneously improving both accuracy and efficiency across 10 math and knowledge-intensive tasks.
- Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
-
This paper reveals that the chain-of-thought (CoT) reasoning of Large Reasoning Models (LRMs) often contains harmful content even when the final answer is safe. It proposes Intervened Preference Optimization (IPO), which corrects unsafe reasoning trajectories by replacing compliance cues with safety triggers to construct preference pairs for alignment. IPO reduces the reasoning harm rate by over 30% across three LRMs without compromising reasoning performance.
- Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning
-
This paper proposes Latent-Trajectory (LT) signals—by tracing the "temporal evolution trajectory" (net change, cumulative change, and aligned change) of the model's hidden states during reasoning token generation, it predicts whether a reasoning trajectory leads to a correct answer without training. This signal guides early stopping and early path selection in multi-sample reasoning, reducing token consumption by up to approximately 70% while maintaining or even improving accuracy.
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding
-
This paper proposes Progressive Thought Encoding, which encodes evicted thought tokens into LoRA weights under KV cache constraints. This allows Large Reasoning Models (LRMs) to reduce GPU memory consumption by half during RL training while reasoning accuracy surpasses full-cache LoRA (with a maximum improvement of +23.4% on AIME2024/2025).
- TRAPO: Enhancing LLM Reasoning with Semi-supervised Reinforcement Learning
-
TRAPO proposes a semi-supervised RLVR paradigm that uses a small set of labeled samples to "anchor" the consistency rewards of unlabeled samples. By comparing the similarity of "pass rate trajectories" between labeled and unlabeled samples, it selects reliable unlabeled data. With only 1K labeled and 3K unlabeled samples, it achieves a 42.6% average accuracy, surpassing the strongest unsupervised method trained on 45K unlabeled samples (38.3%), and matches fully supervised performance using only 10% of the labeling volume.
- Tricks or Traps? A Deep Dive into RL for LLM Reasoning
-
This paper isolates "tricks" commonly used in RL4LLM—such as normalization, clipping, loss aggregation, and length filtering—within a unified open-source framework through 160+ sets of controlled experiments. It clarifies their applicable scenarios and discovers that combining "group-mean + batch-std advantage normalization" with "token-level loss aggregation" (termed Lite PPO) consistently outperforms more complex methods like GRPO and DAPO under a critic-free, vanilla PPO loss setting.
- TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks
-
TRIM refines the routing granularity of "large vs. small models" from the entire query to individual reasoning steps. It uses a Process Reward Model (PRM) to identify "critical steps that cause solution failure," assigning only these steps to an expensive large model for rewriting while allowing a cheap small model to continue the remaining routine steps. This achieves the accuracy of large models on benchmarks like MATH-500 and AIME using as little as 20% of expensive tokens.
- TSLM: Tree-Structured Language Modeling for Divergent Thinking
-
TSLM linearizes a complete search tree (including successful paths and failure dead ends) using several special tokens. This allows a standard autoregressive language model to natively produce multi-branch exploration structures in a single generation, thereby internalizing systematic search capabilities via supervised learning. It achieves a 100% pass@1 on Game of 24 (baseline 17%), 91.5% on larger Gridworld OOD tasks far exceeding Tree-of-Thought's 42.7%, and demonstrates significantly faster inference speeds.
- TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
-
TUMIX enables a single LLM to derive 15 agents with distinct tool-use strategies (pure text / code / search / code+search, etc.), letting them answer in parallel and refine answers across rounds via sharing. It uses LLM-as-Judge for adaptive early stopping plus majority voting to select the final answer. On HLE, GPQA, and AIME, it achieves an average improvement of 3.55% over the strongest tool-augmented test-time scaling baselines with nearly identical inference costs.
- TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis
-
The authors propose TumorChain, an interleaved multimodal Chain-of-Thought (CoT) reasoning framework for tumor analysis across five major digestive organs. By integrating a knowledge-graph-driven 1.5M CoT-VQA data engine, organ-guided Iterative Interleaved Reasoning (IIR), and collaborative optimization of segmentation, classification, and LLM modules, it achieves a complete reasoning chain from findings to impressions to pathological predictions, with a mean accuracy of 84.41%, significantly outperforming GPT-5-Mini (51.59%).
- Understanding the Role of Training Data in Test-Time Scaling
-
The paper provides a theoretical analysis of how training data attributes influence test-time scaling performance. It proves that CoT reasoning is equivalent to pseudo-Newton iterations, proposes a task hardness metric based on the minimum eigenvalue of feature covariance, reveals the mechanism behind the "overthinking" phenomenon where more computation degrades performance, and identifies the optimal task selection strategy for multi-task training—emphasizing that training sets should be diverse, relevant, and hard.
- Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
-
The paper proposes Uni-CoT, a hierarchical macro-micro reasoning framework that decomposes multimodal CoT into macro-level task planning (breaking complex tasks into sub-goals) and micro-level sub-task execution (MDP-style iterative optimization via self-reflection). By designing an attention mask to reduce \(O(T^2)\) complexity to \(O(T)\), it outperforms the BAGEL baseline by +0.02 on GenEval, achieving unified reasoning for interleaved text and images.
- Unleashing Scientific Reasoning for Bio-Experimental Protocol Generation via Structured Component-based Reward Mechanism
-
This paper reformulates "bio-experimental protocol generation" as a structured and verifiable reasoning task. It introduces the Sketch-and-Fill reasoning paradigm to decompose free-text into a three-stage output: "Thought → Atomic Steps → Natural Language." The authors propose SCORE, a rule-based component reward mechanism (Step granularity + Action sequence + Semantic fidelity) to replace expensive LLM-as-a-judge signals for RL. Combined with a three-stage Knowledge-to-Action training pipeline, the resulting 8B model, Thoth, outperforms larger models like GPT-4o and DeepSeek-V3 on protocol generation and multiple biomedical benchmarks.
- USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents
-
USTBench decomposes the spatiotemporal reasoning capabilities of LLMs acting as urban agents into four process dimensions: Understanding—Prediction—Planning—Reflection. Within the interactive urban environment UAgentEnv, the authors construct 62,466 structured QAs and 9 real-world urban downstream tasks. Evaluating 14 mainstream LLMs reveals that while they perform well in understanding and prediction, they generally struggle with long-range planning and reflection. Furthermore, models specifically trained for reasoning (e.g., DeepSeek-R1) do not consistently outperform standard models in urban tasks.
- Variation in Verification: Understanding Verification Dynamics in Large Language Models
-
This paper systematically deconstructs the question of "when LLM verifiers are reliable." Through large-scale controlled experiments across 12 benchmarks and 15 models, the authors find that verification effectiveness is jointly determined by three dimensions: problem difficulty, generator capability, and verifier capability. Specifically, difficulty dominates the recognition of correct answers (TPR), generator capability dominates the detection of errors (TNR), and the relationship between verifier capability and effectiveness follows three patterns (saturation, linear, or threshold) depending on difficulty. This reveals that the default practice of "using the strongest model as a verifier" is often wasteful in many scenarios.
- Variational Reasoning for Language Models
-
This paper treats the "Chain-of-Thought" (CoT) as a latent variable and "correct answer" as an observation, deriving a training objective from ELBO using variational inference. It introduces a variational posterior with an "answer hint" to sample CoT trajectories more likely to be correct. The model is updated using the IWAE multi-trajectory tight bound with accuracy-based weights, while the posterior is updated using forward-KL to prevent collapse. The authors further prove that RFT and GRPO are "accuracy-weighted local forward-KL," revealing an implicit bias toward easy problems. The method consistently outperforms strong baselines across multiple scales of Qwen2.5/Qwen3.
- VERICOT: Neuro-Symbolic Chain-of-Thought Validation via Logical Consistency Checks
-
VERICOT translates each step of an LLM's Chain-of-Thought (CoT) into First-Order Logic (FOL) formulas and uses an SMT solver to check if each step is entailed by "established premises." This process localizes "ungrounded / contradictory / untranslatable" reasoning steps; these validation signals predict final answer correctness and drive self-reflection, SFT, and DPO to generate more verifiable reasoning.
- Verifying Chain-of-Thought Reasoning via Its Computational Graph
-
Proposes CRV (Circuit-based Reasoning Verification), which constructs interpretable attribution graphs by replacing LLM MLPs with transcoders. It extracts "fingerprints" of reasoning errors from the structural features of these graphs to achieve white-box CoT reasoning verification and enables correcting erroneous reasoning through causal intervention.
- VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
-
VisioMath is proposed as a benchmark containing 1800 K-12 mathematics problems where all options consist of highly visually similar charts. It reveals a core weakness of LMMs in multi-image-text alignment and explores three alignment strategies achieving a +12.6% gain.
- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
-
Vision-R1 is proposed, which constructs 200K high-quality multimodal CoT data via Modality Bridging for cold-start initialization. By combining a Progressive Thought Suppression Training (PTST) strategy with GRPO reinforcement learning, it achieves multimodal mathematical reasoning capabilities close to OpenAI o1 at a 7B parameter scale.
- VoG: Enhancing LLM Reasoning through Stepwise Verification on Knowledge Graphs
-
VoG utilizes a "plan-retrieve-verify-revise" iteration loop with three agents to enable multi-hop reasoning over Knowledge Graphs (KG). Per-step retrieval results (KG triplets) are checked against current reasoning plans. Upon detecting inconsistencies, a Multi-Armed Bandit (MAB) is used to adaptively select a context range for plan rewriting, improving both accuracy and efficiency across three KGQA benchmarks (with lower token consumption than baselines).
- WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning
-
Addressing the scheduling problem of "which tokens to determine first" during Diffusion Language Model (DLM) decoding, this paper proposes WavefrontDiffusion—a training-free dynamic scheduling strategy. It allows finalized tokens to expand candidate regions like water waves, ensuring each token is finalized only when sufficient context is available. Across five reasoning and code benchmarks, it consistently outperforms the current strongest BlockDiffusion using the exact same compute budget.
- Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity
-
The DMVR framework and α-DPG algorithm are proposed by explicitly defining a target distribution that "filters out incorrect answers" and approximating it via the α-divergence family. This approach unifies RLVR (Reverse KL) and Rejection Sampling Fine-Tuning (Forward KL), achieving optimal performance on the precision-coverage Pareto frontier in Lean theorem proving.
- When More Is Less: Understanding Chain-of-Thought Length in LLMs
-
This paper systematically reveals that the belief "the longer the Chain-of-Thought, the better" is a misconception—task accuracy follows an inverted U-shaped curve relative to CoT length. There exists an optimal length that shortens as model capability increases and task difficulty decreases. The authors explain this phenomenon using a theoretical model of error accumulation, derive a scaling law, and provide two practical recipes: "constructing training data based on optimal length" and "length-filtered voting during inference."
- When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models
-
Ours systematically investigates the impact of quantization, distillation, and pruning on Large Reasoning Models (LRM). Through performance benchmarking and mechanistic interpretability analysis, the study reveals core findings: the number of weights affects knowledge memory more than reasoning, the last-layer MLP up_proj is the most critical component, and current quantization methods over-compress the final layers.
- When Silence is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
-
This paper systematically investigates "how to teach LLMs to abstain when they don't know the answer in temporal QA," proposing a pipeline involving CoT-SFT cold-start + GRPO reinforcement learning (with abstention-aware rewards). This allows a 1.5B small model to outperform GPT-4o in Exact-Match (EM) on TimeQA (+3.46% on Easy / +5.80% on Hard). It also reveals the trade-offs where SFT causes overconfidence and RL improves accuracy but degrades abstention behavior on difficult problems.
- Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
-
This work attributes each sentence output by a distilled student model at test time to its true source model across four categories: Teacher, Student, Shared, and Boosted. It demonstrates that students indeed reuse teacher sentences in new scenarios and that these sentences correlate with correct answers. Based on this, a "Teacher-guided Data Selection" strategy is proposed to select training samples with the most teacher sentences, yielding average improvements of 1.7%–2.5% across multiple teacher-student pairs.
- Why is Your Language Model a Poor Implicit Reward Model?
-
This paper reveals through theory and experiments the fundamental reason why Implicit Reward Models (IM-RM, such as DPO) generalize worse than Explicit Reward Models (EX-RM): IM-RM over-relies on surface token-level cues rather than semantic representations. This leads to a significant drop in accuracy under token distribution shifts. Furthermore, the paper refutes the popular "generation-verification gap" hypothesis.
- Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking
-
Addressing the "overthinking" problem in Large Reasoning Models (LRMs), this paper proposes JET (Just-Enough Thinking). During the RL rollout phase, the model's self-generated long reasoning chains are progressively truncated and appended with a "stop-thinking" prompt to construct short reasoning samples consistent with the model's own distribution. Combined with a "correctness-first, brevity-second" quality-controlled length reward, the model learns to actively stop thinking when information is sufficient. Using a 1.5B model, JET achieves a +4.6% accuracy improvement on Olympiad while reducing output length by 46.3%.
- Zero-Overhead Introspection for Adaptive Test-Time Compute
-
ZIP-RC enables LLMs to reuse unused reserved logits in the output head during each decoding step to predict the joint distribution of "final reward × remaining length" with zero extra overhead. This distribution is used to optimize a "sampling utility" that balances quality, compute, and latency online, adaptively deciding when to sample more, prune, or stop—improving accuracy by up to 12% at equal or lower costs on mixed-difficulty math benchmarks.