💡 LLM Reasoning¶

💬 ACL2025 · 54 paper notes

📌 Same area in other venues: 📷 CVPR2026 (16) · 🔬 ICLR2026 (241) · 💬 ACL2026 (82) · 🧪 ICML2026 (78) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (82)

🔥 Top topics: Reasoning ×42 · LLM ×9 · Question Answering ×2 · Multimodal/VLM ×2

An Efficient and Precise Training Data Construction Framework for Process-Supervised Reward Model in Mathematical Reasoning: This paper proposes the EpicPRM framework, which quantifies the contribution of each reasoning step through perplexity-based Monte Carlo estimation and utilizes adaptive binary search to efficiently locate the first incorrect step. It constructs Epic50k, a high-quality process-supervised dataset (with only 50k annotated steps), which trains a PRM that performs comparably to or even outperforms models trained on PRM800k.
Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework: This paper proposes Aristotle, a logical reasoning framework that fully integrates symbolic expressions and logical rules into every stage of the Decompose-Search-Resolve process. Utilizing three core components—a logical decomposer, a search router, and a resolver—it achieves logic-complete reasoning, outperforming SOTA on several logical reasoning benchmarks with an average improvement of 4.5% on GPT-4 and 5.4% on GPT-4o.
Beyond the Answer: Advancing Multi-Hop QA with Fine-Grained Graph Reasoning and Evaluation: To address the issues of opaque reasoning processes and coarse evaluation granularity in multi-hop question answering (Multi-hop QA), this paper proposes a fine-grained graph reasoning framework. By constructing a reasoning graph to explicitly model evidence chains, and introducing fine-grained evaluation metrics, the framework measures the quality of the reasoning process rather than solely focusing on the correctness of the final answer.
BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving: The BPP-Search algorithm is proposed, which integrates Beam Search, Process Reward Models (PRMs), and a Pairwise Preference mechanism into the Tree-of-Thought framework for automatic mathematical modeling in operations research, significantly outperforming CoT/SC/ToT baselines on datasets like StructuredOR with fewer reasoning steps.
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?: This paper introduces DeltaBench, the first benchmark dataset to systematically evaluate the quality of long CoT reasoning in o1-like models and the error detection capabilities of existing LLMs/PRMs. Through fine-grained human annotation of 1,236 samples, it reveals a sobering reality: o1-like models exhibit approximately 27% reasoning redundancy, 67.8% ineffective reflections, and even the strongest critic model, GPT-4-turbo-128k, achieves only an F1 score of 40.8%.
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective: Proposed the Chain-of-Reasoning (CoR) framework, which unifies three paradigms—Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR)—into a single reasoning chain. Guided by a Progressive Paradigm Training (PPT) strategy, a 7B model (CoR-Math-7B) achieves a 41% accuracy improvement over GPT-4o on theorem proving under zero-shot settings, and outperforms reinforcement learning (RL) methods by 15% on the MATH benchmark.
ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations: ClozeMath proposes a fine-tuning strategy inspired by human cloze learning. By masking equations in mathematical solutions and training the model to predict them (a text-infilling objective) jointly with standard language modeling objectives, ClozeMath significantly outperforms the strong baseline Masked Thought on GSM8K and MATH. It also demonstrates superior generalization in test-time scaling and robustness evaluations.
Commonsense Abductive Reasoning using Knowledge from Multiple Sources: This paper proposes a commonsense abductive reasoning method that integrates multi-source knowledge (knowledge graphs, pre-trained language models, and rule bases). By jointly utilizing structured and unstructured knowledge to generate more accurate and explainable best explanations, the method achieves significant improvements on abductive reasoning benchmarks.
Complex Reasoning with Natural Language Contexts and Background Knowledge: This paper proposes a complex reasoning framework that integrates natural language contexts with structured background knowledge. By utilizing knowledge graph retrieval augmentation and context-aware reasoning chain generation, it significantly improves LLM performance on multi-step reasoning tasks that require external knowledge support.
CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis: This paper proposes CoT-based Synthesizer—a novel inference scaling strategy that leverages CoT reasoning to analyze complementary information from multiple candidate responses to synthesize a superior final answer. Even when all candidate answers are incorrect, it can still synthesize the correct answer, achieving an 11.8% Gain for Llama3-8B and a 10.3% Gain for GPT-4o on MATH500.
CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought Learning from In-Context Demonstrations: The paper proposes the CoT-ICL Lab framework, which generates controllable synthetic tokenized datasets by decoupling the causal structure (DAG) and token processing function (MLP). It systematically studies the acceleration effect of CoT on ICL, the critical role of model depth, and the mechanisms by which Transformer embeddings and attention maps learn the underlying reasoning structure.
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought: To address the overconfidence of LLMs in reasoning tasks, this paper proposes the CoT-UQ framework, which integrates keyword extraction and importance scoring from CoT reasoning steps into the uncertainty quantification process, achieving an average AUROC improvement of 5.9% on logical and mathematical reasoning tasks.
CoT-Valve: Length-Compressible Chain-of-Thought Tuning: This paper proposes CoT-Valve, a method to elastically control the length of reasoning chains by identifying a "length-control direction" in the parameter space (implemented with LoRA). It requires only a single training run to generate reasoning paths of varying lengths, compressing the GSM8K reasoning chain on QwQ-32B-Preview from 741 to 225 tokens with only a 0.15% drop in accuracy (95.07% to 94.92%).
Critic-CoT: Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thoughts Critic: This paper proposes the Critic-CoT framework, which transitions LLM self-criticisms from System-1 intuitive judgments to System-2 deliberate step-by-step analyses through a step-by-step Chain-of-Thought critic paradigm and automated weakly supervised data construction without human annotation. Two-stage training (GPT-4 distillation + self-criticism) improves Llama-3-70B-Instruct performance on GSM8K from 89.6% to 95.4% and on MATH500 from 50.4% to 68.4%. Additionally, it is discovered that criticism capabilities and task-solving capabilities can mutually reinforce each other.
Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs: Proposes the Diverse Chain of Thought (DCoT) training method, which enables "within-inference refinement" by generating multiple sequential reasoning chains in a single inference session. It consistently outperforms standard CoT baselines across models ranging from 1.3B to 70B, with particularly significant improvements in large output space tasks (numerical/extractive).
DeFine: Decision-Making with Analogical Reasoning over Factor Profiles: This paper proposes the DeFine framework, which constructs probabilistic factor profiles from spoken transcripts in complex scenarios such as earnings calls. By combining the Bradley-Terry model to identify key factors and using KL divergence between factor profiles for analogical reasoning, DeFine assists LLMs in making investment decisions under uncertainty, outperforming baselines in both accuracy and F1 score.
Dynamic and Generalizable Process Reward Modeling (DG-PRM): The DG-PRM framework is proposed, which dynamically stores and selects multi-dimensional evaluation criteria by building a hierarchical reward tree, and identifies positive and negative sample pairs under multiple objectives in combination with Pareto dominance estimation, achieving dynamic and generalizable process reward modeling.
DRT: Deep Reasoning Translation via Long Chain-of-Thought: This work introduces long CoT reasoning into machine translation by establishing a multi-agent framework (Translator \(\to\) Advisor \(\to\) Evaluator) to iteratively refine literary translations containing metaphors and similes. It synthesizes a 22K long-thought translation training dataset, and the resulting DRT-14B model outperforms large models such as QwQ-32B and DeepSeek-R1-Distill-32B in literary translation.
Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning: This paper proposes CRFT, a method that automatically identifies "critical representations" with the greatest impact on reasoning outputs across Transformer layers through information flow analysis. By optimizing these representations in a low-rank linear subspace using supervised learning, CRFT improves the accuracy of LLaMA-2-7B on GSM8K by 18.2% while utilizing only 0.016% of the model parameters.
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction: This paper proposes StepCo (Stepwise Correction), an iterative "verify-and-correct" framework. It leverages a Process Supervised Verifier (PSV) to sequentially locate the first erroneous step in an LLM's reasoning path and trigger corrections by the LLM. Using GPT-4o as the backbone, StepCo achieves an average accuracy of 94.1% across 8 mathematical reasoning benchmarks, outperforming the Best-of-10 method by +2.4 percentage points while reducing token consumption by 77.8%.
Enhancing Retrieval Systems with Inference-Time Logical Reasoning: Proposes an Inference-Time Logical Reasoning (ITLR) framework, which leverages LLMs to decompose natural language queries into logical expressions (AND/OR/NOT), and then composes the cosine similarity scores of individual sub-terms based on fuzzy logic. It consistently outperforms traditional dense retrieval and the BRIGHT baseline on synthetic data and three real-world datasets (NFCorpus/SciFact/ArguAna), showing significant improvements particularly on complex queries containing negations.
Entropy-based Exploration Conduction for Multi-step Reasoning: This paper proposes the Entro-duction method, which dynamically adjusts exploration depth by monitoring changes in output entropy and variance entropy during the LLM reasoning process. It adopts an \(\epsilon\)-greedy strategy to select among three exploration behaviors (deepening, expanding, or stopping), improving reasoning accuracy while avoiding redundant computation.
FineReason: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving: Proposes FineReason—a logical puzzle-based reasoning benchmark that performs atomic-level evaluation of LLMs' deliberate reasoning capabilities (reflection, backtracking, error correction) through two tasks: "state checking" (determining if the current state is solvable) and "state transition" (deciding the next action). It demonstrates that training on puzzle data can transfer to improve mathematical reasoning performance (GSM8K improved by 5.1%).
Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering: It is discovered from the perspective of representation space that LLMs encode long CoT reasoning as a general capability clearly distinct from vanilla CoT. This work proposes GLoRE (General Long CoT Reasoning via Representation Engineering)—which unlocks long CoT capabilities through contrastive reasoning pattern injection and domain-specific representation steering, outperforming SFT methods in both in-domain and cross-domain scenarios without training.
Improve Vision Language Model Chain-of-thought Reasoning: By (1) performing SFT on 193K multi-task CoT reasoning data distilled from GPT-4o, and (2) utilizing model-self-generated reasoning chains to construct positive and negative sample pairs for DPO reinforcement learning, the chain-of-thought reasoning capability of VLMs is significantly enhanced, with an average improvement of +11.7% in CoT prediction, and +7.3% in direct answering.
Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions: This paper proposes QuaSAR (Quasi-Symbolic Abstract Reasoning), a Chain-of-Thought (CoT) variant that guides LLMs to first abstract the problem symbolically (extracting variables/predicates), reconstruct it using a semi-formal representation, and finally solve it based on a quasi-symbolic reasoning chain. QuaSAR achieves up to an 8% accuracy improvement over standard CoT on GPT-4o, while significantly enhancing robustness against adversarial variants (e.g., option shuffling, numerical substitution).
Large Language and Reasoning Models are Shallow Disjunctive Reasoners: This paper evaluates the systematic generalization capabilities of LLMs and LRMs on disjunctive rule reasoning tasks that require composing multiple reasoning paths, using a synthetic spatial and temporal reasoning benchmark (STaR). It finds that even reasoning models like o3-mini can only handle single-path reasoning, with performance degrading drastically in multi-path disjunctive reasoning scenarios.
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving: This paper proposes LeanListener, which introduces a verifier-in-the-loop design in automated theorem proving (ATP). By using the Lean verifier to provide step-level intermediate feedback (changes in the number of sub-goals) rather than merely trajectory-level rewards, the tactic validity rate and proving rate of ReProver are both improved via online GRPO training, achieving a 20% faster proving speed.
LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning: This paper proposes LogicPro, a data synthesis method that leverages LeetCode algorithmic problems and Python code solutions as logic sources. Through a three-step pipeline ("problem generation \(\rightarrow\) code intermediate variable extraction \(\rightarrow\) program-guided reasoning generation"), it synthesizes 540K high-quality textual reasoning data from 2,360 algorithmic problems, significantly outperforming existing reasoning datasets on multiple OOD benchmarks such as BBH27, LogicBench, and DROP.
Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models: Reveals the bottleneck of "formalistic long-time thinking" when directly distilling long CoT data from large reasoning models (e.g., DeepSeek-R1) to smaller models, and proposes reconstructing tree-structured CoT data from scratch using MCTS, combined with thoughts length balance, fine-grained DPO, and a joint training objective to alleviate this issue.
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning: This work proposes MCLM (a competition-level mathematical benchmark in 55 languages) and reveals that while three test-time scaling methods (ORM/PRM/Budget Forcing) yield significant improvements in English (e.g., +20 points on AIME), they yield an average gain of only 1.94 points in other languages, demonstrating a severe bottleneck in the multilingual generalization capability of test-time scaling.
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification: This paper proposes two models, MM-Verifier and MM-Reasoner. By synthesizing long-chain CoT verification data through simulation-based search combined with rejection sampling, and creating multimodal reasoning data via text distillation, the proposed 7B parameter models achieve an accuracy of 65.3% on MathVista, outperforming GPT-4o (63.8%) and human performance (60.3%).
On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures: This paper systematically investigates the generalization capability of LLMs across measurement systems (currency, length, weight). It reveals that models default to dominant measurements from their training data (e.g., USD, metric system), resulting in a significant accuracy drop for queries using non-dominant systems. While Chain-of-Thought (CoT) reasoning can alleviate this performance drop, it increases inference costs by up to 300%, creating a systemic inequality for users from underrepresented cultures.
One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL: This work introduces the Long CoT Collection—a 100K long chain-of-thought reasoning dataset annotated by a short-CoT LLM (such as GPT-4o). By leveraging reasoning flows extracted from o1 as guidance, it enables short-CoT LLMs to generate long-CoT data, thereby addressing the cold-start problem in reinforcement learning. Models initialized on this data achieve a 2-3x performance gain in subsequent RL training.
PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation: PCoT (Persuasion-Augmented Chain of Thought) is proposed as a two-stage zero-shot method. In the first stage, LLMs are guided by prompts incorporating persuasion knowledge to identify six types of persuasion strategies in texts. In the second stage, the persuasion analysis is integrated as context into disinformation detection reasoning. On average, this improves the F1 score by 15% across 5 LLMs and 5 datasets, including two brand-new post-cutoff datasets.
ProcessBench: Identifying Process Errors in Mathematical Reasoning: This paper proposes the ProcessBench benchmark (comprising 3,400 test cases, focusing primarily on competition-/Olympiad-level math problems) to evaluate the capability of PRMs and critic models in locating the earliest erroneous step in mathematical reasoning. The findings reveal that existing PRMs fail to generalize to difficult problems beyond GSM8K/MATH, whereas general LLMs (e.g., QwQ-32B-Preview) acting as critics perform comparably to GPT-4o.
Ranked Voting based Self-Consistency of Large Language Models: Upgrades the majority voting in Self-Consistency to ranked voting, allowing the LLM to generate a preference ranking of multiple candidate answers for each reasoning path instead of a single answer. It uses three ranked voting methods (IRV/BCV/MRRV) to aggregate ranking information across multiple reasoning paths, consistently performing better than traditional SC on six datasets with a maximum improvement of 12.46%.
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory: Through systematic experiments across 6 LLMs \(\times\) 8 prompting strategies \(\times\) 6 benchmarks, this work discovers that as the number of majority voting samples increases, simple CoT consistently outperforms complex prompting strategies. This phenomenon is theoretically proven from a probability perspective, and an \(O(1)\) complexity scaling performance prediction method along with two improvement strategies are proposed.
Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation: Reinterprets Self-Consistency as a dynamic alignment problem between the sampling distribution and the true answer distribution, revealing that temperature not only controls sampling randomness but also directly shapes the true answer distribution. Based on this, a confidence-driven three-stage dynamic temperature adjustment mechanism is proposed (with theoretical derivation of the FSD threshold), improving both average and best performance with zero training overhead across 10 models on GSM8K/MATH.
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?: This paper systematically reveals that o1-like models, such as QwQ, DeepSeek-R1, and LIMO, do not possess true sequential scaling capabilities at test time—longer Chain-of-Thought (CoT) sequences do not yield higher accuracy, primarily due to insufficient self-revision capabilities. Based on this finding, the authors propose an alternative parallel scaling method called Shortest Majority Vote, which significantly outperforms traditional majority voting.
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought: This paper proposes the RSVP framework, unifying the reasoning capabilities of multimodal large models with visual segmentation through a two-stage structure (reasoning-driven localization + segmentation refinement). Utilizing multimodal chain-of-thought visual prompting, it outperforms the SOTA on ReasonSeg by up to +6.5 gIoU / +9.2 cIoU, and achieves 49.7 mAP on zero-shot SegInW.
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification: The Safe framework is proposed, which for the first time utilizes the Lean 4 formal language to perform retrospective step-by-step verification of LLM mathematical reasoning. It detects hallucinations through auto-formalization and automated theorem proving (ATP), and integrates with prospective PRM scores to achieve SOTA performance on multiple mathematical datasets. Furthermore, the FormalStep benchmark containing 30,809 formal statements is released.
Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks: Proposes Self-Correction Learning (SCL), which categorizes self-correction data generated by the VLM itself (both successful and failed correction samples) into preference/dispreference pairs to perform preference fine-tuning via DPO. This fundamentally enhances the model's ability to directly generate correct answers, rather than relying solely on iterative refinement during inference.
Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning: Proposes the Self-Error-Instruct (SEI) framework, which analyzes error cases of the target model in mathematical reasoning, uses GPT-4o to extract error keyphrases and clusters them into error types, synthesizes training data for each error type using a self-instruct approach, and iteratively fine-tunes the model to systematically address weaknesses in mathematical reasoning.
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs: This paper proposes SoftCoT, which uses a frozen small auxiliary model (e.g., LLaMA-3.2-1B) to generate instance-specific "soft thought tokens" (continuous hidden states) that are mapped into the representation space of the primary LLM via a trainable projection module, serving as a reasoning prefix. This approach achieves parameter-efficient continuous-space CoT reasoning while avoiding the catastrophic forgetting problem caused by full-model fine-tuning.
STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond: This paper proposes the STRICTA framework, which models expert text assessment (e.g., peer review) as a step-by-step reasoning graph based on Structural Causal Models (SCMs). By collecting over 4,000 reasoning steps from more than 40 biomedical experts across 22 papers, the study reveals that differences in prior knowledge are the primary cause of review disagreement, and that writing style has an outsized impact on final decisions. Additionally, it highlights that LLMs can effectively assist in structured assessment when under human supervision.
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering: This paper presents the first evaluation of test-time scaling models (DeepSeek R1, S1) in selective question answering scenarios (where abstaining from answering is allowed). It finds that increasing test-time compute not only improves accuracy but also enhances the model's confidence in correct answers. The authors propose a selection function based on a confidence threshold and a "Jeopardy Odds" utility function to evaluate test-time scaling performance under non-zero penalties for incorrect answers.
ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails: By distilling structured critique (safety labels + detailed reasoning) from GPT-4o/DeepSeek-R1, the guardrail model is fine-tuned to implement "slow thinking" safety judgment. It achieves the highest average F1 (75.5%) and AUPRC (79.5%) across 4 safety benchmarks, outperforming LLaMA Guard 3 with a 16.1% increase in accuracy and a 27.0% increase in Macro F1.
Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness: This paper systematically analyzes the representation patterns of CoT from the dual perspectives of effectiveness and faithfulness. It finds that problem difficulty, information gain, and the monotonicity of information flow govern the effectiveness of CoT. It also reveals the mechanism of unfaithful CoT: the model recalls correct information from the question that was omitted by the CoT when predicting answers. Based on these insights, the QUIRE algorithm is proposed, which simultaneously improves both the effectiveness (+2.4%) and faithfulness (+5.6%) of CoT.
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation: This paper proposes AIDsafe, a multi-agent iterative deliberation framework that automatically generates high-quality Safety Reasoning CoT data embedded with safety policies. The fine-tuned models significantly outperform traditional safety training in safety generalization and jailbreak robustness. Additionally, an "ear-whisperer" agent is introduced to resolve the difficulty of distinguishing between selected and rejected responses in DPO preference data.
TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning: Proposes TRACT, a two-stage regression-aware fine-tuning method that combines CoT reasoning with regression loss (squared error) to improve numerical scoring accuracy in LLM-as-a-judge scenarios, significantly outperforming existing approaches using only cross-entropy training or only regression loss.
Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors: This work proposes the Traver (Trace-and-Verify) agent workflow, which explicitly estimates student knowledge states via knowledge tracing and scores and selects optimal candidate tutoring utterances using a turn-by-turn verifier. To evaluate this, the Dict automated evaluation protocol (simulated student + code generation test) is introduced. In programming tutoring scenarios, it improves the student pass rate from 38.7% to 43.7% (a relative gain of 106.5%), significantly outperforming Vanilla Instruct, Self-Refine, and TreeInstruct.
Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research: This paper proposes the AGORA framework, which unifies 10 mainstream Agent reasoning algorithms (such as CoT, ReAct, ToT, and RAP) into pluggable Operator modules using a DAG-based graph orchestration engine. Systematic self-controlled comparisons on mathematical reasoning and multimodal tasks reveal that simple CoT methods often outperform complex algorithms in both accuracy and cost-effectiveness, and a single prompt modification can lead to a 90% performance leap.
Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning: A systematic study of the three key factors influencing CoT distillation (granularity, format, and teacher models) reveals a non-monotonic relationship between SLM performance and granularity, demonstrates that format has minimal impact, and shows that stronger teachers do not always yield better students.