💡 LLM Reasoning¶

💬 ACL2026 · 45 paper notes

AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning: This paper proposes AIM-CoT, a framework driven by Information Foraging Theory that addresses two core problems in Interleaved Multimodal Chain-of-Thought (I-MCoT)—what to see and when to see—through Active Visual Probing (AVP) based on information gain and a Dynamic Attention-shift Trigger (DAT) mechanism.
Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data: This paper proposes a budget-aware anytime reasoning framework and an Anytime Index metric to quantify the quality-efficiency trade-off of LLM reasoning under limited token budgets. It further introduces Preference Data Prompting (PDP), a test-time self-improvement method based on LLM-synthesized preference data, achieving substantial improvements in both intermediate and final solution quality across planning, mathematics, and science QA tasks.
Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models: This paper proposes the Alignment Score — a semantic-level metric based on pairwise semantic entropy matrices — that quantifies reasoning alignment by comparing intermediate steps of model-generated chains-of-thought against human-preferred reference chains. The authors find that Alignment Score correlates strongly with task accuracy, readability, and coherence, and that 2-hop reasoning represents the peak depth for alignment.
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models: This paper introduces OlymMATH, the first olympiad-level mathematical benchmark that unifies natural language evaluation and formal theorem proving. It comprises 350 bilingual (Chinese–English) problems, spanning OlymMATH-EASY/HARD (200 problems with numerical answers) and OlymMATH-LEAN (150 problems formalized in Lean 4). Experiments reveal that the strongest model achieves only 58.4% accuracy on the HARD subset.
CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization: To address the unlearning challenge in large reasoning models (LRMs)—where sensitive knowledge must be removed from both chain-of-thought (CoT) reasoning and final answers simultaneously—this paper proposes the CiPO framework. CiPO instructs the model to generate logically valid counterfactual reasoning trajectories and employs iterative preference optimization to steer the model toward these counterfactual paths, achieving effective unlearning while preserving reasoning capability.
CRISP: Compressing Redundancy in Chain-of-Thought via Intrinsic Saliency Pruning: This paper proposes CRISP, a framework that identifies the attention pattern of the </think> token as a reliable indicator for distinguishing critical from redundant steps in reasoning chains. Building on this insight, CRISP designs a greedy-search compression pipeline with four atomic operators, reducing token usage by 50–60% while preserving accuracy.
Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective: Through Cross-CoT experiments and step-wise analysis, this paper reveals a "decoupling mechanism" underlying CoT reasoning: final accuracy is determined by CoT content (~99% variance contribution), whereas distributional ranking is dominated by the model's intrinsic prior (>80%). This demonstrates that long CoT is a strong decision-maker but a weak distribution calibrator.
Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views: This work identifies a shared logical subspace within LLMs that simultaneously aligns natural-language and symbolic-logic reasoning representations. Steering activations along this subspace at inference time improves logical reasoning accuracy by up to 11 percentage points without any model training.
Dissecting Failure Dynamics in Large Language Model Reasoning: By analyzing LLM reasoning trajectories, this work finds that errors concentrate at a small number of critical turning points in the early stages, after which the model enters a "cognitive spiral"—continuously extending the reasoning in a locally coherent but globally erroneous manner. Based on these findings, the paper proposes the GUARD framework, which performs short-range branching repairs at high-risk turning points detected via entropy signals.
Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error: This paper proposes LTE (Learning to reason from Trial and Error), which mitigates exploration stagnation in RLVR by using the model's own erroneous answers as hints to guide additional rollouts, without relying on any external expert supervision.
DPC: Training-Free Text-to-SQL Candidate Selection via Dual-Paradigm Consistency: DPC reframes Text-to-SQL candidate selection from "guessing over hidden data" to "deterministic verification over constructed data": it builds a Minimal Discriminative Database (MDD) that forces conflicting SQL candidates to produce different execution results, then uses Python/Pandas solutions as reference anchors to select the correct candidate via cross-paradigm consistency, outperforming Self-Consistency by up to 2.2% on BIRD and Spider.
Efficient PRM Training Data Synthesis via Formal Verification: This paper proposes FoVer, a framework that leverages formal verification tools (Z3 and Isabelle) to automatically annotate step-level correctness labels for reasoning chains in formal reasoning tasks. It constructs the FoVer-40K training set and fine-tunes a PRM, demonstrating formal-to-informal transfer capability and cross-task generalization across 12 reasoning benchmarks.
Efficient Process Reward Modeling via Contrastive Mutual Information: This paper proposes CPMI (Contrastive Pointwise Mutual Information), an efficient automatic step-level reward annotation method that estimates step-wise contributions by contrasting the conditional probability shifts a reasoning step induces on correct versus incorrect answers. Compared to Monte Carlo estimation, CPMI reduces construction time by 84% and token generation by 98%, while achieving higher accuracy on both process-level evaluation benchmarks and mathematical reasoning benchmarks.
Efficient Test-Time Scaling via Temporal Reasoning Aggregation: This paper proposes TRACE, a framework that determines reasoning convergence by aggregating two complementary signals within a sliding window — answer consistency across steps and confidence trajectory over time — enabling training-free dynamic early exit that reduces token usage by 25–30% with only a 1–2% accuracy drop.
Explicit Trait Inference for Multi-Agent Coordination: This paper proposes Explicit Trait Inference (ETI), a method that enables LLM agents to reason about and track partners' behavioral traits along the psychological dimensions of warmth and competence. ETI reduces payoff loss by 45–77% in economic games and improves task performance by 3–29% on MultiAgentBench.
Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck: This paper proposes Multi-Focus Attention Instruction (MFAI) as a semantic probe to reveal the "weakest link effect" in multi-hop QA — multi-hop reasoning performance is determined by the absolute position of the least visible evidence bucket rather than the inter-fact distance. Failures primarily stem from a recognition bottleneck rather than reasoning deficits, and System-2 reasoning models can effectively resist positional bias and misleading attention cues.
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents: This paper proposes FS-Researcher, a file-system-based dual-agent framework for deep research. A Context Builder constructs a hierarchical knowledge base while a Report Writer composes reports section by section. By leveraging a persistent workspace to overcome context window limitations, the framework achieves 53.94 RACE (SOTA) on DeepResearch Bench and demonstrates a positive test-time scaling effect between context-building compute and report quality.
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO: This paper presents GanitLLM, the first mathematical reasoning model that genuinely reasons in Bengali (rather than translating or reasoning in English), together with Ganit, a difficulty-annotated Bengali math dataset. The proposed Curriculum-GRPO addresses the cold-start problem in GRPO training for low-resource languages. The 4B model achieves an 8 percentage-point accuracy gain on Bn-MGSM, and the proportion of Bengali reasoning tokens increases from 14% to 88%.
Generating Effective CoT Traces for Mitigating Causal Hallucination: This paper first proposes the Causal Hallucination Rate (CHR) metric to quantify the tendency of small LLMs to over-predict causal relations in event causal identification (ECI). Through systematic experiments, two key criteria for effective CoT data are identified—sufficiently long semantic explanations paired with a distribution aligned to the target model—and a low-cost CoT data generation pipeline is designed accordingly. The pipeline reduces CHR of Qwen2.5-1.5B from 83.54% to 6.26% while improving mean accuracy to 66.00%.
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study: This paper systematically investigates how to enhance the safety of large reasoning models (LRMs) via SFT. It identifies five risky reasoning patterns—most notably weak vacillation—as the root cause of limited effectiveness in direct safety response distillation, proposes targeted distillation strategies that reduce the PAIR attack success rate from 63% to 13%, and demonstrates that short chain-of-thought and template-based reasoning achieve safety performance comparable to full-length reasoning chains.
JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents: JTPRO proposes a joint optimization framework that requires no model fine-tuning. Through reflection-driven iterative editing, it simultaneously optimizes global instructions and per-tool schema/parameter descriptions, significantly improving end-to-end success rates for tool selection and slot filling in large-scale tool library settings, achieving 5%–20% OSR gains over baselines such as GEPA.
Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning: This paper proposes InstruCoT, which synthesizes diverse training data covering multiple injection vectors and threat scenarios, and introduces a three-stage instruction-level chain-of-thought fine-tuning framework based on a situation-aware model. This enables LLMs to effectively identify and reject malicious instructions under various prompt injection attacks, substantially outperforming existing defenses across three evaluation dimensions: behavioral deviation, privacy leakage, and harmful output.
Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners: This paper systematically investigates the latent reasoning behavior of large reasoning models (LRMs) across 11 languages, finding that latent reasoning capability exists multilingually but is unevenly distributed (stronger for high-resource languages, weaker for low-resource ones), and that internal reasoning dynamics tend toward an English-centric shared pathway.
Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting: CoT2Edit proposes a new paradigm for teaching LLMs to perform knowledge editing via CoT reasoning. It constructs CoT instruction data for both structured and unstructured edits, trains with SFT warm-start followed by GRPO optimization, and retrieves edited facts via RAG at inference time. A single training run achieves SOTA across 6 editing benchmarks with strong generalization.
Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning: This paper identifies a "logical phase transition" phenomenon in LLM logical reasoning—performance collapses abruptly at specific complexity thresholds rather than degrading smoothly. The authors propose a Logical Complexity Metric (LoCM) to quantify this phenomenon, and design a Neuro-Symbolic Curriculum Tuning (NSCT) framework that achieves average accuracy gains of +1.26 over naive prompting and +3.95 over CoT across five benchmarks via adaptive neuro-symbolic alignment and complexity-aware curriculum optimization.
MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference: This paper introduces the MARCH benchmark (2,209 multi-hop ambiguous questions) and the CLARION framework, presenting the first systematic study of QA challenges at the intersection of ambiguity interpretation and multi-step reasoning, and revealing severe deficiencies in existing SOTA models on such problems.
MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis: This paper proposes MathAgent, a hierarchical data synthesis framework based on adversarial evolution of constraint graphs. It reformulates data synthesis from a text generation task into an unsupervised optimization problem over constraint graphs. A three-agent Legislator system (Proposer-Critic-Moderator) evolves problem skeletons, which are then instantiated into natural language by an Executor. With only 1K synthetic samples, MathAgent surpasses LIMO and s1K across eight mathematical benchmarks.
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning: OctoTools is a training-free, user-friendly, and easily extensible multi-agent framework that encapsulates heterogeneous tools via standardized tool cards, adopts a Planner-Executor separation paradigm, and employs a task-specific toolset optimization algorithm. It achieves an average accuracy improvement of +9.3% over GPT-4o and up to +10.6% over frameworks such as AutoGen and LangChain across 16 diverse benchmarks.
Parallel Test-Time Scaling for Latent Reasoning Models: This paper is the first to introduce parallel test-time scaling (parallel TTS) into latent reasoning models. It proposes two uncertainty-theoretic stochastic sampling strategies (MC-Dropout and additive Gaussian noise) along with a step-level contrastively trained latent reward model (LatentRM), enabling models that reason in continuous vector spaces to achieve consistent performance gains through parallel sampling and aggregation.
Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards: This paper proposes leveraging the Planning Domain Definition Language (PDDL) to automatically generate large-scale, high-precision step-level reward datasets for training Process Reward Models (PRMs), achieving significant improvements on both mathematical and non-mathematical reasoning benchmarks.
ReCoQA: A Benchmark for Tool-Augmented and Multi-Step Reasoning in Real Estate Question and Answering: This paper introduces ReCoQA—a large-scale benchmark comprising 29,270 real estate question-answer pairs—that requires models to perform hybrid multi-source reasoning by integrating database queries and map API calls. The authors further propose HIRE-Agent, a hierarchical multi-agent framework serving as a strong baseline, and systematically identify the bottlenecks of existing LLMs in complex reasoning within vertical domains.
Reinforced Efficient Reasoning via Semantically Diverse Exploration: ROSE proposes a semantic-entropy-guided MCTS branching strategy and a length-aware segment-level advantage estimation to address the insufficient exploration diversity and low reasoning efficiency of existing MCTS-based RLVR methods, achieving state-of-the-art pass@8 performance across multiple mathematical reasoning benchmarks.
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning: This paper proposes Render-of-Thought (RoT), the first approach to render textual CoT reasoning steps as images. It leverages a pretrained visual encoder as a semantic anchor to align LLM hidden states to the visual embedding space, achieving 3–4× token compression and significant inference acceleration while preserving the interpretability of the reasoning chain.
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models: This paper systematically investigates entropy dynamics in RLVR training of LLMs, identifies positive-advantage tokens as the primary driver of entropy collapse, and proposes Positive-Advantage Reweighting, which dynamically adjusts the loss weights of positive-advantage tokens to effectively regulate model entropy.
Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models: This paper proposes GenCluster, a scalable test-time compute framework that achieves gold-medal performance on IOI 2025 (446.75/600) with the open-weight model gpt-oss-120b, via large-scale parallel generation → behavioral clustering → tournament ranking → round-robin submission.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning: This paper proposes CoT-PoT, a cross-modal ensembling method that exploits the complementarity between chain-of-thought (CoT) and program-of-thought (PoT) reasoning modalities to reduce the number of samples required for self-consistency by 9.3×, resolving 78.6% of problems with only 2 samples.
Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration: This paper proposes RDDG, a tabular data synthesis framework based on progressive Chain-of-Thought, which guides LLMs to generate high-fidelity tabular data through coreset selection, relational mining, and a self-reinforcing feedback mechanism, achieving an average improvement of 2%+ Macro-F1 on imbalanced classification tasks.
Semantic-Aware Logical Reasoning via a Semiotic Framework: This paper proposes LogicAgent, a logical reasoning framework grounded in the Greimas Semiotic Square. By performing multi-perspective semantic analysis and reflective verification, LogicAgent achieves state-of-the-art logical reasoning performance under the dual challenges of semantic and logical complexity.
Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning: This paper proposes Step-GRPO, which internalizes dynamic early-exit capability into the model — measuring reasoning complexity via semantic steps rather than raw tokens, exposing concise correct trajectories through dynamically truncated rollouts, and guiding the model to learn when to stop reasoning via step-aware relative rewards. On Qwen3-8B, it reduces token consumption by 32% with no accuracy degradation.
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment: This paper identifies that LLM agents exhibit a human-like "actor-observer asymmetry" (AOA) cognitive bias during role-play — when acting as actors, agents tend to attribute failures to external factors, while as observers they tend to attribute failures to internal errors. The authors propose ReTAS, which employs dialectical reasoning (thesis–antithesis–synthesis) and GRPO-based alignment to mitigate this bias.
Think Outside the Policy: In-Context Steered Policy Optimization: This paper proposes ICPO (In-Context Steered Policy Optimization), which leverages the in-context learning (ICL) capability of large language models as implicit expert guidance to expand the policy exploration space during RLVR training, without relying on reasoning trajectories from external, stronger models.
Towards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval: This paper proposes DIN-Retrieval, which identifies domain-invariant neurons (DINs) in LLMs exhibiting consistent activation polarity across domains, constructs a domain-robust representational subspace for retrieving structurally compatible cross-domain demonstrations, and provides the first systematic evidence that cross-domain ICL examples can improve LLM reasoning performance, achieving an average gain of 1.8% on math-to-logic reasoning transfer.
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models: TrigReason proposes an event-triggered collaboration framework between small and large reasoning models. By analyzing three systematic failure modes of small reasoning models (SRMs)—path deviation, cognitive overload, and recovery failure—the framework designs three corresponding triggers: strategic priming, cognitive offloading, and intervention request. These triggers replace step-wise polling verification, enabling 1.70–4.79× more reasoning steps to be offloaded to the SRM while maintaining LRM-level accuracy, reducing latency by 43.9% and API cost by 73.3%.
TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards: This paper frames automated multi-turn jailbreak attacks as a multi-turn reinforcement learning problem and proposes TROJail, which introduces two heuristic process rewards—over-harm penalization and semantic relevance progression—to alleviate the sparse supervision problem of outcome-only rewards, achieving substantial improvements in attack success rate across multiple models and benchmarks.
When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning: This paper proposes the DTSR framework, which detects "reflection signals" (e.g., Wait, Alternatively) during the reasoning process and triggers a self-assessment of the current reasoning chain's "sufficiency" at those positions to determine whether to exit early. DTSR achieves 28.9%–34.9% reasoning length reduction on the Qwen3 model series with negligible accuracy loss.