💡 LLM Reasoning¶

🤖 AAAI2026 · 30 paper notes

A Reasoning Paradigm for Named Entity Recognition: This paper proposes ReasoningNER, which reframes named entity recognition from "implicit pattern matching" to an "explicit reasoning" paradigm. Through a three-stage pipeline (CoT data construction → CoT fine-tuning → GRPO reinforcement enhancement), the model first reasons and then extracts entities. Under zero-shot settings, ReasoningNER surpasses GPT-4 by 12.3 F1 points, and the 8B model achieves an average F1 of 72.4 on CrossNER.
Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models: This paper systematically analyzes abstention failures in Large Reasoning Models (LRMs) when confronted with unanswerable math problems. It finds that LRMs possess sufficient internal cognitive capacity to recognize unsolvability (linear probe classification accuracy >80%), yet their external behavior remains biased toward forced answering. A two-stage approach combining cognitive monitoring and inference-time intervention is proposed, improving abstention rates from 16–54% to 60–92% without degrading reasoning performance on answerable questions.
ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction: This paper proposes the Latent Reasoning Chain Extraction (ARCHE) task, which requires LLMs to decompose scientific paper argumentation into Reasoning Logic Trees (RLTs) grounded in Peirce's three reasoning paradigms. Through two complementary metrics—Entity Coverage (EC) and Reasoning Edge Accuracy (REA)—the study reveals a fundamental trade-off between content completeness and logical correctness across 10 mainstream LLMs.
BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models: This paper proposes BadThink — the first training-time backdoor attack targeting CoT reasoning efficiency. By iteratively optimizing verbose reasoning templates via an LLM, it constructs poisoned data that causes the victim model, upon trigger activation, to generate reasoning chains inflated by over 17× (on MATH-500), while preserving final answer correctness and maintaining strong stealthiness.
BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards: This paper proposes BLM-Guard, an explainable multimodal moderation framework for short-video commercial advertisements. It first establishes structured reasoning capability via rule-driven ICoT data synthesis and SFT cold-start, then applies Self-Adaptive GRPO reinforcement learning (combining rule correctness rewards and a self-adaptive consistency reward SCA-R) to optimize policy alignment, achieving 91.4% strict accuracy and 0.845 reasoning consistency score on a real-world ad benchmark.
Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models: This paper proposes ASE (Adversarial Scenario Extrapolation), an inference-time CoT defense framework that enables LLMs to autonomously simulate adversarial scenarios and formulate defensive strategies prior to responding. ASE achieves near-zero attack success rates across four categories of safety threats (jailbreak, toxicity, hallucination, and bias), while reducing direct refusal rates to ≤4%, effectively balancing robustness and user experience.
CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation: This paper proposes the CMMCoT framework, which constructs interleaved multimodal multi-step reasoning chains (with visual region token supervision) and a test-time retrieval-based memory augmentation module (RIFREM) to enhance slow-thinking reasoning in multi-image scenarios without increasing model parameters. Built on Qwen2.5-VL-7B, the method achieves an average improvement of 1.4 points on multi-image benchmarks.
Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning: This paper demonstrates that attention head activations in intermediate layers of LLMs implicitly encode truthfulness information about reasoning steps during CoT inference (probing accuracy up to 85%). Based on this finding, confidence predictors are trained to guide beam search in dynamically selecting high-confidence reasoning paths, surpassing Self-Consistency and PRM Guided Search on mathematical, symbolic, and commonsense reasoning tasks.
Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment: This paper systematically investigates decision-making uncertainty across 32 open-source LLMs in moral dilemma scenarios (trolley problem variants), finding that uncertainty is primarily driven by model architecture rather than moral dimension. Introducing attention dropout at inference time significantly increases mutual information and improves human-LLM moral alignment, suggesting that reducing overconfidence in moral scenarios can enhance consistency with human preferences.
Efficient Thought Space Exploration Through Strategic Intervention: This paper proposes the Hint-Practice Reasoning (HPR) framework, in which a large model (hinter) provides short hints at sparse critical tokens while a small model (practitioner) handles the majority of the reasoning. HPR achieves performance comparable to the self-consistency baseline using only 1/5 of the tokens, and improves accuracy by up to 5.1% under the same FLOPs budget.
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation: This paper constructs ESG-Bench — 270 manually annotated QA pairs from 94 real ESG reports (2020–2024) — and proposes a three-stage hallucination mitigation pipeline: SFT (with grounded answers + "Not Provided" abstention labels) → CoT Prompting (2/4-step prompt templates) → CoT Fine-tuning (with human-annotated reasoning chains). The 4-step CoT fine-tuned Llama-3 achieves 92.52% with-answer (WA) accuracy and 99.37% without-answer (WoA) accuracy (balanced 96%), with generalization gains on HaluEval and BioASQ.
Evaluating, Synthesizing, and Enhancing for Customer Support Conversation: This paper defines five dialogue phases and twelve support strategies based on the COPC industry standard, generates 11,232 strategy-rich synthetic dialogues (RoleCS) via five-agent role-playing, and constructs a 1,855-sample evaluation set (CSConv) by rewriting real conversations. Fine-tuning on these resources substantially improves strategy-aligned response quality and issue resolution rates.
ExtendAttack: Attacking Servers of LRMs via Extending Reasoning: This paper proposes ExtendAttack, a resource exhaustion attack targeting Large Reasoning Models (LRMs): by randomly converting characters in the prompt into multi-base ASCII encodings, the attack forces models to perform extensive character-by-character decoding before answering, increasing o3's response length by more than 2.7× and doubling latency, while keeping answer accuracy largely intact.
Graph of Verification: Structured Verification of LLM Reasoning with Directed Acyclic Graphs: This paper proposes Graph of Verification (GoV), a structured verification framework that models LLM reasoning processes as directed acyclic graphs (DAGs). Through a flexible Node Block architecture, GoV enables multi-granularity verification—ranging from atomic-level steps in formal tasks to paragraph-level verification in natural language narratives—and substantially outperforms both holistic verification and other decomposed verification methods on both structured and loosely structured reasoning benchmarks.
Improving Value-based Process Verifier via Low-Cost Variance Reduction: To address the high-variance issue in value-based process reward model (PRM) training caused by limited Monte Carlo (MC) samples, this paper proposes Compound Monte Carlo Sampling (ComMCS), which constructs an unbiased low-variance estimator by linearly combining MC estimates from the current step and subsequent steps. The method introduces no additional LLM inference overhead and achieves a 2.2-point improvement on MATH-500 under Best-of-32 evaluation.
Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement: This paper proposes a Self-Rewriting framework that enables large reasoning models (LRMs) to rewrite their own reasoning traces for "easy" samples (queries where all responses are correct) during RL training and learn from the rewritten versions. With only ~10% additional training overhead, the approach reduces reasoning length by 46% while maintaining accuracy, improves internal reasoning quality (LLM-as-Judge) by 7.2 points, and effectively mitigates issues such as over-thinking and redundant thinking.
Intention Chain-of-Thought Prompting with Dynamic Routing for Code Generation: This paper proposes RoutingGen — a difficulty-aware code generation framework grounded in the principle of cognitive economy. A Qwen3-8B classifier dynamically routes tasks to either a simple path (few-shot direct generation) or a complex path (Intention CoT = specification constraints + algorithmic intent + complexity analysis), achieving a +45.15% improvement on McEval while reducing average token consumption by 46.37%.
Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search: This work constructs the NbQA dataset (38K task-solution pairs extracted from real Jupyter Notebooks) and proposes the Jupiter framework (modeling data analysis as a state-level search problem with PUCT search guided by a value model), enabling Qwen2.5-14B to achieve 86.38% on InfiAgent-DABench, surpassing GPT-4o (85.99%), and improving Qwen2.5-7B on DSBench from 63.51% to 89.19%.
L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention: Through LAT analysis, this paper reveals that the low-frequency CoT directional representations of LLMs and VLMs share similar distributions. It proposes L2V-CoT: extract CoT directional representations from an LLM → apply low-pass filtering → frequency-domain resampling for dimension alignment → inject into VLM hidden layers. This training-free approach transfers LLM reasoning capabilities to VLMs, achieving an average improvement of 3.7% and a maximum gain of 8.6%.
LLMs for Game Theory: Entropy-Guided In-Context Learning and Adaptive CoT Reasoning: This paper proposes an entropy-guided adaptive LLM reasoning framework that combines dynamic in-context retrieval with adaptive chain-of-thought (CoT) reasoning. On the Tic-Tac-Toe benchmark, the framework improves the average game outcome of LLMs from \(-11.6\%\) to \(+9.5\%\) while maintaining a low number of LLM queries.
Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension: This paper proposes Relation-R1, the first unified framework for binary and N-ary relation comprehension, combining progressively cognitive CoT-guided SFT with GRPO multi-reward optimization. With only 3B parameters, it surpasses 13B models, achieving 21.20% Mean (+6.87%) on PSG and state-of-the-art performance across all metrics on SWiG (Grnd-all 30.18%, +14.48%).
RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation: This paper proposes RPM-MCTS, which replaces a trained Process Reward Model (PRM) with knowledge base retrieval to guide MCTS search for code generation. Exploiting the homogeneity of correct implementations within the same algorithm family, the method retrieves reference algorithm steps from a knowledge base as evaluation signals, applies similarity-based filtering to prune redundant expansion nodes, and uses sandbox execution to localize errors—achieving approximately 15% token reduction while surpassing prior state-of-the-art.
SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger: Inspired by Error-Related Negativity (ERN) in neuroscience, this paper proposes SAPO, a self-adaptive process optimization method that replaces costly step-wise Monte Carlo rollouts with first error detection and local posterior estimation. SAPO reduces computational cost by 2–3× while enabling joint optimization of the reasoner and verifier, allowing small language models (≤2B) to outperform most self-evolution methods on mathematical and code reasoning tasks.
SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling: Grounded in the dual-process theory from cognitive science, SCALE decomposes mathematical problems into sub-problems and allocates compute resources according to difficulty (System 1 for fast computation vs. System 2 for deep reasoning). On AIME25, it improves Qwen3-32B from 57.50% to 71.25% while reducing token consumption by 33–53% compared to InftyThink.
SERL: Self-Examining Reinforcement Learning on Open-Domain: This paper proposes SERL, a self-improvement framework in which an LLM simultaneously acts as an Actor (generator) and a Judge (evaluator). It derives reward signals from the model's own judgments via the Copeland pairwise comparison method, requiring neither external reward models nor human annotations. SERL improves Qwen3-8B from 52.37% to 59.90% (+7.53%) on AlpacaEval 2.0, approaching the performance of Qwen3-32B.
Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning: By applying a single-epoch supervised fine-tuning (SFT) on OPT-350M, this work achieves a 77.55% pass rate on ToolBench, substantially outperforming large-model baselines such as ChatGPT-CoT (26%) and ToolLLaMA-DFS (30.18%), demonstrating that small models with targeted fine-tuning can significantly surpass general-purpose large models on specific tasks.
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision: This paper proposes SPARE, a unified single-pass evaluation framework that simultaneously performs step-to-reference alignment and correctness judgment (with explicit reasoning) in a single structured generation, requiring no additional training data. SPARE achieves 2.3× speedup over MCTS-based methods and attains OOD generalization with only 16% of the training samples.
Stable Voting and the Splitting of Cycles: This paper investigates the conjecture that Simple Stable Voting (SSV)—a recursive voting rule already used in hundreds of real-world elections—always refines Split Cycle (SC). Through mathematical proof (≤5 candidates) and SAT solving (6–7 candidates), the paper establishes that the conjecture holds for ≤6 candidates, is refuted for ≥7 candidates, and generalizes the counterexample to arbitrarily many candidates via a constructive proof.
Text-to-Scene with Large Reasoning Models: This paper proposes Reason-3D, which leverages the multi-step spatial reasoning capabilities of large reasoning models (LRMs) to achieve zero-shot text-to-3D scene generation via semantic-voting-based object retrieval and a two-stage layout strategy (autoregressive placement + collision-aware refinement). The system achieves an Elo score of 2248 in human evaluation, substantially outperforming Holodeck (1500) and LayoutVLM (1650).
Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities: This paper systematically evaluates the negative impact of deliberative reasoning on foundational capabilities (helpfulness and harmlessness) in Large Reasoning Models (LRMs) such as DeepSeek-R1, QwQ, and OpenThinker. It finds that deliberative reasoning significantly degrades instruction-following and safety, and proposes adaptive reasoning modes—Zero-Thinking, Less-Thinking, and Summary-Thinking—that effectively mitigate these deficiencies.