Skip to content

💡 LLM Reasoning

💬 ACL2026 · 82 paper notes

📌 Same area in other venues: 📷 CVPR2026 (16) · 🔬 ICLR2026 (241) · 🧪 ICML2026 (78) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (81) · 📹 ICCV2025 (3)

🔥 Top topics: Reasoning ×58 · LLM ×21 · Reinforcement Learning ×6 · Alignment/RLHF ×4 · Adversarial Robustness ×3

Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication

This paper proposes the Amortized Intelligence paradigm: treating the LLM as a "one-time compiler" to compile legal contracts into a deterministic Directed Acyclic Graph (DAG) intermediate representation called DACL. At runtime, a lightweight agent schedules a symbolic engine for execution, achieving 99.5% accuracy across 400 real-world contract events. Compared to large reasoning models like GPT-5.2/Claude/Gemini, accuracy on complex contracts jumps from 22-46% to 98%, while token consumption is reduced by 9.9x.

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

This paper proposes APMPO, which unifies GRPO (arithmetic mean) and GMPO (geometric mean) objectives using a "power-mean" controlled by the current mean reward. In conjunction with an adaptive clip range based on reward stability, APMPO allows RLVR training to dynamically switch between "amplifying rare high rewards" and "emphasizing consistency" across different stages, consistently outperforming GRPO, DAPO, and GMPO on 9 mathematical, SQL, and multimodal reasoning benchmarks.

AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

The AIM-CoT framework is proposed to address two core issues in Interleaved Multimodal Chain-of-Thought (I-MCoT)—"what to see" and "when to see"—through Information Foraging Theory-driven Active Visual Probing (AVP) and an attention-shift-based Dynamic Attention-shift Trigger (DAT).

Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data

This paper proposes a budget-aware anytime reasoning framework and an Anytime Index metric to quantify the quality-efficiency trade-off of LLMs under limited token budgets. It also designs a reasoning-time self-improvement method (PDP) based on LLM-synthesized preference data, significantly improving the quality of intermediate and final solutions across planning, mathematics, and science QA tasks.

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Addressing the double-edged sword where "self-generated rubrics often mislead reward models," the authors use language model (LM) likelihood margins to automatically label 16 self-sampled rubrics as "helpful/misleading" pairs. They then train a cooperative rubric generator via DPO and a "critical" verifier via GRPO, which assesses rubric reliability before making judgments. Using only binary preference data, C2 Improves reasoning RM performance by up to 6.5 points on RM-Bench and increases downstream DPO LC win rates by 6 points. Notably, an 8B model using self-generated rubrics matches the performance of using rubrics from a \(4\times\) larger model (Qwen3-32B).

Calibration-Aware Policy Optimization for Reasoning LLMs

The authors first prove that the "reward-only" advantage estimation in GRPO-like algorithms is equivalent to an AUC-inconsistent surrogate (\(\phi(t)=-t\), violating scale-invariance), which leads to a continuous degradation of relative calibration (perplexity AUC) even as accuracy increases. Accordingly, they propose CAPO: replacing the advantage with a "pairwise, uncertainty-aware" form based on a logistic AUC consistent surrogate, further enhanced by denoising masking using reference-model PPL. On Qwen2.5-Math 1.5B/7B, CAPO achieves +15~25% calibration improvements with comparable or superior accuracy to GRPO, and an additional 5% gain in AIME inference-time scaling.

Can Reasoning Path still be Effective as Input? Bridging Post-Reasoning to Chain-of-Thought Compression

This paper proposes post-reasoning and UCoT: a lightweight compressor first generates soft tokens representing the reasoning path via a single forward pass, and then an executor uses these soft tokens as input context to perform short-output reasoning, significantly reducing CoT tokens and latency while maintaining reasoning accuracy.

Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models

This paper proposes Alignment Score—a semantic-level metric based on a semantic entropy matrix—to quantify reasoning alignment by comparing intermediate steps of model-generated chains-of-thought with human-preferred reference chains. The study finds that Alignment Score is highly correlated with task accuracy, readability, and coherence, identifying 2-hop reasoning as the peak depth for alignment.

ChAIRO: Contextual Hierarchical Analogical Induction and Reasoning Optimization for LLMs

The authors propose ChAIRO, a framework for contextual hierarchical analogical induction and reasoning optimization. Through a three-stage pipeline (analogical case generation → rule induction → rule-injected fine-tuning), the framework enables LLMs to autonomously generate analogical cases and induce explicit moderation rules for content moderation. It achieves a 4.5% \(F1\) improvement over single-instance rule generation and a 2.3% improvement over static RAG.

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

CoAct utilizes self-consistency during preference alignment to partition unlabeled samples into "high-consistency" and "low-consistency" sets. It then employs k-NN distance to identify "self-consistent yet potentially incorrect" risky samples from the high-consistency set for Oracle labeling, while the remaining high-consistency samples are treated as AI self-labeled data. Finally, Oracle-verified samples are used as in-context demos to generate new instructions. By integrating human and AI supervision into a single DPO loop, CoAct achieves a 4–8 percentage point improvement over state-of-the-art baselines on GSM8K, MATH, and WebInstruct.

CRISP: Compressing Redundancy in Chain-of-Thought via Intrinsic Saliency Pruning

This paper proposes the CRISP framework, discovering that the attention patterns of the </think> token can reliably distinguish between critical and redundant steps in a reasoning chain. Based on this, a greedy search compression pipeline with four atomic operations is designed, reducing token usage by 50-60% while maintaining accuracy.

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

CSRP employs a three-stage training pipeline—CPT, SFT with CoT rationales, and GRPO with Efficiency-Aware Rewards—to train a Chinese text correction model. It achieves 50.99 \(F_{0.5}\) on NACGEC and 59.61 F1 on CSCD, significantly mitigating over-correction issues in LLM-based correction through explicit rewards for editing efficiency.

Decoupling the Effect of Chain-of-Thought Reasoning: A Human Label Variation Perspective

This paper reveals the "decoupling mechanism" of CoT reasoning through Cross-CoT experiments and step-wise analysis: while final accuracy is determined by CoT content (99% variance contribution), the distribution ranking is dominated by the model's intrinsic prior (>80%). This indicates that long CoT acts as a powerful decision-maker but a weak distribution calibrator.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

DELTA is a training-free hierarchical sparse attention mechanism. It categorizes Transformer layers into three groups: "initial full attention layers + a few Δ-layers for re-selecting salient pages + subsequent sparse attention layers." It achieves comparable or superior accuracy to full attention on AIME / GPQA-Diamond, while reducing the number of attended tokens by \(4.25\times\) and achieving a \(1.54\times\) end-to-end inference speedup.

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

This paper discovers a shared logical subspace within LLMs that aligns natural language and symbolic logic representations. By steering activations along this subspace during inference, logical reasoning accuracy is improved by up to 11 percentage points without requiring training.

Dissecting Failure Dynamics in Large Language Model Reasoning

Analysis of LLM reasoning trajectories reveals that errors cluster at key early turning points, after which models enter a "cognitive spiral"—extending trajectories in a locally coherent but globally erroneous manner. Based on this, the GUARD framework is proposed to perform short-range branch repair at high-risk turning points detected via entropy signals.

Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding (CoRD)

The authors propose CoRD (Collaborative Reasoning Decoding), which transforms multi-teacher Long-CoT reasoning distillation from "generating full trajectories followed by post-hoc selection" into "step-wise collaborative decoding." In each step, multiple LRMs propose candidate steps, which are scored by the predictive perplexity of a meta-prover. Top-B partial trajectories are maintained via beam search. Consequently, a 32B student model surpasses all single teachers on AIME24/25 (79.6 / 70.2 vs 78.9 / 67.9).

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

Ours proposes LTE (Learning to reason from Trial and Error), which effectively mitigates the exploration stagnation problem in RLVR without relying on external experts by using the model's own incorrect answers as prompts to guide additional rollouts.

Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

The authors split the 57 subjects of MMLU into two subsets—symbolic reasoning and knowledge recall (approx. 1:2)—using the "=="-heuristic from Sprague et al. based on academic disciplines. They empirically demonstrate that self-consistency (SC) is not only effective for symbolic reasoning—where CoT already excels—but also consistently yields gains in knowledge recall (+2.48 when \(n=20\)). This pushes the overall MMLU accuracy of GPT-4o to 88.93%. The mechanism is explained using the "majority answer ratio" as a confidence signal (Pearson \(\rho \approx 0.42\)).

DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models

DRP allows a "Short-CoT teacher (GPT-4o)" to perform skill-level step decomposition and pruning/rewriting on the "Long-CoT student's (R1-Distill-Qwen)" own reasoning trajectories. By distilling these trajectories—which remove redundancy while preserving the student's speaking style—back into the student, DRP reduces the tokens of a 7B model on GSM8K from 917 to 328 (−64%) while increasing Pass@1 from 91.7% to 94.1%. It simultaneously reduces token counts and improves accuracy on OOD tasks like AIME/AMC/MATH500.

DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping

DVMap shifts "pluralistic value alignment" from coarse-grained national labels to 11-dimensional demographic profiles. It filters 56,100 WVS data points through "high-consensus profiles" (Shannon entropy = 0), then trains Qwen3-8B using Structured CoT + GRPO (binary rewards). The model outperforms DeepSeek-v3.2 and matches GPT-4o in triple generalization tests across demographics, countries, and values.

Efficient PRM Training Data Synthesis via Formal Verification

This paper proposes FoVer, a framework that leverages formal verification tools (Z3 and Isabelle) to automatically annotate step-level correctness labels for reasoning chains in formal reasoning tasks. By generating the FoVer-40K training set and fine-tuning a PRM, the study demonstrates formal-to-informal transfer capabilities and cross-task generalization across 12 reasoning benchmarks.

Efficient Process Reward Modeling via Contrastive Mutual Information

This paper proposes CPMI (Contrastive Pointwise Mutual Information), an efficient automated step-level reward annotation method. It estimates the step-level contribution by contrasting the change in conditional probabilities of correct and incorrect answers. Compared to Monte Carlo estimation, CPMI reduces construction time by 84% and token generation by 98%, while achieving higher accuracy on process-level evaluation and mathematical reasoning benchmarks.

Efficient Test-Time Scaling via Temporal Reasoning Aggregation

The TRACE framework is proposed to judge inference convergence by aggregating two complementary signals—multi-step answer consistency and confidence trajectory—within a sliding window. This enables training-free dynamic early exit, reducing token usage by 25-30% while maintaining accuracy within a 1-2% margin.

ETR: Entropy Trend Reward for Efficient Chain-of-Thought Reasoning

The authors propose ETR (Entropy Trend Reward), which incorporates momentum-weighted stepwise entropy reduction as a reward shaping term in GRPO. This constraint forces the LLM's CoT to converge adaptively under a "global entropy decay" objective, reducing average CoT length by 35–65% with maintained accuracy. On DeepSeek-R1-Distill-7B, it achieves a +9.9% accuracy gain while reducing tokens by 67%.

Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS

This paper proposes Evo-Attacker, which models tool return tampering for LLM multi-agent systems (LLM-MAS) as a long-horizon reinforcement learning problem with dynamic attack memory. It optimizes retrieval, reflection, and modification decisions using Attack-Flow GRPO, significantly reducing system success rates across multiple architectures and task benchmarks.

Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck

This paper proposes Multi-Focus Attention Instruction (MFAI) as a semantic probe to reveal the "Weakest Link Effect" in multi-hop QA—multi-hop reasoning performance is determined by the absolute position of the least visible evidence rather than the distance between facts. Failures primarily stem from recognition bottlenecks rather than reasoning defects, and System-2 reasoning models effectively resist position bias and misleading attention cues.

FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosures

FinReporting decomposes the localization of US, Japanese, and Chinese financial reports into an auditable agent workflow comprising "rule-based extraction + ontological mapping + constrained LLM verification/repair + human review." It utilizes a unified IS/BS/CF schema to mitigate inconsistencies in financial disclosure formats and accounting semantics across different jurisdictions.

Foresight Optimization for Strategic Reasoning in Large Language Models

This paper proposes Foresight Policy Optimization (FoPO), which introduces a foresight correction term based on opponent modeling into policy optimization. This enables LLMs to explicitly anticipate opponent behaviors and adjust their own strategies accordingly. FoPO significantly improves strategic reasoning in both cooperative (Cooperative RSA) and competitive (Competitive Taboo) game tasks and achieves consistent improvements on the cross-domain \(\gamma\)-Bench.

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

This paper introduces FS-Researcher, a file-system-based dual-agent framework for deep research. By utilizing a Context Builder to construct a hierarchical knowledge base and a Report Writer for sectional reporting within a persistent workspace, it overcomes context window limitations. FS-Researcher achieves 53.94 RACE (SOTA) on the DeepResearch Bench and demonstrates a positive test-time scaling effect between context construction computation and report quality.

GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO

This paper introduces GanitLLM, the first model to perform mathematical reasoning genuinely in Bengali (rather than through translation or reasoning in English). Through the construction of Ganit, a difficulty-annotated Bengali mathematics dataset, and the proposal of Curriculum-GRPO to address the cold start problem in low-resource GRPO training, the 4B model achieves an 8-percentage-point accuracy gain on Bn-MGSM, while increasing Bengali reasoning tokens from 14% to 88%.

HISR: Hindsight Information Modulated Segmental Process Rewards for Multi-turn Agentic Reinforcement Learning

HISR utilizes GPT-4o to partition agent trajectories into segments aligned with sub-goals. Subsequently, a hindsight model and a policy model calculate importance scores via likelihood ratios to modulate segmental process rewards. This approach improves credit assignment on Alfworld, Virtualhome, and Webshop, achieving an average score increase of 5+ over SPA.

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

This paper traces the information flow of CoT in reverse from three levels: decoding, probability projection, and FFN activation. It finds that CoT primarily improves reasoning performance by constraining answer structures, reducing prediction entropy, and modulating neuron activation based on task types, rather than simply making the model "truly more capable of logical reasoning."

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

This paper systematically refutes the popular recent conclusion that "CoT does not count as explainability." Using four complementary metrics—Filler Tokens, FUR, faithful@k, and Causal Mediation Analysis—it demonstrates that over half of CoT samples judged "unfaithful" by Biasing Features (hint verbalization) actually reflect model reasoning "in other ways." Unfaithfulness primarily stems from "incompleteness" due to lossy natural language compression rather than true divergence—increasing the sampling budget can raise hint verbalization probability to 90%, and even non-verbalized hints can causally transmit influence through the CoT.

JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents

JTPRO proposes a joint optimization framework that avoids model fine-tuning. By using reflection-driven iterative editing, it simultaneously optimizes global instructions and tool-wise schemas/parameter descriptions. This significantly improves end-to-end success rates in large-scale tool library scenarios, achieving a 5%–20% gain in OSR compared to baselines like GEPA.

Language Model as Planner and Formalizer under Constraints

This paper proposes the CoPE benchmark, which injects formally classified natural language constraints into classic planning environments. It reveals that a single constraint can halve the planning performance of current state-of-the-art LLMs, exposing a severe lack of robustness in LLM planning.

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

This paper systematically investigates the latent reasoning behavior of Large Reasoning Models (LRMs) across 11 languages. It finds that latent reasoning capabilities exist in multiple languages but are unevenly distributed (strong in high-resource languages, weak in low-resource ones), and internal reasoning dynamics tend to follow an English-centric shared path.

Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

CoT2Edit proposes a new paradigm for teaching LLMs to perform knowledge editing via CoT reasoning. By constructing CoT instruction data for structured and unstructured editing, the model undergoes SFT cold-start followed by GRPO optimization. During inference, it combines RAG to retrieve edited facts, achieving SOTA performance on 6 editing benchmarks with strong generalization capabilities from a single training run.

LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

LegalDrill employs an Audit Agent to diagnose specific error patterns in 0.6B/1.7B small language models (SLMs) during legal reasoning. It prompts a strong teacher (GPT-4o / Qwen3-30B) to "deliberately reproduce and correct" these errors to generate preference pairs based on diagnostic instructions. Samples that the student already understands are filtered out using a Difficulty Score derived from the student's own forced-choice probabilities. After iterative SFT+DPO, the 1.7B student model approaches the performance of the 30B teacher across multiple LegalBench subsets.

LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues

This paper proposes LePREC, a neuro-symbolic framework inspired by legal professionals that transforms unstructured legal text into structured features via LLM-generated reasoning QA pairs. By utilizing sparse linear models for relevance classification, it achieves a 30–40% performance gain over LLM baselines like GPT-4o on the LIC dataset constructed from 769 Malaysian contract law cases.

LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

This paper models LLM chain-of-thought reasoning as geometric trajectories in the representation space. It discovers that (a) each reasoning step occupies a linearly separable subspace that becomes clearer in deeper layers, and (b) correct and incorrect solutions overlap in early stages but diverge systematically later. This allows predicting the final correctness with an ROC-AUC of 0.87 before the answer is output, leading to a proposed "trajectory steering" method for reasoning correction and length control.

Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning

This paper discovers the "logical phase transition" phenomenon in LLM logical reasoning—performance collapses abruptly at specific complexity thresholds rather than degrading smoothly. It proposes the Logical Complexity Measure (LoCM) to quantify this phenomenon and designs the Neuro-Symbolic Curriculum Tuning (NSCT) framework. Through adaptive neuro-symbolic alignment and complexity-aware curriculum optimization, NSCT improves accuracy by an average of +1.26 over naive prompting and +3.95 over CoT across five benchmarks.

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

ProxyCoT leverages short yet sufficient proxy contexts to obtain high-quality reasoning trajectories, which are then distilled into full long-context inputs. This approach enables 4B models to significantly improve long-context reasoning on SciTrek, HotpotQA, and Loong while reducing the number of CoT tokens during inference.

MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis

This paper proposes MathAgent, a hierarchical data synthesis framework based on the adversarial evolution of constraint graphs. It reformulates data synthesis from a text generation task into an unsupervised optimization problem of constraint graphs. The framework evolves question skeletons through a Legislator tri-agent system, which are then instantiated into natural language by an Executor. With only 1K synthesized samples, MathAgent outperforms LIMO and s1K across eight mathematical benchmarks.

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Whisper models the problem of "reducing thinking without sacrificing accuracy" in Large Reasoning Models (LRMs) as black-box persuasive prompting. By automatically generating and iteratively filtering prompt suffixes through multiple perspectives, it significantly reduces output tokens on Qwen3, DeepSeek-R1-Distill, and Claude/Gemini APIs while maintaining reasoning accuracy.

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

MTR-Bench constructs an automated multi-turn reasoning evaluation framework comprising 4 categories, 40 tasks, and 3,600 instances, demonstrating that current frontier reasoning models remain unreliable in interactive and dynamic feedback environments.

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

N-GRPO replaces "sampling a token then looking up its embedding" with a "weighted mixture of the anchor token and its semantic neighbors' embeddings" during the GRPO rollout phase. It injects exploration diversity via controlled embedding-level perturbations without deviating from the semantic manifold, consistently outperforming GRPO and Gaussian noise baselines on Pass@16/Pass@32 across multiple backbones such as DeepSeek-R1-Distill-Qwen.

On the Step Length Confounding in LLM Reasoning Data Selection

This paper identifies a "step length confounding" issue in naturalness-based LLM reasoning data selection methods—a systematic preference for samples with longer steps rather than higher quality, rooted in the dilution of low-probability first tokens in long steps. Two correction methods, Aslec-drop (discarding first token probabilities) and Aslec-casl (causal regression debiasing), are proposed, improving average accuracy by 6-9%.

Parallel Test-Time Scaling for Latent Reasoning Models

This paper introduces parallel test-time scaling (parallel TTS) to latent reasoning models for the first time. It proposes two stochastic sampling strategies based on uncertainty theory (MC-Dropout and Additive Gaussian Noise) and a Latent Reward Model (LatentRM) trained with step-level contrastive learning. This enables models reasoning in continuous vector spaces to achieve stable performance gains through parallel sampling and aggregation.

PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

PPA-Plan predicts potential logical pitfalls before generating reasoning plans for long contexts and converts these pitfalls into negative constraints to guide the planner. This allows LLMs to avoid superficial keyword matching and incorrect assumption paths, improving accuracy and NLI scores while significantly reducing plan execution failure rates across multiple long-context QA datasets.

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

This paper proposes utilizing Planning Domain Definition Language (PDDL) to automatically generate large-scale, high-precision step-level reward datasets for training Process Reward Models (PRM), achieving significant improvements across both mathematical and non-mathematical reasoning benchmarks.

Reasoning Fails Where Step Flow Breaks

This work proposes Step-Saliency, a diagnostic tool discovered two depth-related information flow failure modes (Shallow Lock-in and Deep Decay) in Large Reasoning Models (LRMs), and designs StepFlow, a test-time intervention method that repairs information propagation and improves reasoning accuracy without retraining.

Reinforced Efficient Reasoning via Semantically Diverse Exploration

ROSE proposes an MCTS branching strategy guided by semantic entropy and length-aware segment-level advantage estimation. It addresses the issues of insufficient exploration diversity and low inference efficiency in existing MCTS-based RLVR methods, achieving optimal pass@8 performance across multiple mathematical reasoning benchmarks.

Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning

ReASC transforms adaptive self-consistency from "counting answer votes" into "determining if sufficient reliable evidence exists." By utilizing response-confidence-weighted Beta accumulation, it significantly reduces multi-sample reasoning costs on GSM8K, MATH500, Omni-Math, and GPQA-Diamond while maintaining near-original accuracy.

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Ours proposes Render-of-Thought (RoT), the first to render textual CoT reasoning steps into images. By utilizing a pre-trained visual encoder as a semantic anchor to align LLM hidden states with the visual embedding space, it achieves 3-4x token compression and significant inference acceleration while maintaining the analyzability of reasoning chains.

ReProbe: Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models

This paper proposes ReProbe, which uses a lightweight transformer probe with fewer than 10M parameters to read the hidden states, attention, and logits of a frozen LLM to determine the reliability of each reasoning step. It approaches or exceeds the performance of PRMs 750-810 times larger on math, planning, and QA tasks, serving as an efficient step verifier for Best-of-N and beam search.

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

This work systematically investigates the entropy dynamics of LLMs during RLVR training, revealing that positive-advantage tokens are the primary drivers of entropy collapse. It introduces Positive-Advantage Reweighting to effectively regulate model entropy by dynamically adjusting the loss weights of these tokens.

Revisiting the Uniform Information Density Hypothesis in LLM Reasoning

This paper introduces the Uniform Information Density (UID) hypothesis from psycholinguistics into LLM reasoning analysis. It proposes an entropy-based step-level information density framework and discovers that high-quality reasoning trajectories exhibit a counter-intuitive pattern of "local uniformity + global non-uniformity." This pattern significantly outperforms traditional confidence/entropy baselines in Best-of-N sampling.

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

RSAT utilizes "SFT in structured citation format + GRPO with NLI faithfulness as the core reward" to train 1B-8B small language models. This approach enables table QA to not only provide answers but also bind each reasoning step to specific table cells, increasing average faithfulness from 0.224 in SFT to 0.826.

Scaling Evaluation-Time Compute with Reasoning Models as Evaluators

This paper extends test-time scaling from "answer generation" to "answer evaluation," finding that allowing reasoning models to generate more reasoning tokens, perform step-by-step process checks, and combine outcome/process scores during evaluation allows them to outperform trained PRMs/ORMs in ProcessBench and Best-of-N reranking.

Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Ours proposes GenCluster, a scalable test-time compute framework. Through large-scale parallel generation → behavioral clustering → tournament ranking → round-robin submission strategies, it enables the open-weight model gpt-oss-120b to achieve gold medal level (446.75/600 points) on IOI 2025 for the first time.

SeLaR: Selective Latent Reasoning in Large Language Models

This paper proposes SeLaR, a lightweight training-free framework that activates soft-embedding latent reasoning only during uncertain "exploration steps" via an entropy gating mechanism, while maintaining discrete decoding during high-confidence "certain steps." It introduces entropy-aware contrastive regularization to prevent soft embeddings from collapsing toward the dominant token, consistently outperforming standard CoT and SOTA training-free methods across five reasoning benchmarks.

Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

The authors propose SABA, a reasoning framework based on the "perceive then act" paradigm. It explicitly constructs and audits knowledge states before making final decisions by utilizing Information Fusion (IF) to integrate narratives into verified baseline states and Query-driven Structured Reasoning (QSR) to recursively identify and resolve missing premises. SABA achieves peak performance across detective reasoning and general reasoning benchmarks.

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

Ours proposes the CoT-PoT cross-modal ensembling method, which leverages the complementarity of two fundamentally different reasoning modalities—Chain-of-Thought (CoT) and Program-of-Thought (PoT)—to reduce the number of samples required for self-consistency by 9.3x, solving 78.6% of problems with only 2 samples.

Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration

This paper proposes RDDG, a tabular data synthesis framework based on progressive Chain-of-Thought (CoT). It guides Large Language Models (LLMs) to generate high-fidelity tabular data through coreset selection, relational mining, and a self-reinforcing feedback mechanism, achieving an average Macro-F1 improvement of over 2% in imbalanced classification tasks.

Semantic-Aware Logical Reasoning via a Semiotic Framework

Proposes LogicAgent, a logical reasoning framework based on the Greimas Semiotic Square, achieving SOTA logical reasoning performance under the dual challenges of semantic and logical complexity through multi-perspective semantic analysis and reflective verification.

SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning

SHAPE conceptualizes LLM reasoning as trajectories within a "solvability potential" state space. It utilizes length-aware stage-level advantages and entropy-driven token-level redistribution to simultaneously enhance mathematical reasoning accuracy and reduce generated tokens by approximately 30%.

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

SPPO reformulates RLVR in long-chain CoT reasoning from a token-level MDP into a sequence-level contextual bandit. By utilizing a scalar critic that only observes the prompt to estimate problem solvability, SPPO achieves stability and performance comparable to or exceeding GRPO using single-sample PPO. This approach yields approximately 5.9x training acceleration and lower GPU memory consumption.

Stabilizing Efficient Reasoning with Step-Level Advantage Selection

This paper discovers that short-context GRPO inherently compresses reasoning length but suffers from training instability due to incorrect credit assignment of truncated samples. The authors propose Step-level Advantage Selection (SAS), which selectively zeroes out advantages at the granularity of reasoning steps to significantly reduce inference tokens while maintaining or even improving Pass@1.

Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

This paper proposes Step-GRPO, which internalizes dynamic early exit capabilities into the model. It measures reasoning complexity through semantic steps rather than raw tokens and utilizes dynamic truncation Rollout to expose short, correct trajectories. Combined with a step-aware relative reward to guide the model to stop reasoning at appropriate moments, it achieves a 32% reduction in token consumption on Qwen3-8B without a drop in accuracy.

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Instead of reinforcing models based solely on win/loss outcomes in text-style game self-play, Stratagem modulates the advantage signal using "abstract transferability" and "reasoning evolution." This ensures that policies learned from games transfer effectively to mathematics, general reasoning, and code generation tasks.

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Strategy-Induct proposes a framework for inducing task-level instructions using only a few input questions without requiring ground-truth labels. By first generating reasoning strategies for individual questions and then inducing reusable task instructions from strategy-question pairs, the method surpasses current SOTA approaches on BBH-Induct, Evals-Induct, and Shift Cipher benchmarks.

TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning

TemplateRL abstracts structured reasoning templates from a small seed set using MCTS and introduces these templates as explicit guidance during reinforcement learning training. This significantly improves the efficiency and stability of multi-step reasoning in LLMs, achieving a 99% improvement over GRPO on AIME.

Think Outside the Policy: In-Context Steered Policy Optimization

Ours proposes ICPO (In-Context Steered Policy Optimization), which leverages the large language model's inherent in-context learning (ICL) capability as an implicit expert steer to expand the policy exploration space during RLVR training, without depending on reasoning trajectories from external stronger models.

TIME: Temporally Intelligent Meta-Reasoning Engine for Context-Triggered Explicit Reasoning

TIME transforms explicit reasoning from a "permanently active long chain-of-thought" into a locally controlled strategy triggered by temporal and discourse cues. Through time tags, tick events, short think blocks, and a four-phase QLoRA curriculum, Qwen3 models significantly outperform thinking/no-thinking baselines on TimeBench while compressing reasoning tokens by approximately an order of magnitude.

TInR: Exploring Tool-Internalized Reasoning in Large Language Models

This paper proposes the TInR-U framework, which achieves efficient and reliable tool-assisted reasoning by internalizing tool knowledge into LLM parameters (rather than relying on external documentation), outperforming existing methods in both in-domain and out-of-domain tests.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

ToolPRM decomposes function calling into fine-grained decisions such as function name selection, parameter name selection, and parameter value assignment. It trains an intra-call process reward model to guide beam search and proposes an inference scaling principle for structured outputs: "explore more but retain less." This approach consistently improves the Hammer2.1 series tool-calling models on BFCL and ToolAlpaca.

Towards Effective In-context Cross-domain Knowledge Transfer via Domain-invariant-neurons-based Retrieval

This paper proposes DIN-Retrieval, which identifies domain-invariant neurons (DINs) with consistent activation polarity across domains in LLMs to construct a domain-robust representation subspace. This subspace is used to retrieve structurally compatible cross-domain examples. It serves as the first demonstration of the feasibility of using cross-domain ICL examples to improve LLM reasoning performance, achieving an average improvement of 1.8% in math-to-logic reasoning transfer.

TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

TrigReason proposes an event-triggered collaboration framework between small and large reasoning models. By analyzing three types of reasoning risks in small models (path deviation, cognitive overload, and recovery failure), it designs strategic priming, cognitive offload, and intervention request triggers to replace step-by-step polling verification. While maintaining LRM-level accuracy, it offloads 1.70-4.79x more reasoning steps to the small model, reducing latency by 43.9% and API costs by 73.3%.

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

This paper systematically analyzes the sources and amplification mechanisms of spurious signals in Test-Time Reinforcement Learning (TTRL). It identifies that the ambiguous regions formed by mid-frequency answers are the primary noise sources, and that group-relative normalization in GRPO amplifies these spurious signals. The proposed DDRL framework mitigates these issues through a three-pronged approach: balanced sampling, fixed advantage values, and consensus-based offline refinement, achieving a 15.3% relative improvement on Qwen2.5-Math-1.5B.

When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

Ours proposes the DTSR framework, which identifies "reflection signals" (e.g., Wait, Alternatively) within the reasoning process and triggers a self-assessment of "thought sufficiency" to determine early termination. It achieves a 28.9%–34.9% reduction in reasoning length on Qwen3 series models with negligible accuracy loss.

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

The authors propose the Rank-Surprisal Ratio (RSR) metric, which evaluates training data suitability by jointly measuring the "informativeness" and "alignment" of reasoning trajectories for a student model. RSR achieves an average Spearman correlation of 0.86 with post-training performance across 5 student and 11 teacher model combinations, and it is successfully applied to trajectory and teacher selection.