🎮 Reinforcement Learning¶
💬 ACL2026 · 46 paper notes
📌 Same area in other venues: 📷 CVPR2026 (23) · 🔬 ICLR2026 (400) · 🧪 ICML2026 (110) · 🤖 AAAI2026 (58) · 🧠 NeurIPS2025 (140) · 📹 ICCV2025 (7)
🔥 Top topics: Reinforcement Learning ×21 · LLM ×10 · Reasoning ×10 · Agents ×2 · Adversarial Robustness ×2
- A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks (EAGLET)
-
EAGLET decouples long-horizon agent tasks into "global planner + local executor" modules. It trains a plug-and-play planner through a two-step pipeline: "cold-start SFT with homologous consensus filtering" followed by "GRPO fine-tuning using executor capability gain as reward." It achieves new SOTA on three long-horizon benchmarks while reducing training costs to 1/8 of RL baselines.
- A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions
-
The first systematic survey of Reinforcement Learning (RL) for LLMs under data scarcity, proposing a three-layer taxonomy: data-centric, training-centric, and framework-centric. It covers directions such as data pruning/synthesis/compression, trajectory generation/reward engineering/policy optimization, and self-evolution/co-evolution/multi-agent evolution.
- Adaptive Instruction Composition for Automated LLM Red-Teaming
-
The Adaptive Instruction Composition (AIC) framework is proposed, utilizing Neural Thompson Sampling to adaptively select attack instructions within a combinatorial space of crowdsourced harmful queries and jailbreak tactics. By simultaneously optimizing attack success rate and diversity, it significantly outperforms existing methods on Harmbench.
- ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
-
ARGUS utilizes a Prosecutor–Defender–Umpire three-agent debate combined with GRPO reinforcement learning. This enables the ad-review VLM to correct historical "outdated labels" and uncover latent violations in gray zones when policies are updated. Industrial A/B testing shows a relative 35.2% reduction in the Violation Leakage Rate (VLR).
- AttnPO: Attention-Guided Process Supervision for Efficient Reasoning
-
Ours proposes AttnPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. By identifying Key-Focus Heads (KFH) to distinguish between redundant and critical reasoning steps, AttnPO significantly reduces reasoning length while substantially improving accuracy.
- Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models
-
This paper identifies that in diffusion language models (dLLMs), "tokens that attend more to determined contexts exhibit more stable generation and are more critical for reasoning." Consequently, it proposes AGDO—a method that derives denoising order from attention and emphasizes these "attention hub" tokens via weighting during supervised fine-tuning (SFT) and reinforcement learning (RL). This approach consistently outperforms existing post-training methods for dLLMs that rely on random masking in mathematical and code reasoning tasks.
- Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning
-
Addressing the "confirmation bias + sparse reward" issues in TTRL caused by using majority voting for pseudo-labels, SCOPE proposes step-wise confidence-weighted voting (moving beyond frequency-based selection) and Pareto-optimal dynamic subgroup partitioning (bootstrapping local consensus in independent subgroups). On Qwen3-8B, it improves AIME 2024 from 47.13 → 52.70 and AIME 2025 from 27.40 → 31.00.
- Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents
-
To address the "evolution impasse" in open-ended social language games (Negotiation / Don't Say It / Two Dollar Game) within self-play RLVR—where agent behavior homogenization leads to deterministic match outcome distributions and vanishing gradient signals—this paper proposes DEPT. It utilizes a fast/slow dual-timescale EMA baseline to detect stagnation and applies asymmetric advantage reshaping to suppress dominant outcomes while amplifying rare trajectories. This method boosts the negotiation win rate on Qwen3-4B/8B-Base from 16-20% to 32%, with simultaneous benefits observed on OOD math and reasoning benchmarks.
- Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
-
Ours proposes DYPO (Dynamic Policy Optimization), which routes samples to different optimization paths based on dynamic difficulty grading—Hard samples utilize multi-teacher distillation to reduce SFT bias, while Mid samples use Group Alignment Loss to reduce RL variance. This achieves an average gain of 4.8% on mathematical reasoning benchmarks and 13.3% on OOD tasks.
- CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
-
The CE-GPPO algorithm is proposed. By reintroducing gradient signals for low-probability tokens outside the PPO clipping interval through stop-gradient operations, it achieves fine-grained coordinated control of policy entropy and attains a better balance between exploration and exploitation.
- Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation
-
This paper introduces the CASTER task and CASTER-Bench, proposing MEDEA to simulate community responses via Social-CoT, SFT, and process-supervised Reinforcement Learning (RL) with Social Alignment Reward. MEDEA improves High-Quality F1 to 0.650 and Macro-F1 to 0.749 on CASTER-Bench, significantly outperforming traditional VQA and general LMM baselines.
- Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions
-
The authors propose constructing a compact latent action space for Multimodal Conversational Agents (MCA) to replace the vast token action space during RL fine-tuning. By utilizing cross-modal projectors and cycle consistency loss, they leverage paired image-text and text-only data to build a codebook. This approach compresses the action space from 152K (vocabulary size) to 128 (codebook size), consistently outperforming token-level RL baselines across two dialogue tasks.
- d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models
-
To address two major reliability bottlenecks in RL for Diffusion Language Models (dLLM)—sparse rewards and probability estimation bias—the authors propose d-TreeRPO. It organizes rollouts into a tree structure, calculating step-wise advantages bottom-up using verifiable rewards from leaf nodes. Simultaneously, it provides a theoretical proof that "higher model confidence leads to more accurate single-step forward probability estimation," and designs a time-scheduled self-distillation loss to sharpen the policy in later training stages. Tested on LLaDA-8B-Instruct, it achieves gains of +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500.
- Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints
-
This paper proposes Deliberative Searcher, a reasoning-primary framework that integrates search operations into CoT generation while maintaining explicit confidence calibration. By employing constrained RL with adaptive Lagrange multipliers to jointly optimize correctness and reliability, the framework reduces the average "false-certain" rate of 7B models from a 54% baseline to 2%.
- DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents
-
The authors propose a new paradigm of "parallel exploration"—where an agent interacts with \(K\) environments synchronously and shares experiences across trajectories—and introduce the corresponding RL algorithm DPEPO. It undergoes "Cold-start SFT" to learn parallel reasoning, followed by GRPO training with hierarchical rewards consisting of "Trajectory-level Success + Step-level Diverse Action / Diverse State Transition." DPEPO achieves SOTA on all ALFWorld and ScienceWorld splits (98.2% / 61.4% on Qwen2.5-7B), with token growth significantly lower than "multi-sampling" baselines as \(K\) increases.
- Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
-
Ours proposes the EasyRL framework, inspired by cognitive development theory, which uses only 10% of simple labeled data to initialize the model via knowledge transfer, and then progressively masters difficult unlabeled data through divide-and-conquer pseudo-labeling and difficulty-incremental self-training, consistently outperforming GRPO trained on the full dataset.
- Efficient Hyperparameter Optimization for LLM Reinforcement Learning
-
This paper proposes JF-HPO, which integrates small intra-family proxy models, training step fidelity, training dynamic early stopping, and checkpoint reuse into a Bayesian HPO framework. This approach finds more stable hyperparameters for LLM reinforcement learning at a lower cost and outperforms VeRL Recipe, Random Search, and BOHB across multiple reasoning tasks.
- EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning for LLMs
-
This paper proposes EvoCoT, a two-stage self-evolving curriculum learning framework. It first constrains the LLM with final answers to self-generate verifiable CoT trajectories, then progressively deletes reasoning steps from the tail to expand the exploration space. This enables stable RLVR training on hard problems with sparse rewards without relying on teacher models or human-written CoTs, significantly improving the accuracy of R1-Qwen-1.5B on hard MATH training set problems from 55.7% to 87.8%.
- Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
-
FREIA introduces the Free Energy Principle (FEP) into label-free RL fine-tuning, simultaneously addressing the premature convergence of traditional majority voting/confidence rewards and the advantage estimation mismatch during training. It employs a "consensus + exploration" adaptive reward (FER) and adaptive advantage shaping (AAS) based on reward distribution skewness, achieving performance comparable to or better than supervised GRPO across 3 reasoning tasks and 9 datasets.
- From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation
-
This paper transforms LLM paper review from "individual absolute scoring" to "pairwise comparison followed by global ranking." By employing semantic graph sampling, comparative SFT, and Reinforcement Learning from Verifiable Rewards (RLVR) to train a 7B model, it significantly outperforms DeepReview-14B in ICLR-2025 paper ranking and acceptance prediction, while demonstrating strong transferability to unseen conferences.
- GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
-
This paper proposes GeoRA, a low-rank adaptation method specifically designed for Reinforcement Learning from Verifiable Rewards (RLVR). By constructing a geometric constraint matrix (fusing spectral and Euclidean priors) to extract the principal directions of the RL update subspace for SVD initialization and freezing the residual matrix as a structural anchor, GeoRA consistently outperforms baselines like LoRA, PiSSA, and MiLoRA on 1.5B-32B Qwen/Llama models across mathematical, medical, and code RLVR tasks, demonstrating stronger out-of-distribution generalization and reduced catastrophic forgetting.
- Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning
-
This paper proposes Glance-or-Gaze (GoG), which enables Multimodal Large Language Models (LMMs) to first scan the full image and then adaptively select high-value regions for intensive gaze when answering knowledge-intensive visual questions. Through SFT and complexity-adaptive GRPO, GoG significantly outperforms baselines such as direct answering, full-image search, and MMSearch-R1 across six visual Q&A and search benchmarks.
- Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning
-
This paper points out that RLVR cannot distinguish between "high-quality reasoning for a correct answer" and "low-quality reasoning that happens to get the answer right." It proposes using the pedagogical utility of a demonstration, termed Evidence Gain, as an implicit quality signal. By employing In-Context RLVR, the model improves mathematical reasoning accuracy and quality without training a PRM.
- HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
-
The HEAL framework is proposed to address severe entropy collapse in few-shot RLVR by mixing general-domain data and employing an Entropy Dynamics Alignment (EDA) reward mechanism. It achieves performance matching or exceeding full-set RLVR (1K samples) using only 32 target-domain samples.
- ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following
-
ImpRIF formalizes implicit reasoning structures in complex instructions as verifiable Explicit Reasoning Graphs (ERG). Based on this, it constructs large-scale single/multi-turn data and performs training via SFT and process-verified RL. This approach enables 4B-32B models to significantly outperform base models across five instruction-following benchmarks, with the 32B model even surpassing some larger commercial models.
- KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks
-
KASER first estimates student mastery of knowledge components, then trains a code generator using GRPO with a hybrid reward of "code similarity + error matching + diversity" to simulate programming errors consistent with the student's knowledge state.
- KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
-
KnowRL integrates "atomic fact verification" as a process-level reward directly into the GRPO training loop, performing factual assessment on each step of the slow-thinking model's Chain-of-Thought (CoT). Simultaneously, it employs a "positive reward for refusal" strategy to teach the model to identify its own knowledge boundaries. This approach reduces the SimpleQA Incorrect Rate by 20.3% without compromising (and even slightly improving) reasoning capabilities like GPQA/AIME, while demonstrating cross-lingual transfer from English knowledge to Chinese QA.
- LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
-
LANG bootstraps multilingual mathematical reasoning RL with same-language reasoning hints, then utilizes cosine decay and language-difficulty-based adaptive hint termination to improve non-English reasoning accuracy while maintaining language consistency.
- LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment
-
To address data selection for RLVR post-training, LearnAlign is proposed—utilizing "gradient alignment" as a representativeness metric and "success rate \(V(\xi)=p(1-p)\)" as a learnability weight to eliminate response length bias. With only 1,000 samples (~6%), it achieves performance close to full-set training across 5 reasoning benchmarks (42.4% vs 44.9%), and on GSM8K, using 13.4% of the data (77.5%) exceeds full-set performance (77.0%).
- LENS: Less Noise, More Voice — Reinforcement Learning for Reasoning via Instruction Purification
-
LENS discovers that many exploration failures in RLVR are not due to task difficulty but are caused by a small portion (<5%) of interference tokens in the prompt. By identifying and removing these tokens to improve rollout success rates and transferring learning signals from purified rollouts to policy optimization on the original noisy prompts, LENS achieves an average improvement of 3.88% and a 1.6x acceleration.
- LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations
-
LoVeC trains LLMs to append a numerical
<confidence>tag (0–10) after each sentence during long-form generation. Using GRPO (online, requiring an oracle fact-checker) or DPO (offline preference pairs), the model aligns these tags with factuality determined by GPT-4o. This enables single-pass decoding to output calibratable, machine-parseable confidence scores, outperforming the Prev. SOTA LUQ across Brier/ECE/Spearman metrics and achieving a 20x inference speedup. - NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks
-
NaviMaster reformulates both GUI operations and embodied navigation into a unified MDP of "visual target localization + action execution." It trains a Qwen2.5-VL-7B policy using GRPO on mixed trajectories with distance-aware dense rewards, outperforming single-domain training and mainstream baselines in OOD GUI tasks, spatial affordance prediction, and ObjectNav.
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
-
RL-PLUS proposes a hybrid-policy optimization method that addresses external data distribution mismatch through Multiple Importance Sampling (MIS) and guides the model to learn low-probability but correct reasoning paths via the Exploration-based Advantage Function (EAF). It successfully breaks the capability boundary collapse caused by RLVR, achieving SOTA (average 53.4) across six mathematical reasoning benchmarks and consistent improvements across models by up to 69.2%.
- Savoir: Learning Social Savoir-Faire via Shapley-based Reward Attribution
-
This paper proposes Savoir, a social RL framework based on cooperative game theory. It combines Expected Utility (prospective evaluation of the strategic potential of utterances) and Shapley values (axiomatic fair credit assignment) to solve the credit assignment problem in multi-turn dialogues. It achieves SOTA performance on the SOTOPIA benchmark with a 7B model (Goal 7.18 in the Hard setting), matching or exceeding GPT-4o and Claude-3.5-Sonnet, while revealing that large reasoning models (o1, DeepSeek-R1) systematically underperform on social tasks.
- Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study
-
This paper presents the first systematic study of scaling behaviors in LLM reinforcement learning (RL) post-training. Conducted on the Qwen2.5 series (0.5B-72B), the study reveals that performance follows a power-law relationship with training resources, and learning efficiency tends toward saturation as model scale increases.
- Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS
-
Self-EmoQ models "what emotion the system should use to speak" as an utterance-level reinforcement learning decision problem. Before generating text, it utilizes value-based RL (DQN) to plan the emotion for the current turn. This emotion then simultaneously drives both text generation and streaming emotional speech synthesis (Emo-TTS), with rewards designed based on Plutchik's Wheel of Emotions theory to ensure more human-like emotion selection.
- Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning
-
This paper argues that the traditional token-level exploration-exploitation trade-off in RLVR is an artifact of measurement. It proposes decoupling exploration and exploitation in the latent semantic space using Effective Rank (ER) and its temporal derivatives (ERV/ERA). Based on this, the VERL method is designed to achieve simultaneous improvement in both, resulting in gains of up to 21.4% on benchmarks such as Gaokao Math.
- SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving
-
This paper proposes SpiralThinker, a framework for implicit reasoning that updates latent representations iteratively while interleaving them with text reasoning steps. By introducing a progressive alignment objective, the framework ensures that latent representations remain consistent with explicit reasoning during iterations, outperforming all latent reasoning baselines on math, logic, and commonsense reasoning tasks.
- Targeted Exploration via Unified Entropy Control for Reinforcement Learning
-
This paper proposes UEC-RL, a unified bidirectional entropy control framework. It addresses the common issues of entropy collapse and training instability in GRPO through targeted high-temperature exploration for difficult prompts (increasing entropy) and experience replay stabilizers to consolidate high-quality trajectories (decreasing entropy), achieving a 37.9% relative improvement on Geometry3K.
- The Stackelberg Speaker: Optimizing Persuasive Communication in Social Deduction Games
-
This paper models turn-based dialogues in Social Deduction Games (SDGs) as a Stackelberg game, where the current player acts as a leader optimizing the persuasiveness of an utterance by measuring the response distribution of the next player. A Refiner model trained using GRPO significantly outperforms baselines across four game benchmarks, including Werewolf and Avalon.
- Understanding Generalization in Role-Playing Models via Information Theory
-
This paper proposes the first information-theoretic framework, R-EMID, to quantify the performance degradation of Role-Playing Models (RPMs) under distribution shifts of users, roles, and dialogues. By introducing intermediate reasoning processes and Co-evolutionary Reinforcement Learning (CoRL) for accurate estimation, it identifies user shift as the primary generalization risk and finds that reinforcement learning is the only consistently effective method for improvement.
- UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning
-
This paper proposes the UniCreative framework, which unifies two creative writing modes—long-form (Plan \(\rightarrow\) Write) and short-form (Direct Generation)—using Adaptive Constrained Preference Optimization (ACPO) and Adaptive Criteria Generative Reward Model (AC-GenRM). Without SFT or reference solutions, the model develops an emergent metacognitive ability to autonomously distinguish task types.
- Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward
-
VIGOR employs the teacher-forced NLL gradient norm of each completion under current model parameters as an intrinsic reward, favoring outputs with low gradient norms. It stabilizes GRPO using \(\sqrt{T}\) length correction and intra-group rank shaping, thereby enhancing mathematical and code reasoning without requiring gold answers or external verifiers.
- Visually-Guided Policy Optimization for Multimodal Reasoning
-
VGPO utilizes hidden-state similarity to locate vision-related tokens during RLVR training. By applying late-stage visual compensation and dual-grained advantage re-weighting (intra- and inter-trajectory), it strengthens visual focus. Qwen2.5-VL-7B equipped with VGPO outperforms GRPO/DAPO and existing vision-enhanced RL methods in mathematical multimodal reasoning and vision-dependent tasks.
- SCRL: What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time
-
This paper proposes SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework. It mitigates label noise amplification by using selective positive pseudo-labels (filtering unreliable majorities with strict consensus criteria) and entropy-gated negative pseudo-labels (introducing negative supervision signals in TTRL for the first time to prune incorrect trajectories). SCRL achieves up to a 10.1 percentage point improvement over TTRL on AIME25.
- Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
-
Applying strictly controlled SFT/RL post-training comparisons and Sparse Crosscoder feature alignment, this paper finds that while SFT rapidly forms numerous specialized features, RL tends to retain base representations while gradually enhancing a small set of cross-task generalization features. Ablating these features significantly harms RL generalization, whereas amplifying them improves base model performance.