🎮 Reinforcement Learning¶

💬 ACL2026 · 38 paper notes

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions: The first systematic survey of reinforcement learning for LLMs under data scarcity, proposing a three-level taxonomy organized around data-centric, training-centric, and framework-centric perspectives, covering data pruning/synthesis/compression, trajectory generation/reward engineering/policy optimization, and self-evolution/co-evolution/multi-agent evolution paradigms.
Adaptive Instruction Composition for Automated LLM Red-Teaming: This paper proposes the Adaptive Instruction Composition (AIC) framework, which employs Neural Thompson Sampling to adaptively select attack instructions from the combinatorial space of crowdsourced harmful queries and jailbreak strategies, jointly optimizing attack success rate (ASR) and diversity. AIC achieves substantial improvements over existing methods on HarmBench.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation: This paper introduces AJ-Bench, the first benchmark systematically evaluating Agent-as-a-Judge capabilities, covering 155 tasks and 516 annotated trajectories across three domains—search, data systems, and GUI. Experiments demonstrate that Agent-as-a-Judge improves average F1 by approximately 13 percentage points over LLM-as-a-Judge.
AttnPO: Attention-Guided Process Supervision for Efficient Reasoning: This paper proposes AttnPO, a low-overhead process supervision RL framework that leverages intrinsic attention signals for step-level credit assignment. By identifying Key-Focus Heads (KFH) to distinguish redundant from critical reasoning steps, AttnPO substantially reduces reasoning length while significantly improving accuracy.
Bootstrapping Code Translation with Weighted Multilanguage Exploration: BootTrans proposes a bootstrapping multilingual code translation approach that leverages test cases from a single pivot language (Python) as cross-lingual verification oracles, employs a dual-pool architecture to expand training data through experience collection, and designs a language-aware weighting mechanism to dynamically prioritize difficult translation directions, achieving significant improvements over baselines on HumanEval-X and TransCoder-Test.
Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning: This paper proposes DYPO (Dynamic Policy Optimization), which dynamically routes samples to different optimization paths based on difficulty grading — Hard samples use multi-teacher distillation to reduce SFT bias, while Mid samples use Group Alignment Loss to reduce RL variance. DYPO achieves an average improvement of 4.8% on mathematical reasoning benchmarks and 13.3% on OOD tasks.
CAP: Controllable Alignment Prompting for Unlearning in LLMs: This paper proposes the CAP framework, which trains a lightweight SLM to generate controllable prompt prefixes that guide a frozen LLM to selectively forget target knowledge. Without modifying model parameters, CAP achieves reversible and transferable knowledge unlearning in LLMs.
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning: This paper proposes the CE-GPPO algorithm, which reintroduces gradient signals for low-probability tokens outside the PPO clipping range via stop-gradient operations, enabling fine-grained coordination of policy entropy and achieving a better balance between exploration and exploitation.
ChipSeek: Optimizing Verilog Generation via EDA-Integrated Reinforcement Learning: ChipSeek proposes a hierarchical reward RL framework that integrates the EDA toolchain directly into the training loop. Through Curriculum-guided Dynamic Policy Optimization (CDPO), it enables LLMs to generate RTL code that simultaneously satisfies functional correctness and PPA (Power-Performance-Area) optimization, achieving SOTA on standard benchmarks.
Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions: This paper proposes constructing a compact latent action space for multimodal conversational agents (MCAs) to replace the prohibitively large token action space in RL fine-tuning. A cross-modal projector and a cycle-consistency loss are employed to jointly leverage paired image-text data and text-only data for codebook construction, compressing the action space from 152K (vocabulary size) to 128 (codebook size). The proposed method consistently outperforms token-level RL baselines on two dialogue tasks.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training: This paper proposes the Data Mixing Agent, the first model-based end-to-end domain re-weighting framework. By training a small agent on a large collection of data mixing trajectories via CQL-based reinforcement learning, the framework learns generalizable data mixing heuristics that balance source- and target-domain performance during continual pre-training for mathematical reasoning. The learned heuristics generalize to unseen source domains, target models, and domain spaces.
Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with Constraints: This paper proposes Deliberative Searcher, a reasoning-first framework that integrates search operations into chain-of-thought (CoT) generation with explicit confidence calibration. It employs constrained RL with adaptive Lagrangian multipliers to jointly optimize correctness and reliability, reducing the average "false-certain" rate of a 7B model from a baseline of 54% to 2%.
Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning: This paper proposes EasyRL, a cognitively inspired framework that uses only 10% easy labeled data for warmup initialization via knowledge transfer, then progressively masters hard unlabeled data through divide-and-conquer pseudo-labeling and difficulty-progressive self-training, consistently outperforming supervised GRPO trained on the full dataset.
FaithLens: Detecting and Explaining Faithfulness Hallucination: This paper proposes FaithLens, an 8B-parameter faithfulness hallucination detection model trained via high-quality data synthesis with three-dimensional filtering (label correctness, explanation quality, and data diversity) for cold-start SFT, followed by rule-based reinforcement learning (prediction correctness reward + explanation quality reward) for further optimization. FaithLens surpasses GPT-5.2 and o3 across 12 tasks while providing high-quality explanatory outputs.
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments: This paper proposes FTRL, a framework that constructs stable and controllable tool-use training environments through a five-stage automated pipeline, and designs a verifiable reward mechanism balancing tool-call precision and task completion in an F1-inspired manner. Combined with preference-optimization RL algorithms, FTRL achieves an average performance improvement of over 10% on tool-use benchmarks for 7B–14B models, surpassing even the strongest closed-source models.
Frame of Reference: Addressing the Challenges of Common Ground Representation in Dialogue: This paper introduces the IndiRef benchmark for evaluating dialogue systems' ability to establish and exploit persistent common ground through "relational references" (e.g., "the café next to the park we visited yesterday"). Experiments show that existing LLMs achieve no more than 50% accuracy even under full-context conditions, and a combination of synthetic data generation and GRPO reinforcement learning training yields performance improvements of 15–20%.
From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models: This paper presents a systematic survey of the functional evolution of uncertainty quantification (UQ) in LLMs—from a "passive diagnostic metric" to an "active control signal"—covering three frontier domains: advanced reasoning (guiding computational allocation and self-correction), autonomous agents (meta-cognitive decision-making driving tool use and information acquisition), and reinforcement learning (mitigating reward hacking and enabling self-improvement via intrinsic rewards).
GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR: This paper proposes GeoRA, a low-rank adaptation method specifically designed for Reinforcement Learning with Verifiable Rewards (RLVR). It constructs a geometry-constrained matrix that fuses spectral and Euclidean priors to extract the principal directions of the RL update subspace for SVD initialization, while freezing a residual matrix as a structural anchor. On Qwen/Llama models ranging from 1.5B to 32B parameters, GeoRA consistently outperforms baselines such as LoRA, PiSSA, and MiLoRA across mathematical, medical, and code RLVR tasks, with stronger out-of-domain generalization and reduced capability forgetting.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment: This work proposes the HEAL framework, which addresses severe entropy collapse in few-shot RLVR by mixing general-domain data with an Entropy Dynamics Alignment (EDA) reward mechanism. Using only 32 target-domain samples, HEAL matches or surpasses full-shot RLVR performance trained on 1K samples.
ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following: ImpRIF formalizes the implicit reasoning structure in complex instructions as a verifiable Explicit Reasoning Graph (ERG), constructs large-scale single-turn/multi-turn training data accordingly, and trains models via SFT combined with process-verified RL. Models ranging from 4B to 32B parameters significantly outperform their base counterparts across five instruction-following benchmarks, with the 32B model surpassing several large commercial models.
Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation: This paper proposes the LcRL framework, which addresses knowledge bias and knowledge conflict in multilingual RAG through language-coupled GRPO policy optimization and anti-alignment penalty rewards, achieving significant improvements on multilingual question answering tasks.
LENS: Less Noise, More Voice — Reinforcement Learning for Reasoning via Instruction Purification: LENS identifies that many exploration failures in RLVR stem not from problem difficulty but from a small fraction (<5%) of distractor tokens in the prompt. By detecting and removing these tokens to improve rollout success rates, and transferring the learning signal from purified rollouts back to policy optimization on the original noisy prompts, LENS achieves an average improvement of 3.88% and a 1.6× training speedup.
Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization: This paper proposes PURPLE, a framework that models user profile construction in retrieval-augmented LLM personalization as a contextual bandit problem. It employs the Plackett-Luce ranking model to capture inter-record dependencies, uses the LLM's log-likelihood over reference responses as a reward signal, and directly optimizes retrieval to align with generation quality.
Quality Over Clicks: Intrinsic Quality-Driven Iterative RL for Cold-Start E-Commerce Query Suggestion: This paper proposes Cold-EQS, a query suggestion framework for cold-start e-commerce scenarios. It leverages answerability, factual accuracy, and information gain as intrinsic quality rewards, and employs iterative reinforcement learning to continuously optimize query suggestion quality, achieving a 6.81% online chatUV improvement.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning: This paper proposes ReRec, a reinforcement fine-tuning (RFT)-based recommendation assistant framework that addresses the limitations of coarse reward signals and unsupervised reasoning processes through three components: dual-graph enhanced reward shaping for fine-grained reward signals, reasoning-aware advantage estimation for step-level differentiated supervision, and an online curriculum scheduler for dynamic training difficulty adjustment. ReRec enables LLMs to handle complex multi-step reasoning recommendation queries and significantly outperforms existing methods on the RecBench+ benchmark.
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF: This paper proposes Reverse Constitutional AI (R-CAI), which inverts the principles of Constitutional AI into a "toxic constitution" and combines a critique-revision loop with a probability-clamped RLAIF mechanism to achieve automated, controllable, multi-dimensional adversarial toxic data synthesis. Probability clamping mitigates reward hacking-induced semantic degradation, improving semantic coherence by 15%.
Right at My Level: A Unified Multilingual Framework for Proficiency-Aware Text Simplification: This paper proposes Re-RIGHT, a framework that trains a 4B policy model via GRPO with a three-module reward (vocabulary coverage + semantic preservation + coherence) to accurately simplify text in English, Japanese, Korean, and Chinese according to learner proficiency levels (CEFR/JLPT/TOPIK/HSK), outperforming large models such as GPT-5.2 and Gemini 2.5.
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization: RL-PLUS proposes a hybrid-policy optimization approach that addresses external data distribution mismatch via Multiple Importance Sampling (MIS) and guides models to learn low-probability but correct reasoning paths via an Exploration-Based Advantage Function (EAF), successfully overcoming the capability boundary collapse induced by RLVR and achieving SOTA on six mathematical reasoning benchmarks (average 53.4), with consistent cross-model improvements of up to 69.2%.
Savoir: Learning Social Savoir-Faire via Shapley-based Reward Attribution: This paper proposes Savoir, a cooperative game-theoretic social RL framework that combines expected utility (prospective evaluation of the strategic potential of utterances) and Shapley values (axiomatic fair credit assignment) to address the credit assignment problem in multi-turn dialogue. Savoir achieves state-of-the-art performance on the SOTOPIA benchmark with a 7B model (Goal 7.18 in the Hard setting), matching or surpassing GPT-4o and Claude-3.5-Sonnet, while large reasoning models (o1, DeepSeek-R1) systematically underperform on social tasks.
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study: This work presents the first systematic study of scaling behaviors in LLM reinforcement learning post-training, revealing power-law relationships between performance and training resources across the Qwen2.5 family (0.5B–72B), with learning efficiency saturating as model scale increases.
Semantic-Space Exploration and Exploitation in RLVR for LLM Reasoning: This paper argues that the conventional token-level exploration–exploitation trade-off in RLVR is an artifact of the measurement space. It proposes to measure exploration and exploitation in the hidden-state semantic space via Effective Rank (ER) and its temporal derivatives (ERV/ERA), and on this basis designs VERL, a method that simultaneously improves both objectives, achieving gains of up to 21.4% on benchmarks such as Gaokao mathematics.
SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving: This paper proposes SpiralThinker, a framework for implicit reasoning that performs iterative updates in the latent representation space interleaved with explicit text reasoning steps. A progressive alignment objective is introduced to ensure latent representations remain consistent with explicit reasoning throughout the iterative process. SpiralThinker surpasses all latent reasoning baselines on mathematical, logical, and commonsense reasoning tasks.
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems: This paper proposes the STRIDE-ED framework, which achieves state-of-the-art performance in empathetic dialogue across multiple open-source LLMs by constructing a comprehensive empathy strategy system covering positive/neutral/negative emotions, designing task-aligned multi-stage cognitive CoT reasoning, and combining strategy-aware data refinement with a two-stage SFT+PPO training paradigm. The framework attains an emotion accuracy of 57.25% and BLEU-4 of 4.67.
Table Question Answering in the Era of Large Language Models: A Comprehensive Survey: This paper presents a comprehensive survey of Table Question Answering (TQA) research in the era of large language models. It systematically categorizes task settings along five dimensions (table format, question complexity, answer format, modality, and domain), organizes modeling approaches around five core challenges (table understanding, complex queries, large input handling, data heterogeneity, and knowledge integration), covers 277 papers, and provides forward-looking discussions on emerging directions such as reinforcement learning and interpretability.
The Stackelberg Speaker: Optimizing Persuasive Communication in Social Deduction Games: This paper models turn-based dialogue in social deduction games as a Stackelberg game, where the current player acts as the leader and optimizes the persuasive impact of utterances by measuring the response distribution of the next player. A Refiner model trained with GRPO achieves significant improvements over baselines across four game benchmarks including Werewolf and Avalon.
Understanding Generalization in Role-Playing Models via Information Theory: This paper proposes R-EMID, the first information-theoretic framework for quantifying performance degradation in role-playing models (RPMs) under user, character, and dialogue distribution shifts. By incorporating reasoning processes and Co-evolutionary Reinforcement Learning (CoRL), the framework enables accurate estimation of this metric. Key findings reveal that user shift poses the greatest generalization risk, and reinforcement learning is the only consistently effective training strategy.
UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning: This paper proposes UniCreative, a framework that unifies long-form (plan→write) and short-form (direct generation) creative writing modes through Adaptive Constraint Preference Optimization (ACPO) and an Adaptive Criteria Generative Reward Model (AC-GenRM), requiring neither SFT nor reference answers. The trained model exhibits emergent metacognitive ability to autonomously distinguish between task types.
SCRL: What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time: This paper proposes SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that mitigates label noise amplification through selective positive pseudo-labels (filtering unreliable majorities via strict consensus criteria) and entropy-gated negative pseudo-labels (introducing negative supervision signals into TTRL for the first time to prune erroneous trajectories), achieving up to 10.1 percentage points improvement over TTRL on AIME25.