🎮 Reinforcement Learning¶

💬 ACL2025 · 8 paper notes

📌 Same area in other venues: 📷 CVPR2026 (25) · 🔬 ICLR2026 (400) · 💬 ACL2026 (46) · 🧪 ICML2026 (110) · 🤖 AAAI2026 (58) · 🧠 NeurIPS2025 (143)

🔥 Top topics: Reinforcement Learning ×7 · LLM ×4

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback: This paper proposes the Align-SLM framework, which applies preference optimization (DPO + RLAIF) to textless spoken language models (without text injection) for the first time. By utilizing LLMs to automatically evaluate the quality of generated speech continuations to construct preference datasets, combined with curriculum learning, the approach iteratively enhances the semantic understanding of SLMs, setting a new SOTA on benchmarks like ZeroSpeech and StoryCloze.
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient: This paper proposes a policy gradient-based structural pruning method for LLMs. By learning Bernoulli pruning masks in the probability space, it directly optimizes the loss function of the pruned model without requiring any backpropagation through the LLM itself, relying solely on forward inference to complete pruning optimization.
An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals: This paper is the first to apply Evolutionary Reinforcement Learning (ERL) to the task-oriented dialogue policy task. It proposes the EIERL method, which combines the global exploration of Evolutionary Algorithms (EA) with the local optimization of Deep Reinforcement Learning (DRL). It addresses the slow evolution of EA in the large search space of natural language through an Elite Individual Injection (EII) mechanism, achieving a more efficient exploration-exploitation balance across four datasets.
Learning to Generate Structured Output with Schema Reinforcement Learning: Proposes SchemaBench, a benchmark containing approximately 40,000 JSON schemas, and Schema Reinforcement Learning (SRL), a training framework. By utilizing a fine-grained schema validator to provide dense reward signals combined with a Thoughts of Structure (ToS) reasoning mechanism, SRL improves LLM accuracy in complex JSON generation by up to 16% without compromising general reasoning abilities.
LLM-Enhanced Self-Evolving Reinforcement Learning for Multi-Step E-Commerce Payment Fraud Risk Detection: Formulates e-commerce payment fraud detection as a multi-step MDP and utilizes LLMs (Mixtral/LLaMA/Gemma) to automatically generate and optimize RL reward functions through an evolutionary algorithm, significantly improving dollar-wise precision on real eBay transaction data compared to human-designed rewards and traditional SL baselines.
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning: Proposes MAPoRL—a post-training paradigm based on multi-agent reinforcement learning. By co-training multiple LLMs within a debate framework, integrated with verifier scoring and collaborative incentive mechanisms, it significantly enhances the effectiveness of multi-LLM collaboration and demonstrates cross-task generalization capabilities.
Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering: This paper proposes RL-Profiler, which trains a post relevance filter (SelNet) using reinforcement learning to select a small subset of posts relevant to personality traits from a user's large profile. These selected posts are then passed to an LLM for zero-shot personality prediction, thereby significantly reducing context length while maintaining prediction performance close to using all posts.
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search: TreeRL is proposed to directly integrate Entropy-Guided Parallel Tree search (EPTree) into on-policy reinforcement learning training for LLMs. By branching at tokens with high uncertainty, it expands the diversity of reasoning paths and utilizes global and local advantages derived from the tree structure as process supervision signals, surpassing traditional multi-chain sampling RL on mathematics and code reasoning tasks.