🎮 Reinforcement Learning¶

🤖 AAAI2026 · 71 paper notes

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs: This paper proposes a multi-dimensional objective-space framework for evaluating LLM steerability, decomposing steering error into miscalibration and side effects (orthogonality). Experiments on text rewriting reveal that even the strongest LLMs produce severe side effects; prompt engineering proves ineffective, best-of-N sampling is prohibitively costly, and RL fine-tuning yields improvements but does not fully resolve the problem.
A Learning Framework For Cooperative Collision Avoidance of UAV Swarms Leveraging Domain Knowledge: This paper proposes reMARL, a framework that leverages domain knowledge from image processing (active contour model) to design reward functions for multi-agent reinforcement learning, enabling cooperative collision avoidance in UAV swarms. Compared to traditional metaheuristic methods, reMARL reduces reaction time by 98.75% and energy consumption by 85.37%.
A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses: This paper proposes MACO, a multi-agent conversational bandit framework that achieves online evaluation and user preference alignment for LLM responses through a local-agent phase elimination mechanism and an adaptive preference query strategy on a cloud server, attaining a near-optimal regret bound of \(\tilde{O}(\sqrt{dMT})\).
Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward: AC3 proposes an actor-critic framework that directly learns continuous action sequences (action chunks), stabilizing long-horizon robotic manipulation under sparse rewards via an asymmetric actor update rule—updating the actor only from successful trajectories—and self-supervised anchor-based intrinsic rewards. The method achieves superior success rates over existing approaches across 25 tasks on BiGym and RLBench.
Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping: This paper proposes a test-time policy shaping method that interpolates and modifies the action probability distribution of pretrained RL agents at inference time using lightweight ethical attribute classifiers, enabling fine-grained behavioral steering across multiple ethical attributes without retraining.
BAMAS: Structuring Budget-Aware Multi-Agent Systems: This paper proposes the BAMAS framework, which employs Integer Linear Programming (ILP) to select the optimal LLM combination under budget constraints, and uses a reinforcement learning policy to choose the best collaboration topology (Linear/Star/Feedback/Planner-Driven). BAMAS achieves accuracy comparable to state-of-the-art multi-agent systems on GSM8K, MBPP, and MATH, while reducing costs by up to 86%.
Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning: This paper proposes Behaviour Policy Optimization (BPO), which optimizes a dedicated behaviour policy for off-policy data collection such that the variance of return estimates is provably lower than on-policy collection, thereby improving the sample efficiency and stability of REINFORCE and PPO.
Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning: Through dynamical systems analysis, this paper proves that under approximate greedy exploration policies, all zero-loss solutions violating IGM consistency in non-monotonic value factorization Q-learning are unstable saddle points, while IGM-consistent solutions are stable attractors — enabling reliable convergence to optimal solutions without monotonicity constraints.
Beyond the Lower Bound: Bridging Regret Minimization and Best Arm Identification in Lexicographic Bandits: Two elimination-based algorithms, LexElim-Out and LexElim-In, are proposed to simultaneously address regret minimization (RM) and best arm identification (BAI) in lexicographic multi-objective bandits for the first time. LexElim-In breaks the known lower bound of single-objective problems through cross-objective information sharing.
Bi-Level Contextual Bandits for Individualized Resource Allocation under Delayed Feedback: This paper proposes MetaCUB — a bi-level contextual bandit framework for individualized resource allocation under delayed feedback, dynamic cohorts, cooldown constraints, and fairness requirements. The meta-level optimizes subgroup budget allocation to ensure fairness, while the base-level applies a UCB strategy to select the most promising individuals within each subgroup.
ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing: This paper introduces the ChartEditVista benchmark (7,964 samples, 31 chart types) and the ChartEditor model. By combining a GRPO reinforcement learning framework with a novel rendering reward, ChartEditor surpasses GPT-4o and several 72B-scale models on chart editing tasks using only 3B parameters.
CHDP: Cooperative Hybrid Diffusion Policies for RL in Parametric Environments: This paper models the hybrid action space problem as a fully cooperative two-agent game, employing discrete and continuous diffusion policies respectively to generate actions. Sequential updates and a Q-guided codebook are introduced to resolve policy conflicts and high-dimensional scalability issues, achieving up to a 19.3% improvement in success rate.
Constrained and Robust Policy Synthesis with Satisfiability-Modulo-Probabilistic-Model-Checking: This paper proposes the first framework capable of efficiently computing robust policies under arbitrary structural constraints. By tightly integrating a SAT solver with probabilistic model checking algorithms, the framework enables constrained and robust policy synthesis for finite Markov Decision Processes (MDPs), with feasibility and competitiveness validated across hundreds of benchmarks.
Deep (Predictive) Discounted Counterfactual Regret Minimization: This paper proposes two model-free neural CFR algorithms, VR-DeepDCFR+ and VR-DeepPDCFR+, which integrate advanced tabular CFR variants (DCFR+/PDCFR+) into neural network approximation frameworks for the first time. Through bootstrapped cumulative advantage estimation, discounted clipping mechanisms, and baseline variance reduction, the proposed methods achieve faster convergence in standard imperfect information games.
DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs: This paper proposes DeepProofLog (DPrL), a neurosymbolic system grounded in stochastic logic programs that introduces neural network parameterization at each proof step and establishes a formal mapping between SLD resolution and MDPs. This enables dynamic programming and reinforcement learning techniques to be applied for efficient inference and learning, substantially improving the scalability of neurosymbolic systems.
DiffOP: Reinforcement Learning of Optimization-Based Control Policies via Implicit Policy Gradients: This paper proposes DiffOP, a framework that treats optimization-based control policies (e.g., MPC) as differentiable modules, derives analytic policy gradients via implicit differentiation to enable end-to-end reinforcement learning training, and provides the first non-asymptotic convergence guarantee for this setting.
Discounted Cuts: A Stackelberg Approach to Network Disruption: This paper introduces the Discounted Cuts mathematical framework, modeling the classical Most Vital Links problem as a Stackelberg game. It systematically establishes a computational complexity classification for eight variants of discounted cuts and proves that all variants are solvable in polynomial time on bounded-genus graphs.
Distilling Deep Reinforcement Learning into Interpretable Fuzzy Rules: An Explainable AI Framework: This paper proposes a hierarchical Takagi-Sugeno-Kang (TSK) fuzzy classifier system that distills deep RL neural network policies into human-readable IF-THEN fuzzy rules. Three quantitative interpretability metrics are introduced (FRAD, FSC, ASG). On the Lunar Lander continuous control task, the proposed system achieves 81.48% fidelity, surpassing decision trees by 21 percentage points.
Distributionally Robust Online Markov Game with Linear Function Approximation: This paper studies online distributionally robust Markov games with linear function approximation. It is the first to identify the hardness of learning in this setting, and proposes the DR-CCE-LSI algorithm, which achieves minimax-optimal sample complexity with respect to the feature dimension \(d\) under a specific feature mapping condition.
Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning: This paper proposes a novel reward specification framework based on Linear Temporal Logic over finite traces modulo theories (LTLfMT), replacing manually coded labeling functions with first-order logic formulas. Combined with CRM and HER to address the inherent sparse reward problem in logic-based specifications, the framework achieves significant improvements on continuous control tasks.
Does Self-Evaluation Enable Wireheading in Language Models?: This paper theoretically proves and empirically validates that when a language model's self-evaluation is coupled with its reward signal, the model systematically inflates its self-assigned grades (wireheading), while decoupling self-grades from rewards mitigates this behavior. Experiments on Llama-3.1-8B and Mistral-7B across three tasks show that grade inflation in ambiguous tasks such as summarization reaches as high as 0.92.
DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift: This paper is the first to reformulate Android malware detection as a one-step Markov Decision Process (MD-MDP) and trains a PPO-based deep reinforcement learning agent, DRMD, that unifies sample classification, rejection, and active learning within a single policy. The approach achieves average AUT improvements of 8.66 (classification only) and 10.90 (with rejection) in multi-year temporal evaluations, significantly outperforming conventional supervised learning classifiers under concept drift.
Efficient Multiagent Planning via Shared Action Suggestions: This paper proposes the MCAS algorithm, which infers other agents' belief states by sharing only "suggested actions" within a decentralized POMDP framework, achieving coordination performance close to centralized methods while substantially reducing communication overhead and computational complexity.
Enhancing Robustness of Offline RL Under Data Corruption via SAM: This paper is the first to apply Sharpness-Aware Minimization (SAM) as a plug-and-play optimizer for offline RL. It hypothesizes that data corruption induces sharp minima in the loss landscape, leading to poor generalization, and demonstrates that SAM improves robustness by seeking flat minima. On the D4RL benchmark, IQL+SAM improves average score from 34.47 to 44.40.
Explaining Decentralized Multi-Agent Reinforcement Learning Policies: This paper proposes the first explainability method for decentralized multi-agent reinforcement learning (MARL) policies, comprising Hasse diagram-based policy summarization and query-based natural language explanations (When / Why Not / What). The approach is demonstrated across four MARL domains, showing both generality and computational efficiency. A user study confirms that it significantly improves human understanding of policies and question-answering performance.
First-Order Representation Languages for Goal-Conditioned RL: This paper investigates the application of first-order relational languages to goal-conditioned RL and generalized planning. It proposes representing goals as subsets or lifted versions of sets of atoms, and combines this with HER to automatically construct easy-to-hard goal curricula, enabling the learning of generalizable policies on large-scale sparse-reward planning problems.
Formal Verification of Diffusion Auctions: This paper presents the first formal logical framework for diffusion auctions, introducing the \(n\)-seller diffusion incentive logic \(\mathcal{L}^n\) and its strategic extension \(\mathcal{SL}^n\). The framework supports model-checking verification of auction properties such as Nash equilibria and the existence of seller strategies, with complexity results of P and PSPACE-complete respectively.
G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation: This paper proposes G-UBS (Group-aware User Behavior Simulation), a paradigm that employs a User Group Manager (UGM) based on a "Summarize–Cluster–Reflect" LLM workflow to generate group profiles, combined with group-aware reinforcement learning in a User Feedback Modeler (UFM), achieving robust user behavior understanding under implicit feedback noise. The paper also introduces IF-VR, the first multimodal implicit feedback benchmark for video recommendation.
Good-for-MDP State Reduction for Stochastic LTL Planning: This paper proposes a novel Good-for-MDP (GFM) automaton state reduction technique that significantly reduces automaton state counts via a GFM→DBA→DCA→GFG minimization→0/1-PA transformation pipeline. Additionally, for formulas of the form \(\textsf{GF}\varphi\) where \(\varphi\) is a co-safety formula, a direct singly-exponential construction is provided, achieving an exponential reduction in state count compared to the general doubly-exponential construction.
HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning: This paper proposes HCPO, an algorithm that enhances the expressiveness and exploration efficiency of multi-agent joint policies by introducing a conductor mechanism, constructing a Gaussian mixture model-like joint policy framework, and providing monotonic improvement guarantees for two-level policy updates.
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback: This paper proposes the InTRO framework, which aligns the model's generation policy with its answer-conditioned posterior via KL divergence minimization. By enabling token-level exploration and self-generated feedback within a single forward pass, InTRO improves both accuracy and conciseness of LLM reasoning without relying on any external supervision.
InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization: To address the exploration bottleneck in semantic alignment for GUI grounding, this paper proposes the Adaptive Exploration Policy Optimization (AEPO) framework. AEPO enforces broad exploration via a multi-answer generation strategy, dynamically guides learning through an adaptive exploration reward function, and ensures exploration quality via a collinearity penalty mechanism, significantly improving multimodal large language model performance on complex GUI grounding tasks.
Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation: This paper proposes INSIGHT, a two-stage unified framework for egocentric long-term action anticipation (LTA). Stage one enhances action representations via hand-object interaction (HOI) region feature extraction and verb-noun co-occurrence matrices; stage two introduces a GRPO-based reinforcement learning cognitive reasoning module that simulates a structured "perceive → reason → answer" cognitive process for intention inference and action prediction.
Know your Trajectory -- Trustworthy Reinforcement Learning Deployment through Importance-Based Trajectory Analysis: This paper proposes a trajectory-level explanation framework based on state importance metrics. By combining Q-value differences with a goal-affinity measure (radical term), trajectories are ranked by importance. Counterfactual rollouts are then used to verify the robust superiority of the selected optimal trajectory, providing trustworthy explanations for RL policies in the form of "why this path rather than that one?"
Language Model Distillation: A Temporal Difference Imitation Learning Perspective: This paper revisits language model distillation from an imitation learning / inverse reinforcement learning perspective. It exploits the sparsity of teacher output distributions (top-p tokens concentrate over 96% of probability mass) to construct a top-p MDP for temporal difference (TD) learning, proves that the optimal policy in the reduced action space admits a bounded suboptimality guarantee, and demonstrates that the resulting Bellman Distill method — built on the IQL algorithm — outperforms existing distillation methods across multiple model families.
Learning to Generate and Extract: A Multi-Agent Collaboration Framework for Zero-shot Document-level Event Arguments Extraction: This paper proposes a "Propose-Evaluate-Revise" multi-agent collaboration framework (comprising a generator agent and an evaluator agent) to address zero-shot document-level event argument extraction (ZS-DEAE). The generator agent synthesizes training data for unseen event types, while the evaluator agent provides log-likelihood-based quality scores to guide reinforcement learning for iterative optimization, simultaneously improving synthetic data quality and extraction performance.
ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation: This paper proposes ManiLong-Shot, a framework comprising three modules—interaction-aware task decomposition, invariant region prediction, and region matching—that generalizes to 20 unseen long-horizon manipulation tasks after training on only 10 short-horizon tasks, achieving a one-shot imitation success rate of 30.2%, a relative improvement of 22.8% over the prior state of the art.
MARS: A Meta-Adaptive Reinforcement Learning Framework for Risk-Aware Multi-Agent Portfolio Management: This paper proposes the MARS framework, which achieves risk-aware portfolio management under dynamic market conditions through a two-level architecture comprising a Heterogeneous Agent Ensemble (HAE)—where each agent has a distinct risk preference and Safety-Critic—and a Meta-Adaptive Controller (MAC). The framework significantly reduces maximum drawdown and volatility.
MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization: This paper proposes MARS, a five-agent framework for automated prompt optimization (APO): a Planner generates task-specific optimization trajectories; a Teacher-Critic-Student triad conducts Socratic dialogue for iterative prompt refinement (simulating pseudo-gradient descent in text space); and a Target agent executes the prompt and provides feedback. The entire process is modeled as a POMDP. MARS outperforms the previous SOTA (PE2) by 6.04% on general tasks and 6.42% on domain-specific tasks across 17 datasets, requiring only 1-shot training data.
MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy: This paper proposes MathSmith, a framework that generates hard mathematical problems by randomly sampling concept pairs from PlanetMath, applying 9 predefined difficulty strategies, and jointly optimizing structural validity, reasoning complexity, and answer consistency via GRPO-based reinforcement learning. The resulting high-difficulty synthetic problems significantly improve LLM mathematical reasoning on AIME and OlympiadBench.
MMhops-R1: Multimodal Multi-hop Reasoning: This paper proposes the MMhops benchmark (31K samples, 3–4 reasoning hops) and the MMhops-R1 framework, which trains MLLMs via reinforcement learning to autonomously plan reasoning paths and dynamically invoke image/text retrievers for multimodal multi-hop reasoning. A 7B model surpasses 72B baselines and existing mRAG methods.
Object-Centric Latent Action Learning: This paper proposes an object-centric latent action learning framework that leverages self-supervised object decomposition (VideoSAUR) to disentangle task-relevant entities from visual distractions (e.g., dynamic backgrounds), reducing the performance degradation of LAPO on distracted videos by approximately 50%. A linear action probe is used to automatically select control-relevant slots.
Object-Centric World Models for Causality-Aware Reinforcement Learning: This paper proposes STICA, a framework that implements the world model, policy network, and value network through a unified object-centric Transformer architecture. The world model decomposes observations into independent per-object latent states for token-level dynamics prediction, while the policy and value networks estimate token-level causal relationships via a causal attention mechanism to enable causality-aware decision-making. STICA significantly outperforms DreamerV3 and other state-of-the-art methods on the Safety Gym and OCVRL benchmarks.
One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow: This paper reformulates MeanFlow from visual generation into a generative policy for offline RL. It proposes a residual-form direct noise-to-action mapping that achieves expressive one-step sampling and enables stable joint optimization with a Q-function in a single training stage, achieving strong performance across 73 tasks on OGBench and D4RL.
PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning: This paper proposes PA-FAS, a framework that addresses two critical bottlenecks of the SFT+RL paradigm in multimodal FAS — insufficient reasoning path diversity and reasoning shortcut — via a Reasoning Path Augmentation strategy and an answer shuffling mechanism, achieving the first unified solution for multimodal fusion, domain generalization, and interpretability simultaneously.
Partial Action Replacement: Tackling Distribution Shift in Offline MARL: This paper proposes the Partial Action Replacement (PAR) principle, theoretically proving that under a factorized behavior policy, distribution shift grows linearly with the number of deviating agents (rather than exponentially in the joint action space). Building on this, the SPaCQL algorithm is developed to dynamically weight different PAR operators via Q-ensemble uncertainty, achieving substantial improvements over all baselines on Random and Medium-Replay datasets.
Perturbing Best Responses in Zero-Sum Games: This paper investigates the introduction of stochastic perturbations into best-response oracles (BROs) for zero-sum games. It proves that Stochastic Fictitious Play (SFP) achieves an expected iteration count of \(O(\frac{\log n}{\varepsilon^2})\) with respect to the number of pure strategies \(n\), and proposes the Stochastic Double Oracle (SDO) algorithm, which achieves logarithmic convergence under specific game structures.
Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization: This paper presents the first theoretical study of preference-aware customization in multi-objective multi-armed bandits (MO-MAB) with explicit user preferences. It proposes the PAMO-MAB framework and designs PRUCB-UP and PRUCB-HP algorithms for the "unknown preference" and "hidden preference" settings, respectively. Through a two-component architecture combining preference estimation and preference-aware optimization, both algorithms achieve near-optimal regret bounds. The paper also proves that preference-free algorithms inevitably incur \(\Omega(T)\) linear regret when the Pareto front contains conflicting arms.
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation: This paper proposes MTMC (Macro Thinking Micro Coding), a hierarchical framework that decouples GPU kernel generation into two stages: a lightweight RL-trained LLM generates high-level optimization actions (Macro Thinking), while a general-purpose LLM incrementally implements each action (Micro Coding). This design separates correctness from performance concerns, achieving near-100% accuracy and a 2.2× speedup over expert-optimized PyTorch Eager kernels on KernelBench.
Realistic Curriculum Reinforcement Learning for Autonomous and Sustainable Marine Vessel Navigation: This paper proposes a Curriculum Reinforcement Learning (CRL) framework for autonomous and sustainable marine vessel navigation. The framework integrates a high-fidelity simulation environment built on real AIS data, a diffusion model-enhanced dynamic maritime traffic simulator, and a machine learning-based fuel consumption prediction module. A multi-objective reward function simultaneously optimizes navigation safety, emission reduction, timeliness, and goal completion.
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination: This paper conducts a systematic data leakage audit revealing severe data contamination of the Qwen2.5 series on standard math benchmarks such as MATH-500. It demonstrates that recent findings claiming "spurious rewards can improve mathematical reasoning" are artifacts of contamination, and constructs a fully uncontaminated benchmark, RandomCalculation, to verify that only correct reward signals yield genuine reasoning improvements.
Reasoning with Exploration: An Entropy Perspective: This paper analyzes the positive correlation between exploratory reasoning behaviors in LLMs (pivotal tokens, self-reflection, rare behaviors) and high-entropy regions from an entropy perspective. It proposes a minimalist entropy-based advantage shaping method—requiring only a single line of code modification—that significantly enhances the Pass@K reasoning capability ceiling of LLMs.
ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India: This paper presents the first application of PPO-based reinforcement learning (RLAIF) to Indian legal judgment prediction and summarization tasks. Although performance does not surpass SFT or commercial models, this position paper reveals fundamental challenges and future directions for RL in legal NLP.
Revealing POMDPs: Qualitative and Quantitative Analysis for Parity Objectives: This paper proves that limit-sure analysis for revealing POMDPs under parity objectives is equivalent to almost-sure analysis (EXPTIME-complete), and that quantitative analysis can also be completed within EXPTIME, thereby resolving two important open problems for this subclass.
Risk-Sensitive Exponential Actor Critic: To address the high variance and numerical instability of policy gradients under the entropic risk measure, this paper derives a complete set of on/off-policy risk-sensitive policy gradient theorems and proposes the rsEAC algorithm, which achieves stable risk-sensitive continuous control via log-domain critic parameterization and gradient normalization-clipping mechanisms.
RLSLM: A Hybrid Reinforcement Learning Framework Aligning Rule-Based Social Locomotion Model with Human Social Norms: This paper proposes RLSLM, a hybrid framework that embeds a psychology-experiment-driven rule-based Social Locomotion Model (SLM) into the reward function of reinforcement learning, enabling agents to efficiently learn navigation policies aligned with human social norms in crowd environments. VR experiments demonstrate that RLSLM achieves significantly higher comfort ratings than existing rule-based baselines.
SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories: This paper proposes SafeMIL, which formulates cost function learning as a Multiple Instance Learning (MIL) problem to learn a safe imitation policy from a limited set of non-preferred trajectories and a large collection of unlabeled trajectories—without requiring step-level reward or cost annotations—achieving constraint satisfaction performance 3.7× better than the strongest baseline.
Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation: This paper proposes PolicyGradEx, which efficiently estimates policy adaptation performance on arbitrary task subsets via first-order gradient approximation and surrogate models, constructs a task affinity matrix, and performs task grouping through convex optimization. PolicyGradEx outperforms state-of-the-art baselines by an average of 16% on multi-objective RL and meta-RL benchmarks, with a speedup of up to 26×.
Speculative Sampling with Reinforcement Learning: This paper proposes Re-SpS, the first framework to formulate the draft tree hyperparameter optimization of Speculative Sampling (SpS) as an MDP and solve it via reinforcement learning. Through two key designs—feature reuse and action caching—Re-SpS achieves up to 1.12× additional speedup over EAGLE-3 without any loss in output fidelity.
Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding: This work identifies that CoT reasoning can be counterproductive in visual grounding, and proposes CuRPO (Curriculum-based Relative Policy Optimization), which leverages CoT length and gIoU reward as data complexity proxies for curriculum-based RL training, achieving up to +12.52 mAP improvement over Visual-RFT on RefCOCO.
STELAR-Vision: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision: This paper proposes STELAR-Vision, a topology-aware training framework for visual language reasoning. Via the TopoAug data generation pipeline, it introduces diverse reasoning topologies—Chain, Tree, and Graph—and combines SFT with RL (SimPO) post-training. The framework achieves +9.7% accuracy on in-distribution data and up to +28.4% on out-of-distribution benchmarks, while reducing output length by 18.1% through Frugal Learning.
TAdaRAG: Task Adaptive Retrieval-Augmented Generation via On-the-Fly Knowledge Graph Construction: This paper proposes TAdaRAG, a task-adaptive RAG framework that performs on-the-fly knowledge graph construction via intent-driven template routing, supervised fine-tuning, and REINFORCE-based reinforcement learning. It addresses three core limitations of conventional RAG—chunking-induced hallucination, broken reasoning chains, and irrelevant information interference—achieving state-of-the-art performance on 6 public benchmarks and 1 commercial scenario benchmark.
Test-driven Reinforcement Learning in Continuous Control: This paper proposes the Test-driven Reinforcement Learning (TdRL) framework, which replaces a single reward function with multiple test functions — pass-fail tests defining optimality criteria and indicative tests guiding learning — to represent task objectives. A return function is learned via lexicographic-heuristic trajectory comparison, matching or surpassing hand-crafted reward methods on the DeepMind Control Suite while naturally supporting multi-objective optimization.
TextShield-R1: Reinforced Reasoning for Tampered Text Detection: This paper proposes TextShield-R1, the first reinforcement learning-based multimodal large language model (MLLM) method for tampered text detection. The approach integrates forensic continual pre-training (a curriculum from natural images to text images), GRPO reinforcement learning (five carefully designed reward functions to reduce annotation dependency), and OCR rectification (leveraging the MLLM's text recognition capability to improve localization accuracy). Together with the newly introduced TFR benchmark (45K+ images, 16 languages, 10 tampering techniques), this work substantially advances the state of the art in interpretable tampered text detection.
Think, Speak, Decide: Language-Augmented Multi-Agent Reinforcement Learning for Economic Decision-Making: This paper proposes the LAMP framework, which integrates LLM-driven language reasoning with MARL policy optimization through a Think–Speak–Decide three-stage pipeline. The framework enables economic decision-making agents to understand and leverage natural language information (e.g., news, dialogues), achieving cumulative returns exceeding pure MARL baselines by 63.5% and LLM-only baselines by 34.0% in economic simulation environments.
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction: This paper proposes the Thinker framework, which achieves structured deep search reasoning through hierarchical thinking (breadth decomposition + depth solving) and dual representation (natural language + logical functions). Combined with knowledge boundary determination to reduce unnecessary retrieval, the model is trained via SFT and significantly outperforms RL-based deep search methods across multiple QA benchmarks.
TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents: This paper introduces TowerMind, a lightweight multimodal environment based on tower defense games for evaluating LLMs' long-term planning and decision-making capabilities. It reveals a significant performance gap between current LLMs and human experts (the best model achieves only 42% of human expert scores) and identifies behavioral deficiencies including insufficient plan verification, lack of multi-goal thinking, and underutilization of the action space.
Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach: This paper proposes Geo-R, a retrieval-free, reasoning-driven image geolocalization framework. By introducing the Chain-of-Region (CoR) hierarchical reasoning paradigm and a reinforcement learning strategy based on Haversine distance coordinate-alignment rewards, Geo-R achieves 18.10% street-level (1 km) accuracy on IM2GPS3K, surpassing all retrieval-free methods and approaching retrieval-based ones.
Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning: This paper identifies the Beginning Lock-in Effect (BLE) in LLM reasoning — the initial reasoning steps significantly determine subsequent trajectories and final outcomes. Based on this finding, the paper proposes PPPO, a method that optimizes only prefix tokens (approximately 26% of all tokens), achieving accuracy improvements of up to 18.02% while reducing output token counts by up to 18.35%.
Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning: This paper proposes the STV framework, which identifies attention head positions sensitive to in-context information via activation deltas, and leverages reinforcement learning to select optimal task vectors from a pre-clustered activation bank for insertion—enabling efficient many-shot multimodal in-context learning without increasing input length.
Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position: This paper presents the first systematic safety analysis of diffusion large language models (dLLMs), revealing that—unlike autoregressive LLMs—middle tokens are more critical to safety in dLLMs, and that attackers are fundamentally constrained by the model's inherent sequential generation tendency from manipulating these positions. Based on this asymmetry, the paper proposes MOSA (Middle-tOken Safety Alignment) as a defense method.