🎮 Reinforcement Learning¶

📹 ICCV2025 · 7 paper notes

Embodied Navigation with Auxiliary Task of Action Description Prediction: DescRL introduces action description generation as an auxiliary task for reinforcement learning-based navigation. By distilling knowledge from pretrained vision-language models to train an ADPredictor, the navigation agent simultaneously produces interpretable action descriptions and achieves improved navigation performance, attaining state-of-the-art results on Semantic Audio-Visual Navigation (SAVNav) and several other tasks.
mDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs: This paper proposes mDP3, a training-free and model-agnostic video frame selection method that estimates frame similarity in RKHS via a conditional Gaussian kernel, leverages Determinantal Point Processes (DPP) to capture query relevance and list-wise diversity, and models temporal structure via a Markov Decision Process (MDP). Using only 8 input frames, mDP3 significantly outperforms uniform sampling and existing frame selection methods on multiple long-video benchmarks.
NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation: This paper proposes NavQ, a foresighted VLN agent that employs a Q-model to predict, in a single forward pass, long-horizon future semantic aggregation features (Q-features) for each candidate action. Combined with an A*-style search strategy, NavQ achieves significant improvements on object-goal navigation benchmarks.
Progressor: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement: This paper proposes Progressor, a framework that learns task-agnostic reward functions from unannotated videos via self-supervision. It provides dense reward signals by predicting task progress distributions and addresses distribution shift during online RL training through an adversarial push-back strategy.
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization: This paper proposes R1-Onevision, a framework that converts images into formalized textual representations via a cross-modal reasoning pipeline, combined with a two-stage post-training strategy of SFT followed by rule-based reinforcement learning (GRPO), to significantly enhance multimodal reasoning in vision-language models, surpassing GPT-4o on multiple mathematical reasoning benchmarks.
RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment: This paper proposes RL-Selector, which introduces the ε-sample cover concept to quantify sample redundancy and formulates data selection as a reinforcement learning problem. A lightweight A2C policy network adaptively optimizes the selection strategy, achieving generalization performance comparable to or surpassing full-data training with significantly fewer samples across multiple benchmark datasets.
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints: This paper introduces the concept of compositional constraints to formalize safety and efficiency requirements in multi-agent embodied collaboration, constructs the first multi-agent manipulation benchmark RoboFactory based on this formalization, and systematically investigates architectures and training strategies for multi-agent imitation learning.