🎮 Reinforcement Learning¶

📷 CVPR2026 · 22 paper notes

AceTone: Bridging Words and Colors for Conditional Image Grading: AceTone is proposed as the first unified framework for multimodal-conditioned color grading supporting both text and reference image inputs. By compressing 3D-LUTs into 64 discrete tokens via VQ-VAE, a VLM is trained to predict LUT token sequences, followed by GRPO reinforcement learning to align color similarity and aesthetic preference, achieving a 50% improvement in LPIPS on both style transfer and instruction-based grading tasks.
Anticipatory Planning for Multimodal AI Agents: This paper proposes TraceR1, a two-stage RL framework in which the first stage employs trajectory-level reward optimization to train agents to perform multi-step look-ahead planning, while the second stage applies grounded fine-tuning via tool execution feedback to improve single-step precision. The approach achieves open-source state-of-the-art results across 7 GUI and tool-use benchmarks.
AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization: AnyDoc proposes a general-purpose document generation framework based on a unified HTML/CSS representation. It constructs a 265K-document dataset, DocHTML, via an automated data synthesis pipeline, and fine-tunes a multimodal large language model through SFT and Height-Aware Reinforcement Learning (HARL). The framework surpasses baselines including GPT-4o on three tasks: intent-to-document, document de-rendering, and element-to-document generation.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment: This paper proposes BRIDGE, a system that distills noisy multimodal queries into retrieval-optimized pure-text queries via FORGE (an RL-trained query alignment model), paired with LENS, a reasoning-enhanced retriever. BRIDGE achieves 29.7 nDCG@10 on MM-BRIGHT, and as a plug-in further improves Nomic-Vision to 33.3, surpassing the best text-only retriever.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning: This paper proposes CCCaption, a dual-reward reinforcement learning framework that jointly optimizes completeness (via a visual query set generated by multiple MLLMs) and correctness (via hallucination detection on sub-queries decomposed from the caption) for image captioning. A 2B model trained under this framework surpasses a 32B baseline.
Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning: This paper proposes Cross-modal Identity Mapping (CIM), which quantifies information loss in image captioning by analyzing the representational consistency (GRC) of images retrieved via captions and their relevance to the source image (QIR). These metrics serve as RL reward signals to train LVLMs to generate fine-grained and accurate captions without requiring additional annotations.
GeoWorld: Geometric World Models: GeoWorld maps the latent representations of predictive world models from Euclidean space onto a hyperbolic manifold, preserving geometric structure and hierarchical relationships via Hyperbolic JEPA, and proposes Geometric Reinforcement Learning to optimize multi-step planning. The method achieves approximately 3% SR (T=3) and 2% SR (T=4) gains on CrossTask and COIN.
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion: This paper proposes GraspLDP, which injects grasp pose priors from a pretrained grasp detector and graspness map visual cues into a latent diffusion policy framework. By leveraging VAE-encoded action latent spaces for guidance and a self-supervised reconstruction objective, GraspLDP substantially improves grasping accuracy and generalization.
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment: This paper proposes a lifelong imitation learning framework that stores and replays compact representations in the feature space of frozen encoders via Multimodal Latent Replay (MLR), and introduces an Incremental Feature Adjustment (IFA) mechanism that employs angular distance constraints to maintain inter-task separability. The method achieves AUC improvements of 10–17 points and reduces forgetting by up to 65% on the LIBERO benchmark.
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment: This paper proposes a lifelong imitation learning framework that combines Multimodal Latent Replay (storing and replaying compact multimodal features in the latent space of a frozen encoder) with Incremental Feature Adjustment (an adaptive margin constraint based on angular distance to prevent inter-task representation drift), achieving 10–17 point AUC gains and 65% reduction in forgetting on the LIBERO benchmark.
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning: This paper proposes Multi-Stage Reinforcement Learning (MSRL), which first learns reward reasoning capabilities on large-scale text preference data and then progressively transfers them to multimodal tasks, addressing the bottleneck of scarce annotated data in multimodal reward model training. MSRL improves accuracy on VL-RewardBench from 66.6% to 75.9%.
MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning: This paper proposes MSRL (Multi-Stage Reinforcement Learning), which scales generative multimodal reward modeling through a multi-stage RL curriculum: first learning general reward reasoning on large-scale text preference data (400K) via RL, then transferring to the multimodal domain via caption-based RL and cross-modal knowledge distillation, and finally fine-tuning with a small amount of multimodal preference data. Without additional multimodal annotations, MSRL improves performance from 66.6% to 75.9% on VL-RewardBench and from 70.2% to 75.7% on GenAI-Bench.
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset: This paper proposes RADAR, a fully autonomous closed-loop robotic data collection framework. Through the synergistic operation of four modules—VLM semantic planning, GNN policy execution, VQA success evaluation, and LIFO causal environment reset—the system requires only 2–5 human demonstrations to continuously generate high-quality manipulation data without human intervention, achieving a 90% success rate on long-horizon simulation tasks.
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset: This paper presents RADAR — a fully autonomous closed-loop robotic manipulation data generation engine comprising four modules: VLM-based semantic planning, GNN policy execution, VQA-based success evaluation, and FSM-orchestrated LIFO causal reverse environment reset. Requiring only 2–5 human demonstrations, the system continuously generates high-fidelity manipulation data, achieving 90% success rate on complex long-horizon tasks in simulation.
ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering: This paper proposes ReAG, a reasoning-augmented multimodal RAG framework that combines coarse- and fine-grained retrieval with a Critic filtering model to reduce noise, and trains a generator via GRPO reinforcement learning to perform explicit reasoning, achieving new state-of-the-art performance on knowledge-intensive VQA.
Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision: This paper proposes two modules, ReAL and CGRO, which extract anomaly-relevant tokens from the autoregressive reasoning process of an MLLM and aggregate their visual attention maps to generate pixel-level anomaly maps. A consistency-guided reinforcement learning scheme then aligns reasoning tokens with visual evidence, enabling end-to-end anomaly detection, localization, and interpretable reasoning under image-level supervision only.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning: This paper proposes RLER, a dual-paradigm framework in which the training stage employs GRPO with three novel rewards (Frame-sensitive, Think-transparency, Anti-repetition) to teach the model to generate structured evidence, while the inference stage uses a training-free orchestrator to perform evidence-consistency-based weighted election and self-checking across multiple candidates. RLER comprehensively outperforms open-source and RL-based LMMs on 8 video benchmarks with an average gain of 6.3%, requiring only approximately 3.1 candidates on average.
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation: This paper presents the first systematic empirical study on the properties of wrist-mounted fisheye cameras in imitation learning for robotic manipulation. Centered on three core research questions—spatial localization, scene generalization, and hardware generalization—it reveals both the advantages and limitations of wide field-of-view (FoV) imaging, and proposes Random Scale Augmentation (RSA) to address scale overfitting in cross-camera transfer.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning: This paper proposes RoboAgent, a capability-driven embodied task planning framework that employs a single VLM to simultaneously serve as a scheduler and five basic capabilities (exploration guidance, object grounding, scene description, action decoding, experience summarization). Through three-stage training (SFT + DAgger + expert-guided RL), RoboAgent achieves state-of-the-art performance on EB-ALFRED and ALFWorld.
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs: This paper proposes Evidence-Constrained Reweighting Decoding (ECRD), a framework that maintains a dynamic textual evidence pool during LVLM decoding, reweights candidate tokens via distribution negotiation, and automatically invokes a lightweight visual decider to extract micro-evidence under uncertainty—achieving significant reductions in visual hallucination and improvements in reasoning accuracy across multiple LVLMs without any training.
Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement: VFLM proposes a layout generation framework that leverages visual feedback for iterative refinement. By combining a visually grounded reward model based on OCR accuracy with reinforcement learning training, the framework enables multimodal large language models to "see" rendered outputs and repeatedly self-correct, achieving substantial improvements in text layout quality over code-only generation approaches.
Specificity-aware Reinforcement Learning for Fine-grained Open-world Classification: This paper proposes SpeciaRL—a specificity-aware reinforcement learning framework that guides reasoning-capable large multimodal models to simultaneously improve prediction specificity and correctness in open-world fine-grained image classification, via a dynamic reward signal derived from the best prediction among online rollouts.