Skip to content

🎮 Reinforcement Learning

📷 CVPR2026 · 23 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (400) · 💬 ACL2026 (46) · 🧪 ICML2026 (110) · 🤖 AAAI2026 (58) · 🧠 NeurIPS2025 (140) · 📹 ICCV2025 (7)

🔥 Top topics: Reinforcement Learning ×16 · Agents ×3 · Multimodal/VLM ×2 · Reasoning ×2

AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization

AnyDoc proposes a general document generation framework based on a unified HTML/CSS representation. Through an automated data synthesis pipeline, it constructs the DocHTML dataset containing 265K documents. By combining SFT and Height-Aware Reinforcement Learning (HARL) to fine-tune MLLMs, it outperforms baselines such as GPT-4o on intention-to-document, document derendering, and element-to-document tasks.

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning

The authors propose the CCCaption dual-reward reinforcement learning framework. By jointly optimizing image captioning completeness (based on visual query sets generated by multiple MLLMs) and correctness (based on hallucination detection of decomposed sub-queries), the 2B model outperforms the 32B baseline.

CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

Aiming at the industrial scenario of "directly generating executable and editable CAD code from 2D engineering triple-views," CME-CAD enables multiple heterogeneous pre-trained large models to act as "experts" with distinct styles. It first employs Multi-Expert Fine-Tuning (MEFT) using their respective reasoning styles, followed by a Multi-Expert Reinforcement Learning (MERL) stage. In MERL, strong experts transfer superior strategies to weak experts via KL distillation, and a Hard Sample Buffer mechanism is used to repeatedly tackle the most difficult samples. Ultimately, on the self-built industrial-grade benchmark CADExpert, the IoU is improved from 71.84% to 80.71%, and the code execution rate reaches 98.25%.

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

The authors propose Cross-modal Identity Mapping (CIM), which quantifies information loss in image descriptions by analyzing the representation consistency (GRC) of images retrieved via the caption and their relevance to the source image (QIR). This serves as an RL reward signal to train LVLMs to generate fine-grained and accurate descriptions without requiring additional human annotations.

DreamSAC: Learning Hamiltonian World Models via Symmetry Exploration

DreamSAC replaces the black-box dynamics of pixel-based world models (DreamerV3) with an SE(3)-invariant Hamiltonian dynamics prior and employs a "symmetry-breaking work" intrinsic curiosity to collect physically informative data. This allows the model to learn conservation laws rather than just pixel statistical correlations, achieving 22%–163% higher extrapolative generalization on unseen physical parameters such as mass, gravity, and friction compared to SOTA.

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

EVA models long video understanding as a "planning-before-perception" Markov Decision Process (MDP), enabling the MLLM agent to decide "which segment to watch, how many frames to sample, and at what resolution" based solely on the text question. Through a three-stage training pipeline (SFT Cold Start \(\rightarrow\) KTO Offline Correction \(\rightarrow\) Data-Enhanced GRPO), the model evolves from a format imitator to an active video explorer. It achieves a 6–12% accuracy improvement over general MLLMs and a 1–3% gain over existing adaptive agents using approximately 1/10 of the visual tokens across six video benchmarks.

GeoWorld: Geometric World Models

GeoWorld maps the latent representations of predictive world models from Euclidean space onto hyperbolic manifolds. By maintaining geometric structures and hierarchical relationships through Hyperbolic JEPA and employing Geometric Reinforcement Learning to optimize multi-step planning, it achieves improvements of approximately 3% SR (3 steps) and 2% SR (4 steps) on CrossTask and COIN.

Incentivizing Generative Zero-Shot Learning via Outcome-Reward Reinforcement Learning with Visual Cues

RLVC treats the feature generator in generative zero-shot learning as an RL policy. It utilizes outcome rewards based on "correct classification" from a frozen classifier to drive generator self-evolution, combined with class-level visual cues for prototype distillation to stabilize training. It achieves new SOTA on CUB, SUN, and AWA2 benchmarks (e.g., 90.1% CZSL accuracy and 81.2% GZSL harmonic mean on CUB).

JoPPO: Hierarchical Photography Assessment via Contrastive Joint Conditional Probabilistic Reinforcement Learning

JoPPO upgrades "using VLMs to score image aesthetics" from regressing a single global score to modeling the joint Gaussian distribution of attribute scores and total scores across a batch. By deriving attribute-conditional pairwise win rates and utilizing them as rewards in GRPO to train the judge, the model provides interpretable multi-attribute sub-scores while significantly exceeding GPT-4o in ranking consistency.

Local Motion Matters: A Deconstruct-Recompose Paradigm for Reinforcement Learning Pre-training from Videos

This work deconstructs complex "global motion" in videos into morphology-agnostic "atomic actions" (local optical flow patches). A dual-attention encoder learns transferable local motion representations, which are recomposed into a world model via a learnable aggregation token. This paradigm significantly enhances RL sample efficiency and final performance on downstream robotic control tasks such as DMControl Remastered and Meta-World.

MangoBench: A Benchmark for Multi-Agent Goal-Conditioned Offline Reinforcement Learning

This paper extends single-agent "Offline Goal-Conditioned RL (OGCRL)" to multi-agent collaborative scenarios for the first time. It proposes a goal-conditioned offline MARL framework based on goal relabeling and robot structural decomposition, alongside MangoBench—the first fully collaborative multi-goal benchmark for this setting (3 environments, 4 agent types, 47 tasks, 6 baselines). Experiments demonstrate that hierarchical IHIQL generalizes best under sparse rewards, though no single method dominates all tasks.

Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning

MARVAL utilizes Guided Score Implicit Matching (GSIM) with CFG guidance to compress the multi-step diffusion denoising chain within Masked Auto-Regressive models into "single-step generation." It achieves FID=2.00 on ImageNet 256×256—over 30 times faster than MAR—and leverages this acceleration to enable the first practical reinforcement learning post-training for MAR-like models using verifiable rewards, significantly enhancing human preference scores like CLIP and ImageReward.

MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Proposes the Multi-Stage Reinforcement Learning (MSRL) method, which first learns reward reasoning capabilities on large-scale text preference data and then progressively transfers them to multimodal tasks. This addresses the bottleneck of scarce annotated data in multimodal reward model training, improving accuracy on VL-RewardBench from 66.6% to 75.9%.

PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning

To address the near-collapse of VLM 3D spatial reasoning on 360° Equirectangular Projection (ERP) panoramas, this work constructs the PanoEnv-QA benchmark with 14.8K questions across five geometrically aligned categories. By employing GRPO post-training with "task-routed ground-truth rewards" and a "two-stage curriculum," the total accuracy of a 7B model is improved from 49.34% to 52.93%, and open-ended question accuracy rises from 6.39% to 14.83%, surpassing 32B models.

PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning

PlannerRFT performs reinforcement fine-tuning for diffusion-based autonomous driving planners: it uses "policy-guided denoising" to transform modal-collapsed diffusion sampling into diverse and scene-adaptive trajectory groups, then applies a dual-branch closed-loop optimization with GRPO + PPO, supported by the self-developed 10× accelerated simulator nuMax, achieving SOTA closed-loop planning performance on nuPlan.

ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

ReAG is proposed as a reasoning-augmented multimodal RAG method that combines coarse and fine-grained retrieval with a Critic filtering model to reduce noise. It employs GRPO reinforcement learning to train the generator for explicit reasoning, achieving a new SOTA on knowledge-intensive VQA tasks.

Resolving the Stability-Plasticity Dilemma in Reinforcement Learning via Complementary Continual Critics

To address the "Stability-Plasticity Dilemma"—the need for both fast adaptation and no forgetting—in visual RL, this paper proposes CD-CCA: it equips one "plastic critic" with Continual Backpropagation (CBP) and one "stable critic" with Elastic Weight Consolidation (EWC), then adaptively fuses their Q-values based on observations via a cross-attention mechanism. It simultaneously improves sample efficiency and convergence stability on DMControl and CARLA.

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Ours proposes the Evidence-Constrained Reweighting Decoding (ECRD) framework: it maintains a dynamic textual evidence pool during LVLM decoding, reweights candidate tokens via distribution negotiation, and automatically invokes a lightweight visual decider to extract micro-evidence when uncertain. It significantly reduces visual hallucinations and improves reasoning accuracy across multiple LVLMs without training.

Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

VFLM proposes a layout generation framework that utilizes visual feedback for iterative optimization. By combining a visual reward model based on OCR accuracy with reinforcement learning, the framework enables Multimodal Large Language Models (MLLMs) to "see" rendering results and repeatedly correct them, significantly outperforming code-only generation methods in text layout quality.

Specificity-aware Reinforcement Learning for Fine-grained Open-world Classification

SpeciaRL is proposed — a specificity-aware reinforcement learning framework that simultaneously improves the specificity and correctness of predictions in open-world fine-grained image classification by guiding reasoning-based Large Multimodal Models (LMMs) with dynamic reward signals based on the best prediction from online rollouts.

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Talk2Move models "translating/rotating/scaling an object in a scene based on text instructions" as an RL problem. It utilizes Flow-GRPO for exploration on diffusion trajectories with spatial rewards, eliminating the need for paired supervision data. By employing early-exit sampling, it accelerates training by \(2\times\). It significantly outperforms existing editing models like GPT-Image-1, Flux-Kontext, and QwenImageEdit in terms of spatial accuracy and scene consistency.

TaskForce: Cooperative Multi-agent Reinforcement Learning for Multi-task Optimization

TaskForce models the process of weighting task-specific gradients into a unified update direction as a cooperative multi-agent reinforcement learning (MARL) problem. Each task is assigned an agent that observes a compressed gradient summary via a Gram matrix and outputs a weight for its own gradient. The learning is driven by a hybrid reward encoding both "gradient alignment" and "loss descent." The method consistently outperforms existing SOTA multi-task optimization (MTO) methods on NYU-v2, Cityscapes, and QM9.

TSTM: Temporal Segmentation for Task-relevant Mask in Visual Reinforcement Learning Generalization

TSTM utilizes an "encoder-temporal-decoder" segmentation network with ConvLSTM to extract task-relevant regions (masks) from continuous multi-frame observations. Combined with VICReg-style invariant representation learning and policy consistency constraints for SAC training, it achieves SOTA generalization performance on the DMC-GB video easy/hard benchmarks.