Skip to content

⚖️ Alignment & RLHF

📷 CVPR2026 · 12 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (102) · 💬 ACL2026 (38) · 🧪 ICML2026 (37) · 🤖 AAAI2026 (17) · 🧠 NeurIPS2025 (36) · 📹 ICCV2025 (2)

🔥 Top topics: Multimodal/VLM ×4 · Alignment/RLHF ×3 · Adversarial Robustness ×2

Anchoring the Mind of Multimodal Reasoners: Cognitive Bias as a Vector for Jailbreak Attacks

This paper discovers an "anchoring effect" in the safety judgments of Multimodal Large Reasoning Models (MLRMs)—where the model is significantly biased by the first information it encounters. Based on this, RA-Attack is proposed: it first anchors the model's reasoning chain to a "safe tone" using a "seemingly safe" structured mind map and educational context text, then smoothly packages harmful intent as a natural extension of this reasoning chain. It achieves SOTA Attack Success Rates (ASR) of 92% (Gemini-2.5-Pro) and 82% (GPT-4o) across 7 mainstream MLRMs.

Bridging Human Evaluation to Infrared and Visible Image Fusion

To address the long-standing issue of Infrared and Visible Image Fusion (IVIF) optimizing only handcrafted metrics and disconnecting from human aesthetics, this paper constructs the first large-scale IVIF human feedback dataset. It trains a "fusion-oriented reward model" to quantify perceptual quality and utilizes SAM-assisted GRPO to align the fusion network with human preferences, achieving SOTA performance on mainstream benchmarks with more visually pleasing fusion results.

DRM: Diffusion-based Reward Model With Step-wise Guidance

This paper utilizes the pre-trained diffusion model itself as the reward model backbone (DRM). By leveraging its unique ability to score noise latents at any denoising step, the authors design Step-GRPO for training with dense step-wise rewards and Step-wise Sampling for "explore-and-select" during inference. This approach significantly improves the generation quality of SD3.5-Medium without adding parameters and achieves 2.5–3.5 times faster convergence.

EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment

EcoAlign reframes the inference-time alignment of Large Vision-Language Models (LVLMs) as an "optimal path search problem under a limited compute budget." It utilizes a Net Present Value (NPV)-like look-ahead function to score candidate actions on a dynamically constructed Graph-of-Thought, balancing safety, utility, and cost while defining path safety via the "weakest link" principle to achieve superior safety and utility at lower compute costs.

From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward

Addressing the fundamental misalignment in handwritten mathematical expression recognition where "LaTeX text similarity \(\neq\) rendered image similarity," this paper proposes Image Matching Score (IMS)—a lightweight image-level reward based on column projection encoding and Levenshtein distance. This reward drives IMPO, a GRPO reinforcement learning framework without a value network. Across CROHME, HME100K, and M2E benchmarks, it increases ExpRate by an average of approximately 1.1% (up to 1.37%), achieving a new SOTA.

MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration

MorphSeek redefines deformable medical image registration as "policy optimization in the encoder's latent space"—attaching a Gaussian policy head to the top layer of a U-Net encoder to treat latent features as samplable actions. It first uses unsupervised warm-up to stabilize the latent space, then employs GRPO for multi-trajectory multi-step weakly supervised fine-tuning. Combined with LDVN to stabilize policy gradients in the tens-of-thousands-dimensional latent space, it improves Dice by 2–4% and reduces the folding rate (NJD) by 30–60% on three 3D registration benchmarks using minimal labels.

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Ours proposes NullSteer, an activation steering defense framework based on null-space projection. By restricting steering operations within the null space of benign activations, it effectively defends against visual jailbreak attacks without compromising the model's general capabilities.

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization

SafeGRPO integrates "verifiable rule-governed rewards" into GRPO, allowing Multimodal Large Language Models (MLLMs) to learn self-rewarded safety through a "step-guided reasoning process" (analyzing visual, text, and combined risks) without manual preference annotations. This approach enhances jailbreak defense, safety awareness, and stability across multiple safety benchmarks while minimizing degradation of general capabilities and avoiding excessive refusal.

Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

REACT is a frame-level reward model targeting "structural distortion" in generated videos. It establishes a taxonomy of eight distortion categories and labels 15,000 pairs of frame preference data. Using grounding reconstruction combined with Gemini-2.5-Pro, it synthesizes 6K CoT samples at low cost. Qwen2.5-VL-7B is trained in two stages via "Masked SFT + GRPO pairwise reward." During inference, a dynamic sampling mechanism focuses on frames most likely to be distorted, significantly outperforming existing video/image evaluators in both preference alignment and distortion identification.

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

UE-DPO shifts the optimization focus for hallucination suppression in Multimodal Large Language Models (MLLMs) from "visually sensitive tokens that the model already understands" to "critical cognitive blind-spot tokens that the model fails to comprehend." By quantifying these blind spots with token-level epistemic uncertainty, UE-DPO asymmetrically adjusts DPO gradient intensities for preferred and dispreferred branches. It outperforms similar methods like TPO and V-DPO on multiple hallucination benchmarks using significantly less data.

Unlocking Token Rewards via Training-Free Reward Attribution

P2T employs a first-order Taylor approximation to training-freely decompose "segment-level" rewards from existing Process Reward Models (PRMs) into individual tokens. With just a single forward and backward pass, it calculates token-level rewards for the entire sequence. Integrating this into GRPO accelerates math/multimodal reasoning RL convergence by approximately 4× and achieves a +11.5% improvement on AIME24 compared to outcome rewards.

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Addressing the issues where Multi-modal Large Language Models (MLLMs) lack logical reasoning and ignore visual content in Video Event Prediction (VEP), this paper proposes the Chain of Events (CoE) paradigm. It requires the model to segment videos into timestamped historical event chains and perform causal reasoning based on them. Through a two-stage training process (CoE-SFT for reasoning injection + CoE-GRPO for reinforcing event chain construction via dense rewards), Qwen2.5-VL-7B was improved from 52.9% to 75.0% on FutureBench, setting a new VEP SOTA.