⚖️ Alignment & RLHF¶

📷 CVPR2026 · 12 paper notes

Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group: This paper proposes a method that bypasses Clebsch-Gordan (CG) coefficient computation and directly constructs explicit steerable kernel bases from group representation matrix elements. Through a three-step strategy of "stabilizer constraint + Schur's lemma + steering," it uniformly covers SO(2), O(2), SO(3), O(3), and the non-compact Lorentz group, substantially simplifying the kernel design pipeline for equivariant CNNs.
Bases of Steerable Kernels for Equivariant CNNs: From 2D Rotations to the Lorentz Group: This paper proposes a method that bypasses Clebsch-Gordan coefficients to solve the steerable kernel constraint in equivariant CNNs by solving simple invariance conditions on stabilizer subgroups and then "steering" to arbitrary points, providing explicit kernel bases for symmetry groups ranging from SO(2) to the Lorentz group.
Bias at the End of the Score: Demographic Biases in Reward Models for T2I: This paper conducts a large-scale demographic bias audit of widely used reward models (PickScore, ImageReward, HPS, etc.) in text-to-image generation, revealing that reward-guided optimization disproportionately sexualizes female images, converges demographics toward white, and that reward scores correlate with real-world population frequency priors.
GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering: GlyphPrinter constructs a region-level glyph preference dataset (GlyphCorrector) and proposes Region-Grouped DPO (R-GDPO) to significantly improve glyph accuracy in visual text rendering without relying on explicit reward models, while introducing inference-time Regional Reward Guidance (RRG) for controllable generation.
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models: This paper proposes MapReduce LoRA and RaTE as two complementary methods for advancing the Pareto front in multi-preference optimization: the former uses a "Map (parallel preference expert training) + Reduce (iterative merging)" strategy to progressively advance the Pareto front; the latter learns reward-aware token embeddings for inference-time composable preference control.
Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation: This paper proposes Mesh-Pro, the first asynchronous online reinforcement learning framework for 3D quadrilateral mesh generation. Its core algorithm, ARPO (Advantage-guided Ranking Preference Optimization), combines the Plackett-Luce ranking model with advantage-function weighting to achieve simultaneous improvements in efficiency (3.75× faster than offline DPO) and generalization, attaining state-of-the-art generation quality for both artist-style and dense meshes.
LocalDPO: Direct Localized Detail Preference Optimization for Video Diffusion Models: LocalDPO is proposed to perform localized preference alignment at the detail level by applying random spatiotemporal Bézier-masked local corruption to real high-quality videos to construct negative samples (single inference pass, no external ranking), paired with a region-aware DPO loss. The method consistently outperforms vanilla DPO and SFT on Wan2.1 and CogVideoX in terms of video quality.
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization: This paper proposes MoD-DPO (Modality-Decoupled DPO), which decouples the contribution of each modality in multimodal LLMs via three mechanisms—invariance regularization, sensitivity regularization, and language-prior debiasing—to effectively mitigate cross-modal hallucinations (e.g., answering visual questions using auditory information). A closed-form optimal policy is also derived.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization: This work introduces DPO preference optimization into the post-training stage of diffusion-based motion generation models. A physics simulation controller automatically constructs preference data pairs, enabling generated human motions to satisfy both text/spatial control instructions and physical constraints. The approach successfully transfers zero-shot to a real Unitree G1 robot.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization: PhysMoDPO integrates a pretrained whole-body controller (WBC/DeepMimic) into the post-training pipeline of a diffusion-based motion generator. By automatically constructing preference pairs via physical simulation and fine-tuning with DPO, generated motions—after WBC execution—simultaneously satisfy physical plausibility and text/spatial condition faithfulness, enabling zero-shot transfer to the Unitree G1 real robot.
Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models: This paper proposes NullSteer, an activation steering defense framework based on null-space projection, which effectively resists visual jailbreak attacks without degrading general model capability by constraining steering operations to the null space of benign activations.
\(\varphi\)-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models: This paper proposes \(\varphi\)-DPO, which adopts DPO as a continual learning paradigm (using the previous-step model as the reference policy) and introduces a fairness modulation factor \((1-p)^\gamma\) inspired by focal loss to balance gradient contributions across data groups. The authors theoretically prove that the gradient bias approaches zero as \(\gamma \to \infty\), and achieve state-of-the-art performance on the CoIN and MLLM-CL benchmarks.