Skip to content

⚖️ Alignment & RLHF

🧪 ICML2026 · 7 paper notes

📌 Same area in other venues: 💬 ACL2026 (8) · 📷 CVPR2026 (10) · 🔬 ICLR2026 (41) · 🤖 AAAI2026 (19) · 🧠 NeurIPS2025 (50) · 📹 ICCV2025 (2)

🔥 Top topics: Alignment/RLHF ×6

BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking

BLOCK-EM uses SAE to identify a small set of internal latents that causally control emergent misalignment, then adds a one-sided regularizer during narrow-domain SFT to prevent the model from amplifying these latents in the misalignment direction—reducing emergent misalignment by an average of 93% across 6 fine-tuning domains, with almost no loss in in-domain task performance.

\(f\)-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

This paper establishes, for the first time, \(O(\log T)\) regret and \(O(1/T)\) suboptimality gap upper bounds for online RLHF under general \(f\)-divergence regularization. Two sampling strategies are proposed: (1) an optimism-in-face-of-uncertainty approach with a bonus term; (2) a novel "derivative-as-uncertainty" perspective—treating \(f'\) as an uncertainty signal, enabling derivative-based sampling without explicit confidence bound estimation each round.

Pareto-Guided Optimal Transport for Multi-Reward Alignment

PG-OT shifts "multi-reward text-to-image alignment" from "weighted global summation" to "constructing a Pareto frontier for each prompt and using Sinkhorn optimal transport to move dominated samples to the frontier," introducing two new metrics, Joint Domination Rate / Joint Collapse Rate, to expose reward hacking masked by averaging. On Parti-Prompts, JDR₂ reaches 47.98%, an 11% improvement over strong baselines, with a human evaluation win rate close to 80%.

Reward Modeling from Natural Language Human Feedback

This paper identifies a severe "outcome-process inconsistency" (20–30%, up to 44%) in generative reward models (GRM) trained on binary preference rewards, where the model guesses the correct preference but provides an incorrect critique. The authors propose RM-NLHF: using the similarity between model and human critiques on core arguments as an additional process reward, and employing MetaRM to automatically predict process rewards and update them online with policy changes. This approach consistently outperforms outcome-only GRPO-trained SOTA GRMs across multiple benchmarks.

The Realignment Problem: When Right becomes Wrong in LLMs

This paper formalizes the "what if the policy changes after model deployment" scenario as the Realignment problem, and proposes the TRACE framework: using a stronger proxy model to triage existing preference pairs into three categories (Invert / Punish / Retain), then performing surgical realignment with a hybrid IPO+NPO+KL objective, enabling adaptation to policy drift without a new round of human annotation.

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

This paper proposes SVGT, which shifts value alignment from "embedding into backbone parameters/activations" to "attaching an independent value module." The module continuously assesses the safety direction of the current hidden state in an isolated value space, then uses a set of learnable Bridge Tokens as explicit attention anchors to guide generation. Across four backbones, harmfulness scores are reduced by over 70% with almost no loss in fluency.

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

TUR-DPO augments DPO's preference logits with a "semantic + topological structure" shaping reward difference and an instance weight dynamically down-weighted by per-pair uncertainty. This allows the model to explicitly reward structural soundness of reasoning and suppress the impact of fragile preference pairs, while retaining the simplicity of RL-free training. As a result, TUR-DPO systematically outperforms DPO and IPO on reasoning tasks such as GSM8K / MATH / BBH / QA, and matches PPO on most tasks.