⚖️ Alignment & RLHF¶

🔬 ICLR2026 · 42 paper notes

A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models: A2D is proposed, a token-level safety alignment method for diffusion language models (dLLMs) that trains the model to output [EOS] tokens at masked positions containing harmful content, enabling robust defense across any decoding order and any decoding step. It reduces DIJA template attack success rates from 80%+ to near zero (1.3%/0.0%) while supporting early rejection for a 19.3× speedup.
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment: This paper proposes Multi-Lingual Consistency (MLC), an auxiliary loss that manipulates the singular values of a multilingual representation matrix via SVD to drive it toward rank-1 (i.e., collinear multilingual representations). Using only multilingual prompt translations—without requiring target-language responses—MLC consistently transfers safety alignment from one language to all others.
Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization: This paper proposes MetaAPO, a framework that employs a lightweight meta-learner (a two-layer MLP) to dynamically estimate the alignment gap between offline and online data. The meta-learner simultaneously guides where to perform online sampling (addressing distribution mismatch) and adaptively reweights offline/online data during training (improving learning efficiency). MetaAPO outperforms DPO, Online DPO, and other baselines on AlpacaEval 2, Arena-Hard, and MT-Bench, while reducing online annotation costs by 42%.
AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint: This paper proposes AlphaSteer, which learns a null-space-constrained transformation matrix to dynamically construct steering vectors that produce near-zero vectors for benign inputs (preserving utility) while reconstructing the refusal direction vector for malicious inputs (enhancing safety), providing theoretical guarantees for the decoupling of safety and utility.
Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence: This paper proposes Antibody, a two-stage defense framework that (1) during alignment, applies flatness regularization to place the model in a flat region of the harmful loss landscape (small gradients → harder to attack), and (2) during fine-tuning, suppresses learning from harmful samples via a likelihood-ratio-based sample weighting scheme (contrasting the likelihood of task completion vs. refusal). The average Harmful Score is reduced from 15.29% to 7.04%.
AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization: To address spurious associations and hallucinations in multimodal large language models (MLLMs) for emotion reasoning, this work proposes the EmoReAlM evaluation benchmark and the AVEm-DPO preference optimization method. By constructing targeted preference pairs and incorporating text-prior regularization, the approach achieves 6–19% relative zero-shot performance gains on DFEW, RAVDESS, and EMER.
Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling: This paper proposes RCPO, a framework that extends LLM alignment from pairwise preference to ranked choice modeling. By unifying a utility model (MNL) and a ranking model (Mallows-RMJ) under MLE, RCPO outperforms DPO and its variants under both single-best and top-k feedback formats.
Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework: This paper proposes a preference learning framework grounded in social choice theory axioms. It infers the feasible set of evaluator population distributions from pairwise comparison data and constructs policies satisfying two axioms: Population-Proportional Alignment (PPA) and Population-Bounded Manipulability (PBM).
CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation: This paper proposes the CAGE framework, which decouples the adversarial structure of red-teaming prompts from their cultural content via a construct termed the Semantic Mold. CAGE systematically adapts English red-teaming benchmarks to diverse cultural contexts, yielding culturally grounded prompts that achieve substantially higher attack success rates (ASR) than direct translation.
Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training: This paper theoretically establishes that reward over-optimization stems primarily from misspecification in the high-reward tail region, and proposes a rubric-based reward modeling approach: leveraging off-policy data (high-quality responses from stronger models) to construct scoring rubrics, which are progressively refined by distinguishing "good vs. better" responses, effectively mitigating reward over-optimization.
Displacement-Resistant Extensions of DPO with Nonconvex \(f\)-Divergences: This paper establishes that the solvability of f-DPO does not require convexity of \(f\) — only \(\lim_{t\to 0^+} f'(t) = -\infty\) is needed — and further proves that \(\arg\min f(t) \geq 1\) is a necessary condition for displacement resistance. Based on these findings, the paper proposes SquaredPO (\(f(t) = \frac{1}{2}(\log t)^2\), nonconvex), which significantly alleviates the winner probability degradation problem while maintaining competitive performance.
Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation: This paper proposes the Dual-IPO framework, which performs multi-round bidirectional iterative optimization between a reward model and a video generation model. Without large-scale human annotation, the approach continuously improves text-to-video generation quality and human preference alignment, enabling a 2B model to surpass a 5B model.
From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization: This paper proposes ALPO (Adaptive Local Preference Optimization) for training expressive subtitle translation LLMs. Three empirical findings motivate the design: (1) subtitle translation exhibits the lowest back-translation consistency, indicating the highest degree of paraphrase; (2) reasoning-type LLMs (R1/GPT-5 Thinking) produce more expressive paraphrases than chat-type LLMs (GPT-4o/Qwen-Max); (3) a 14B model used as a translation evaluator achieves Spearman correlation \(\geq 0.82\) with human judgments, qualifying it as a low-cost reward model. Building on these findings, the paper proposes a fine-grained, process-supervised preference alignment method operating at the sentence-segment level (with adaptive weighting, dynamic beta, and prefix mixing). A 14B model trained with ALPO surpasses GPT-4o and DeepSeek-R1 in vividness across multiple subtitle translation directions.
General Exploratory Bonus for Optimistic Exploration in RLHF: This paper theoretically demonstrates that existing RLHF exploratory bonuses under KL and α-divergence regularization actually drive the policy toward high-probability regions of the reference model—contrary to the principle of optimism. It proposes the General Exploratory Bonus (GEB) framework, which introduces reference-model-dependent reward modulation to counteract the conservative bias induced by divergence regularization, and provably satisfies the optimism principle.
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends: By constructing a KL-regularized surrogate objective and deriving a pairwise consistency condition from first principles, this paper proves that group-relative REINFORCE (GRPO) is inherently an off-policy algorithm. Component isolation experiments further reveal that clipping is the sole driver of training stability while importance sampling can be entirely removed. Within this unified framework, the paper reinterprets several seemingly independent algorithms—including Kimi OPMD and Meta AsymRE—under a common theoretical lens.
GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models: This paper proposes GuardAlign, a training-free test-time safety defense framework for multimodal large language models. It leverages optimal transport (OT) to precisely detect and mask unsafe regions in images, and employs cross-modal attention calibration to sustain the influence of safety prefixes across layers. Evaluated on six LVLMs, GuardAlign reduces unsafe response rates by up to 39% while preserving or improving general capability.
Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks: This paper identifies a "historical context inconsistency" problem in stepwise group-based RL methods (e.g., GRPO/GiGPO)—steps within the same group may have different historical contexts, leading to biased advantage estimation. HGPO is proposed to achieve low-bias, balanced-variance advantage estimation through hierarchical grouping and adaptive weighting, yielding significant improvements on ALFWorld and WebShop with negligible additional overhead (<0.001%).
Is On-Policy Data always the Best Choice for Direct Preference Optimization-based LM Alignment?: This paper challenges the prevailing assumption that on-policy data is always superior, revealing that LLM alignment comprises two distinct stages — preference injection (requiring high-diversity off-policy data) and preference fine-tuning (requiring high-quality on-policy data) — with the optimal data type varying across models and stages. A boundary detection algorithm incurring only 3.2% additional computational overhead is proposed and validated across 5 models × 55 configurations.
JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks: This paper introduces JailNewsBench, the first multilingual and multi-regional benchmark for evaluating LLM robustness against fake news generation under jailbreak attacks. Covering 34 regions, 22 languages, and approximately 300,000 instances, the benchmark reveals attack success rates as high as 86.3% and exposes a systematic safety imbalance in which English- and U.S.-topic defenses are significantly weaker than those for other regions.
Learning More with Less: A Dynamic Dual-Level Down-Sampling Framework for Efficient Policy Optimization: This paper proposes D3S (Dynamic Dual-Level Down-Sampling), a framework that maximizes advantage variance at the sample level and prioritizes high-entropy, high-advantage tokens at the token level, combined with a dynamic scheduling strategy. D3S achieves faster convergence and superior performance using fewer than 20% of tokens.
Learning Ordinal Probabilistic Reward from Preferences (OPRM): This paper proposes the Ordinal Probabilistic Reward Model (OPRM), which discretizes response quality into ordinal grades from 1 to 9 and learns the full probability distribution over these grades. Combined with Region Flooding Tuning (RgFT), it enables data-efficient training. OPRM achieves 89.3% on RewardBench, outperforming existing reward models by 2.9%–7.4%, while also providing uncertainty estimation and annotation disagreement detection.
Mitigating Mismatch within Reference-based Preference Optimization: This paper identifies the premature satisfaction problem in DPO — when the reference policy assigns lower probability to chosen than to rejected responses (~45% of pairs), DPO's gradient is unnecessarily attenuated by the pessimistic reference signal, even when the policy is still incorrect (i.e., \(\Delta_\theta < 0\)). The paper proposes HyPO (a one-line code change: clipping the reference margin via \(\max(0, \Delta_{ref})\)), achieving a 41.2% relative improvement over DPO on AlpacaEval 2.0.
Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization: This paper proposes NSPO, which projects safety alignment policy gradients onto the null space of general-task representations, geometrically ensuring that safety optimization does not degrade general capabilities. Using only 40% of the safety training data, NSPO achieves state-of-the-art results across 7 safety benchmarks while incurring virtually no performance loss on mathematics, code generation, and instruction following.
No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping: This paper identifies that a large proportion of "zero-variance prompts" (where all sampled responses are either entirely correct or entirely incorrect) are silently discarded during GRPO training. The proposed RL-ZVP algorithm extracts learning signals from these prompts via entropy-guided advantage shaping, achieving improvements of up to 8.61 accuracy points and 7.77 pass-rate points over GRPO across six mathematical reasoning benchmarks.
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search: This paper proposes CC-BOS, a framework that exploits the semantic compression and inherent ambiguity of Classical Chinese, combined with a Fruit Fly Optimization Algorithm to search an eight-dimensional strategy space for optimal jailbreak prompts, achieving nearly 100% attack success rate across six mainstream LLMs.
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks: This paper systematically investigates how sparsity in MoE language models differentially affects memorization and reasoning tasks: memorization tasks favor higher sparsity (more parameters), while reasoning tasks peak near TPP≈20, and this trend remains consistent after GRPO post-training and increased test-time compute.
Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check: This paper proposes an Answer-Then-Check strategy: the model first generates an intended answer summary in its chain-of-thought, then conducts safety analysis against a safety policy, and finally decides whether to output or refuse. After training on the constructed 80K ReSA dataset, the method achieves a 99.3% defense rate against 7 jailbreak attacks (RL variant), with only 500 samples needed to match full-dataset performance.
PURGE: Reinforcement Unlearning via Group Relative Policy Optimization: PURGE reformulates LLM unlearning as a verifiable RL task, employing the GRPO framework with intrinsic reward signals (penalizing mentions of forbidden concepts) to achieve safe and consistent knowledge removal. It consumes 46× fewer tokens than the SOTA while improving fluency by +5.48% and adversarial robustness by +12.02%.
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety: This work revisits the safety-constrained RLHF objective, proves the existence of a closed-form optimal policy, and derives an equivalent tractable objective, SafeDPO. The method requires only a safety-aware data transformation and a safety margin term (one additional hyperparameter) on top of standard DPO, without reward or cost models. It achieves a 96.87% harmlessness rate on PKU-SafeRLHF-30K while maintaining competitive helpfulness, and trains 25× faster than SafeRLHF.
Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study: Through four systematic experiments (parallel projection, orthogonal projection, subspace overlap, and activation space analysis) conducted across five open-source LLMs, this paper establishes a key finding: safety alignment behavior is highly entangled with general learning in both weight space and activation space, and no linearly separable independent safety subspace exists. Consequently, defense strategies based on subspace projection/filtering face fundamental limitations.
SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks: This paper proposes SEMA, a two-stage training framework consisting of prefilling self-tuning and RL with an intent-drift-aware reward. Without relying on any existing attack strategies or external data, SEMA trains an attacker capable of automatically generating multi-turn jailbreak attacks, achieving an average ASR@1 of 80.1% across three victim models on AdvBench — surpassing the prior state of the art by 33.9%.
Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment: This paper identifies a fundamental limitation of standard KL divergence regularization in RLHF: it compares token probabilities only at identical index positions, completely ignoring semantic similarity. The authors propose Wasserstein Policy Regularization (WPR), a semantic-aware policy regularization based on entropy-regularized Wasserstein distance. Through a dual formulation, WPR converts the regularization into token-level penalty terms compatible with standard RL algorithms such as PPO, and consistently outperforms KL divergence and various f-divergence baselines on dialogue generation and summarization tasks.
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy: This paper proposes a two-stage preference data curation pipeline based on Human-AI synergy. Stage 1 accumulates approximately 1M preference pairs over 8 iterative rounds via human verification, error-driven adaptive retrieval, and preference-guided LLM annotation. Stage 2 scales the dataset to 26M pairs using dual-RM consistency filtering. The resulting Skywork-Reward-V2 8B model achieves 97.8% on RewardBench and an average of 88.6% across 7 mainstream benchmarks, comprehensively surpassing all open-source 70B reward models.
Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning: This paper proposes SFPO (Slow-Fast Policy Optimization), which decomposes each training step into a three-stage structure of "fast trajectory — reposition — slow correction." Without modifying the objective function or rollout procedure, SFPO serves as a plug-and-play enhancement to GRPO, achieving up to 2.80-point average improvement on mathematical reasoning benchmarks and up to 4.93× reduction in rollouts.
Superficial Safety Alignment Hypothesis: This paper proposes the Superficial Safety Alignment Hypothesis (SSAH): safety alignment is essentially teaching a model to perform an implicit binary classification task (execute vs. refuse), requiring only ~1.3% of neurons to establish safety guardrails. Freezing these safety-critical units during fine-tuning preserves safety, and leveraging redundant units as an "alignment budget" eliminates the alignment tax.
Swap-guided Preference Learning for Personalized RLHF (SPL): This paper addresses posterior collapse in Variational Preference Learning (VPL) by proposing SPL, which introduces swap-guided base regularization (forcing latent variables to encode user preferences rather than being ignored), a Preferential-IAF decomposition of swap-reversible and swap-invariant signals, and adaptive latent variable modulation. On Llama-3.1-8B, SPL achieves 63.71% accuracy and 97.10% active units, whereas VPL collapses to 57.14% accuracy and 0% active units.
Token-Importance Guided Direct Preference Optimization (TI-DPO): TI-DPO is proposed, which precisely quantifies each token's contribution to preference via a hybrid weighting mechanism combining gradient attribution and a Gaussian prior, and incorporates a triplet loss to guide optimization in a continuous semantic space. The method achieves state-of-the-art performance with an average score of 62.3 across 6 benchmarks, while providing interpretable token-level control.
Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models (UltraBreak): This paper proposes UltraBreak, which combines a semantic adversarial objective (replacing cross-entropy with cosine similarity to produce a smooth loss landscape) and input-space constraints (random transformations + TV regularization to yield transformation-invariant features) to optimize a single universal adversarial image capable of jailbreaking 6+ VLM architectures and commercial models. The average black-box ASR reaches 71% on SafeBench, substantially outperforming prior methods.
Towards Understanding Valuable Preference Data for Large Language Model Alignment: This work studies preference data quality from a model-dependent perspective. It proposes Truncated Influence Functions (TIF), revealing that data with medium IF values—rather than high IF values as conventionally assumed—is most valuable. Two lightweight proxy metrics, LossDiff and IRM, are designed to approximate TIF. The combined LossDiff-IRM selector achieves an average WinRate improvement of 13.58% using only 50–64% of the data, with consistent effectiveness across multiple LLM families and alignment benchmarks.
Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs: Uni-DPO is proposed to unify dynamic reweighting of preference pairs via three components — quality-aware weighting (prioritizing pairs with large score margins), performance-aware weighting (focal loss focusing on underfitted samples), and a calibrated NLL loss — consistently outperforming DPO/SimPO on text understanding and mathematical reasoning benchmarks, with Gemma-2-9B achieving 67.1% on Arena-Hard, surpassing Claude 3 Opus (60.4%).
Unifying Stable Optimization and Reference Regularization in RLHF (DAR): This paper proposes DAR (Dual-regularized Advantage Regression), identifying that reference-model regularization (for preventing reward hacking) and policy stability constraints (for preventing collapse) in standard RLHF progressively conflict, excessively restricting the optimization space. DAR addresses this via a dual-KL objective that interpolates reference policies in log-space and applies a regression transformation to eliminate policy-ratio instability, achieving an average win rate of 92.42% in direct AI alignment and standard RLHF settings, surpassing GRPO by 7.27%.
Why DPO is a Misspecified Estimator and How to Fix It: This paper proves from an information-geometric perspective that DPO is fundamentally a misspecified statistical estimator under parameterized (non-tabular) policy classes—DPO projects the true reward function onto the implicit reward manifold via KL projection, leading to preference reversal and reward degradation when the reward is unrealizable—and proposes AuxDPO, which introduces null-space auxiliary variables to remedy this misspecification.