⚖️ Alignment & RLHF¶

🧠 NeurIPS2025 · 53 paper notes

A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs: This paper proposes an Adaptive Alpha aggregation strategy that dynamically adjusts reward weights based on each user group's historical alignment performance within a federated RLHF framework, simultaneously achieving high fairness and strong alignment performance for pluralistic preference alignment.
Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency: This paper proposes JAIL-CON, a jailbreak attack framework based on task concurrency. By interleaving harmful and benign tasks at the word level, it exploits LLMs' ability to handle concurrent tasks to bypass safety mechanisms, while the resulting concurrent outputs exhibit stronger evasiveness against guardrails.
Alignment of Large Language Models with Constrained Learning: This paper proposes CAID (Constrained Alignment via Iterative Dualization), an iterative dualization method that alternately updates the LLM policy and dual variables. It theoretically establishes that the dual approach can identify the optimal constrained LLM policy (up to a parametrization gap), and empirically demonstrates significant improvements in constraint satisfaction and the helpfulness–safety trade-off on the PKU-SafeRLHF dataset.
Ask a Strong LLM Judge when Your Reward Model is Uncertain: This paper proposes an uncertainty-based routing framework that applies SNGP to a pairwise reward model for uncertainty quantification, routing high-epistemic-uncertainty samples to a strong LLM judge (DeepSeek-R1). At a judge invocation cost of only 9.2%–42.5%, the approach significantly outperforms random routing in accuracy and demonstrably improves downstream online RLHF alignment.
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs: This paper proposes a two-stage fine-tuning attack: Stage 1 fine-tunes an LLM on 10 benign questions paired with identical refusal answers, driving the model to overfit into a sharp loss landscape; Stage 2 fine-tunes the same 10 questions with normal answers, triggering catastrophic forgetting of safety alignment. Using entirely benign data, the method achieves a 94.84% attack success rate (ASR), comparable to malicious fine-tuning (97.25%), while completely evading content moderation.
Can DPO Learn Diverse Human Values? A Theoretical Scaling Law: This paper establishes a theoretical generalization framework for DPO under diverse human value settings. By analyzing the dynamic trajectory of reward margins after a finite number of gradient steps, it proves that the number of samples required per value must grow logarithmically with the number of value categories \(K\) (i.e., \(Q = \Theta(\log K)\)) to maintain generalization performance, thereby revealing the statistical cost of aligning with diverse societal values.
Capturing Individual Human Preferences with Reward Features: This paper proposes the Reward Feature Model (RFM), which learns shared reward features \(\phi_\theta(x,y)\) such that each user obtains a personalized reward \(r_h = \langle \phi_\theta, \mathbf{w}_h \rangle\) via a linear weight vector \(\mathbf{w}_h\). The work provides the first PAC generalization bound for multi-annotator preference learning, proving that increasing the number of annotators \(m\) is more effective than increasing per-annotator sample count \(n\), and that as few as 30 samples suffice for fast adaptation to new users.
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO: This paper proposes DeepVideo-R1, which reformulates GRPO as Reg-GRPO that directly regresses advantage values (eliminating clipping/min safeguards), and mitigates the vanishing advantage problem via difficulty-aware data augmentation, achieving up to 10.1 percentage points improvement over standard GRPO on video reasoning tasks.
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models: This paper identifies and addresses the motion bias problem in video DPO — by constructing structurally aligned video pairs via noising and denoising GT videos to fix the motion dimension, annotating dense preferences at the temporal segment level for more precise learning signals, and leveraging off-the-shelf VLMs for automatic annotation to reduce cost. Using only 1/3 of the annotation data, the method substantially improves motion generation quality while matching visual quality and text alignment.
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization: This paper proposes the Latent Reward Model (LRM) and Latent Preference Optimization (LPO), which repurpose the pretrained diffusion model itself as a noise-aware latent-space reward model to perform step-level preference optimization directly in the noisy latent space. Compared to Diffusion-DPO, LPO achieves a 10–28× training speedup; compared to SPO, it achieves a 2.5–3.5× speedup.
DP²O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution: This paper proposes DP²O-SR, a framework that exploits the inherent stochasticity of diffusion models to generate diverse super-resolution outputs, constructs preference pairs via a hybrid perceptual reward, and introduces a Hierarchical Preference Optimization (HPO) strategy to adaptively weight training pairs — significantly improving perceptual quality in real-world image super-resolution without any human annotations.
EvoRefuse: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions: This paper proposes EvoRefuse—a framework that employs evolutionary search (mutation/recombination + ELBO fitness function + simulated annealing) to automatically generate semantically benign yet reliably refusal-triggering "pseudo-malicious" instructions. The resulting EvoRefuse-Test benchmark achieves 85.34% higher refusal trigger rate and 34.86% greater lexical diversity than the strongest baseline, while the EvoRefuse-Align dataset reduces over-refusal by 29.85%–45.96% via SFT/DPO fine-tuning without compromising safety.
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring: This paper proposes the Streaming Content Monitor (SCM)—the first harmful content monitor natively designed for partial detection. Built upon the FineHarm dataset (29K samples with token-level annotations) and hierarchical consistency-aware learning, SCM achieves a macro F1 of 0.95+ after observing on average only 18% of response tokens, enabling real-time early stopping of harmful LLM outputs.
g-DPO: Scalable Preference Optimization for Protein Language Models: To address the quadratic growth of preference pairs with respect to sample size when applying DPO to protein language models (PLMs)—which renders training intractable—this paper proposes g-DPO: (1) redundant preference pairs are pruned via union-mask-based clustering in sequence space, retaining more informative comparisons within local neighborhoods; (2) grouped likelihood amortization via shared union masks enables computation of log-likelihoods for all sequences within a group in a single forward pass. Across three protein engineering tasks, g-DPO achieves statistically indistinguishable in silico and in vitro performance compared to standard DPO, while delivering 1.7–5.4× training speedups.
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs: This paper proposes GASP, a framework that trains a dedicated SuffixLLM to generate human-readable adversarial suffixes. It employs Latent Bayesian Optimization (LBO) to efficiently search the continuous embedding space and iteratively fine-tunes the generator via ORPO, achieving high attack success rates in a fully black-box setting while maintaining suffix readability.
Generalizing while Preserving Monotonicity in Comparison-based Preference Learning Models: This paper proposes Linear GBT with Diffusion Prior, a class of preference learning models that simultaneously guarantee monotonicity (the score of the preferred item never paradoxically decreases after a comparison) and generalization to uncompared items, thereby affirmatively answering the central question of whether generalization and monotonicity can coexist.
Greedy Sampling Is Provably Efficient for RLHF: This paper proves that, under KL-regularized RLHF, directly applying greedy sampling based on empirical estimates—without constructing optimistic or pessimistic confidence sets—achieves \(O(\log T)\) regret in the online setting and \(O(\varepsilon^{-1})\) sample complexity in the offline setting. These are the first results of such order under general preference models.
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training: GVPO is a more stable LLM post-training method than GRPO, derived by embedding the analytical solution of KL-constrained reward maximization into gradient weights (zero-sum weights eliminate the partition function). It achieves 20.72% on AIME (vs. GRPO's 14.79%) and is proven to possess a unique global optimum.
Human-assisted Robotic Policy Refinement via Action Preference Optimization: This paper proposes Action Preference Optimization (APO), a human-robot collaboration framework that collects interactive trajectories and applies preference alignment to VLA models using binary desirability signals grounded in prospect theory and an adaptive reweighting scheme, enabling the model to learn from failures and improve iteratively.
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay: Two complementary techniques are proposed to improve the data efficiency of LLM reinforcement fine-tuning (GRPO): (1) DOTS — an attention-based mechanism for predicting adaptive difficulty that prioritizes moderate-difficulty questions to maximize gradient signal; and (2) Rollout Replay — reusing recent rollouts to reduce per-step computational overhead. Together, these techniques reduce training time by an average of 40.7% across 6 model–dataset combinations.
Inference-time Alignment in Continuous Space: This paper proposes Simple Energy Adaptation (SEA), which shifts the inference-time alignment paradigm from discrete-space search to continuous-space optimization. By performing gradient-based Langevin sampling over the continuous logit space, SEA approximates the optimal RLHF policy, achieving a 77.51% relative improvement over the strongest baseline on AdvBench and a 16.36% improvement on MATH.
Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models: This paper proposes a policy-based (rather than example-based) evaluation framework for LLM red teaming, along with the Jailbreak-Zero method. By employing a simple large-scale parallel sampling strategy—requiring no manually crafted jailbreak tactics—the method achieves attack success rates of 99.5% on GPT-4o and 96.0% on Claude 3.5 on HarmBench, while attaining Pareto optimality across three objectives—coverage, diversity, and fidelity—through fine-tuning.
KL Penalty Control via Perturbation for Direct Preference Optimization: This paper proposes ε-DPO, which achieves instance-level adaptive KL penalty control by monitoring the monotonicity of logits—used as preference model outputs—under small perturbations of \(\beta\) during training. The method incurs no additional computational overhead and significantly outperforms DPO and most direct alignment algorithms, achieving a 46.4% LC win rate on AlpacaEval 2 (vs. 40.3% for DPO).
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits: This work frames the selection of multiple reward models (RMs) as a contextual multi-armed bandit (LinUCB) problem, adaptively choosing the most suitable RM for each training batch during iterative LLM training. LASeR comprehensively outperforms RM ensemble and single-RM baselines on reasoning, instruction-following, and long-context tasks, while achieving a 2–3× efficiency advantage.
Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis: This paper proposes LENS, a framework that synthesizes preference data pairs in the latent space of LLM embeddings via a VAE, bypassing costly text generation and achieving substantial improvements in reward model performance at dramatically reduced computational cost (16,000× smaller model, 18× faster generation).
LLM Safety Alignment is Divergence Estimation in Disguise: This paper establishes a unified theoretical framework demonstrating that alignment methods such as RLHF, DPO, KTO, and BCO are essentially estimating the divergence between a safe distribution \(\mathcal{D}^+\) and an unsafe distribution \(\mathcal{D}^-\). This perspective explains the latent-space separation phenomenon observed after alignment. Building on this insight, the paper proposes KLDO, a KL divergence-based alignment method that achieves state-of-the-art robustness across 5 models.
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization: LongVPO proposes a two-stage DPO framework. Stage 1 constructs pseudo-long-video preference data by anchoring short clips and introduces an anchor-only reference model approximation to address context-length mismatch. Stage 2 performs self-training on real long videos via recursive captioning and multi-clip reasoning tasks. Using only 16K synthetic samples, the method surpasses long-video models trained with large-scale supervised data.
Mechanism Design for LLM Fine-tuning with Multiple Reward Models: This paper formulates multi-party preference aggregation in RLHF fine-tuning as a mechanism design problem. It proves that under social-welfare-maximizing training rules, participants have incentives to misreport their preferences, and achieves dominant-strategy incentive compatibility (DSIC) via an extended VCG payment mechanism that ensures truthful reporting.
MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation: This paper proposes MetaDefense, a two-stage (pre-generation + mid-generation) defense framework that trains the LLM itself to predict the harmfulness of queries and partial responses, defending against finetuning-based jailbreak attacks without external classifiers, achieving 2× memory efficiency.
Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization: This paper proposes SymMPO (Symmetric Multimodal Preference Optimization), which addresses two key limitations of existing vision-augmented DPO methods—namely, theoretically unsound objective functions and indirect preference supervision—through symmetric paired preference learning over contrastive images and preference margin consistency regularization. Consistent performance gains are achieved across five hallucination benchmarks.
Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability: This paper systematically studies Multi-Environment POMDPs (ME-POMDPs)—a class of POMDP ensembles sharing state, action, and observation spaces but with arbitrarily different transition, observation, and reward functions—with the goal of finding a robust policy that maximizes reward under the worst-case environment. By introducing the Adversarial Belief POMDP (AB-POMDP) as a unified model and establishing its equivalence to one-sided partially observable stochastic games (POSGs), the paper proposes both exact (value iteration + LP) and approximate (AB-HSVI) algorithms.
On Extending Direct Preference Optimization to Accommodate Ties: This paper replaces the Bradley-Terry preference model in DPO with the Rao-Kupper and Davidson extensions, enabling preference optimization to explicitly model "tie" data. This avoids discarding ambiguous preference pairs and yields improved regularization and performance on translation and mathematical reasoning tasks.
ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation: This paper proposes ORPO-Distill, which reformulates cross-architecture LLM knowledge distillation as a preference optimization problem. The teacher model generates positive reasoning chains while the student model generates negative ones; an ORPO contrastive loss is used for training, augmented by a mixed-policy update strategy for student negative samples. The method consistently outperforms black-box KD baselines across 5 QA benchmarks.
PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors: This paper proposes PolyJuice, the first black-box, image-agnostic red teaming method for synthetic image detectors (SIDs). By discovering and exploiting a "realism direction" in the latent space of text-to-image (T2I) models, PolyJuice universally steers generated images to fool detectors, achieving an attack success rate of up to 84%.
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma: This paper formalizes the recurring safety–fairness–efficiency tensions in RLHF as an "alignment trilemma": it proves that no RLHF system can simultaneously satisfy \(\varepsilon\)-representativeness (faithfully reflecting diverse values), polynomial tractability (computational feasibility), and \(\delta\)-robustness (resistance to adversarial attacks), thereby providing a unified complexity-theoretic explanation for pathological phenomena such as preference collapse and sycophancy observed in current RLHF systems.
Preference Learning with Lie Detectors can Induce Honesty or Evasion: This paper systematically investigates the effects of integrating lie detectors into the LLM preference learning annotation pipeline (the SOLiD framework), finding that whether a trained model becomes genuinely honest or learns to evade detection depends on three key factors: the degree of exploration (GRPO vs. DPO), detector accuracy (TPR), and KL regularization strength.
Preference Optimization by Estimating the Ratio of the Data Distribution: This paper reinterprets DPO as a likelihood ratio (ratio matching) estimation problem and proposes BPO (Bregman Preference Optimization) under a Bregman divergence framework. BPO defines a generalized family of loss functions that subsumes DPO as a special case, and introduces the SBA (Scaled Basu's Power Divergence) instantiation, achieving a state-of-the-art 55.9% AlpacaEval2 length-controlled win rate on Llama-3-8B.
Provably Efficient Online RLHF with One-Pass Reward Modeling: This paper proposes a one-pass reward modeling method based on online mirror descent (OMD) that eliminates the computational bottleneck in online RLHF — namely, storing all historical data and re-optimizing from scratch at each iteration — achieving \(\mathcal{O}(1)\) time and memory complexity per iteration while also improving upon MLE methods in statistical efficiency.
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models: RL fine-tuning of LLMs updates only 5%–30% of parameters in practice (sparse subnetworks), and these subnetworks exhibit high consistency across different random seeds, datasets, and algorithms. Fine-tuning only the identified subnetwork can reproduce both the performance and the parameter values of full fine-tuning.
ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning: This paper proposes ResponseRank, a method that robustly learns utility differences by exploiting local relative differences in proxy signals of preference strength (e.g., response time and annotator agreement), significantly improving the sample efficiency of reward models.
Rethinking Direct Preference Optimization in Diffusion Models: To address two core issues in DPO for diffusion models — limited exploration and reward scale imbalance — this paper proposes a stable reference model update strategy and a timestep-aware training strategy, both of which can be integrated into various preference optimization algorithms.
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization: This paper proposes two robust DPO variants—WDPO (Wasserstein) and KLDPO (KL divergence)—under a distributionally robust optimization (DRO) framework to address alignment failures caused by shifts in user preference distributions. The approach provides \(O(n^{-1/4})\) convergence guarantees and achieves significant improvements over standard DPO on multi-dimensional alignment tasks and the OpenLLM leaderboard.
SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism: By analyzing the propagation mechanism of harmful tokens in multimodal LLMs, this work finds that fewer than 1% of tokens trigger jailbreak behavior in early-to-middle layers. Based on this finding, the training-free SafePTR framework is proposed, which prunes harmful tokens at vulnerable layers and restores benign features in subsequent layers, significantly improving safety without sacrificing task performance.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning: This work is the first to systematically apply the Constrained Markov Decision Process (CMDP) framework from Safe Reinforcement Learning (SafeRL) to safety alignment of Vision-Language-Action (VLA) models. Through a four-stage Integrated Safety Approach (ISA)—Model, Elicit, Constrain, and Assure—the method achieves an 83.58% reduction in safety violation costs on mobile manipulation tasks while maintaining task performance (+3.85%).
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization: This paper proposes RRPO (Refined Regularized Preference Optimization), which replaces DPO's response-level rewards with subsequence-level fine-grained rewards and token-wise KL regularization. Combined with a self-alignment data generation framework, RRPO reduces hallucinations and improves temporal reasoning on video understanding tasks.
Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: This paper theoretically proves and empirically validates that defending against suffix jailbreak attacks of length \(\Theta(M)\) requires adversarial training on suffixes of only length \(\Theta(\sqrt{M})\)—i.e., "short adversarial training defends against long jailbreaks." Across five mainstream LLMs, adversarial training with 20-token suffixes reduces the attack success rate (ASR) of 120-token jailbreak attacks by at least 30%.
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning: This paper identifies that reference model bias in NPO (Negative Preference Optimization) leads to uneven optimization power allocation across forget data and early-stage gradient weight smoothing failure. The proposed SimNPO eliminates reference model dependency and adopts length-normalized rewards, improving FQ from 0.79 to 0.99 on TOFU and consistently outperforming NPO across all benchmarks.
Strategyproof Reinforcement Learning from Human Feedback: This paper is the first to study strategic manipulation by annotators in RLHF from a mechanism design perspective. It proves a fundamental tradeoff between strategyproofness and policy alignment, and proposes the Pessimistic Median of MLEs algorithm to achieve approximate strategyproofness.
T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning: This paper proposes T-SHIRT, a data selection framework that introduces Selective IFD (considering only informative tokens) and a hierarchical selection strategy (preferring samples with high neighborhood consistency). Fine-tuning on only 5% of data selected by T-SHIRT surpasses training on the full dataset, while the selection process requires only GPT-2 and 40 minutes on a single GPU.
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons: Through a mechanistic interpretability lens, this work identifies a sparse set of "safety neurons" comprising approximately 5% of all neurons in LLMs. Patching only these neurons' activations recovers over 90% of safety performance, and the neuron-overlap perspective offers a mechanistic explanation for the alignment tax phenomenon.
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training: This paper proposes TBA (Trajectory Balance with Asynchrony), which combines the GFlowNet Trajectory Balance (TB) objective with an asynchronous distributed RL architecture to decouple exploration and learning in LLM post-training, achieving 4–50× speedups without performance degradation across mathematical reasoning, preference fine-tuning, and automated red-teaming tasks.
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning: TBRM minimizes trajectory-level Bellman residuals by treating LLM output logits as implicit Q-values, requiring only a single forward rollout per prompt during training. This yields substantially lower complexity than PPO/GRPO while achieving comparable or superior performance on mathematical reasoning benchmarks.
What Makes a Reward Model a Good Teacher? An Optimization Perspective: From an optimization-theoretic perspective, this paper proves that reward model accuracy alone is insufficient to measure its quality as an RLHF "teacher." Even a perfectly accurate reward model can lead to a flat RLHF objective landscape and extremely slow policy gradient optimization if the induced reward variance is too low. Moreover, different language models require different reward models.