⚖️ Alignment & RLHF¶
🧠 NeurIPS2025 · 36 paper notes
📌 Same area in other venues: 📷 CVPR2026 (12) · 🔬 ICLR2026 (102) · 💬 ACL2026 (38) · 🧪 ICML2026 (37) · 🤖 AAAI2026 (17) · 📹 ICCV2025 (2)
🔥 Top topics: LLM ×12 · Alignment/RLHF ×7 · Adversarial Robustness ×5 · Reinforcement Learning ×2
- Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency
-
This paper proposes JAIL-CON, a jailbreak attack framework based on task concurrency. By interleaving harmful and benign tasks at the word level, it exploits LLMs' ability to handle concurrent tasks to bypass safety mechanisms, while the resulting concurrent outputs exhibit stronger evasiveness against guardrails.
- Alignment of Large Language Models with Constrained Learning
-
This paper proposes CAID (Constrained Alignment via Iterative Dualization), an iterative dualization method that alternately updates the LLM policy and dual variables. It theoretically establishes that the dual approach can identify the optimal constrained LLM policy (up to a parametrization gap), and empirically demonstrates significant improvements in constraint satisfaction and the helpfulness–safety trade-off on the PKU-SafeRLHF dataset.
- Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
-
This work introduces the Infinity-Chat dataset (26K open-ended real-world user queries with 31,250 human annotations) to expose the "Artificial Hivemind" phenomenon in language models — severe intra-model repetition and inter-model homogeneity in open-ended generation — and demonstrates that Reward Models and LM Judges fail to calibrate on samples with high inter-annotator preference divergence.
- Ask a Strong LLM Judge when Your Reward Model is Uncertain
-
This paper proposes an uncertainty-based routing framework that applies SNGP to a pairwise reward model for uncertainty quantification, routing high-epistemic-uncertainty samples to a strong LLM judge (DeepSeek-R1). At a judge invocation cost of only 9.2%–42.5%, the approach significantly outperforms random routing in accuracy and demonstrably improves downstream online RLHF alignment.
- Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
-
This paper proposes a two-stage fine-tuning attack: Stage 1 fine-tunes an LLM on 10 benign questions paired with identical refusal answers, driving the model to overfit into a sharp loss landscape; Stage 2 fine-tunes the same 10 questions with normal answers, triggering catastrophic forgetting of safety alignment. Using entirely benign data, the method achieves a 94.84% attack success rate (ASR), comparable to malicious fine-tuning (97.25%), while completely evading content moderation.
- Can DPO Learn Diverse Human Values? A Theoretical Scaling Law
-
This paper establishes a theoretical generalization framework for DPO under diverse human value settings. By analyzing the dynamic trajectory of reward margins after a finite number of gradient steps, it proves that the number of samples required per value must grow logarithmically with the number of value categories \(K\) (i.e., \(Q = \Theta(\log K)\)) to maintain generalization performance, thereby revealing the statistical cost of aligning with diverse societal values.
- Capturing Individual Human Preferences with Reward Features
-
This paper proposes the Reward Feature Model (RFM), which learns shared reward features \(\phi_\theta(x,y)\) such that each user obtains a personalized reward \(r_h = \langle \phi_\theta, \mathbf{w}_h \rangle\) via a linear weight vector \(\mathbf{w}_h\). The work provides the first PAC generalization bound for multi-annotator preference learning, proving that increasing the number of annotators \(m\) is more effective than increasing per-annotator sample count \(n\), and that as few as 30 samples suffice for fast adaptation to new users.
- DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
-
This paper proposes DeepVideo-R1, which reformulates GRPO as Reg-GRPO that directly regresses advantage values (eliminating clipping/min safeguards), and mitigates the vanishing advantage problem via difficulty-aware data augmentation, achieving up to 10.1 percentage points improvement over standard GRPO on video reasoning tasks.
- EvoRefuse: Evaluating and Mitigating LLM Over-Refusal via Evolutionary Prompt Optimization
-
This paper proposes EvoRefuse, a framework that employs evolutionary search to maximize the ELBO for automatically generating diverse pseudo-malicious instructions, yielding a more challenging over-refusal evaluation benchmark (EvoRefuse-Test) and an effective alignment mitigation dataset (EvoRefuse-Align).
- From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
-
This paper proposes the Streaming Content Monitor (SCM)—the first harmful content monitor natively designed for partial detection. Built upon the FineHarm dataset (29K samples with token-level annotations) and hierarchical consistency-aware learning, SCM achieves a macro F1 of 0.95+ after observing on average only 18% of response tokens, enabling real-time early stopping of harmful LLM outputs.
- GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
-
This paper proposes GASP, a framework that trains a dedicated SuffixLLM to generate human-readable adversarial suffixes. It employs Latent Bayesian Optimization (LBO) to efficiently search the continuous embedding space and iteratively fine-tunes the generator via ORPO, achieving high attack success rates in a fully black-box setting while maintaining suffix readability.
- Generalizing while Preserving Monotonicity in Comparison-based Preference Learning Models
-
This paper proposes Linear GBT with Diffusion Prior, a class of preference learning models that simultaneously guarantee monotonicity (the score of the preferred item never paradoxically decreases after a comparison) and generalization to uncompared items, thereby affirmatively answering the central question of whether generalization and monotonicity can coexist.
- Greedy Sampling Is Provably Efficient for RLHF
-
This paper proves that, under KL-regularized RLHF, directly applying greedy sampling based on empirical estimates—without constructing optimistic or pessimistic confidence sets—achieves \(O(\log T)\) regret in the online setting and \(O(\varepsilon^{-1})\) sample complexity in the offline setting. These are the first results of such order under general preference models.
- GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
-
GVPO is a more stable LLM post-training method than GRPO, derived by embedding the analytical solution of KL-constrained reward maximization into gradient weights (zero-sum weights eliminate the partition function). It achieves 20.72% on AIME (vs. GRPO's 14.79%) and is proven to possess a unique global optimum.
- Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
-
Two complementary techniques are proposed to improve the data efficiency of LLM reinforcement fine-tuning (GRPO): (1) DOTS — an attention-based mechanism for predicting adaptive difficulty that prioritizes moderate-difficulty questions to maximize gradient signal; and (2) Rollout Replay — reusing recent rollouts to reduce per-step computational overhead. Together, these techniques reduce training time by an average of 40.7% across 6 model–dataset combinations.
- Inference-time Alignment in Continuous Space
-
This paper proposes Simple Energy Adaptation (SEA), which shifts the inference-time alignment paradigm from discrete-space search to continuous-space optimization. By performing gradient-based Langevin sampling over the continuous logit space, SEA approximates the optimal RLHF policy, achieving a 77.51% relative improvement over the strongest baseline on AdvBench and a 16.36% improvement on MATH.
- Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
-
This paper proposes a policy-based (rather than example-based) evaluation framework for LLM red teaming, along with the Jailbreak-Zero method. By employing a simple large-scale parallel sampling strategy—requiring no manually crafted jailbreak tactics—the method achieves attack success rates of 99.5% on GPT-4o and 96.0% on Claude 3.5 on HarmBench, while attaining Pareto optimality across three objectives—coverage, diversity, and fidelity—through fine-tuning.
- LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
-
This work frames the selection of multiple reward models (RMs) as a contextual multi-armed bandit (LinUCB) problem, adaptively choosing the most suitable RM for each training batch during iterative LLM training. LASeR comprehensively outperforms RM ensemble and single-RM baselines on reasoning, instruction-following, and long-context tasks, while achieving a 2–3× efficiency advantage.
- Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis
-
This paper proposes LENS, a framework that synthesizes preference data pairs in the latent space of LLM embeddings via a VAE, bypassing costly text generation and achieving substantial improvements in reward model performance at dramatically reduced computational cost (16,000× smaller model, 18× faster generation).
- LLM Safety Alignment is Divergence Estimation in Disguise
-
This paper establishes a unified theoretical framework demonstrating that alignment methods such as RLHF, DPO, KTO, and BCO are essentially estimating the divergence between a safe distribution \(\mathcal{D}^+\) and an unsafe distribution \(\mathcal{D}^-\). This perspective explains the latent-space separation phenomenon observed after alignment. Building on this insight, the paper proposes KLDO, a KL divergence-based alignment method that achieves state-of-the-art robustness across 5 models.
- Mechanism Design for LLM Fine-tuning with Multiple Reward Models
-
This paper formulates multi-party preference aggregation in RLHF fine-tuning as a mechanism design problem. It proves that under social-welfare-maximizing training rules, participants have incentives to misreport their preferences, and achieves dominant-strategy incentive compatibility (DSIC) via an extended VCG payment mechanism that ensures truthful reporting.
- MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation
-
This paper proposes MetaDefense, a two-stage (pre-generation + mid-generation) defense framework that trains the LLM itself to predict the harmfulness of queries and partial responses, defending against finetuning-based jailbreak attacks without external classifiers, achieving 2× memory efficiency.
- Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability
-
This paper systematically studies Multi-Environment POMDPs (ME-POMDPs)—a class of POMDP ensembles sharing state, action, and observation spaces but with arbitrarily different transition, observation, and reward functions—with the goal of finding a robust policy that maximizes reward under the worst-case environment. By introducing the Adversarial Belief POMDP (AB-POMDP) as a unified model and establishing its equivalence to one-sided partially observable stochastic games (POSGs), the paper proposes both exact (value iteration + LP) and approximate (AB-HSVI) algorithms.
- PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors
-
This paper proposes PolyJuice, the first black-box, image-agnostic red teaming method for synthetic image detectors (SIDs). By discovering and exploiting a "realism direction" in the latent space of text-to-image (T2I) models, PolyJuice universally steers generated images to fool detectors, achieving an attack success rate of up to 84%.
- Preference Learning with Lie Detectors can Induce Honesty or Evasion
-
This paper systematically investigates the effects of integrating lie detectors into the LLM preference learning annotation pipeline (the SOLiD framework), finding that whether a trained model becomes genuinely honest or learns to evade detection depends on three key factors: the degree of exploration (GRPO vs. DPO), detector accuracy (TPR), and KL regularization strength.
- Preference Optimization by Estimating the Ratio of the Data Distribution
-
This paper reinterprets DPO as a likelihood ratio (ratio matching) estimation problem and proposes BPO (Bregman Preference Optimization) under a Bregman divergence framework. BPO defines a generalized family of loss functions that subsumes DPO as a special case, and introduces the SBA (Scaled Basu's Power Divergence) instantiation, achieving a state-of-the-art 55.9% AlpacaEval2 length-controlled win rate on Llama-3-8B.
- Provably Efficient Online RLHF with One-Pass Reward Modeling
-
This paper proposes a one-pass reward modeling method based on online mirror descent (OMD) that eliminates the computational bottleneck in online RLHF — namely, storing all historical data and re-optimizing from scratch at each iteration — achieving \(\mathcal{O}(1)\) time and memory complexity per iteration while also improving upon MLE methods in statistical efficiency.
- Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
-
RL fine-tuning of LLMs updates only 5%–30% of parameters in practice (sparse subnetworks), and these subnetworks exhibit high consistency across different random seeds, datasets, and algorithms. Fine-tuning only the identified subnetwork can reproduce both the performance and the parameter values of full fine-tuning.
- ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning
-
This paper proposes ResponseRank, a method that robustly learns utility differences by exploiting local relative differences in proxy signals of preference strength (e.g., response time and annotator agreement), significantly improving the sample efficiency of reward models.
- SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism
-
By analyzing the propagation mechanism of harmful tokens in multimodal LLMs, this work finds that fewer than 1% of tokens trigger jailbreak behavior in early-to-middle layers. Based on this finding, the training-free SafePTR framework is proposed, which prunes harmful tokens at vulnerable layers and restores benign features in subsequent layers, significantly improving safety without sacrificing task performance.
- Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks
-
This paper theoretically proves and empirically validates that defending against suffix jailbreak attacks of length \(\Theta(M)\) requires adversarial training on suffixes of only length \(\Theta(\sqrt{M})\)—i.e., "short adversarial training defends against long jailbreaks." Across five mainstream LLMs, adversarial training with 20-token suffixes reduces the attack success rate (ASR) of 120-token jailbreak attacks by at least 30%.
- Strategyproof Reinforcement Learning from Human Feedback
-
This paper is the first to study strategic manipulation by annotators in RLHF from a mechanism design perspective. It proves a fundamental tradeoff between strategyproofness and policy alignment, and proposes the Pessimistic Median of MLEs algorithm to achieve approximate strategyproofness.
- T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning
-
This paper proposes T-SHIRT, a data selection framework that introduces Selective IFD (considering only informative tokens) and a hierarchical selection strategy (preferring samples with high neighborhood consistency). Fine-tuning on only 5% of data selected by T-SHIRT surpasses training on the full dataset, while the selection process requires only GPT-2 and 40 minutes on a single GPU.
- Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
-
Through a mechanistic interpretability lens, this work identifies a sparse set of "safety neurons" comprising approximately 5% of all neurons in LLMs. Patching only these neurons' activations recovers over 90% of safety performance, and the neuron-overlap perspective offers a mechanistic explanation for the alignment tax phenomenon.
- Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
-
This paper proposes TBA (Trajectory Balance with Asynchrony), which combines the GFlowNet Trajectory Balance (TB) objective with an asynchronous distributed RL architecture to decouple exploration and learning in LLM post-training, achieving 4–50× speedups without performance degradation across mathematical reasoning, preference fine-tuning, and automated red-teaming tasks.
- What Makes a Reward Model a Good Teacher? An Optimization Perspective
-
From an optimization-theoretic perspective, this paper proves that reward model accuracy alone is insufficient to measure its quality as an RLHF "teacher." Even a perfectly accurate reward model can lead to a flat RLHF objective landscape and extremely slow policy gradient optimization if the induced reward variance is too low. Moreover, different language models require different reward models.