⚖️ Alignment & RLHF¶

🤖 AAAI2026 · 20 paper notes

Align to Structure: Aligning Large Language Models with Structural Information: This paper proposes Structural Alignment, a method that integrates linguistic discourse structure frameworks—surface-level text structure scoring and an RST-based discourse motif classifier—into PPO reinforcement learning training, and introduces a discourse motif-based dense reward mechanism. This enables LLMs to generate more coherent, human-like long-form text, outperforming standard RLHF models on academic essay writing and long document summarization tasks.
AlignTree: Efficient Defense Against LLM Jailbreak Attacks: AlignTree leverages internal LLM activation features — combining linear refusal directions with nonlinear SVM signals — to train a lightweight random forest classifier that efficiently detects jailbreak attacks with negligible computational overhead, achieving state-of-the-art reductions in attack success rate (ASR).
AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment: This paper proposes AMaPO, an algorithm that dynamically modulates gradient magnitudes via instance-level adaptive margins (combining Z-normalization and exponential scaling) to address the core overfitting-underfitting dilemma in offline preference optimization methods such as DPO, thereby substantially improving ranking accuracy and downstream alignment performance.
BiasJailbreak: Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models: This paper reveals that ethical biases introduced by LLM safety alignment can be reverse-exploited as jailbreak attack vectors — marginalized-group keywords yield jailbreak success rates up to 20% higher than privileged-group keywords — and proposes BiasDefense, a lightweight prompt-based defense method.
DeCoRL: Decoupling Reasoning Chains via Parallel Sub-Step Generation and Cascaded Reinforcement for Interpretable and Scalable RLHF: DeCoRL transforms CoT reasoning from monolithic sequential processing into an "orchestra-style" modular parallel collaboration — nine specialized sub-models (parsing / semantic / entity / fact-checking / style / quality / computation / verification / integration) generate reasoning sub-steps in parallel, coordinated via dual reward attribution (local quality + contribution) and cascaded DRPO optimization, achieving 80.8% on RM-Bench (surpassing all baselines), a 3.8× inference speedup, and a 22.7% improvement in interpretability.
Differentiated Directional Intervention: A Framework for Evading LLM Safety Alignment: This work deconstructs the internal representations of LLM safety alignment from the conventional "single refusal direction" into two functionally independent directions — a harm detection direction and a refusal execution direction — and proposes the DBDI framework, which applies adaptive projection elimination and direct steering to intervene on each direction separately, achieving a 97.88% attack success rate (ASR) on Llama-2.
EASE: Practical and Efficient Safety Alignment for Small Language Models: This paper proposes EASE, a safety alignment framework for edge-deployed small language models (SLMs), which addresses the tension between "shallow refusal being insufficiently robust" and "deep reasoning being prohibitively expensive" via a two-stage design. Stage one distills safety reasoning capabilities from a large reasoning model into the SLM; stage two applies selective reasoning activation, enabling reasoning only for adversarial queries in vulnerable semantic regions while responding directly to benign queries. EASE reduces jailbreak attack success rate by 17% compared to shallow alignment, while cutting reasoning overhead by 90% compared to full-reasoning alignment.
Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal States: This paper proposes EAGLE, a method that estimates uncertainty by aggregating logits from multiple intermediate hidden layers of an LLM and computing the expectation of the resulting confidence distribution. EAGLE requires no additional trainable parameters and reduces ECE from 12.6% to 3.2% while improving AUROC from 59.0% to 61.6% across multiple datasets and models.
EPO: Diverse and Realistic Protein Ensemble Generation via Energy Preference Optimization: This paper proposes EPO (Energy Preference Optimization), which combines reverse SDE sampling with listwise energy-ranked preference optimization to align a pretrained protein generator with the target Boltzmann distribution using only energy signals. EPO achieves state-of-the-art performance across 9 metrics on three benchmarks (Tetrapeptides, ATLAS, and Fast-Folding), entirely eliminating the need for expensive molecular dynamics (MD) simulations.
Exploring the Effects of Alignment on Numerical Bias in Large Language Models: This paper systematically demonstrates that the LLM alignment process (instruction tuning + preference tuning) is the root cause of numerical bias in LLM evaluators, and validates that score range adjustment is the most effective mitigation strategy.
GRAM-R²: Self-Training Generative Foundation Reward Models for Reward Reasoning: This paper proposes GRAM-R², a generative foundation reward model that elicits reward reasoning capabilities on unlabeled data via self-training. The model simultaneously produces preference labels and reasoning rationales, consistently outperforming both discriminative and generative baselines across multiple downstream tasks including response ranking, task adaptation, and RLHF.
Importance-Aware Data Selection for Efficient LLM Instruction Tuning: This paper proposes MIWV (Model Instruction Weakness Value), a metric that measures the importance of each instruction sample for improving model capability by comparing LLM loss with and without a one-shot ICL demonstration. Using only 1% (520 samples) of the Alpaca dataset, the method comprehensively outperforms fine-tuning on the full 52,002 samples.
Margin-aware Preference Optimization for Aligning Diffusion Models without Reference: This paper proposes MaPO (Margin-aware Preference Optimization), a reference-free preference alignment method that aligns T2I diffusion models by directly optimizing the likelihood margin between preferred and dispreferred outputs under the Bradley-Terry model. MaPO outperforms DPO and task-specific methods across 5 domains, including style adaptation, safety generation, and general preference alignment.
MetaGDPO: Alleviating Catastrophic Forgetting with Metacognitive Knowledge through Group Direct Preference Optimization: This paper proposes MetaGDPO, which addresses catastrophic forgetting in reasoning distillation for small models (<8B) from two complementary perspectives: (1) the data side, constructing a 5K dataset (MetaKL) based on metacognitive knowledge annotation; and (2) the training side, introducing GDPO—a DPO variant that replaces GRPO's online sampling with offline response groups generated by a large model.
On the Exponential Convergence for Offline RLHF with Pairwise Comparisons: Under the offline RLHF pairwise comparison setting, this paper proposes the RL-LOW algorithm achieving exponential convergence \(\exp(-\Omega(n/H))\) for simple regret, and derives the first instance-dependent lower bound proving this rate is optimal in the exponential sense.
Reducing the Scope of Language Models: This paper systematically evaluates LLM "scoping" methods—restricting deployed LLMs to respond only to in-domain queries while rejecting all out-of-domain requests. Five approaches (prompting / SFT / DPO / probing / Circuit Breakers) are compared across 3 model families and multiple tasks. Key findings: SFT performs best under high data diversity, Circuit Breakers (CB) excel under low diversity, and a hierarchical combination (SFT→CB) preserves the strengths of both. A central finding is that the effectiveness of scoping is highly dependent on training data diversity.
Rethinking Direct Preference Optimization in Diffusion Models: Two orthogonal and plug-and-play improvement strategies are proposed to enhance preference optimization in diffusion models: stable reference model updating (relaxing the frozen constraint with a regularization anchor) and timestep-aware training (adaptive weighting to balance reward scales across timesteps). Both strategies can be embedded into various preference optimization algorithms such as DPO and IPO, achieving state-of-the-art performance on human preference evaluation benchmarks.
SafeNlidb: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces: This paper proposes SafeNlidb, a framework that jointly optimizes safety reasoning and SQL generation in LLM-driven Natural Language Interfaces to Databases (NLIDBs) through a safety-aware data synthesis pipeline and an alternating preference optimization strategy, effectively defending against privacy leakage under implicit inference attacks.
W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search: This paper proposes W2S-AlignTree, the first inference-time alignment framework that integrates Monte Carlo Tree Search (MCTS) with the weak-to-strong generalization (W2SG) paradigm. It leverages step-level proxy value functions derived from a weak model to guide the generation of a strong model at inference time, achieving significant improvements over baselines across sentiment control, summarization, and instruction-following tasks — with a 15.9% gain on the Llama3-8B summarization task.
When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF: To address the pervasive "preference flipping" problem in human preference annotation, this paper proposes FA-DPO (Flipping-Aware DPO), which models the annotation process as a two-stage procedure consisting of "true human intent + instance-dependent flipping probability." By correcting the BT model loss and iteratively optimizing a flipping estimation module, FA-DPO substantially improves alignment robustness under various noise conditions, achieving up to a 16.7% gain over DPO when instance-dependent flipping rates are high.