⚖️ Alignment & RLHF¶

💬 ACL2026 · 13 paper notes

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling: This paper proposes Plan-RewardBench, a trajectory-level preference benchmark targeting complex tool-augmented scenarios, designed to evaluate the ability of reward models to distinguish superior from inferior agent trajectories across multi-step planning, tool usage, and error recovery settings.
Alignment Data Map for Efficient Preference Data Selection and Diagnosis: This paper proposes the Alignment Data Map, an analytical tool that visualizes, selects, and diagnoses preference data by jointly considering response quality and variability. Using only 33% of the data, it achieves alignment performance comparable to full-data training.
Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs: This paper proposes a representativeness evaluation framework for LLMs that goes beyond marginal distributions by jointly examining marginal response distributions and cross-question correlation structures to assess demographic-aligned models. The findings reveal that while fine-tuning and persona prompting improve the approximation of marginal distributions, neither faithfully reproduces the multivariate correlation patterns observed in human values surveys.
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training: ConsistRM proposes a consistency-aware self-training framework for generative reward models (GRMs). It introduces two modules — temporal consistency pseudo-labels (integrating online-state and memory-driven preference consistency) and semantic consistency critique rewards (measuring semantic similarity across multiple generated critiques) — achieving an average improvement of 1.5% across five benchmarks without human annotation, while significantly mitigating position bias.
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries: This paper demonstrates that domain-specific contexts (e.g., chemistry papers) selectively relax LLM safeguards on related harmful knowledge (vertical unlocking), while security research contexts trigger broad relaxation across all harmful categories (general unlocking). Based on these findings, the authors propose the Jargon attack framework, achieving over 93% attack success rate (ASR) on seven frontier models including GPT-5.2 and Claude-4.5.
Reward Modeling for Scientific Writing Evaluation: This paper proposes SciRM and SciRM-Ref, two open-source reward models tailored for scientific writing evaluation. Through two-stage reinforcement learning (GRPO) that separately optimizes evaluation preference and reasoning ability, these models achieve fine-grained multi-aspect evaluation across diverse scientific writing tasks and generalize to unseen evaluation tasks and criteria.
Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors: This paper proposes Fission-GRPO, which dynamically converts tool execution errors into on-policy corrective training instances within the RL training loop. A learned error simulator generates diagnostic feedback, and recovery trajectories are resampled from the augmented context. The approach improves the error recovery rate of Qwen3-8B by 5.7% and raises overall accuracy from 42.75% to 46.75%.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models: This paper proposes a Simple-to-Hard (S2H) DPO framework that constructs multi-image preference data across three progressively harder levels (anchored reasoning → cross-image comparison → global visual search), systematically improving VLM multi-image reasoning while preserving single-image performance.
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging: This paper proposes SafeMERGE, a lightweight post-fine-tuning framework that detects layers deviating from safe behavior via cosine similarity, and selectively merges only those layers with their counterparts from a safety model. Across four LLMs, the method significantly reduces harmful outputs while maintaining or even improving task performance.
SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe: This paper proposes SFTMix, a Mixup-based instruction tuning method that partitions SFT data into high-confidence and low-confidence subsets via training dynamics, applies linear interpolation between the two subsets in the hidden representation space with Mixup regularization, and consistently improves instruction-following ability across LLM families and dataset scales without relying on high-quality curated datasets.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming: This paper proposes STAR-Teaming, an automated red teaming framework based on a Strategy-Response Multiplex Network, which models attack strategy selection as a probabilistic optimization of the inverse Ising problem. The framework achieves an average attack success rate (ASR) of 74.5% on HarmBench, outperforming the strongest baseline by 13.5%, while significantly reducing computational overhead.
Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms: This paper identifies the "reward-generation gap" in Direct Alignment Algorithms (DAAs)—a mismatch between training objectives and autoregressive decoding dynamics—and proposes POET (Prefix-Oriented Equal-length Training), which truncates preference response pairs to the length of the shorter response to implicitly constrain token-level MDP convergence across all timesteps, achieving up to 11.8 percentage point improvement on AlpacaEval 2.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense: This paper proposes TrajGuard, a training-free decoding-time jailbreak defense framework that quantifies risk in real time by aggregating hidden-state trajectories from key layers within a sliding window, and triggers a lightweight semantic judge only when risk persistently exceeds a threshold. TrajGuard achieves an average defense rate of 95% across 12 jailbreak attacks, with a detection latency of only 5.2 ms/token and a false positive rate below 1.5%.