ICML2025 Alignment & RLHF AI paper notes paper summaries Alignment/RLHF LLM Multimodal/VLM Agents Reinforcement Learning Adversarial Robustness

⚖️ Alignment & RLHF¶

🧪 ICML2025 · 16 paper notes

📌 Same area in other venues: 📷 CVPR2026 (12) · 🔬 ICLR2026 (102) · 💬 ACL2026 (38) · 🧪 ICML2026 (37) · 🤖 AAAI2026 (17) · 🧠 NeurIPS2025 (36)

🔥 Top topics: Alignment/RLHF ×10 · LLM ×4

AlphaPO: Reward Shape Matters for LLM Alignment: AlphaPO introduces an \(\alpha\) parameter into the Direct Alignment Algorithms (DAA) framework to alter the "shape" of the reward function, generalizing it from the standard log-based reward to a more general power transform. This enables fine-grained control over likelihood displacement and over-optimization, achieving a 7%-10% improvement over SimPO and a 15%-50% improvement over DPO on Mistral-7B and Llama3-8B.
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection: The AMPO framework is proposed, combining online policy generation, multi-preference group contrastive loss, and active subset selection. By intelligently choosing small but highly informative subsets from a large pool of candidate responses for preference optimization, it achieves state-of-the-art results on AlpacaEval.
AssistanceZero: Scalably Solving Assistance Games: AssistanceZero is proposed, scaling assistance games to complex environments (Minecraft building assistance with \(10^{400}\) possible goals) for the first time. By extending AlphaZero with a reward prediction head and a human action prediction head to perform planning under uncertainty via MCTS, the method significantly outperforms PPO and imitation learning baselines. Human experiments demonstrate that AssistanceZero effectively reduces user actions and exhibits emergent behaviors such as digging foundations, inferring roofs, and learning from corrections.
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective: This work identifies a structural property induced by KL regularization in RLHF—the policy coverage over the optimal policy is bounded by its sub-optimality (\(\text{Cov}^{\pi^*|\pi} \leq 1 + \kappa \cdot (J(\pi^*) - J(\pi))/\beta\)). Based on this, two transfer learning principles are proposed: (1) selecting a transfer policy with high policy value, and (2) self-transfer distilling the policy from online data. The proposed TPO algorithm achieves a regret of \(O(W\sqrt{T})\) in the early stage and \(O(\sqrt{T})\) in the late stage. It can be modularly integrated with DPO/IPO/XPO, and its effectiveness is validated on the T5 summarization task.
Challenges and Future Directions of Data-Centric AI Alignment: This paper is a position paper advocating for shifting the research focus of AI alignment from algorithm design to data quality. Through qualitative analysis of the Anthropic-HH dataset, it reveals six major sources of unreliability in human feedback and proposes future directions for improving data collection, cleaning, and verification.
Diverging Preferences: When do Annotators Disagree and do Models Know?: This paper systematically analyzes the reasons behind annotator disagreement in RLHF preference datasets by taxonomizing them into 10 categories. It reveals that over 75% of disagreements stem from personal preference rather than annotation noise. To address this, the paper proposes a Mean-Var Reward Model to effectively differentiate between diverging and high-consensus preferences, and uncovers systematic biases in LLM-as-Judge evaluation methodologies when facing disagreement.
DPO Meets PPO: Reinforced Token Optimization for RLHF: This paper proposes Reinforced Token Optimization (RTO), which models RLHF as a token-level MDP (rather than a sentence-level bandit). It leverages DPO to implicitly extract token-wise reward signals and then performs policy optimization using PPO. RTO outperforms PPO by 7.5 points on AlpacaEval 2 and by 4.1 points on Arena-Hard, achieving PPO-level performance with only 1/8 of the data.
Improving Model Alignment through Collective Intelligence of Open-Source LLMs: This paper proposes Mixture of Agents Alignment (MoAA), which leverages the collective intelligence of multiple open-source LLMs to generate high-quality alignment data (SFT data and preference data). This significantly improves the performance of the target model on Arena-Hard and AlpacaEval2, demonstrating self-improvement capabilities without external strong supervision.
Instruction Tuning of Large Language Models for Tabular Data Generation—in One Day: This paper is the first to explore utilizing instruction tuning to enhance the tabular data generation capabilities of LLMs. By constructing a high-quality instruction dataset of only 10K instances and fine-tuning Llama3.1-8B-Instruct on a single A100 for less than 6 hours, the approach achieves tabular data generation performance comparable to GPT-4o.
Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models: This work identifies the Image enCoder Early-exiT (ICET) vulnerability in VLMs, where skipping certain layers of the image encoder significantly increases the probability of generating harmful outputs. It proposes Layer-wise PPO (L-PPO), which modifies the Clipped-PPO algorithm to perform multimodal RLHF across different layers, leading to up to a 48% reduction in ASR and a 33.64% reduction in toxicity score.
M³HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality: The M³HF framework is proposed to integrate multi-phase, mixed-quality natural language human feedback during multi-agent reinforcement learning. It leverages LLMs to parse feedback, and updates the reward function through predefined templates and adaptive weights, significantly improving multi-agent coordination performance.
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence: Drawing inspiration from the Particle Swarm Optimization (PSO) algorithm, this work treats multiple LLM experts as "particles" collaboratively searching in the weight space. Guided by three signals—individual best, global best, and global worst—the experts iteratively update their positions. This achieves tuning-free model adaptation using only 200 samples, outperforming 12 baselines by an average of 13.3% across 9 tasks.
MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment: Proposes MPO (Mixing Preference Optimization), a lightweight post-processing framework that achieves multi-preference alignment by log-linearly combining existing single-objective policies, which bypasses the expensive reinforcement learning process in multi-objective RLHF.
On the Robustness of Reward Models for Language Model Alignment: This paper proposes Batch-wise Sum-to-Zero Regularization (BSR), which represses the excessive dispersion of hidden-state norms by constraining the sum of reward scores within each batch to zero, fundamentally addressing the over-optimization problem of reward models. This mechanism enables an 8B-scale RM to outperform the state-of-the-art (SOTA) by more than 5% on complex preference prediction tasks, reduces generation length by 40% during downstream RLHF training, and improves win rate by 7%.
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning: Proposes PoisonBench, the first benchmark to systematically evaluate LLM vulnerability to data poisoning attacks during the preference learning phase. It covers two attack types (content injection and alignment deterioration) and reveals a log-linear relationship between the poisoning ratio and attack effectiveness across 22 models, along with preliminary evidence of deceptive alignment.
TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization: Deconstructs sequence-level PPO into a series of token-level proximal policy optimization problems and introduces a token-level reward guidance function \(f(\hat{r}(s_t, a_t))\) to replace the fixed constant \(\beta\) in DPO. This allows different tokens to deviate from the reference policy to varying degrees based on their respective reward values, improving the win rate on MT-Bench/AlpacaEval 2/Arena-Hard by up to 7.5/6.2/4.3 percentage points respectively.