Skip to content

⚖️ Alignment & RLHF

💬 ACL2026 · 38 paper notes

📌 Same area in other venues: 📷 CVPR2026 (12) · 🔬 ICLR2026 (102) · 🧪 ICML2026 (37) · 🤖 AAAI2026 (17) · 🧠 NeurIPS2025 (36) · 📹 ICCV2025 (2)

🔥 Top topics: LLM ×11 · Alignment/RLHF ×9 · Personalized Generation ×2 · Reinforcement Learning ×2

AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

To address two structural defects in reward models—the fixed spatial inductive bias and the misalignment between generative backbone representations and discriminative tasks caused by "compressing the entire sequence into a scalar via fixed pooling (e.g., last-token)"—AdaJudge proposes a gated refinement block to reshape representations into a discriminative space. It then utilizes "domain-aware gated multi-perspective pooling" to dynamically fuse evidence from last-token, mean, and attention poolings conditioned on the prompt. This approach allows 4B/8B models to outperform strong 27B off-the-shelf reward models on RM-Bench and JudgeBench.

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

The reward model is reshaped from a "single-turn scoring" mechanism into a multi-turn deliberation process featuring "forward + backward dual agents + tool calls." Through SFT+GRPO, these multi-agent capabilities are distilled into a single 4B model, which outperforms 70B-scale ORMs by 25.2% in Best-of-N (BoN) selection.

Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

Plan-RewardBench is proposed as a trajectory-level preference benchmark for complex tool-augmented scenarios, designed to evaluate the capability of reward models in distinguishing superior from inferior agent trajectories across multi-step planning, tool usage, and error recovery.

Alignment Data Map for Efficient Preference Data Selection and Diagnosis

This paper proposes the Alignment Data Map, an analytical tool that visualizes, selects, and diagnoses preference data by jointly considering response quality and variability. It achieves the alignment performance of full-set training using only 33% of the data.

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

ARES detects "systemic weaknesses" (simultaneous failure of both Core LLM and Reward Model) using a Safety Mentor that dynamically combines a quaternary structure of "Topic / Persona / Goal / Tactic." It subsequently employs a two-stage closed-loop process—repairing the RM before the policy—to raise the RedTeam safety rate from 0.28 to 0.96 with negligible loss in general capabilities.

BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models

This paper proposes an abstraction-grounding framework that decomposes the conceptual understanding of LLMs into three layers: "abstract-abstract, abstract-concrete, and concrete-concrete." Using concept probing and activation steering across 6 open-source LLMs and 10 value dimensions, the authors demonstrate that structured value representations exist within LLMs, migrate across abstraction layers, and causally drive concrete decisions.

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

This paper decomposes literary translation quality into two dimensions: "expression fluency" and "literary effect." By using specialized LLMs to iteratively generate high-quality reference translations and preference pairs, the authors employ SFT + explicit Reward Model + GRPO to train LitMT. This allows 8B/14B small models to approach or even surpass some large models in English-to-Chinese literary translation.

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

Building on the token-level stabilization method DFT, CADFT introduces a "sample-level compatibility" signal calculated from the model's own likelihood to re-weight supervised gradients. It further employs a delayed, low-frequency "compatibility-guided rewriting" to transform stubborn, difficult samples into learnable targets. This suppresses high-variance gradients without reward models or RL, enhancing SFT stability, generalization, and the quality of cold-start RL initialization.

ComplexConstraints and Beyond: Expert Rubrics for RLVR

This paper systematically demonstrates that "expert-written fine-grained scoring rubrics" serve as both more reliable evaluation tools for frontier LLMs and data-efficient RLVR reward signals. It proposes five design principles for constructing high-quality rubrics and introduces the ComplexConstraints dataset, where each prompt contains 10–40 atomic criteria. Empirical results show that performing RLVR with only ~1,000 expert samples improves the instruction-following capability of a 4B model by +15.5 pp and a 235B model by +12.2 pp. Furthermore, single-epoch agentic training successfully transfers to out-of-distribution (OOD) benchmarks that the model never encountered during training (BFCL +4.5 / τ²-Bench +7.4 / Toolathlon +6.8 pp).

ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training

ConsistRM proposes a consistency-aware self-training framework. By utilizing two modules—temporal consistency pseudo-labels (preference consistency merging online states and historical memory) and semantic consistency critique rewards (measuring semantic similarity of multiple generated critiques)—it improves the average performance of generative reward models by 1.5% across five benchmarks without human annotation, while significantly mitigating position bias.

CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

CuMA argues that dense models suffer from "Mean Collapse" when fitting conflicting cultural values, resulting in representations that represent no specific group well. By utilizing a "Demographic + Semantic" joint routing in a Mixture of LoRA Adapters, the method decouples conflicting gradients into dedicated expert subspaces, improving accuracy while preserving cultural diversity across multiple benchmarks.

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

The authors view the Bradley-Terry reward model as a causal graph for estimating total effect and identify bias-specific neurons (accounting for < 2% of total neurons) highly correlated with activations of five stylistic biases (length / paragraph / word overlap / exclamation mark / bold). During inference, these neuron activations are replaced with validation set medians (estimating the controlled direct effect). This approach eliminates bias without performance degradation on RewardBench / RM-Bench. When used downstream with DPO, it allows an 8B model's alignment score to match that of a 70B SOTA reward model.

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human–LLM Collaborative Writing

The paper identifies "draft-based co-authoring" as a neglected jailbreak surface where malicious users provide incomplete, harmful drafts for LLMs to "polish and complete." The model's "completion instinct" often overrides safety guardrails, leaking executable dangerous details. The authors construct the HarDBench benchmark to quantify this vulnerability (all eight models achieved ASR >80% under CoJP attacks) and propose SUBA preference optimization alignment. By learning to refuse harmful drafts while cooperating with benign ones, SUBA reduces ASR to single digits with minimal utility loss.

How Value Induction Reshapes LLM Behaviour

This paper performs DPO fine-tuning on 8 open-source LLMs (Llama 3 series) across 15 values using value-annotated preference data subsets. It reveals systematic crosstalk between values: inducing one value simultaneously strengthens or suppresses other related/opposing values. While positive values enhance safety, all value inductions increase "anthropomorphism," making outputs more likely to be perceived as sycophantic.

Large Language Models Are Overconfident in Their Own Responses

This paper discovers that instruction-tuned LLMs exhibit a significant ownership bias when evaluating "answers they generated themselves," and proposes a simple inference-time strategy of rewriting the answer as a user input before asking for confidence to reduce overconfidence without retraining.

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Ours proposes MAESTRO, which reformulates reward scalarization in GRPO as a contextual bandit problem. By utilizing a lightweight Conductor network to leverage last-layer hidden states of the model, it adaptively selects reward weights for each prompt-response pair, consistently outperforming static and single reward baselines across seven open-domain benchmarks.

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

MDP-GRPO addresses the instability of GRPO under discrete low-variance rewards in multi-constraint instruction following. By combining multi-temperature sampling, dual-anchor advantage, prospect-theoretic shaping, and asymmetric KL, it achieves more stable soft/hard constraint satisfaction rates for small models on IFEval, FollowBench, and a custom multi-constraint test set.

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

The authors identify that standard GRPO treats different option orderings of the same question as independent prompts, leading to "permutation-blindness" where model choices change when order varies. They propose PA-GRPO: organizing multiple permutations of the same semantic instance into a permutation group and employing a cross-permutation advantage baseline with a consistency reward. This explicitly optimizes for "order invariance," substantially reducing selection bias across 7 MCQ/Judge benchmarks while maintaining or improving accuracy.

ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

ModeX models Best-of-N selection for open-ended text generation as a problem of "finding modal clusters on a generated text similarity graph." By using n-gram Jaccard graph construction, recursive spectral clustering with Fiedler vectors, and centrality-based centroid selection, it generalizes self-consistency to tasks without standard answers (e.g., summarization, code, math) without requiring any reward models or LLM-judges.

On the Rejection Criterion for Proxy-Based Test-Time Alignment

This paper unifies proxy-based test-time alignment methods, such as implicit rewards, Nudging, and KAD, into a "sample-then-decide" probabilistic graphical model. It proposes a conservative confidence bet that uses the best confidence of a small alignment model as a reference, improving hybrid decoding accuracy across multiple mathematical and commonsense reasoning datasets.

P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist

P-Check transforms personalized reward modeling from "cramming user history into the judge" to "first generating a weighted dynamic evaluation checklist for the current user and current query, then using it to guide reward scoring." It significantly outperforms persona, memory retrieval, and fine-tuned reward model baselines on personalized preference prediction and downstream generation tasks in PRISM, Arena, and BESPOKE.

PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

PERSA utilizes "professor demonstrations + professor preference rewards + PPO updating only high-level LoRA" to tune general LLMs into specific teacher programming feedback styles. It significantly improves style consistency across APPS, PyFiXV, and CodeReviewQA while maintaining nearly 100% diagnostic accuracy.

Pref-CTRL: Preference Driven LLM Alignment using Representation Editing

Pref-CTRL introduces margin loss and regularizer loss oriented toward paired preference data to train a lightweight value function within the RE-Control framework—a test-time alignment framework that does not update LLM parameters. This makes representation editing more consistent with human preferences and achieves stable performance gains over RE-Control on SHP, HH-RLHF, and cross-domain data.

RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation

RbtAct treats author rebuttals as implicit supervision for "which review comments actually prompt modifications," constructs a dataset of 75,000 review-rebuttal segment-level mappings, and employs SFT+DPO to train an 8B model to generate more specific and actionable paper review feedback.

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

Ours proposes the Simple-to-Hard (S2H) DPO framework, which systematically enhances the multi-image reasoning capabilities of VLMs by constructing multi-image preference data across three progressive difficulty levels (fixed-point reasoning \(\rightarrow\) cross-image comparison \(\rightarrow\) global visual search) while maintaining single-image performance.

SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

This paper proposes SFTMix, a Mixup-based instruction tuning method. By partitioning SFT datasets into high-confidence and low-confidence subsets through training dynamics, it performs linear interpolation in the hidden representation space and applies Mixup regularization. SFTMix consistently improves instruction-following capabilities across different LLM families and dataset scales without relying on high-quality dataset curation.

Student Guides Teacher: Weak-to-Strong Inference via Spectral Orthogonal Exploration

This paper interprets the phenomenon of LLMs repeatedly sampling along the same incorrect logic on difficult problems as low-rank collapse of hidden states. It proposes Spectral Orthogonal Exploration (SOE): using a weak student model to provide short probes orthogonal to the teacher's current dominant subspace, forcing the teacher to leap out of the original bias manifold. This improves Pass@16 on difficult subsets of AIME/MATH/Olympiad from 26.7% to 45.9% on average.

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

This paper attributes the entropy instability in GRPO training to the "log probability-advantage" covariance contribution of a few extreme tokens. It utilizes a Gaussian kernel without additional hyperparameters to softly suppress the advantage of these tokens, resulting in stable performance improvements across 1.5B and 7B mathematical reasoning models.

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

Addressing hotel price reduction negotiation scenarios on online travel platforms (OTA), this paper proposes REPO. It co-trains Qwen3-32B using three types of rewards: preference reward models, LLM reviewers, and rule functions. The method simultaneously improves persuasiveness, SOP compliance, and badcase repair quality across expert evaluations and 9,653 real A/B dialogues.

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

TPAW transforms LLM self-training into an alignment process of "teaming up current and historical models for competition." It stabilizes preference optimization through two adaptive mechanisms—target response weighting and main player weighting—improving performance on the Open LLM Leaderboard and GSM8K without additional human preference annotations.

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

To address low reward accuracy and slow training performance when using large models as judges (LLM-as-a-judge) for soft constraints in RLVR instruction following, TinyJudge first identifies that "only style/structure/semantic categories possess high generalizability among soft constraints." It then distills the judgment expertise of frontier models into several 0.6B small specialist models to form an ensemble reward. This increases reward accuracy by approximately 12%, accelerates judging by 6x, and reduces total training time by 3x, while improving the downstream instruction satisfaction rate by an average of approximately 10%.

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

Addressing the "quality blindness" in inference-time alignment—where an aligned model guides an unaligned base model token-by-token but fails to distinguish good advice from bad, leading to an "intervention paradox" where more intervention results in worse performance—BlendIn adopts quality-aware probabilistic distribution blending. At positions where the base model is uncertain, it adaptively fuses the distributions of both models based on their respective confidence levels before greedily selecting a token. This preserves beneficial guidance while suppressing unreliable suggestions, achieving up to a 50% consistent improvement on the most challenging high-intervention model pairs.

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

This paper points out that strong reasoning models stop learning during GRPO on training sets that are "too easy and nearly all correct" because intra-group reward variance disappears. It proposes Mixed-CUTS, which mixes standard rollouts with constrained Top-K uniform sampling to recreate meaningful exploration differences. On Qwen3-4B, this method improves AIME25 Pass@1 by 15.1% compared to standard GRPO.

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

This paper views LLM alignment as a "semantic trajectory" shaping problem in the hidden space. It extracts prompt-answer topological bridges using 0D persistent homology to incorporate TTL during the SFT stage, and utilizes topic-specific preference directions for TPO during the DPO stage. This approach consistently outperforms non-topological baselines in reward, win rate, and harmlessness metrics on UltraChat and HH-RLHF.

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

This paper identifies the "reward-generation gap" in Direct Alignment Algorithms (DAAs)—a mismatch between training objectives and autoregressive decoding dynamics. The authors propose POET (Prefix-Oriented Equal-length Training), which implicitly constrains the token-level MDP to converge at all timesteps by truncating preference pairs to the length of the shorter response, achieving up to an 11.8 percentage point improvement on AlpacaEval 2.

What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective

This paper proposes weighted In-Context Influence (wICI), which evaluates the value of instruction data by measuring whether a candidate sample, used as a one-shot demonstration, can reduce the instruction-following difficulty of related hard probes. Under a 10% data budget, it outperforms or matches selection methods such as IFD, DEITA, NUGGETS, and SelectIT.

Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

This paper provides the first systematic study of "Incomplete Learning Phenomenon" (ILP) in SFT—where models fail to correctly reproduce part of the training data despite convergence. It identifies five recurring causes (Knowledge Absence, Knowledge Conflict, Internal Data Contradiction, Left-side Forgetting, Insufficient Optimization) and proposes a diagnostic framework along with targeted mitigation strategies.

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

WildFeedback automatically identifies satisfied/dissatisfied feedback from real multi-turn ChatGPT conversations. It transforms naturally occurring user preferences into preference training samples and instance-specific checklist evaluation standards. This enables small open-source instruction models to align more closely with real user needs than those trained on UltraFeedback, both on general benchmarks and in real-world user preference tests.