ACL2025 Alignment & RLHF AI paper notes paper summaries Alignment/RLHF LLM Adversarial Robustness Agents Multimodal/VLM Reasoning

⚖️ Alignment & RLHF¶

💬 ACL2025 · 82 paper notes

📌 Same area in other venues: 📷 CVPR2026 (12) · 🔬 ICLR2026 (102) · 💬 ACL2026 (38) · 🧪 ICML2026 (37) · 🤖 AAAI2026 (17) · 🧠 NeurIPS2025 (36)

🔥 Top topics: Alignment/RLHF ×39 · LLM ×11 · Adversarial Robustness ×11 · Agents ×5 · Multimodal/VLM ×4

A Dual-Mind Framework for Strategic and Expressive Negotiation Agent: Inspired by the dual-process theory of human cognition, this paper proposes a Dual-Mind Negotiation Agent (DMNA) framework. It combines an intuitive module (fast strategic planning, trained based on MCTS+DPO) and a deliberative module (slow expression optimization, based on a multifaceted reflection mechanism) to achieve state-of-the-art performance on negotiation tasks.
AceCoder: Acing Coder RL via Automated Test-Case Synthesis: The study constructs AceCode-87K (87K coding problems + 1.38M automatically synthesized test cases) to train a code-specific Reward Model (the 7B model outperforms the 340B Nemotron). Best-of-N sampling improves Llama-3.1-8B by 8.9 points on average. Direct R1-style RL from a base model for only 80 steps improves HumanEval+ by 22.5%.
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models: This paper proposes AGD (Adversarial Game Defense), an LLM jailbreak defense method based on adversarial games. By dynamically adjusting the internal representations of the model to balance helpfulness and harmlessness, AGD significantly improves LLM safety through three stages: IQR anomaly detection, bi-level optimization game, and expert model sampling.
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic LLMs: This paper proposes the AgentAlign framework, which leverages abstract behavior chains as an intermediary to synthesize high-quality agent safety alignment data (both harmful and benign) in simulated environments. Through Supervised Fine-Tuning (SFT), AgentAlign improves the agent safety of three open-source model families by 35.8%–79.5% while maintaining or even enhancing their task capabilities.
AgentRM: Enhancing Agent Generalization with Reward Modeling: AgentRM is proposed, a generalizable reward model constructed via explicit, implicit, and LLM-as-judge approaches. It guides policy models using test-time search (Best-of-N / Beam Search), achieving an average improvement of 8.8 points across 9 agent tasks and outperforming the best generalist agent by 4.0 points.
Aligning to What? Limits to RLHF Based Alignment: Through systematic experiments, this paper finds that RLHF (including DPO, ORPO, RLOO, etc.) is fundamentally ineffective at reducing covert racial bias in LLMs. Furthermore, executing SFT prior to RLHF "solidifies" model biases, revealing the deep limitations of current alignment techniques when dealing with ambiguous goals such as bias elimination.
AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models: This paper proposes the AMoPO framework, which achieves dimension-aware adaptive weight allocation by modeling the generation space as a Gaussian distribution. It completes multi-objective preference alignment without relying on reward models or reference models, outperforming the state-of-the-art (SOTA) by 28.5% on the HelpSteer2 dataset, and validating scalability on 7B, 14B, and 32B models.
ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning: Refines the granularity of preference optimization in DPO from the response level to the sentence level. By dynamically computing adaptive reward weights for each sentence based on image-text similarity and textual perplexity, it achieves average improvements of 2.57/2.87/1.98 points on LLaVA-1.5-7B/13B and InstructBLIP-13B, respectively, while significantly reducing hallucination rates.
Atyaephyra at SemEval-2025 Task 4: Low-Rank Negative Preference Optimization: In the SemEval 2025 LLM Unlearning Shared Task, this paper combines Negative Preference Optimization (NPO) with Low-Rank Adaptation (LoRA). By leveraging the structural properties of LoRA, the authors acquire the original model distribution with zero additional overhead to compute KL divergence regularization, significantly stabilizing the unlearning process and outperforming the task baselines.
AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs: AutoMixAlign proposes a theory-driven data mixing method for multi-task preference optimization: it first trains specialist models for each task to establish optimal loss baselines, and then adaptively adjusts data mixing proportions via minimax optimization, prioritizing tasks with the largest excess loss (gap from the specialist). It achieves an average improvement of 9.42% in helpfulness/harmlessness/reasoning multi-task DPO.
Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning: This paper systematically investigates how to optimally allocate resources between Supervised Finetuning (SFT) and Preference Finetuning (PFT/DPO) under a fixed data annotation budget. It reveals that pure SFT is optimal at low data regimes, while a combined approach performs best at high budgets. Furthermore, allocating less than 10% of the budget to SFT can resolve the cold start problem of DPO, bringing a 15-20% performance gain in mathematical reasoning.
Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection: This paper proposes G2IS (Gradient-based Graph Instruction Selection), which models the joint distribution and mutual dependencies among instruction data by constructing a gradient-based instruction graph. Combined with a gradient walk algorithm for data selection, it outperforms full-dataset instruction tuning using only 1% of the data.
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs: This paper proposes EDDF, a jailbreak defense framework based on "attack essence" rather than surface-level patterns. It offline extracts essential strategies of known attacks to store in a vector database, and online performs essence abstraction, retrieval, and fine-grained judgment on new queries. This reduces the attack success rate by at least 20% with a false positive rate of only 2.18%.
Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models: This paper systematically evaluates the safety of 13 SOTA small language models (<4B parameters) under 5 jailbreak attacks, finding that while SLMs can resist direct attacks, they are significantly more vulnerable under jailbreak attacks than large language models. It further analyzes the impact of SLM practices such as structural compression, quantization, and knowledge distillation on safety.
Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data: The proposed ReVD framework comprehensively enhances LLM vulnerability detection accuracy by 12-23% and achieves SOTA on PrimeVul and SVEN. It features bidirectional vulnerability reasoning data synthesis, triplet SFT (simultaneously learning reasoning across vulnerable code, patched code, and code differences), and Curriculum Online Preference Optimization (COPO).
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space: Based on the Elaboration Likelihood Model (ELM), this work decomposes jailbreak strategies into four categories of independently evolvable components (Role/Content Support/Context/Communication Skills). It proposes the CL-GSO genetic algorithm to execute crossover and mutation at the component level, expanding the strategy space from 40 in prior work to 839. This achieves a 96% Jailbreak Success Rate (JSR) on Claude-3.5 (where prior methods reached at most 4%). Meanwhile, a strategy evaluation mechanism based on intent consistency is proposed, reaching an accuracy of 96.5% and outperforming specialized safety models.
Call for Rigor in Reporting Quality of Instruction Tuning Data: Through systematic experiments across 16 hyperparameter combinations, this study reveals a severe issue in evaluating the quality of instruction tuning data: researchers' arbitrary choice of training hyperparameters can lead to entirely opposite conclusions (e.g., "Data A is superior to Data B"). It calls for the mandatory use of validated hyperparameter configurations when reporting data quality.
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step: This paper proposes the Chain-of-Jailbreak (CoJ) attack, which decomposes a malicious query that cannot directly bypass safety guardrails into a sequence of multi-step editing sub-queries (delete-then-insert, insert-then-delete, change-then-change-back). CoJ achieves a 60%+ jailbreak success rate on GPT-4V/4o/Gemini. To counter this, the paper introduces the Think-Twice Prompting defense, which successfully intercepts over 95% of CoJ attacks.
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch: To fill the gap in Chinese Reward Model resources, this paper constructs CheemsBench (the first large-scale Chinese RM evaluation benchmark) and CheemsPreference (the first large-scale Chinese preference dataset). Trained via human-machine collaborative annotation and a distant supervision filtering strategy, CheemsRM significantly outperforms all existing open-source RMs in Chinese scenarios.
Constitutional Classifiers: Defending Against Universal Jailbreaks Across Thousands of Hours of Red Teaming: Anthropic proposes "Constitutional Classifiers", which train input/output safety classifiers by generating synthetic training data from natural language safety principles (constitutions). In over 3000 hours of professional red teaming, no universal jailbreak attacks were discovered, while only incurring a 0.38% increase in over-refusal rate and a 23.7% inference overhead.
Curiosity-Driven Reinforcement Learning from Human Feedback: CD-RLHF introduces curiosity-driven exploration (curiosity-driven RL) into RLHF. By utilizing the prediction error of a forward dynamics model as an intrinsic reward, combined with top-k gating filtering and reward whitening, it significantly enhances LLM output diversity without compromising alignment quality (achieving a 40.26% increase in Diversity and an 8.92% increase in EAD on Llama-3.2-1B).
Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement: This paper proposes the D&R framework, which allows a small model (student) to engage in multi-round debates with multiple large models (teachers) to collect self-reflections and teacher feedback. The debate logs are organized into preference trees for distillation using Tree-structured DPO (T-DPO). This achieves an average improvement of 14.18 points on MMLU Pro and MATH, with better inference efficiency than baselines.
DiffPO: Diffusion Alignment with Direct Preference Optimization: DiffPO is proposed to reformulate LLM alignment as a sentence-level diffusion denoising process. Through parallel decoding, it achieves efficient inference-time alignment and serves as a plug-and-play module that enhances the alignment quality of any base model.
Don't Say No: Jailbreaking LLM by Suppressing Refusal: This paper proposes the DSN (Don't Say No) attack method. By analyzing the deficiencies of the target loss function in existing jailbreak attacks, it introduces two improvement strategies: cosine decay scheduling and refusal suppression. DSN achieves an Attack Success Rate (ASR) that outperforms existing methods across multiple LLMs and demonstrates strong transferability to unseen datasets and black-box models.
Dynamic Scaling of Unit Tests for Code Reward Modeling: This paper discovers that scaling up the number of LLM-generated unit tests consistently improves the quality of code reward signals (especially for complex problems). Based on this insight, a lightweight unit test generation model, CodeRM-8B, is trained alongside the implementation of a dynamic scaling strategy, achieving significant improvements across multiple code generation benchmarks.
Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent: This paper proposes ECPO (Expectation Confirmation Preference Optimization), the first multi-turn preference optimization method for LLM-driven conversational recommendation Agents. Based on the psychological Expectation Confirmation Theory (ECT), it explicitly models the evolution of user satisfaction across multiple turns. It locates the root causes of dissatisfaction through forward expectation confirmation and reconstructs turn-level preference pairs by rewriting responses via backward expectation derivation. Combined with the AILO user simulator, ECPO significantly outperforms existing MTPO methods across three datasets.
Model Extrapolation Expedites Alignment: Based on the observation that preference alignment only induces minor parameter changes, the ExPO method is proposed. By extrapolating the direction of parameter updates from SFT to DPO (\(\theta_2 = \theta_1 + \alpha\Delta\theta\)), alignment performance is enhanced at zero additional training cost, allowing a DPO model trained on only 20% of steps to outperform its fully trained counterpart.
Federated Data-Efficient Instruction Tuning for Large Language Models: Proposed FedHDS (Federated Hierarchical Data Selection) to eliminate intra- and inter-client data redundancy in federated learning via a two-level hierarchical data selection mechanism, combined with multi-layer Transformer feature fusion to improve coreset quality. Utilizing less than 1.5% of the data, it achieves an average Rouge-L improvement of 10.72% over the state-of-the-art full-data federated baseline, while accelerating training efficiency by up to 48.8 times.
Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization: It is discovered that the traditional DPO preference data construction strategy (max-min) suffers from performance degradation as the sample size increases. Through a systematic exploration based on reward distribution, it is found that the rejected response should be selected at \(\mu-2\sigma\) instead of the minimum. Based on this finding, a preference data construction method is proposed that consistently scales with the sample size.
Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization: This paper proposes Segment Supervised Preference Optimization (SSPO), which models the duration alignment problem between translated text and source speech in video dubbing as segment-level preference optimization. By using sentence-by-sentence sampling and fine-grained DPO loss, it achieves duration consistency for each dialogue line while maintaining translation quality and output format.
FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings: This paper proposes FocalPO, a DPO variant that introduces a focal loss-inspired modulation factor to down-weight incorrectly ranked pairs, prioritizing the reinforcement of the model's understanding of already correctly ranked preference pairs, outperforming DPO and its variants on benchmarks such as AlpacaEval 2.0.
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points: Observing that errors in code generation models are highly concentrated on specific "error-prone points" where prefixes/suffixes remain almost identical while the middle segment determines correctness, this study proposes Focused-DPO. It ranks and locates key middle segments via PageRank on a code-test bipartite graph and magnifies their weights in the DPO loss (\(w_{focused}=2\)). With only 5,000 samples, it improves HumanEval+ by 4.41% and relatively boosts LiveCodeBench-Hard by 42.86%.
From Lists to Emojis: How Format Bias Affects Model Alignment: This paper systematically investigates how preference models (including human annotators, GPT-4, and open-source models) in RLHF exhibit formatting biases towards features such as bold text, lists, and emojis. It demonstrates that injecting less than 1% biased data can significantly introduce bias and proposes a debiasing method utilizing a dual-head reward model.
HAF-RM: A Hybrid Alignment Framework for Reward Model Training: This paper proposes HAF-RM, a Hybrid Alignment Framework that retains the policy layer during reward model training. By simultaneously optimizing sequence-level reward loss and token-level policy loss to jointly supervise the shared internal preference model, HAF-RM consistently outperforms standard Baselines and DPO methods across 5 datasets.
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States: Proposes HiddenDetect, a tuning-free safety detection framework based on internal activation states: it detects jailbreak attacks by monitoring refusal semantic signals in hidden states during LVLM inference, outperforming existing methods in AUROC by a wide margin across multiple models and multimodal benchmarks.
Understanding Impact of Human Feedback via Influence Functions: Applying influence functions to feedback data auditing for RLHF reward models for the first time, combined with OPORP vector compression to achieve a 2.5x speedup. It outperforms GPT-4o in bias detection (AUC 0.8 vs. 0.747) and identifies 47% of mislabeled samples from the Anthropic-HH dataset.
Internal Value Alignment in Large Language Models through Controlled Value Vector Activation: This work proposes the ConVA (Controlled Value Vector Activation) framework, which accurately identifies value vectors in the LLM's latent space using context-controlled datasets, and activates target values at inference time using a gated minimal perturbation mechanism. It achieves an average 29.6% increase in control success rate across the 10 Schwartz basic values while maintaining 97%+ of text fluency and general capabilities.
Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process: By unifying the analysis of SFT and Preference Optimization (PO) through an MDP framework, SFT is identified as a special case of PO where preference estimation and transition optimization are insufficient. IFT (Intuitive Fine-Tuning) is proposed to utilize temporal residual connections, allowing the model to achieve alignment performance comparable to or even surpassing the SFT+PO pipeline without requiring preference data.
IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization: This work proposes IOPO (Input-Output Preference Optimization), which introduces input preference modeling in addition to traditional DPO that only optimizes output preferences. It trains the model to learn "which instruction x better matches a given response y," thereby enhancing the model's fine-grained perception of complex, multi-constraint instructions. Additionally, the authors construct the Trace benchmark, consisting of 119-K training samples and 1-K evaluation samples, covering 26 constraint dimensions across 5 major categories.
Jailbreaking? One Step Is Enough!: This paper proposes the Reverse Embedded Defense Attack (REDA) method, which disguises the attack intent as a task for "defending" harmful content. By integrating a reversed attack perspective, example-guided in-context learning (ICL) prompts, and query intent mitigation, REDA achieves a highly successful, single-step, and cross-model transferable jailbreak attack.
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs: The first unified comprehensive assessment framework covering both automatic and non-automatic jailbreak attacks: compiling 17 representative jailbreak attacks, establishing a taxonomy of six attack categories, and performing large-scale systematic evaluation across 9 aligned LLMs and 8 defense strategies, revealing the key insight that heuristic-based attacks exhibit "high ASR but low utility."
JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning: This paper proposes JsonTuning—a method that replaces natural language text in instruction tuning inputs and outputs with structured JSON formats. By explicitly representing task elements, relationships, and output constraints (via JSON Schema), it consistently outperforms traditional TextTuning across 7 pre-trained models and 6 task categories, improving average performance from 26.78 to 30.88 while significantly enhancing robustness and controllability.
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges: A benchmark named MalwareBench (320 hand-crafted malware code requirements × 11 black-box jailbreak methods = 3520 prompts) is constructed to systematically evaluate the safety of 29 LLMs in malware code generation scenarios. It is found that jailbreak attacks reduce the average refusal rate from 60.93% to 39.92%, and there is no proportional relationship between model parameters and defense capability.
LoGU: Long-form Generation with Uncertainty Expressions: This work defines the "Long-form Generation with Uncertainty Expressions" (LoGU) task, identifies two sub-challenges (uncertainty suppression and uncertainty misalignment), and proposes a decomposition-based data construction framework and an SFT+DPO two-stage training pipeline. This enables LLMs to explicitly express uncertainty for uncertain facts in long-form generation, improving Llama3-8B's factual accuracy from 51.9% to 71.6% and reducing the number of incorrect claims from 20.4 to 5.81 across three datasets.
LPOI: Listwise Preference Optimization for Vision Language Models: This paper proposes LPOI, the first object-aware listwise preference optimization method for VLMs. By identifying and occluding critical objects in images, LPOI interpolates between positive and negative samples to generate a sequence of progressive occlusions. This trains the model to rank according to object visibility, effectively reducing hallucinations without requiring extra annotation, outperforming existing preference optimization methods on MMHalBench, AMBER, and Object HalBench.
LSSF: Safety Alignment via Low-Rank Safety Subspace Fusion: LSSF proposes the hypothesis that the safety information of LLMs resides in a low-rank subspace. It extracts the principal components of the safety-aligned model via SVD, adaptively determines the kept rank for each layer using safety singular value entropy, and finally linearly fuses the extracted safety principal components into the fine-tuned model. This restores the safety alignment degraded by fine-tuning without any additional training, while maintaining downstream task performance.
M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs: This paper proposes the M2S framework, which compresses multi-turn human jailbreak conversations into single-turn prompts using three simple format conversion methods (Hyphenize/Numberize/Pythonize). This approach not only maintains but even exceeds the original multi-turn attack effectiveness (achieving an ASR up to 95.9%, improving by up to 17.5% over multi-turn attacks) while reducing token usage by more than half.
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric: Systematically analyzes the limitations of 11 existing diversity measurement methods and proposes NovelSum—a data diversity metric that simultaneously considers sample uniqueness and information density, achieving a 0.97 correlation with instruction tuning performance.
MPO: Multilingual Safety Alignment via Reward Gap Optimization: MPO discovers that the implicit Reward Gap of LLms between the dominant language (English) and target languages is strongly correlated with safety performance. The authors propose directly minimizing the discrepancy in Reward Gap between the two to transfer the safety alignment capabilities of the dominant language to multiple languages. This method significantly reduces the attack success rate in low-resource languages across three models without compromising general capabilities.
MTSA: Multi-Turn Safety Alignment for LLMs through Multi-Round Red-Teaming: The proposed MTSA framework simultaneously enhances the attack capabilities of the red-team model and the safety defense performance of the target model through thought-guided multi-round red-teaming and reinforcement learning optimization with future rewards within an adversarial iterative process. It achieves state-of-the-art (SOTA) performance across multiple safety benchmarks without degrading general model capabilities.
Mutual-Taught for Co-adapting Policy and Reward Models: Mutual-Taught proposes a self-training framework based on the EM algorithm to iteratively update both the policy model (PM) and the reward model (RM) during the preference optimization process: the E-step optimizes the PM using the current RM, while the M-step updates the RM using pseudo-preference pairs constructed from the output differences of the PM before and after the update. This resolves the reward hacking issue caused by distribution shift, achieving a 54.1% LC win rate on AlpacaEval-2 with an 8B model.
Optimal Transport-Based Token Weighting for Enhanced Preference Optimization: OTPO utilizes Unbalanced Optimal Transport (UOT) to calculate semantic alignment weights between token representations of chosen and rejected responses, focusing preference optimization on critical distinguishing tokens instead of treating all tokens equally. It improves the LC WR of DPO from 48.14% to 55.84% on AlpacaEval 2.0, while unifying DPO, SimPO, SamPO, and LDDPO as special cases of token weighting.

Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference: This work releases PKU-SafeRLHF, a large-scale safety preference dataset containing 44.6k refined prompts, 265k QA pairs with safety meta-labels, and 166.8k preference data items. It introduces 19 harm categories and 3 severity levels of annotation for the first time, and trains a severity-sensitive moderation model (93% accuracy) along with a SafeRLHF alignment pipeline based on this dataset.
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models: This paper proposes PRMBench, a benchmark consisting of 6,216 carefully designed problems and 83,456 step-level labels, to systematically evaluate the fine-grained error detection capabilities of process-level reward models (PRMs) across three dimensions: Simplicity, Soundness, and Sensitivity. Experiments reveal significant deficiencies in 15 existing PRMs.
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning: PCPO introduces a token-level probability consistency metric into the preference pair selection stage. By selecting pairs where the answer is correct and the reasoning process is most "similar" to that of the incorrect response for DPO training, the model is forced to focus on key reasoning differences. This approach consistently outperforms IRPO/ScPO across multiple mathematical reasoning benchmarks.
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language: QueryAttack is proposed to decompose harmful natural language queries into three semantic components (content, modifiers, category) and insert them into programming language templates (9 languages including SQL/URL/Python/Java/C++). Combined with in-context learning (ICL), it guides the target LLM to reply directly with harmful content in natural language without any decryption steps, achieving a 96.35% ASR on GPT-4o in the Ensemble configuration. Additionally, the proposed cross-lingual CoT defense can reduce the ASR by up to 64%.
Red Queen: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking: This paper proposes Red Queen Attack, the first jailbreak attack method based on the Theory of Mind (ToM) that constructs multi-turn dialogue scenarios to conceal malicious intent, generating 56K multi-turn concealed attack data points and achieving an 87.6% ASR on GPT-4o. Concurrently, the Red Queen Guard defense strategy is introduced to reduce the ASR to <1% through multi-turn DPO data training, without compromising general benchmark performance.
Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization: This paper revisits reward model evaluation through the lens of reward overoptimization, finding that existing benchmarks correlate weakly with downstream policy performance. To address this, it proposes three key design principles for building reliable benchmarks: minimizing non-correctness confounding differences in positive/negative samples, using multiple comparisons to cover a wide response spectrum, and sampling responses from diverse models.
Rethinking Table Instruction Tuning: This work systematically ablates overlooked hyperparameter choices (learning rate, data scale, and epoch count) in table instruction tuning. It reveals that existing table LLMs suffer from severe degradation of general capabilities (MMLU drops by 14 points, AI2ARC by 21 points) due to excessively large learning rates (\(2 \times 10^{-5}\)). To address this, the authors propose TAMA, which is constructed by tuning LLaMA 3.1 8B Instruct on only 2,600 samples (200 samples from each of 13 datasets) with a learning rate of \(1 \times 10^{-6}\) and for 2 epochs. TAMA matches or outperforms GPT-3.5 and GPT-4 across 13 table tasks while fully preserving its general capabilities.
Retrieval-Augmented Fine-Tuning With Preference Optimization For Visual Program Generation: This paper proposes a two-stage training strategy for the automatic generation of industrial visual programming languages (specifically Ladder Diagrams): first, leveraging subroutine reuse characteristics via Retrieval-Augmented Fine-Tuning, and second, further improving accuracy through DPO training with preference pairs constructed via graph edit operations, achieving an over 10% improvement in program-level accuracy on real-world LD data.
Reverse Preference Optimization for Complex Instruction Following: Proposed Reverse Preference Optimization (RPO), which converts arbitrary responses into "perfect" chosen samples by dynamically reversing unsatisfied constraints in the instruction. This eliminates noise in multi-constraint preference pairs and significantly outperforms DPO baselines on multi-turn complex instruction-following tasks.
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective: This paper unifies various reward biases in RLHF (such as length bias, category bias, and social bias) under the concept of "reward unfairness". Drawing on resource allocation theory, the authors propose two bias-agnostic methods, Fairness Regularization and Fairness Coefficient, applied to reward model training and policy model training respectively. These methods simultaneously mitigate multiple biases and improve alignment quality without being tailored to any specific bias.
Reward Generalization in RLHF: A Topological Perspective: Systematically characterizes the flow of reward information in RLHF from the perspective of information topology. At the macro level, RLHF is modeled as an autoencoding process. At the micro level, the Induced Bayesian Network (IBN) is proposed to analyze how preference data topology affects reward generalization, leading to a tree-structured preference data method. This method outperforms the chain-based baseline with an average win rate of 65% across three tasks: HH-RLHF, GSM-8K, and DialogSum.
Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction: R2J (Rewrite to Jailbreak) is proposed as a learnable and transferable black-box jailbreak method. By iteratively training an attacker LLM to rewrite harmful instructions (modifying only the phrasing without altering the intent), it achieves an Attack Success Rate (ASR) improvement of over 20% compared to methods like GCG and AutoDAN. R2J generates jailbreak prompts without additional prefixes or suffixes, making them more stealthy and highly transferable across models.
RISE: Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing: RISE finds that around 75% of LLM mathematical errors are subtle inner-step errors (digit substitutions, operand swaps, step omissions). By leveraging the LLM to self-edit correct solutions by injecting predefined subtle errors, it constructs high-quality hard negative samples. Combined with error-aware DPO training, this method improves performance on GSM8K by 3.0% and on MATH by 7.9% using only 4.5K samples, while generalizing to logical reasoning and code generation.
Robust Preference Optimization via Dynamic Target Margins: This paper proposes \(\gamma\)-PO, a plug-and-play method that enhances the robustness of DPO by dynamically adjusting target reward margins at the preference pair level, achieving an average improvement of 4.4% on AlpacaEval2 and Arena-Hard.
RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation: Retrieval Preference Optimization (RPO) is proposed, a lightweight preference alignment method specifically designed for RAG. By implicitly integrating retrieval quality evaluation into the generation process, it enables LLMs to adaptively choose between parametric and retrieved knowledge, mitigating hallucination issues caused by knowledge conflicts without requiring additional components.
Safety Alignment via Constrained Knowledge Unlearning: This paper proposes Constrained Knowledge Unlearning (CKU), which removes harmful knowledge by locating useful knowledge neurons in MLP layers and protecting their gradients during the unlearning process, significantly enhancing the safety of LLMs without compromising their general capabilities.
SDPO: Segment-Level Direct Preference Optimization for Social Agents: SDPO proposes optimizing preferences in multi-turn social dialogues at the granularity of "segments." By dynamically locating error turns, resampling positive instances from the history before the error point, and selecting equal-length key segment pairs for training, it reduces the training noise of session-level DPO and strictly eliminates the partition function \(Z\) through equal-length constraints, outperforming GPT-4o and all DPO variants on the SOTOPIA benchmark.
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings: This work proposes the SEA framework, which generates synthetic modality embeddings (without requiring real images/videos/audio) via gradient optimization. It achieves safety alignment for multimodal LLMs using only textual safety data. A high-quality embedding can be synthesized in just 24 seconds on a single RTX 3090. Additionally, the video and audio safety benchmark VA-SafetyBench is released.
SQL Injection Jailbreak: A Structural Disaster of Large Language Models: Proposed SQL Injection Jailbreak (SIJ), a novel jailbreak method exploiting structural vulnerabilities in LLM prompt construction, achieving a nearly 100% attack success rate on open-source models and over 85% on average for closed-source models, alongside a proposed Self-Reminder-Key defense mechanism.
Synergistic Weak-Strong Collaboration by Aligning Preferences: Ours proposes the CoWest framework, which allows a specialized weak model (such as LLaMA3-8B) to generate initial drafts, which are then refined by a general strong model (such as GPT-4). It leverages collaborative feedback to fine-tune the weak model via DPO to align with the strong model's preferences, significantly outperforming individual models and existing collaborative methods across counterfactual reasoning, medicine, and ethics.
SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs: Introduces the SynthesizeMe method, which automatically reasons and synthesizes user personas from a limited number of pairwise preference interactions to construct interpretable and transferable personalized prompts, significantly improving personalized preference prediction accuracy on PersonalRewardBench.
T-REG: Preference Optimization with Token-Level Reward Regularization: T-REG proposes a token-level reward regularization method that leverages contrastive prompting of LLMs to self-generate token-level reward signals. These signals are used as weak supervision to guide the token-level reward assignment implicitly learned by DPO, outperforming DPO by up to 3.8% on Alpaca Eval 2 and 4.4% on Arena-Hard.
TableDreamer: Progressive and Weakness-Guided Data Synthesis from Scratch for Table Instruction Tuning: Proposes TableDreamer, a two-stage data synthesis framework: Stage 1 synthesizes highly diverse tables and seed instruction data from scratch; Stage 2 explores the input space through weakness-guided iterative data evolution (evolving data along three orthogonal directions, and utilizing an LLM-as-judge to filter out low-performing samples as seeds for the next round). Using only 27K GPT-4o synthetic data, it improves the average accuracy of Llama 3.1-8B by 11.62%, outperforming all baseline methods utilizing 80K–100K data.
Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences: Proposes DiSCo (a secure code preference dataset distilled from frontier LLMs, with 10K instances covering 431 CWEs) and LPO (Localized Preference Optimization algorithm, propagating loss only on security-related tokens), reducing security vulnerabilities by 19-40% across four secure coding benchmarks while improving code quality by 3-10%.
Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search: This paper proposes Tempest (referred to as Siege in its early version), a multi-turn adversarial framework based on breadth-first tree search. By tracking the partial compliance information of the target LLM and re-injecting it into subsequent queries, Tempest achieves a 100% attack success rate against GPT-3.5-turbo and 97% against GPT-4 on JailbreakBench, requiring significantly fewer queries than baselines like Crescendo/GOAT.
Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling: This paper models attributed text generation (text generation with citations) as a multi-step reasoning problem, proposing Self-Guided Monte Carlo Tree Search (SG-MCTS) combined with Progress Reward Modeling (PRM). Through multi-path search, intermediate state reflection, and a dual-dimensional progress reward (generation and attribution), the proposed method significantly outperforms all baselines across three datasets on the ALCE benchmark.
A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns: Proposes TMCHT (a large-scale multi-agent multi-topology text attack evaluation framework) and ARCJ (Adversarial Replicative Contagious Jailbreak) method—enhancing the retrieval probability of toxic samples by optimizing the retrieval suffix + enabling the self-replicating contagious capability of toxic information by optimizing the replicative suffix, solving the "toxicity dissipation" problem faced by single-agent attack methods in multi-agent systems.
Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging: This paper proposes UpIT (Upcycling Instruction Tuning), which leverages intermediate checkpoints from the instruction tuning process of a dense model as specialized experts, and achieves data-efficient and flexible dense-to-MoE upcycling through genetic algorithm-based expert expansion and router pre-optimization.
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning: A Dual Preference Optimization (\(D^2PO\)) framework is proposed. By jointly optimizing preference learning for the dual objectives of state prediction (world modeling) and action selection, vision-language models simultaneously learn to "understand world dynamics" and "make better decisions" during embodied task planning. This allows a 7B model to significantly outperform GPT-4o.