Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards¶
Conference: ICML2026
arXiv: 2510.01167
Code: https://github.com/pearls-lab/multiobj-align
Area: Alignment RLHF
Keywords: multi-objective alignment, Multi-Action-Head DPO, PRM-guided decoding, process reward model, verifiable/non-verifiable rewards
TL;DR¶
MAHALO integrates "standardized PRM training + Multi-Action-Head DPO + PRM-guided decoding with KV-cache persistence" into a unified framework. It enables a single LLM to be aligned simultaneously across three categories—mathematics (verifiable), human values (non-verifiable), and multi-turn tutoring (interactive)—while allowing smooth preference switching during inference through head weights and PRM selection.
Background & Motivation¶
Background: Mainstream alignment routes (RLHF / DPO) typically compress multi-dimensional preferences into a scalar reward, either using a fixed set of weights during training (e.g., MODPO linearization, parameter soups) or employing a single RM to guide generation during testing.
Limitations of Prior Work: (1) Scalarization during training erases trade-offs between dimensions, and changing weights requires retraining. (2) Parameter merging methods like DPO Soup / Personalized Soup incur high costs because single-objective experts must be retrained when adding new objectives. (3) Most reward-guided decoding at test time relies on outcome RMs, facing "training-inference granularity inconsistency" when scoring partial sequences. (4) Current PRM methods almost exclusively cover verifiable domains like mathematics, lacking a general step-level training paradigm for non-verifiable domains (helpfulness/honesty).
Key Challenge: There is a fundamental disconnect between maintaining a multi-dimensional structure during training and achieving fine-grained controllability during testing without incurring massive computational costs or losing structural information.
Goal: To simultaneously solve three sub-problems within a single framework: (a) how to unify PRM training across verifiable and non-verifiable domains; (b) how to train \(H\) decouplable objective heads using a shared backbone that can be mixed on-demand during inference; (c) how to apply PRMs to step-level decoding without introducing the overhead of re-encoding prompts.
Key Insight: The authors observe that "reward verifiability" should determine whether optimization effort is spent on training or inference. Verifiable objectives (math correctness) naturally provide precise step-level signals, where PRM search yields the highest returns at test time. Non-verifiable objectives (helpfulness/engagement) have noisier signals and are better suited for shaping shared representations through multi-head training. Based on this dichotomy, the authors treat the training side (MAH-DPO) and the testing side (PRM-guided decoding) as complementary components.
Core Idea: Use a shared backbone + \(H\) DPO heads for "vectorized multi-objective alignment," and a cross-domain PRM for step-level guided decoding under KV-cache persistence. Training and inference can be used independently or superimposed, achieving "train once, deploy on-demand."
Method¶
Overall Architecture¶
The input consists of a set of multi-objective preference data \(\{\mathcal{D}_i\}_{i=1}^H\) (e.g., Acc / Eng for Math, Help / Honest / Truth for UltraFeedback, Acc / Eng for Socratic Mind), along with corresponding step-level annotations (or labels automatically generated by a PRM). The pipeline follows two branches:
- PRM Training Pipeline (Section 4): Training is categorized into four cases (Case A/B/C and Verifiable) based on "verifiability + rollout ease + structural clarity." Label construction methods are provided for each, resulting in a unified \(r_t\) used to train the PRM. The verifiable domain additionally introduces hindsight relabeling, training the PRM as both a step-quality and future-correctness predictor.
- MAHALO Main Framework (Section 5): During training, Multi-Action-Head DPO (shared backbone \(\theta_b\) + \(H\) linear heads \(W_i\)) is used to let each head focus on one objective. During inference, heads can be used individually, fused via weighted logits, or combined with PRM-guided decoding (a loop of candidate sampling, PRM scoring, and commitment at step boundaries), utilizing "running past KV cache" to avoid re-encoding.
Key Designs¶
-
Standardized PRM Training (Unifying Step-level Supervision for Verifiable / Non-verifiable Domains):
- Function: Provides a unified step-level training signal \(r_t\) and PRM loss across vastly different domains such as mathematics, human values, and dialogue tutoring.
- Mechanism: The verifiable domain uses step-level rewards with hindsight relabeling, back-propagating the final correctness \(z\) to obtain \(\tilde r_t = r_t + \gamma^{n-t} z\). Averaging over \(M\) rollouts yields \(V_t^{\text{target}}\), which the PRM fits via MSE: \(\mathcal{L}_{\text{PRM}} = \mathbb{E}[(p_t - V_t^{\text{target}})^2]\). Non-verifiable domains are handled in three cases: Case A (clear steps, cheap rollouts) uses a calibrated LLM-as-Judge for majority voting over multiple rollouts; Case B (expensive rollouts, e.g., multi-turn dialogue) uses the judge to score prefixes directly; Case C (no clear process structure) degrades to Bradley-Terry style partial sequence scoring.
- Design Motivation: Prior PRM work was almost exclusively limited to domains with automatic verifiers. This paper abstracts "process rewards" from "correctness judgment" into "prefix \(\to\) expected success probability" and handles it hierarchically, allowing a single PRM training paradigm to cover the entire alignment spectrum.
-
Multi-Action-Head DPO (Vectorized Multi-objective Alignment + Inference-time Tunability):
- Function: Enables a single LLM backbone to carry \(H\) objectives simultaneously. Each objective undergoes independent DPO on its own head during training, while inference allows for single-head or weighted-fusion usage.
- Mechanism: A shared backbone provides hidden states \(h_{\theta_b}(x, y_{1:t}) \in \mathbb{R}^d\). Each objective \(i\) is assigned an independent projection head \(W_i \in \mathbb{R}^ {d \times |V|}\), yielding objective-specific logits \(z_i = W_i^\top h_{\theta_b}\). Heads are initialized from the SFT head with small perturbations, while the reference model \(\pi_\text{ref}\) uses a frozen SFT head. The loss for objective \(i\) is \(\mathcal{L}_i = -\mathbb{E}_{\mathcal{D}_i}[\log \sigma(\beta \Delta_i)]\). The total loss is \(\mathcal{L}_{\text{MAH-DPO}} = \sum_i \alpha_i \cdot \frac{1}{|\mathcal{B}_i|}\sum_{\mathcal{B}_i} \mathcal{L}_i\). Inference follows \(\pi_\text{MAH}(y_t \mid \cdot) = \text{Softmax}(\sum_i w_i z_i)\).
- Design Motivation: Unlike MODPO, which fixes weights during training, or DPO Soup, which requires separate models, MAH-DPO places "objective separation" at the lightweight final layer and "knowledge sharing" in the heavy backbone. This avoids \(H\)-fold training overhead and allows sliding along the Pareto front during inference without retraining.
-
PRM-guided Decoding with Continuing Hidden State (Step-level Inference-time Control):
- Function: Samples \(K\) candidates at each "natural boundary" (newlines in math, sentences/paragraphs in values, turns in dialogue) and submits the one with the highest PRM score.
- Mechanism: A "running past KV cache" \(\text{kv}_t\) is maintained. At each step, \(K\) local cache copies are cloned from \(\text{kv}_t\). Each performs independent sampling until a boundary \(\mathcal{Q}\) is hit, yielding candidate \(y_{t+1}^k\) and cache \(\text{kv}_{t+1}^k\). After PRM scoring \(r_k = P(x, y_{1:t}, y_{t+1}^k)\), the cache \(\text{kv}_{t+1}^{k^\star}\) of the winner is promoted to the next running cache.
- Design Motivation: Traditional reward-guided decoding re-concatenates and re-encodes the prefix at every step, which accumulates distribution shifts and errors. Continuing the KV cache maintains "hidden state continuity," achieving a \(4.2\times\) speedup over standard PRM-guided methods.
Loss & Training¶
PRMs use MSE to fit hindsight value targets. MAH-DPO routes samples within a batch to their respective heads to calculate DPO losses before weighted aggregation (\(\beta\) controls preference strength, \(\alpha_i\) controls objective importance). Equal \(\alpha_i\) and balanced sampling were used for fair comparison. All results are averaged over 3 independent runs.
Key Experimental Results¶
Main Results: Alignment During Training (MAH-DPO vs. Baselines)¶
| Dataset | Metric | Base | SFT | Single-Head DPO | MODPO | DPO Soup | MAH-DPO Ensemble |
|---|---|---|---|---|---|---|---|
| Math | Acc | 0.711 | 0.730 | 0.725 | 0.728 | 0.726 | 0.725 |
| Math | Eng | 0.501 | 0.592 | 0.716 | 0.737 | 0.735 | 0.873 |
| Human Values | Help | 0.580 | 0.555 | 0.604 | 0.618 | 0.613 | 0.639 |
| Human Values | Honest | 0.304 | 0.300 | 0.306 | 0.348 | 0.322 | 0.369 |
| Human Values | Truth | 0.189 | 0.199 | 0.201 | 0.233 | 0.215 | 0.248 |
| Socratic Mind | Acc | 0.656 | 0.679 | 0.704 | 0.705 | – | 0.689 |
| Socratic Mind | Eng | 0.322 | 0.347 | 0.446 | 0.360 | – | 0.451 |
MAH-DPO Ensemble is consistently the strongest across Human Values dimensions; on Math, it leads significantly in Eng while trailing slightly in Acc.
Main Results: PRM-guided Decoding Gains during Testing¶
| Dataset | Configuration | Primary Objective | Secondary Objective |
|---|---|---|---|
| Math | Base | Acc 0.685, Eng 0.513 | — |
| Math | Accuracy Value-guided | Acc 0.799 (+11.4) | Eng 0.455 |
| Math | Engaging PRM-guided | Acc 0.701 | Eng 0.719 (+20.6) |
| Human Values | Helpful PRM-guided | Help 0.671 | Honest 0.405, Truth 0.279 |
| Human Values | Honesty PRM-guided | Help 0.645 | Honest 0.469, Truth 0.338 |
Verifiable goals (Math Acc) see the largest inference gains, supporting the claim that "reward verifiability \(\to\) higher returns for search."
Training + Testing Synergy (Excerpts from Table 5)¶
| Dataset | Configuration | Key Metrics |
|---|---|---|
| Math | MAH-DPO + Accuracy Value | Acc 0.800 / Eng 0.855 |
| Math | MAH-DPO + Engaging PRM | Acc 0.721 / Eng 0.906 |
| Human Values | MAH-DPO + Honest PRM | Honest 0.520 / Truth 0.411 |
Combining training and inference pushes the Pareto front outward, showing positive transfer between related objectives (e.g., Honest PRM also improves Truth).
Key Findings¶
- Verifiability Dictates Optimization Focus: Highly verifiable rewards like Math Acc rely primarily on test-time PRM search (+11.4), whereas subjective rewards like Help/Honest/Eng rely on multi-head training for representation shaping.
- Head Weights Smoothly Control Pareto Front: Adjusting Acc/Eng head weights on Math produces a smooth Pareto curve without "objective collapse," allowing on-demand adjustment.
- Unified PRM Enables Cross-domain Transfer: A PRM trained on 7-dimensional mixed data outperforms the base model across all domains, proving shared structures in process rewards.
Highlights & Insights¶
- The decision matrix (verifiability vs. rollout cost vs. structure) for PRM training is a major engineering contribution, extending process supervision from math to the full alignment stack.
- Multi-Action-Head DPO acts as a minimalist MoE variant for alignment, avoiding gradient interference by routing objective-specific data to dedicated heads. Integrated latency and VRAM overhead are only +13% and +7%, respectively.
- The "continuing hidden state" trick achieves \(4\times\) speedup, making step-level guidance practical for deployment and potentially applicable to speculative decoding or agentic workflows.
Limitations & Future Work¶
- PRM labels for non-verifiable domains rely on LLM-as-Judge; judge bias may propagate to the PRM and downstream policies.
- Scale/Capacity: Future work needs to verify if backbone capacity becomes a bottleneck when scaling to 70B+ models or \(>10\) objectives.
- Static Weights: MAH-DPO head weights are currently set manually; adaptive online adjustment based on user feedback is a promising direction.
- Boundary Detection: Step-level split strategies for structured outputs (Code, JSON) or extremely long sequences require refinement.
Related Work & Insights¶
- vs. MODPO: MODPO fixes weights during training. MAH-DPO decouples them to independent heads, outperforming MODPO on Human Values while remaining adjustable for Pareto sliding.
- vs. DPO Soup: MAH-DPO avoids the need to train multiple full models for separate objectives, adding only a single linear head per objective.
- vs. Reward-Guided Decoding: Addresses the granularity mismatch of outcome RMs and the inefficiency of per-step re-encoding through PRMs and KV-cache persistence.
- vs. Math-Shepherd: While Math-Shepherd is math-specific, this work abstracts the process reward paradigm to cover the entire alignment spectrum.
Rating¶
- Novelty: ⭐⭐⭐⭐ While individual components exist, the unified recipe based on reward verifiability is a clear new contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across 3 domains, 7 dimensions, and various training/inference configurations.
- Writing Quality: ⭐⭐⭐⭐ Findings are concisely summarized; technical descriptions are dense but accurate.
- Value: ⭐⭐⭐⭐⭐ Provides actionable components (MAH-DPO, fast PRM decoding) and a solid empirical rule for multi-objective optimization.
Related Papers¶
- [ICML 2026] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
- [ICML 2026] Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling
- [ACL 2026] Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards
- [ACL 2025] AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models
- [ICML 2026] Alignment-Aware Decoding