Robust Multi-Objective Controlled Decoding of Large Language Models¶
Conference: ICLR 2026 arXiv: 2503.08796 Code: GitHub Area: Reinforcement Learning Keywords: multi-objective alignment, inference-time alignment, controlled decoding, robust optimization, minimax game
TL;DR¶
This paper proposes RMOD (Robust Multi-Objective Decoding), an inference-time algorithm that dynamically computes worst-case objective weights by solving for the Nash equilibrium of a minimax game, achieving robust multi-objective alignment of LLMs without requiring any prior knowledge of objective weights.
Background & Motivation¶
LLMs must simultaneously align with multiple objectives (e.g., helpfulness, harmlessness, safety, instruction-following). Multi-objective alignment naturally raises the question: how should multiple potentially conflicting objectives be balanced at inference time?
Existing methods typically require manually specified objective weights, yet weight selection faces several difficulties: - Shi et al. (2024) select weights via hyperparameter search on a validation set, which is susceptible to distribution shift. - Approaches based on user profiles or interaction history require additional information that is often unavailable in practice. - When safety is one of the objectives, it cannot be ignored, yet excessive conservatism is also undesirable.
Core motivation: achieve robust alignment without any prior weight information by maximizing the worst-case objective—ensuring that the weakest objective receives the greatest attention.
Method¶
Overall Architecture¶
At each decoding step, RMOD leverages value functions \(V_g\) trained for each objective and determines weights and policy by solving for the Nash equilibrium of: $\(\max_\pi \min_{w \in \Delta^{G-1}} \lambda \sum_g w_g V_g(x,y^t;\pi) - D_{KL}(\pi \| \pi_{\text{ref}})\)$
Key Designs¶
-
Minimax Game Formulation:
- Function: Models robust multi-objective alignment as a two-player zero-sum game between policy \(\pi\) and weights \(w\).
- Mechanism: The objective is linear in \(w\) and concave in \(\pi\); thus a Nash equilibrium exists and the minimax theorem permits swapping the max-min order, reducing the problem to first solving for the optimal policy and then optimizing the weights.
- Design Motivation: The minimax formulation ensures that no single objective is severely neglected.
-
Closed-Form Optimal Policy:
- Function: Derives the optimal sampling policy given weights \(w\).
- Mechanism (Proposition 1): \(\pi(z|[x,y^t];w) = \frac{\pi_{\text{ref}}(z|[x,y^t]) \exp(\lambda \sum_g w_g V_g(x,y^t;z))}{Z(x,y^t,w)}\)
- Design Motivation: The closed-form solution avoids costly policy search and is consistent with the Boltzmann form of standard KL-regularized RLHF.
-
Convex Optimization for Worst-Case Weights:
- Function: Reduces the weight search to a convex optimization in LogSumExp form.
- Mechanism: \(w^* = \arg\min_{w \in \Delta^{G-1}} \log \mathbb{E}_{z\sim\pi_\text{ref}}[\exp(\sum_g \lambda w_g V_g)]\), with weights updated via exponentiated gradient descent: \(w_{g,i+1} = w_{g,i} \cdot \exp(-\eta \cdot \text{gradient})\).
- Design Motivation: Convexity guarantees a global optimum; the search space has dimension only \(G\) (number of objectives), ensuring computational efficiency.
-
Block-wise Decoding:
- Function: Divides the generation into blocks of length \(B\), selecting the best among \(K\) candidates per block.
- Mechanism: \(K\) block candidates are sampled from \(\pi_{\text{ref}}\), value functions are evaluated, weights are iteratively updated \(I\) times, and the block with the highest weighted value is selected.
- Design Motivation: Substantially reduces the number of value function evaluations compared to token-level decoding.
Loss & Training¶
Value functions are trained with an MSE loss: \(\mathbb{E}[\sum_t(V_g(x,y^t;\theta) - r_g(x,y))^2]\), using responses generated by the reference policy and their corresponding rewards. RMOD itself is an inference-time algorithm and requires no training of a policy network.
Key Experimental Results¶
Main Results (HH Dataset, Worst-Case Reward)¶
| Method | Worst-Case Reward | Worst-Case Win Rate (WCWR) |
|---|---|---|
| CD-Helpful | High helpful, low harmless | Low |
| CD-Harmless | High harmless, low helpful | Low |
| CD-Uniform | Moderate balance | 57.6% |
| MO-GRPO | Moderate | 54.6% |
| RS/MOD | Below Uniform | — |
| Distill-RMOD | — | 57.9% |
| RMOD | Highest | 59.1% |
Ablation Study¶
| Parameter | Key Metric | Notes |
|---|---|---|
| \(\lambda=0.1\) (low) | Close to Uniform | Weight distribution is approximately uniform |
| \(\lambda=0.5\) | Moderate robustness | Balanced trade-off |
| \(\lambda=10\) (high) | Most concentrated on worst objective | Weights are highly sparse |
| \(B=16\) (small block) | Highest win rate | Finer-grained control |
| \(B=256\) (large block) | Win rate decreases | Approaches reference policy |
| Objectives \(= 2\)–\(10\) | RMOD consistently outperforms Uniform | Performance degrades as objective count grows |
Key Findings¶
- RMOD outperforms all baselines by up to 20% in worst-case win rate.
- Latency increases by only 4.5% over standard CD, demonstrating high computational efficiency.
- Distill-RMOD (SFT on RMOD-generated data) performs competitively without using controlled decoding at test time.
- LLM-as-Judge evaluation (GPT-4o) further confirms the superiority of RMOD.
Highlights & Insights¶
- Theoretical Elegance: The problem is formulated as a convex-concave game with a closed-form solution and convex optimization, providing strong theoretical guarantees.
- Practical Flexibility: As an inference-time algorithm, alignment objectives can be switched on the fly with minimal latency overhead.
- In-depth Weight Behavior Analysis: KKT conditions are used to prove that optimal weights equalize the expected rewards across objectives.
- Distill-RMOD offers a practical pathway for distilling inference-time methods into standard policy models.
Limitations & Future Work¶
- Performance degrades as the number of objectives increases beyond 10; large-scale multi-objective settings require further investigation.
- Training a separate value function for each objective incurs non-trivial preparation costs.
- The choice of \(\lambda\) governs the degree of robustness (sparsity of weights) and currently requires manual specification.
- Experiments are conducted only on gemma-2-2b-it; effectiveness on larger models remains unverified.
Related Work & Insights¶
- Mudgal et al. (2023)'s Controlled Decoding serves as the direct foundation; RMOD extends it to a robust multi-objective setting.
- Shi et al. (2024)'s MOD requires pre-specified weights, whereas RMOD discovers them automatically.
- Yoon et al. (2024) and Ramesh et al. (2024) consider robust alignment but not in an inference-time framework.
- Insight: Combining inference-time algorithms with robust optimization yields a flexible and principled solution for multi-objective LLM alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐ The minimax inference-time alignment combination is novel, though individual components are relatively mature.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage across multiple datasets, ablations, LLM-as-Judge evaluation, and latency analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear and problem motivation is intuitive.
- Value: ⭐⭐⭐⭐ Provides a principled inference-time solution for multi-objective LLM alignment.