Robust Multi-Objective Controlled Decoding of Large Language Models¶

Conference: ICLR 2026 arXiv: 2503.08796 Code: GitHub Area: Reinforcement Learning Keywords: multi-objective alignment, inference-time alignment, controlled decoding, robust optimization, minimax game

TL;DR¶

This paper proposes RMOD (Robust Multi-Objective Decoding), an inference-time algorithm that dynamically computes worst-case objective weights by solving for the Nash equilibrium of a minimax game, achieving robust multi-objective alignment of LLMs without requiring any prior knowledge of objective weights.

Background & Motivation¶

LLMs must simultaneously align with multiple objectives (e.g., helpfulness, harmlessness, safety, instruction-following). Multi-objective alignment naturally raises the question: how should multiple potentially conflicting objectives be balanced at inference time?

Existing methods typically require manually specified objective weights, yet weight selection faces several difficulties: - Shi et al. (2024) select weights via hyperparameter search on a validation set, which is susceptible to distribution shift. - Approaches based on user profiles or interaction history require additional information that is often unavailable in practice. - When safety is one of the objectives, it cannot be ignored, yet excessive conservatism is also undesirable.

Core motivation: achieve robust alignment without any prior weight information by maximizing the worst-case objective—ensuring that the weakest objective receives the greatest attention.

Method¶

Overall Architecture¶

At each decoding step, RMOD leverages value functions $V_g$ trained for each objective and determines weights and policy by solving for the Nash equilibrium of: $$\max_\pi \min_{w \in \Delta^{G-1}} \lambda \sum_g w_g V_g(x,y^t;\pi) - D_{KL}(\pi \| \pi_{\text{ref}})$$

Key Designs¶

Minimax Game Formulation:
- Function: Models robust multi-objective alignment as a two-player zero-sum game between policy $\pi$ and weights $w$.
- Mechanism: The objective is linear in $w$ and concave in $\pi$; thus a Nash equilibrium exists and the minimax theorem permits swapping the max-min order, reducing the problem to first solving for the optimal policy and then optimizing the weights.
- Design Motivation: The minimax formulation ensures that no single objective is severely neglected.
Closed-Form Optimal Policy:
- Function: Derives the optimal sampling policy given weights $w$.
- Mechanism (Proposition 1): $\pi(z|[x,y^t];w) = \frac{\pi_{\text{ref}}(z|[x,y^t]) \exp(\lambda \sum_g w_g V_g(x,y^t;z))}{Z(x,y^t,w)}$
- Design Motivation: The closed-form solution avoids costly policy search and is consistent with the Boltzmann form of standard KL-regularized RLHF.
Convex Optimization for Worst-Case Weights:
- Function: Reduces the weight search to a convex optimization in LogSumExp form.
- Mechanism: $w^* = \arg\min_{w \in \Delta^{G-1}} \log \mathbb{E}_{z\sim\pi_\text{ref}}[\exp(\sum_g \lambda w_g V_g)]$, with weights updated via exponentiated gradient descent: $w_{g,i+1} = w_{g,i} \cdot \exp(-\eta \cdot \text{gradient})$.
- Design Motivation: Convexity guarantees a global optimum; the search space has dimension only $G$ (number of objectives), ensuring computational efficiency.
Block-wise Decoding:
- Function: Divides the generation into blocks of length $B$, selecting the best among $K$ candidates per block.
- Mechanism: $K$ block candidates are sampled from $\pi_{\text{ref}}$, value functions are evaluated, weights are iteratively updated $I$ times, and the block with the highest weighted value is selected.
- Design Motivation: Substantially reduces the number of value function evaluations compared to token-level decoding.

Loss & Training¶

Value functions are trained with an MSE loss: $\mathbb{E}[\sum_t(V_g(x,y^t;\theta) - r_g(x,y))^2]$, using responses generated by the reference policy and their corresponding rewards. RMOD itself is an inference-time algorithm and requires no training of a policy network.

Key Experimental Results¶

Main Results (HH Dataset, Worst-Case Reward)¶

Method	Worst-Case Reward	Worst-Case Win Rate (WCWR)
CD-Helpful	High helpful, low harmless	Low
CD-Harmless	High harmless, low helpful	Low
CD-Uniform	Moderate balance	57.6%
MO-GRPO	Moderate	54.6%
RS/MOD	Below Uniform	—
Distill-RMOD	—	57.9%
RMOD	Highest	59.1%

Ablation Study¶

Parameter	Key Metric	Notes
$\lambda=0.1$ (low)	Close to Uniform	Weight distribution is approximately uniform
$\lambda=0.5$	Moderate robustness	Balanced trade-off
$\lambda=10$ (high)	Most concentrated on worst objective	Weights are highly sparse
$B=16$ (small block)	Highest win rate	Finer-grained control
$B=256$ (large block)	Win rate decreases	Approaches reference policy
Objectives $= 2$–$10$	RMOD consistently outperforms Uniform	Performance degrades as objective count grows

Key Findings¶

RMOD outperforms all baselines by up to 20% in worst-case win rate.
Latency increases by only 4.5% over standard CD, demonstrating high computational efficiency.
Distill-RMOD (SFT on RMOD-generated data) performs competitively without using controlled decoding at test time.
LLM-as-Judge evaluation (GPT-4o) further confirms the superiority of RMOD.

Highlights & Insights¶

Theoretical Elegance: The problem is formulated as a convex-concave game with a closed-form solution and convex optimization, providing strong theoretical guarantees.
Practical Flexibility: As an inference-time algorithm, alignment objectives can be switched on the fly with minimal latency overhead.
In-depth Weight Behavior Analysis: KKT conditions are used to prove that optimal weights equalize the expected rewards across objectives.
Distill-RMOD offers a practical pathway for distilling inference-time methods into standard policy models.

Limitations & Future Work¶

Performance degrades as the number of objectives increases beyond 10; large-scale multi-objective settings require further investigation.
Training a separate value function for each objective incurs non-trivial preparation costs.
The choice of $\lambda$ governs the degree of robustness (sparsity of weights) and currently requires manual specification.
Experiments are conducted only on gemma-2-2b-it; effectiveness on larger models remains unverified.

Mudgal et al. (2023)'s Controlled Decoding serves as the direct foundation; RMOD extends it to a robust multi-objective setting.
Shi et al. (2024)'s MOD requires pre-specified weights, whereas RMOD discovers them automatically.
Yoon et al. (2024) and Ramesh et al. (2024) consider robust alignment but not in an inference-time framework.
Insight: Combining inference-time algorithms with robust optimization yields a flexible and principled solution for multi-objective LLM alignment.

Rating¶

Novelty: ⭐⭐⭐⭐ The minimax inference-time alignment combination is novel, though individual components are relatively mature.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage across multiple datasets, ablations, LLM-as-Judge evaluation, and latency analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear and problem motivation is intuitive.
Value: ⭐⭐⭐⭐ Provides a principled inference-time solution for multi-objective LLM alignment.

Parameter	Key Metric	Notes
\(\lambda=0.1\) (low)	Close to Uniform	Weight distribution is approximately uniform
\(\lambda=0.5\)	Moderate robustness	Balanced trade-off
\(\lambda=10\) (high)	Most concentrated on worst objective	Weights are highly sparse
\(B=16\) (small block)	Highest win rate	Finer-grained control
\(B=256\) (large block)	Win rate decreases	Approaches reference policy
Objectives \(= 2\)–\(10\)	RMOD consistently outperforms Uniform	Performance degrades as objective count grows