Skip to content

Robust Multi-Objective Controlled Decoding of Large Language Models

Conference: ICLR 2026 arXiv: 2503.08796 Code: GitHub Area: Reinforcement Learning Keywords: multi-objective alignment, inference-time alignment, controlled decoding, robust optimization, minimax game

TL;DR

This paper proposes RMOD (Robust Multi-Objective Decoding), an inference-time algorithm that dynamically computes worst-case objective weights by solving for the Nash equilibrium of a minimax game, achieving robust multi-objective alignment of LLMs without requiring any prior knowledge of objective weights.

Background & Motivation

LLMs must simultaneously align with multiple objectives (e.g., helpfulness, harmlessness, safety, instruction-following). Multi-objective alignment naturally raises the question: how should multiple potentially conflicting objectives be balanced at inference time?

Existing methods typically require manually specified objective weights, yet weight selection faces several difficulties: - Shi et al. (2024) select weights via hyperparameter search on a validation set, which is susceptible to distribution shift. - Approaches based on user profiles or interaction history require additional information that is often unavailable in practice. - When safety is one of the objectives, it cannot be ignored, yet excessive conservatism is also undesirable.

Core motivation: achieve robust alignment without any prior weight information by maximizing the worst-case objective—ensuring that the weakest objective receives the greatest attention.

Method

Overall Architecture

At each decoding step, RMOD leverages value functions \(V_g\) trained for each objective and determines weights and policy by solving for the Nash equilibrium of: $\(\max_\pi \min_{w \in \Delta^{G-1}} \lambda \sum_g w_g V_g(x,y^t;\pi) - D_{KL}(\pi \| \pi_{\text{ref}})\)$

Key Designs

  1. Minimax Game Formulation:

    • Function: Models robust multi-objective alignment as a two-player zero-sum game between policy \(\pi\) and weights \(w\).
    • Mechanism: The objective is linear in \(w\) and concave in \(\pi\); thus a Nash equilibrium exists and the minimax theorem permits swapping the max-min order, reducing the problem to first solving for the optimal policy and then optimizing the weights.
    • Design Motivation: The minimax formulation ensures that no single objective is severely neglected.
  2. Closed-Form Optimal Policy:

    • Function: Derives the optimal sampling policy given weights \(w\).
    • Mechanism (Proposition 1): \(\pi(z|[x,y^t];w) = \frac{\pi_{\text{ref}}(z|[x,y^t]) \exp(\lambda \sum_g w_g V_g(x,y^t;z))}{Z(x,y^t,w)}\)
    • Design Motivation: The closed-form solution avoids costly policy search and is consistent with the Boltzmann form of standard KL-regularized RLHF.
  3. Convex Optimization for Worst-Case Weights:

    • Function: Reduces the weight search to a convex optimization in LogSumExp form.
    • Mechanism: \(w^* = \arg\min_{w \in \Delta^{G-1}} \log \mathbb{E}_{z\sim\pi_\text{ref}}[\exp(\sum_g \lambda w_g V_g)]\), with weights updated via exponentiated gradient descent: \(w_{g,i+1} = w_{g,i} \cdot \exp(-\eta \cdot \text{gradient})\).
    • Design Motivation: Convexity guarantees a global optimum; the search space has dimension only \(G\) (number of objectives), ensuring computational efficiency.
  4. Block-wise Decoding:

    • Function: Divides the generation into blocks of length \(B\), selecting the best among \(K\) candidates per block.
    • Mechanism: \(K\) block candidates are sampled from \(\pi_{\text{ref}}\), value functions are evaluated, weights are iteratively updated \(I\) times, and the block with the highest weighted value is selected.
    • Design Motivation: Substantially reduces the number of value function evaluations compared to token-level decoding.

Loss & Training

Value functions are trained with an MSE loss: \(\mathbb{E}[\sum_t(V_g(x,y^t;\theta) - r_g(x,y))^2]\), using responses generated by the reference policy and their corresponding rewards. RMOD itself is an inference-time algorithm and requires no training of a policy network.

Key Experimental Results

Main Results (HH Dataset, Worst-Case Reward)

Method Worst-Case Reward Worst-Case Win Rate (WCWR)
CD-Helpful High helpful, low harmless Low
CD-Harmless High harmless, low helpful Low
CD-Uniform Moderate balance 57.6%
MO-GRPO Moderate 54.6%
RS/MOD Below Uniform
Distill-RMOD 57.9%
RMOD Highest 59.1%

Ablation Study

Parameter Key Metric Notes
\(\lambda=0.1\) (low) Close to Uniform Weight distribution is approximately uniform
\(\lambda=0.5\) Moderate robustness Balanced trade-off
\(\lambda=10\) (high) Most concentrated on worst objective Weights are highly sparse
\(B=16\) (small block) Highest win rate Finer-grained control
\(B=256\) (large block) Win rate decreases Approaches reference policy
Objectives \(= 2\)\(10\) RMOD consistently outperforms Uniform Performance degrades as objective count grows

Key Findings

  • RMOD outperforms all baselines by up to 20% in worst-case win rate.
  • Latency increases by only 4.5% over standard CD, demonstrating high computational efficiency.
  • Distill-RMOD (SFT on RMOD-generated data) performs competitively without using controlled decoding at test time.
  • LLM-as-Judge evaluation (GPT-4o) further confirms the superiority of RMOD.

Highlights & Insights

  • Theoretical Elegance: The problem is formulated as a convex-concave game with a closed-form solution and convex optimization, providing strong theoretical guarantees.
  • Practical Flexibility: As an inference-time algorithm, alignment objectives can be switched on the fly with minimal latency overhead.
  • In-depth Weight Behavior Analysis: KKT conditions are used to prove that optimal weights equalize the expected rewards across objectives.
  • Distill-RMOD offers a practical pathway for distilling inference-time methods into standard policy models.

Limitations & Future Work

  • Performance degrades as the number of objectives increases beyond 10; large-scale multi-objective settings require further investigation.
  • Training a separate value function for each objective incurs non-trivial preparation costs.
  • The choice of \(\lambda\) governs the degree of robustness (sparsity of weights) and currently requires manual specification.
  • Experiments are conducted only on gemma-2-2b-it; effectiveness on larger models remains unverified.
  • Mudgal et al. (2023)'s Controlled Decoding serves as the direct foundation; RMOD extends it to a robust multi-objective setting.
  • Shi et al. (2024)'s MOD requires pre-specified weights, whereas RMOD discovers them automatically.
  • Yoon et al. (2024) and Ramesh et al. (2024) consider robust alignment but not in an inference-time framework.
  • Insight: Combining inference-time algorithms with robust optimization yields a flexible and principled solution for multi-objective LLM alignment.

Rating

  • Novelty: ⭐⭐⭐⭐ The minimax inference-time alignment combination is novel, though individual components are relatively mature.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage across multiple datasets, ablations, LLM-as-Judge evaluation, and latency analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear and problem motivation is intuitive.
  • Value: ⭐⭐⭐⭐ Provides a principled inference-time solution for multi-objective LLM alignment.