Skip to content

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Conference: ACL 2026
arXiv: 2601.07208
Code: https://github.com/zy125413/MAESTRO
Area: Model Compression/LLM Alignment
Keywords: Open-domain alignment, multi-objective optimization, reward orchestration, meta-learning, GRPO

TL;DR

This paper proposes MAESTRO, which reformulates reward scalarization in GRPO as a contextual bandit problem. By utilizing a lightweight Conductor network that leverages the model's final-layer hidden states, it adaptively selects reward weights for each prompt-response pair, consistently outperforming static and single-reward baselines across seven open-domain benchmarks.

Background & Motivation

Background: GRPO has become a mainstream paradigm for LLM alignment, performing exceptionally well on tasks with verifiable ground truths like mathematics and code. However, extending GRPO to open-domain generation (e.g., creative writing, social intelligence) remains a critical challenge due to the lack of objective verification rules.

Limitations of Prior Work: Current open-domain alignment primarily relies on two routes: (1) LLM-as-a-Judge, which is computationally expensive and introduces stylistic biases (e.g., preferring longer responses); (2) methods based on heuristic proxy signals like perplexity and entropy, which correlate poorly with human utility and use static, context-independent scalarization weights. Neither approach captures the fine-grained multi-objective trade-offs in open-domain generation.

Key Challenge: Open-domain alignment is inherently a multi-objective optimization problem—contradictions exist between creativity and factuality, or conciseness and richness. Yet, existing methods collapse the high-dimensional Pareto front into a single point using fixed weights. It is fundamentally unreasonable to impose the same reward preferences on mathematical reasoning and creative writing.

Goal: To design a framework capable of dynamically adjusting reward weights based on the semantic content of prompt-response pairs, enabling GRPO to adaptively switch reward preferences across different tasks and contexts.

Key Insight: It is observed that the final-layer hidden states of a Transformer serve as a semantic bottleneck, encoding high-level information about task intent and generation characteristics. Using these latent representations as context, a lightweight meta-policy can be trained to select reward scalarization strategies.

Core Idea: Reward orchestration is modeled as a contextual bandit problem. Using the group-relative advantage of GRPO as a meta-reward signal, a Conductor network is co-evolved with the policy model within a bilevel optimization framework.

Method

Overall Architecture

MAESTRO adds a Conductor layer to the standard GRPO pipeline. Given a prompt \(q\), the policy model \(\pi_\theta\) samples a set of candidate outputs \(\{o_i\}\). The Conductor \(\pi_\phi\) processes the final-layer hidden state of each prompt-response pair and samples a reward-focus action \(a\), which induces a weight vector \(\mathbf{w}^{(a)}\). The raw reward vector \(\mathbf{r}\) and KL penalty are fused into a scalar reward \(R\) via a scalarization node, followed by group normalization to obtain the group-relative advantage \(\hat{A}\). In the bilevel optimization, the inner loop updates \(\pi_\theta\) using GRPO, while the outer loop updates \(\pi_\phi\) using the advantage as a meta-reward.

Key Designs

  1. Conductor Network:

    • Function: Dynamically selects reward weight configurations based on prompt-response semantics.
    • Mechanism: Taking the final-layer hidden state \(h \in \mathbb{R}^{d_{\text{model}}}\) after the policy model processes the complete sequence as context, the Conductor is implemented as a lightweight linear projection head: \(\pi_\phi(\cdot|h) = \text{softmax}((W_\phi h + b_\phi)/\tau)\). During training, discrete actions \(a\) are sampled from a categorical distribution, with each action inducing a specific reward-focus pattern; during inference, the continuous distribution is directly output as deterministic weights.
    • Design Motivation: By exploiting the linear separability of final-layer latent representations, different task semantics (e.g., reasoning vs. creativity) can be distinguished using only a linear projection, incurring minimal overhead without needing complex networks.
  2. Advantage-driven Bilevel Meta-optimization:

    • Function: Stably trains the Conductor to learn meaningful reward trade-offs.
    • Mechanism: The meta-objective \(J(\phi) = \mathbb{E}[\hat{A}(x,y;w(h,a))]\) maximizes the expected GRPO advantage under the reward configurations selected by the Conductor. A key innovation is intra-group heterogeneous sampling—independently sampling reward actions \(a_{i,j}\) for each response to the same prompt. This breaks the symmetry of the group baseline and provides effective meta-gradient variance. The gradient update is \(\nabla_\phi J(\phi) = \frac{1}{NG}\sum_{i,j}[\hat{A}_{i,j}\nabla_\phi\log\pi_\phi(a_{i,j}|h_{i,j}) + \lambda_{\text{ent}}\nabla_\phi\mathcal{H}(\pi_\phi)]\).
    • Design Motivation: Under group-relative normalization, naive prompt-level uniform weights would cause meta-gradients to vanish (as the advantage mean is zero). Intra-group heterogeneous sampling introduces meta-competition and exposes informative variance.
  3. Asynchronous Two-timescale Update:

    • Function: Decouples Conductor optimization from policy model training to prevent instability.
    • Mechanism: During GRPO training, triples of \((h_{i,j}, a_{i,j}, \hat{A}_{i,j})\) are buffered, and \(\phi\) is periodically updated using the Policy Gradient Theorem. The policy model updates at a high frequency at the token level (inner loop), while the Conductor updates at a lower frequency at the episode level (outer loop), forming two distinct timescales.
    • Design Motivation: Decoupling meta-optimization from token-level policy training avoids coupling between meta-gradients and policy gradients that could lead to training instability or degradation.

Loss & Training

The reward space consists of \(K=5\) components: perplexity reward \(r_{\text{ppl}}\) (a proxy for reasoning consistency), format validity reward \(r_{\text{fmt}}\), entropy reward \(r_{\text{ent}}\) (balancing exploration and redundancy), length penalty \(r_{\text{len}}\), and semantic preference reward \(r_{\text{pref}}\) (from the pre-trained Skywork-Reward model). The inner loop uses the standard GRPO loss to update the policy model, while the outer loop uses REINFORCE gradients (with entropy regularization) to update the Conductor.

Key Experimental Results

Main Results (Qwen3-8B)

Dataset Base SFT NOVER EM-GRPO MAESTRO Gain vs. Strongest Baseline
Natural Reasoning 39.6 26.0 46.9 52.0 53.2 +1.2
SS-GEN 33.1 68.7 77.8 88.8 92.5 +1.9
WebInstruct 7.8 34.6 42.7 43.4 43.5 +0.1
ToMBench 5.7 46.9 56.2 63.8 71.9 +8.1
GeneralThoughts 34.0 34.7 64.6 68.0 68.1 +0.1
OPUS-Books 5.1 5.5 10.1 11.7 12.6 +0.9
EmoBench 36.7 46.1 42.2 41.4 47.7 +1.6

Ablation Study

Configuration Description Effect
Equal-Weights (Eq) Fixed uniform weights Moderate gains but unstable; e.g., only 38.27% on ToMBench
Random-Weights (Rand) Random weights Sometimes decreased performance (35.7% on GeneralThoughts)
MAESTRO (Ours) Conductor dynamic weights Optimal in nearly all tasks
Training Time SS-GEN w/ Conductor vs. w/o 20.1% speedup (reduced redundant generation)
Training Time WebInstruct w/ Conductor vs. w/o Overhead only +4.0%

Key Findings

  • Largest improvement on ToMBench (+8.1%): Social intelligence tasks require flexible expression and emotional understanding, where the advantages of dynamic reward orchestration are most significant. EM-GRPO also performed strongly here (63.8%), but MAESTRO maintains a wide lead.
  • EM-GRPO matches MAESTRO on reasoning tasks: Low-entropy decoding benefits deterministic reasoning but degrades severely in open-domain tasks (SS-GEN, ToMBench), indicating that a single inductive bias cannot generalize across domains.
  • Dynamic weights reduce generation redundancy: On SS-GEN, the Conductor learns to suppress verbose outputs early, shortening average sequence length and increasing training throughput by 20.1%.
  • Learned weight patterns have clear semantics: Creative writing tasks focus on entropy rewards, while structured reasoning tasks focus on perplexity rewards. Patterns converge rapidly and stabilize early in training.

Highlights & Insights

  • Elegant Fusion of Contextual Bandit + GRPO: Modeling reward weight selection as a decision problem dependent on prompt-response semantics, implemented via a single linear head, is both elegant and efficient. This paradigm can be generalized to any RL alignment scenario requiring multi-reward trade-offs.
  • Intra-group Heterogeneous Sampling Solves Meta-signal Vanishing: Leveraging the zero-mean property of group-relative advantage by allowing different responses in the same group to use different reward configurations is a sophisticated solution to the meta-credit assignment problem in bilevel optimization.
  • Efficiency Gains Instead of Losses: Dynamic reward orchestration does not just maintain training efficiency; in long-text generation scenarios, it can significantly accelerate training by reducing redundant output, breaking the intuition that more complex methods are necessarily slower.

Limitations & Future Work

  • Evaluated only on 7-8B scale models; effects on larger models remain to be explored.
  • The Conductor uses a simple linear projection head; more complex architectures might capture finer-grained trade-offs.
  • Reward components are limited to 5 predefined signals; how to automatically discover and combine reward signals remains an open question.
  • Evaluation relies on external LLM Judges (Qwen3-235B, Gemini-2.5-Flash), which may introduce biases.
  • vs. NOVER (Liu et al., 2025b): NOVER uses conditional perplexity as the sole reward signal for GRPO training, which is strong for reasoning but degrades in open domains. MAESTRO outperforms it globally through dynamic multi-reward orchestration.
  • vs. EM-GRPO: Entropy minimization methods approach MAESTRO on reasoning tasks but degrade significantly on creative and social tasks (e.g., SS-GEN 88.8% vs. 92.5%), proving the limitations of a single inductive bias.
  • vs. DYNAOPT (Pérez-Rosas et al., 2024): DYNAOPT adjusts reward weights at the global training stage level, whereas MAESTRO performs semantically conditioned orchestration at the instance level, providing finer granularity and broader applicability.
  • vs. Pareto-based MORL: Multi-policy Pareto methods require training and maintaining multiple large models, which is prohibitively expensive. MAESTRO achieves dynamic Pareto front exploration with a single policy and a lightweight Conductor.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The combination of Contextual Bandit + GRPO bilevel optimization is proposed for the first time, with an elegant solution to the meta-credit assignment problem.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Seven benchmarks, two backbone models, and multiple baselines, though lacking validation on larger models.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, rigorous methodology, and in-depth analysis (especially the visualization of reward weight evolution).
  • Value: ⭐⭐⭐⭐⭐ Provides a practical and efficient new paradigm for open-domain LLM alignment; the Conductor design is plug-and-play.