Skip to content

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Conference: ACL 2026 arXiv: 2601.07208 Code: https://github.com/zy125413/MAESTRO Area: Model Compression / LLM Alignment Keywords: Open-domain Alignment, Multi-objective Optimization, Reward Orchestration, Meta-learning, GRPO

TL;DR

This paper proposes MAESTRO, which reformulates reward scalarization in GRPO as a contextual bandit problem. A lightweight Conductor network leverages the final-layer hidden states of the policy model to adaptively select reward weights for each prompt–response pair, consistently outperforming static-reward and single-reward baselines across seven open-domain benchmarks.

Background & Motivation

Background: GRPO has emerged as a mainstream paradigm for LLM alignment, demonstrating strong performance on tasks with verifiable ground truth such as mathematics and code generation. However, extending GRPO to open-domain generation—e.g., creative writing and social intelligence—remains a critical challenge, as these tasks lack objective verification rules.

Limitations of Prior Work: Current open-domain alignment relies primarily on two approaches: (1) LLM-as-a-Judge, which incurs high computational costs and introduces stylistic biases (e.g., preference for longer responses); and (2) heuristic proxy signals based on perplexity and entropy, which correlate poorly with human utility and employ static, context-agnostic scalarization weights. Neither approach captures the fine-grained multi-objective trade-offs inherent in open-domain generation.

Key Challenge: Open-domain alignment is inherently a multi-objective optimization problem—creativity conflicts with factuality, conciseness conflicts with richness—yet existing methods collapse the high-dimensional Pareto front to a single point using a fixed weight vector. Applying identical reward preferences to mathematical reasoning and creative writing is clearly suboptimal.

Goal: Design a framework that dynamically adjusts reward weights based on the semantic content of each prompt–response pair, enabling GRPO to adaptively shift reward preferences across different tasks and contexts.

Key Insight: The final-layer hidden states of a Transformer serve as a semantic bottleneck, encoding high-level information about task intent and generation characteristics. These latent representations are used as context to train a lightweight meta-policy for selecting reward scalarization strategies.

Core Idea: Reward orchestration is modeled as a contextual bandit problem. The group-relative advantage from GRPO serves as the meta-reward signal, allowing the Conductor network and the policy model to co-evolve within a bilevel optimization framework.

Method

Overall Architecture

MAESTRO augments the standard GRPO pipeline with a Conductor layer. Given a prompt \(q\), the policy model \(\pi_\theta\) samples a group of candidate outputs \(\{o_i\}\). The Conductor \(\pi_\phi\) processes the final-layer hidden state of each prompt–response pair, samples a reward-emphasis action \(a\), and induces a weight vector \(\mathbf{w}^{(a)}\). The raw reward vector \(\mathbf{r}\) and KL penalty are fused into a scalar reward \(R\) via a scalarization node, which is then group-normalized to obtain the group-relative advantage \(\hat{A}\). In the bilevel optimization, the inner loop updates \(\pi_\theta\) via GRPO, while the outer loop updates \(\pi_\phi\) using the advantage as a meta-reward.

Key Designs

  1. Conductor Network:

    • Function: Dynamically selects reward weight configurations based on prompt–response semantics.
    • Mechanism: The final-layer hidden state \(h \in \mathbb{R}^{d_{\text{model}}}\), obtained after the policy model processes the full sequence, serves as context. The Conductor is implemented as a lightweight linear projection head: \(\pi_\phi(\cdot|h) = \text{softmax}((W_\phi h + b_\phi)/\tau)\). During training, a discrete action \(a\) is sampled from the categorical distribution, with each action inducing a specific reward-emphasis pattern; at inference, the continuous distribution is used directly as deterministic weights.
    • Design Motivation: The linear separability of final-layer hidden representations allows different task semantics (e.g., reasoning vs. creativity) to be distinguished via a simple linear projection, requiring no complex architecture and incurring minimal overhead.
  2. Advantage-Driven Bilevel Meta-Optimization:

    • Function: Stably trains the Conductor to learn meaningful reward trade-offs.
    • Mechanism: The meta-objective \(J(\phi) = \mathbb{E}[\hat{A}(x,y;w(h,a))]\) maximizes the expected GRPO advantage under the reward configuration selected by the Conductor. A key innovation is intra-group heterogeneous sampling—reward actions \(a_{i,j}\) are independently sampled for each response within the same prompt group, breaking the symmetry of the group baseline and providing effective meta-gradient variance. The gradient update is: \(\nabla_\phi J(\phi) = \frac{1}{NG}\sum_{i,j}[\hat{A}_{i,j}\nabla_\phi\log\pi_\phi(a_{i,j}|h_{i,j}) + \lambda_{\text{ent}}\nabla_\phi\mathcal{H}(\pi_\phi)]\).
    • Design Motivation: Under group-relative normalization, naively assigning uniform weights at the prompt level causes meta-gradients to vanish (since advantage sums to zero). Intra-group heterogeneous sampling introduces meta-competition that exposes informative variance.
  3. Asynchronous Two-Timescale Updates:

    • Function: Decouples Conductor optimization from policy model training to prevent instability.
    • Mechanism: During GRPO training, triplets \((h_{i,j}, a_{i,j}, \hat{A}_{i,j})\) are buffered, and \(\phi\) is updated periodically via the Policy Gradient Theorem. The policy model is updated at a high frequency at the token level (inner loop), while the Conductor is updated at a lower frequency at the episode level (outer loop), forming two distinct timescales.
    • Design Motivation: Decoupling meta-optimization from token-level policy training prevents coupling between meta-gradients and policy gradients, which would otherwise cause training instability or degeneracy.

Loss & Training

The reward space comprises \(K=5\) components: a perplexity reward \(r_{\text{ppl}}\) (proxy for reasoning consistency), a format validity reward \(r_{\text{fmt}}\), an entropy reward \(r_{\text{ent}}\) (balancing exploration and redundancy), a length penalty \(r_{\text{len}}\), and a semantic preference reward \(r_{\text{pref}}\) (from the pretrained reward model Skywork-Reward). The inner loop updates the policy model using the standard GRPO loss; the outer loop updates the Conductor using REINFORCE gradients with entropy regularization.

Key Experimental Results

Main Results (Qwen3-8B)

Dataset Base SFT NOVER EM-GRPO MAESTRO Gain vs. Best Baseline
Natural Reasoning 39.6 26.0 46.9 52.0 53.2 +1.2
SS-GEN 33.1 68.7 77.8 88.8 92.5 +1.9
WebInstruct 7.8 34.6 42.7 43.4 43.5 +0.1
ToMBench 5.7 46.9 56.2 63.8 71.9 +8.1
GeneralThoughts 34.0 34.7 64.6 68.0 68.1 +0.1
OPUS-Books 5.1 5.5 10.1 11.7 12.6 +0.9
EmoBench 36.7 46.1 42.2 41.4 47.7 +1.6

Ablation Study

Configuration Description Performance
Equal-Weights (Eq) Fixed uniform weights Moderate gains but unstable; e.g., only 38.27% on ToMBench
Random-Weights (Rand) Random weights Occasionally degrades performance; e.g., 35.7% on GeneralThoughts
MAESTRO (Ours) Conductor dynamic weights Near-optimal across almost all tasks
Training time on SS-GEN w/ Conductor vs. w/o 20.1% speedup (reduced redundant generation)
Training time on WebInstruct w/ Conductor vs. w/o Only +4.0% overhead

Key Findings

  • Largest gain on ToMBench (+8.1%): Social intelligence tasks require flexible expression and emotional understanding, where dynamic reward orchestration yields the most significant advantage. EM-GRPO also performs strongly on this task (63.8%), yet MAESTRO still leads by a substantial margin.
  • EM-GRPO approaches MAESTRO on reasoning tasks: Low-entropy decoding favors deterministic reasoning, but degrades substantially on open-domain tasks (SS-GEN, ToMBench), demonstrating that a single inductive bias cannot generalize across domains.
  • Dynamic weights reduce generation redundancy: On SS-GEN, the Conductor learns to suppress verbose outputs early, shortening average sequence length and improving training throughput by 20.1%.
  • Conductor-learned weight patterns carry clear semantic meaning: Creative writing tasks emphasize the entropy reward, while structured reasoning tasks emphasize the perplexity reward. These patterns converge rapidly and stabilize early in training.

Highlights & Insights

  • Elegant fusion of contextual bandits and GRPO: Modeling reward weight selection as a decision problem conditioned on prompt–response semantics, implemented with a single linear head, is both elegant and efficient. This paradigm generalizes to any RL alignment scenario requiring multi-reward trade-offs.
  • Intra-group heterogeneous sampling resolves meta-signal vanishing: By exploiting the zero-mean property of group-relative advantages, assigning different reward configurations to different responses within the same group introduces variance—an elegant solution to the meta-credit assignment problem in bilevel optimization.
  • Efficiency improves rather than degrades: Dynamic reward orchestration not only avoids additional training overhead but, in long-text generation settings, achieves meaningful speedups by reducing redundant outputs, defying the intuition that added complexity implies added cost.

Limitations & Future Work

  • Validation is limited to models at the 7–8B scale; effectiveness on larger models remains to be explored.
  • The Conductor employs a simple linear projection head; more expressive architectures may capture finer-grained trade-offs.
  • The reward components are fixed to five predefined signals; automatically discovering and composing reward signals remains an open problem.
  • Evaluation relies on external LLM judges (Qwen3-235B, Gemini-2.5-Flash), which may themselves introduce evaluation bias.
  • vs. NOVER (Liu et al., 2025b): NOVER uses conditional perplexity as the sole reward signal within GRPO, yielding strong performance on reasoning tasks but degrading on open-domain settings. MAESTRO comprehensively surpasses it through multi-reward dynamic orchestration.
  • vs. EM-GRPO: The entropy minimization approach is competitive with MAESTRO on reasoning tasks but degrades substantially on creative and social tasks (e.g., SS-GEN: 88.8% vs. 92.5%), confirming the limitations of a single inductive bias.
  • vs. DYNAOPT (Pérez-Rosas et al., 2024): DYNAOPT adjusts reward weights at the global training-phase level, whereas MAESTRO performs semantically conditioned orchestration at the instance level, offering finer granularity and broader applicability.
  • vs. Pareto-based MORL: Pareto multi-policy methods require training and maintaining multiple large models at prohibitive cost. MAESTRO achieves dynamic Pareto front exploration using a single policy model and a lightweight Conductor.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The combination of contextual bandits with bilevel GRPO optimization is proposed for the first time; the solution to the meta-credit assignment problem is particularly elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Seven benchmarks, two backbone models, and multiple baselines; validation on larger models is absent.
  • Writing Quality: ⭐⭐⭐⭐⭐ Problem motivation is clearly articulated, methodological descriptions are rigorous, and analysis is thorough (reward weight evolution visualizations are especially illuminating).
  • Value: ⭐⭐⭐⭐⭐ Provides a practical and efficient new paradigm for open-domain LLM alignment; the Conductor design is plug-and-play.