MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization¶

Conference: ACL 2026 arXiv: 2601.07208 Code: https://github.com/zy125413/MAESTRO Area: Model Compression / LLM Alignment Keywords: Open-domain Alignment, Multi-objective Optimization, Reward Orchestration, Meta-learning, GRPO

TL;DR¶

This paper proposes MAESTRO, which reformulates reward scalarization in GRPO as a contextual bandit problem. A lightweight Conductor network leverages the final-layer hidden states of the policy model to adaptively select reward weights for each prompt–response pair, consistently outperforming static-reward and single-reward baselines across seven open-domain benchmarks.

Background & Motivation¶

Background: GRPO has emerged as a mainstream paradigm for LLM alignment, demonstrating strong performance on tasks with verifiable ground truth such as mathematics and code generation. However, extending GRPO to open-domain generation—e.g., creative writing and social intelligence—remains a critical challenge, as these tasks lack objective verification rules.

Limitations of Prior Work: Current open-domain alignment relies primarily on two approaches: (1) LLM-as-a-Judge, which incurs high computational costs and introduces stylistic biases (e.g., preference for longer responses); and (2) heuristic proxy signals based on perplexity and entropy, which correlate poorly with human utility and employ static, context-agnostic scalarization weights. Neither approach captures the fine-grained multi-objective trade-offs inherent in open-domain generation.

Key Challenge: Open-domain alignment is inherently a multi-objective optimization problem—creativity conflicts with factuality, conciseness conflicts with richness—yet existing methods collapse the high-dimensional Pareto front to a single point using a fixed weight vector. Applying identical reward preferences to mathematical reasoning and creative writing is clearly suboptimal.

Goal: Design a framework that dynamically adjusts reward weights based on the semantic content of each prompt–response pair, enabling GRPO to adaptively shift reward preferences across different tasks and contexts.

Key Insight: The final-layer hidden states of a Transformer serve as a semantic bottleneck, encoding high-level information about task intent and generation characteristics. These latent representations are used as context to train a lightweight meta-policy for selecting reward scalarization strategies.

Core Idea: Reward orchestration is modeled as a contextual bandit problem. The group-relative advantage from GRPO serves as the meta-reward signal, allowing the Conductor network and the policy model to co-evolve within a bilevel optimization framework.

Method¶

Overall Architecture¶

MAESTRO augments the standard GRPO pipeline with a Conductor layer. Given a prompt \(q\), the policy model \(\pi_\theta\) samples a group of candidate outputs \(\{o_i\}\). The Conductor \(\pi_\phi\) processes the final-layer hidden state of each prompt–response pair, samples a reward-emphasis action \(a\), and induces a weight vector \(\mathbf{w}^{(a)}\). The raw reward vector \(\mathbf{r}\) and KL penalty are fused into a scalar reward \(R\) via a scalarization node, which is then group-normalized to obtain the group-relative advantage \(\hat{A}\). In the bilevel optimization, the inner loop updates \(\pi_\theta\) via GRPO, while the outer loop updates \(\pi_\phi\) using the advantage as a meta-reward.

Key Designs¶

Conductor Network:
- Function: Dynamically selects reward weight configurations based on prompt–response semantics.
- Mechanism: The final-layer hidden state \(h \in \mathbb{R}^{d_{\text{model}}}\), obtained after the policy model processes the full sequence, serves as context. The Conductor is implemented as a lightweight linear projection head: \(\pi_\phi(\cdot|h) = \text{softmax}((W_\phi h + b_\phi)/\tau)\). During training, a discrete action \(a\) is sampled from the categorical distribution, with each action inducing a specific reward-emphasis pattern; at inference, the continuous distribution is used directly as deterministic weights.
- Design Motivation: The linear separability of final-layer hidden representations allows different task semantics (e.g., reasoning vs. creativity) to be distinguished via a simple linear projection, requiring no complex architecture and incurring minimal overhead.
Advantage-Driven Bilevel Meta-Optimization:
- Function: Stably trains the Conductor to learn meaningful reward trade-offs.
- Mechanism: The meta-objective \(J(\phi) = \mathbb{E}[\hat{A}(x,y;w(h,a))]\) maximizes the expected GRPO advantage under the reward configuration selected by the Conductor. A key innovation is intra-group heterogeneous sampling—reward actions \(a_{i,j}\) are independently sampled for each response within the same prompt group, breaking the symmetry of the group baseline and providing effective meta-gradient variance. The gradient update is: \(\nabla_\phi J(\phi) = \frac{1}{NG}\sum_{i,j}[\hat{A}_{i,j}\nabla_\phi\log\pi_\phi(a_{i,j}|h_{i,j}) + \lambda_{\text{ent}}\nabla_\phi\mathcal{H}(\pi_\phi)]\).
- Design Motivation: Under group-relative normalization, naively assigning uniform weights at the prompt level causes meta-gradients to vanish (since advantage sums to zero). Intra-group heterogeneous sampling introduces meta-competition that exposes informative variance.
Asynchronous Two-Timescale Updates:
- Function: Decouples Conductor optimization from policy model training to prevent instability.
- Mechanism: During GRPO training, triplets \((h_{i,j}, a_{i,j}, \hat{A}_{i,j})\) are buffered, and \(\phi\) is updated periodically via the Policy Gradient Theorem. The policy model is updated at a high frequency at the token level (inner loop), while the Conductor is updated at a lower frequency at the episode level (outer loop), forming two distinct timescales.
- Design Motivation: Decoupling meta-optimization from token-level policy training prevents coupling between meta-gradients and policy gradients, which would otherwise cause training instability or degeneracy.

Loss & Training¶

The reward space comprises \(K=5\) components: a perplexity reward \(r_{\text{ppl}}\) (proxy for reasoning consistency), a format validity reward \(r_{\text{fmt}}\), an entropy reward \(r_{\text{ent}}\) (balancing exploration and redundancy), a length penalty \(r_{\text{len}}\), and a semantic preference reward \(r_{\text{pref}}\) (from the pretrained reward model Skywork-Reward). The inner loop updates the policy model using the standard GRPO loss; the outer loop updates the Conductor using REINFORCE gradients with entropy regularization.

Key Experimental Results¶

Main Results (Qwen3-8B)¶

Dataset	Base	SFT	NOVER	EM-GRPO	MAESTRO	Gain vs. Best Baseline
Natural Reasoning	39.6	26.0	46.9	52.0	53.2	+1.2
SS-GEN	33.1	68.7	77.8	88.8	92.5	+1.9
WebInstruct	7.8	34.6	42.7	43.4	43.5	+0.1
ToMBench	5.7	46.9	56.2	63.8	71.9	+8.1
GeneralThoughts	34.0	34.7	64.6	68.0	68.1	+0.1
OPUS-Books	5.1	5.5	10.1	11.7	12.6	+0.9
EmoBench	36.7	46.1	42.2	41.4	47.7	+1.6

Ablation Study¶

Configuration	Description	Performance
Equal-Weights (Eq)	Fixed uniform weights	Moderate gains but unstable; e.g., only 38.27% on ToMBench
Random-Weights (Rand)	Random weights	Occasionally degrades performance; e.g., 35.7% on GeneralThoughts
MAESTRO (Ours)	Conductor dynamic weights	Near-optimal across almost all tasks
Training time on SS-GEN	w/ Conductor vs. w/o	20.1% speedup (reduced redundant generation)
Training time on WebInstruct	w/ Conductor vs. w/o	Only +4.0% overhead

Key Findings¶

Largest gain on ToMBench (+8.1%): Social intelligence tasks require flexible expression and emotional understanding, where dynamic reward orchestration yields the most significant advantage. EM-GRPO also performs strongly on this task (63.8%), yet MAESTRO still leads by a substantial margin.
EM-GRPO approaches MAESTRO on reasoning tasks: Low-entropy decoding favors deterministic reasoning, but degrades substantially on open-domain tasks (SS-GEN, ToMBench), demonstrating that a single inductive bias cannot generalize across domains.
Dynamic weights reduce generation redundancy: On SS-GEN, the Conductor learns to suppress verbose outputs early, shortening average sequence length and improving training throughput by 20.1%.
Conductor-learned weight patterns carry clear semantic meaning: Creative writing tasks emphasize the entropy reward, while structured reasoning tasks emphasize the perplexity reward. These patterns converge rapidly and stabilize early in training.

Highlights & Insights¶

Elegant fusion of contextual bandits and GRPO: Modeling reward weight selection as a decision problem conditioned on prompt–response semantics, implemented with a single linear head, is both elegant and efficient. This paradigm generalizes to any RL alignment scenario requiring multi-reward trade-offs.
Intra-group heterogeneous sampling resolves meta-signal vanishing: By exploiting the zero-mean property of group-relative advantages, assigning different reward configurations to different responses within the same group introduces variance—an elegant solution to the meta-credit assignment problem in bilevel optimization.
Efficiency improves rather than degrades: Dynamic reward orchestration not only avoids additional training overhead but, in long-text generation settings, achieves meaningful speedups by reducing redundant outputs, defying the intuition that added complexity implies added cost.

Limitations & Future Work¶

Validation is limited to models at the 7–8B scale; effectiveness on larger models remains to be explored.
The Conductor employs a simple linear projection head; more expressive architectures may capture finer-grained trade-offs.
The reward components are fixed to five predefined signals; automatically discovering and composing reward signals remains an open problem.
Evaluation relies on external LLM judges (Qwen3-235B, Gemini-2.5-Flash), which may themselves introduce evaluation bias.

vs. NOVER (Liu et al., 2025b): NOVER uses conditional perplexity as the sole reward signal within GRPO, yielding strong performance on reasoning tasks but degrading on open-domain settings. MAESTRO comprehensively surpasses it through multi-reward dynamic orchestration.
vs. EM-GRPO: The entropy minimization approach is competitive with MAESTRO on reasoning tasks but degrades substantially on creative and social tasks (e.g., SS-GEN: 88.8% vs. 92.5%), confirming the limitations of a single inductive bias.
vs. DYNAOPT (Pérez-Rosas et al., 2024): DYNAOPT adjusts reward weights at the global training-phase level, whereas MAESTRO performs semantically conditioned orchestration at the instance level, offering finer granularity and broader applicability.
vs. Pareto-based MORL: Pareto multi-policy methods require training and maintaining multiple large models at prohibitive cost. MAESTRO achieves dynamic Pareto front exploration using a single policy model and a lightweight Conductor.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of contextual bandits with bilevel GRPO optimization is proposed for the first time; the solution to the meta-credit assignment problem is particularly elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Seven benchmarks, two backbone models, and multiple baselines; validation on larger models is absent.
Writing Quality: ⭐⭐⭐⭐⭐ Problem motivation is clearly articulated, methodological descriptions are rigorous, and analysis is thorough (reward weight evolution visualizations are especially illuminating).
Value: ⭐⭐⭐⭐⭐ Provides a practical and efficient new paradigm for open-domain LLM alignment; the Conductor design is plug-and-play.