Skip to content

Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=FXm5U16vxD
Code: https://github.com/lukewys/realchords-pytorch
Area: AI Safety / RL Post-Training / Generative Music
Keywords: Reward Hacking, RL Post-Training, Generative Adversarial, Discriminator, Real-time Music Accompaniment, Diversity Collapse

TL;DR

Addressing the reward hacking issue in real-time melody-chord accompaniment—where RL post-training "collapses into repetitive chords to exploit consistency rewards"—this paper proposes GAPT. It utilizes a discriminator that co-evolves with the policy to provide an adversarial reward representing "authenticity relative to real data." Combined with a two-stage adaptive update schedule, it restores output diversity to near-dataset levels without sacrificing harmonic consistency. In a real-time jamming user study with 12 professional musicians, it significantly improved adaptation speed and the sense of agency.

Background & Motivation

  • Background: While LLM + RL post-training has become mainstream, most generative AI remains "round-based" (input prompt \(\rightarrow\) wait seconds/minutes \(\rightarrow\) get result). Live jamming, however, is a different collaboration mode—it requires real-time coordination, anticipation, and immediate error correction without seeing future actions, all while maintaining diversity to sustain creative flow.
  • Limitations of Prior Work: Accompaniment models trained via supervised maximum likelihood estimation (MLE) often fail during deployment because carefully curated datasets lack errors, corrections, or mutual adaptation behaviors; models never "practice" how to recover and adapt (exposure bias). RL post-training can compensate through on-policy interaction, but optimizing "harmonic consistency rewards \(R(x,y)\)" frequently triggers reward hacking: the policy discovers it can exploit scores by repeating a few simple, high-scoring chords, resulting in extremely repetitive and uninspired outputs.
  • Key Challenge: The more consistency rewards are optimized, the more severe the diversity collapse becomes. In dialogue, this appears as "tricking reward models into believing user satisfaction"; in music, it manifests as "harmonically accurate but boring-to-death" repetitive accompaniment—directly undermining the user's sense of agency in improvisation. The authors demonstrate that common KL constraints (limiting deviation from the pre-trained distribution) are insufficient to prevent this collapse in this setting.
  • Goal: Inhibit diversity collapse at its root while preserving the real-time adaptation capabilities of RL post-training, ensuring the model is both "melodically synchronized" and "musically interesting."
  • Core Idea: Replace or augment KL constraints with adversarial rewards. Drawing from GANs and Generative Adversarial Imitation Learning (GAIL), a discriminator is trained to distinguish "policy-generated chord trajectories" from "real data," providing an "authenticity" score as an additional reward to the policy. Repetitive, boring chords are easily identified by the discriminator \(\rightarrow\) low authenticity reward; chords chasing authenticity without following the melody \(\rightarrow\) low consistency reward. These two pressures complement each other, pushing the policy toward a region that is both "consistent and diverse."

Method

Overall Architecture

GAPT overlays an adversarial reward path onto standard RL post-training (PPO + consistency rewards + KL/entropy regularization). The policy \(\pi_\theta\) generates chord trajectories \(y\) online in response to melody flow \(x\). The discriminator \(D_\psi\) maps \(y\) to an authenticity estimate of "how much it resembles real data" and co-evolves as the policy improves. Since sampling is non-differentiable, the discriminator output is optimized via RL (PPO). The final reward is an integration of "self-supervised consistency rewards + rule-based penalties + adversarial rewards," controlled by a two-stage adaptive discriminator update schedule to stabilize adversarial training.

flowchart LR
    M[Melody flow x<br/>Frame-by-frame] --> P[Chord Policy πθ<br/>Online Generation y]
    P --> Y[Policy Trajectory y]
    Y --> R1[Consistency Reward R(x,y)<br/>Contrastive+Disc.+Rules]
    Y --> D[Discriminator Dψ<br/>Real vs. Policy]
    Data[Real Dataset D] --> D
    D --> R2[Adversarial Reward<br/>Radv = -log(1-Dψ(y))]
    R1 --> SUM[Total Reward]
    R2 --> SUM
    SUM --> PPO[PPO Updates Policy]
    PPO --> P
    PPO -. Two-stage Adaptive Gating .-> D

Key Designs

1. Adversarial Reward as Diversity Regularization: Incorporating "Realness" into Reward The discriminator \(D_\psi\) is a Transformer encoder mapping a policy trajectory \(y\) to an authenticity score \(D_\psi(y)\in[0,1]\). During training, dataset sequences are labeled positive and current policy sequences negative for binary classification. Crucially, in the interactive setting, the discriminator is trained only on the model's output (without the full interaction trajectory) to learn an "input-agnostic prior" that transfers to unseen melodic inputs. Following GAIL, the adversarial reward is defined as \(R_{\text{adv}} = -\log\!\big(1 - D_\psi(y)\big)\). It forms a complementary tension with task rewards: trajectories exploiting consistency through repetition have low authenticity and are penalized by \(R_{\text{adv}}\); trajectories chasing only authenticity without following the melody have low consistency rewards. Together, they push the policy toward "diverse, structured, and distribution-aligned" outputs. The authors emphasize that while adversarial rewards act as a form of KL constraint, the latter is empirically insufficient, making adversarial training necessary.

2. Two-Stage Adaptive Discriminator Updates: Stabilizing Adversarial Rewards Classic GAN training issues—where the discriminator improves too quickly, leading to vanishing gradients, or oscillates due to non-stationary rewards—are addressed via a two-stage schedule. Phase 1 (Warm-up) uses a fixed ratio to align learning speeds—the discriminator is updated only once every 5 PPO policy updates for the first 200 steps. Phase 2 (Adaptive Gating) determines updates based on confidence: let \(\bar R_{\text{adv}}\) be the moving average of the adversarial reward over the last 3 PPO updates; the discriminator is allowed to update only if \(\bar R_{\text{adv}} > \tau\) (where \(\tau=1.0\)), otherwise it is frozen. The intuition is to pause the discriminator when its signal is unstable or too strong, waiting for the policy to catch up and the reward signal to become informative. Additionally, label smoothing (\(\alpha=0.1\)) is applied to the binary classification target to mitigate discriminator overfitting. This schedule balances learning speeds and suppresses oscillations.

3. Integrated Self-supervised Consistency Rewards + Rule Penalties: Defining "Harmonic Quality" The task reward \(R(x,y)\) extends ReaLchords as an integration of self-supervised rewards, calculated as a single per-episode score. The contrastive model uses InfoNCE to align melody and chord encodings, providing a global harmonic alignment signal via cosine similarity. The discriminative model takes full \((x,y)\) pairs to output the probability of a "real vs. random mismatch," providing complementary temporal consistency. To mitigate bias from transposition augmentation, a rhythm-only variant (retaining onset/hold/silence but removing pitch) is included. Rule penalties include: illegal format penalty, silence penalty (over 4% silence during active melody), early EOS penalty, and a repetition penalty (same chord repeated \(>4\) times). The PPO objective maximizes rewards with KL and entropy regularization:

\[\max_\theta\ \mathbb{E}_{x\sim D,\, y\sim\pi_\theta(\cdot|x)}\Big[R(x,y) - \beta\, D_{\mathrm{KL}}\big(\pi_\theta(\cdot|x)\,\|\,\phi_\omega(\cdot|x)\big) + \gamma\textstyle\sum_{t=1}^{T} H\big(\pi_\theta(\cdot|x_{<t},y_{<t})\big)\Big]\]

Here, the KL anchor \(\phi_\omega\) is an offline-trained model with full input access (rather than using MLE initialization, which is proven ineffective for online training), and the entropy term \(\gamma\) further encourages diversity. Online constraints are guaranteed by factorization \(\pi_\theta(y|x)=\prod_t \pi_\theta(y_t|x_{<t},y_{<t})\), ensuring chord generation does not depend on the current frame melody \(x_t\) or future tokens, enabling true real-time deployment.

Key Experimental Results

Evaluation Settings

Setting Description Function
Fixed Melody Simulation Policy responds online to held-out test melodies (including OOD Wikifonia) Isolates online adaptation to real melodies
Model-Model Interaction A melody improvisor agent and chord policy co-adapt Approximates jamming with adaptive human partners
Real-time User Study 12 professional musicians jam in a real-time interactive system Authentic human-AI improvisation evaluation

Baseline systems: Online MLE (Supervised only), ReaLchords (Consistency + penalties, no entropy), GAPT w/o Adv (Ablation without adversarial reward), GAPT (Full method). Metrics: Note-in-chord ratio for harmonic adaptation, Vendi Score for diversity (Shannon entropy of Gram matrix eigenvalues). The authors note that neither metric is sufficient alone—optimizing only harmony leads to boredom, while optimizing only diversity leads to chaos. Ideal models must push the Pareto frontier.

System Harmony (note-in-chord) Diversity (Vendi) Overall Performance
Online MLE Low High Diverse but dissonant; fails on deployment
ReaLchords High Low (collapsed) Harmonically strong but repetitive/boring
GAPT w/o Adv High Low Similar to ReaLchords; diversity crushed
GAPT (Ours) High High (near dataset) Pushes Pareto frontier

Ablation Study: Adversarial Reward is the Key to Diversity Recovery

Configuration Adversarial Reward Diversity Result
GAPT w/o Adv No Diversity collapse (comparable to ReaLchords)
KL Constraint Only No Empirically insufficient to inhibit reward hacking
GAPT Yes Diversity restored, Harmony maintained

Key Findings

  • Comprehensive Advancement of the Pareto Frontier: In fixed melody simulations, Online MLE had high diversity but poor harmony; ReaLchords and the no-adversarial ablation had strong harmony but crushed diversity. GAPT achieved both high diversity and strong harmony, with t-SNE visualizations showing GAPT covers a broader accompaniment space.
  • Stronger Co-adaptation: When jamming with a learned melody agent, GAPT consistently outperformed ReaLchords and the ablation in both harmony and diversity, confirming the adversarial reward acts as an explicit diversity regulator.
  • Significant Positive User Study Results: Across three Likert scales (Adaptation Quality / Adaptation Speed / Agency), GAPT received the highest mean scores. It significantly outperformed ReaLchords in adaptation speed and agency (\(p<0.05\)). Qualitative feedback noted GAPT "grasps tonality and changes faster" (P10) while ReaLchords "is quite dumb, giving the same two chords repeatedly" (P7).
  • KL Insufficiency, Adversarial Necessity: Empirical results confirm KL constraints alone cannot suppress reward hacking in this setting; adversarial training is essential to preserve authenticity while learning to adapt.

Highlights & Insights

  • Reactivating "Old" GAN/GAIL Concepts: The authors explicitly note that generative adversarial objectives largely exited the mainstream after 2020 but demonstrate their unique value in mitigating reward hacking. The discriminator naturally acts as a regularizer against "trivial output collapse," a perspective applicable to any sequence model using learned rewards for RL.
  • Intuitive Diagnosis of Reward Hacking: Translating the abstract "tricking the reward model" into the musical "repeating simple chords" makes reward hacking audible and visible, providing a concrete explanation for why KL constraints fail.
  • Robust Engineering Loop: The work moves beyond simulation to deploy models in a real-time client-server system based on ReaLJam, using look-ahead buffers against latency and employing professional musicians for blind testing.
  • Two-Stage Gated Update is a Practical Trick: Using the moving average of adversarial rewards as a confidence gate for discriminator updates is a simple, effective remedy for the "overpowered discriminator" problem in GANs.

Limitations & Future Work

  • Restricted to Accompaniment Setting: The method assumes fixed melody \(p(x_t|x_{<t})\) and no shared context at startup, lacking treatment of full bidirectional co-evolution.
  • Narrow Task Scope: Validated only on monophonic melody \(\rightarrow\) chord, pop/folk styles, and frame-level discrete tokens. Generalization to polyphony, continuous audio, or other styles remains to be seen.
  • Small User Study Scale: 12 participants and 1-2 minute tasks limit statistical power and focus primarily on experienced musicians.
  • Discriminator Context: To obtain an input-agnostic prior, the discriminator does not see the full interaction trajectory, which may lose information in tasks requiring tighter context coordination.
  • Mitigating Reward Hacking: Traditional approaches use KL penalties (Jaques 2017, Ouyang 2022); recent findings suggest KL is insufficient, leading to elastic resets or reward shaping. This work is orthogonal—introducing a discriminator's adversarial reward as a regularizer during policy training.
  • Real-time Music Accompaniment: Evolves from score-following (Antescofo) and rule-based systems (OMax) to deep learning models (BachDuet, SongDriver, ReaLchords). This paper identifies "diversity collapse = reward hacking" in consistency-only RL.
  • Generative Adversarial Learning: GAN (Goodfellow 2014), GAIL (Ho & Ermon 2016). AMP (Peng 2021) recently used discriminators to improve motion naturalness in robotics; this paper extends that lineage to real-time human-AI interaction.

Rating

  • Novelty: ⭐⭐⭐⭐ — Re-introduces GAN/GAIL adversarial rewards into the LLM-era RL post-training to mitigate reward hacking; the "Adversarial Necessity" argument is well-reasoned.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive three-tier evaluation; includes OOD data, t-SNE, and significance testing.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation, intuitive explanations of abstract concepts, and well-detailed methodologies.
  • Value: ⭐⭐⭐⭐ — Practical value for real-time generative collaboration and music AI; open-sourced weights and real-time infrastructure enhance reproducibility.