M³HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality¶
Conference: ICML 2025
arXiv: 2503.02077
Code: Not released
Area: LLM Alignment
Keywords: Multi-agent Reinforcement Learning, Human Feedback, Reward Design, LLM Parsing, Adaptive Weights, Mixed Quality Feedback
TL;DR¶
The M³HF framework is proposed to integrate multi-phase, mixed-quality natural language human feedback during multi-agent reinforcement learning. It leverages LLMs to parse feedback, and updates the reward function through predefined templates and adaptive weights, significantly improving multi-agent coordination performance.
Background & Motivation¶
Reward function design in multi-agent reinforcement learning (MARL) remains a core challenge. In complex cooperative environments, hand-crafted reward functions often lead to sub-optimal or misaligned behaviors. Existing approaches face the following issues:
Sparse Reward Dilemma: Final reward signals for cooperative tasks (e.g., cooking, football) are sparse, making it difficult for agents to learn effective coordination through trial and error.
Reward Engineering Challenges: Designing dense rewards by hand requires extensive domain knowledge, and improper designs can easily lead to reward hacking.
Limitations of Single-Phase Feedback: Traditional RLHF methods collect feedback only pre- or post-training, which fails to dynamically adapt to the agents' actual behaviors.
Inconsistent Feedback Quality: The expertise of different human evaluators varies significantly, requiring the system to handle feedback of mixed quality.
The core insight of M³HF is that by strategically pausing training multiple times, collecting human feedback, and adaptively integrating feedback of varying quality, the system can guide multi-agent collaborative policy learning more efficiently.
Method¶
Overall Architecture¶
M³HF extends the standard Markov Game into a Multi-phase Human Feedback Markov Game (MHF-MG), which consists of the following pipeline:
- Training Phase: Agents undergo standard training for a certain period.
- Evaluation Pause: Training is strategically paused to display videos of agent behaviors to human evaluators.
- Feedback Collection: Human evaluators provide natural language feedback (e.g., "put the ingredients on the central table").
- LLM Parsing: LLMs are utilized to parse natural language feedback into structured reward signals.
- Reward Update: Reward functions are generated from predefined templates and integrated using adaptive weights.
- Resumed Training: Training resumes with the updated reward functions.
Key Designs: LLM Feedback Parsing¶
Human feedback is parsed by an LLM into a structured format containing:
- Target Agent: Which agent(s) the feedback is directed toward.
- Reward Type: Matches a predefined reward template (e.g., distance-based, action-based).
- Parameter Settings: Specific parameter values within the templates.
Predefined reward templates include: - Distance-based: Rewards agents for moving closer to/further from specific locations. - Action-based: Rewards execution of specific action sequences. - Coordination-based: Rewards cooperative behaviors among multiple agents.
Key Designs: Adaptive Weight Adjustment¶
At the \(k\)-th phase, the combined reward function is formulated as:
where \(R_0\) is the original environment reward, and \(R_m\) (\(m > 0\)) is the reward generated from human feedback. The weight update mechanism includes:
- Weight Decay: Gradually reduces the influence of historical feedback over time, steering the system to focus on the latest feedback.
- Performance-based Adjustment: If performance degrades after introducing new feedback (\(\Delta r_k < 0\)), the weight of that feedback is automatically clipped to zero.
- Incremental Validation: Evaluates the actual utility of new reward functions through rollouts, keeping only beneficial feedback.
Theoretical Guarantee¶
Proposition 4.2 (Robustness to Flawed Feedback): After any \(K\) rounds of feedback:
where \(\delta\) is a bounded positive constant, and \(i_j\) is the index of the \(j\)-th valid feedback. This indicates that the algorithm benefits from every high-quality feedback, whereas performance degradation is at most bounded by the latest erroneous feedback.
Key Experimental Results¶
Main Results: Overcooked Environment¶
| Method | Layout A (Simple) | Layout B (Complex) | Layout C (Most Complex) |
|---|---|---|---|
| IPPO | 19.2 ± 4.5 | 23.1 ± 2.7 | 27.4 ± 4.9 |
| MAPPO | Lower than IPPO | Lower than IPPO | Lower than IPPO |
| Mac-based | Moderate | Moderate | Moderate |
| M³HF | 164.8 ± 1.2 | Significantly outperforms baselines | Significantly outperforms baselines |
M³HF consistently and significantly outperforms the baselines across all layouts and recipe settings, yielding up to a 50% performance improvement in complex tasks.
Ablation Study¶
| Experiment | Finding |
|---|---|
| Mixed-quality feedback | Even with incorrect feedback, M³HF performs close to the oracle and does not degrade significantly. |
| Deliberately misleading feedback | The adaptive weight mechanism effectively suppresses the negative impacts of misleading feedback. |
| Removing LLM parsing | Performance drops, validating the necessity of the LLM parsing component. |
| Fixed weights vs. adaptive weights | Adaptive weights significantly outperform the fixed weight strategy. |
| Replacing humans with VLMs | Gemini-1.5-Pro can generate human-like feedback, but still falls short on specificity compared to humans. |
| Google Football 5v5 | M³HF continues to outperform standard MARL baselines in more complex environments. |
Training Efficiency¶
- Feedback Rounds: Only 2-5 rounds of human feedback are required.
- Training Acceleration: 3×-6× acceleration (15k episodes vs. 80k-100k episodes for traditional methods).
- Human Labor Cost: Approximately 3-5 minutes per feedback round, totaling under 25 minutes.
- Time Saved: Saves 10-15 hours of training time.
Highlights & Insights¶
- Human-AI Collaboration Paradigm: Differing from conventional offline preference collection in RLHF, M³HF achieves online human guidance during training, tightly coupling feedback with policy behaviors.
- Handling Mixed Quality: The adaptive weight mechanism elegantly addresses the challenge of integrating feedback of varying quality, enabling non-experts to contribute effectively.
- LLM as a Bridge: Utilizing LLMs to translate natural language feedback into structured rewards lowers the barrier for humans to provide feedback.
- High Practicality: Only a few rounds of human feedback (2-5 rounds) are needed to drastically improve performance, resulting in very low practical deployment costs.
- Exploring VLM Alternatives: Feasibility of replacing humans with VLMs is initially validated, paving the way for complete automation.
Limitations & Future Work¶
- Dependency on Domain Knowledge for Reward Templates: Predefined templates must be manually crafted for new environments.
- Limited Environmental Evaluation: Primarily validated on Overcooked; although extended to Google Research Football, generalization to more complex environments remains to be studied.
- Strong Theoretical Assumptions: Assumptions such as Gaussian noise and ergodicity may not always hold in realistic deployment scenarios.
- Inadequate VLM Feedback Precision: Feedbacks currently generated by VLMs are often overly generic (e.g., "improve coordination") and lack concrete guidance.
- Lack of Direct Comparison with Other Human-Feedback MARL Methods: Direct comparisons with preference-based reinforcement learning approaches like PbMARL are missing.
Related Work & Insights¶
- RLHF for LLM (Ouyang et al.): M³HF generalizes RLHF methodologies to multi-agent and multi-phase scenarios.
- LLM-Driven Reward Design (Motif, etc.): The innovation of M³HF lies in multi-phase and multi-agent feedback allocation.
- MARL (MAPPO/IPPO): M³HF serves as an orthogonal enhancement to these foundational algorithms.
- Insight: Multi-phase feedback collection strategies can be generalized to single-agent LLM alignment, suggesting that collecting feedback at training midpoints might be more efficient than doing so only pre- or post-training.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Integrating multi-phase, multi-agent human feedback presents a novel problem formulation.
- Experimental Thoroughness: ⭐⭐⭐ — Sufficiently ablated but limited in environment diversity (primarily focused on Overcooked).
- Value: ⭐⭐⭐⭐ — Substantial performance improvements can be achieved with minimal feedback, leading to low deployment costs.
- Recommendation Index: ⭐⭐⭐⭐ — Recommended reading for researchers focusing on the intersection of MARL and RLHF.