Improving Model Alignment through Collective Intelligence of Open-Source LLMs¶
Conference: ICML 2025
arXiv: 2505.03059
Code: Coming soon
Area: LLM Alignment
Keywords: model alignment, mixture of agents, synthetic data, preference optimization, self-improvement
TL;DR¶
This paper proposes Mixture of Agents Alignment (MoAA), which leverages the collective intelligence of multiple open-source LLMs to generate high-quality alignment data (SFT data and preference data). This significantly improves the performance of the target model on Arena-Hard and AlpacaEval2, demonstrating self-improvement capabilities without external strong supervision.
Background & Motivation¶
Background: LLM alignment—making model outputs helpful and harmless—relies on high-quality human-annotated data for supervised fine-tuning (SFT) and preference optimization (DPO/RLHF).
Limitations of Prior Work: Human-annotated data is expensive, difficult to produce at scale, and may suffer from insufficient diversity and annotator bias. Existing synthetic data methods (such as generating alignment data using GPT-4) rely on a single strong model, which limits the diversity of generated data and creates a dependency on closed-source models.
Key Challenge: How to scale up the size and diversity of alignment data while reducing reliance on a single strong model? The individual capability of open-source models may not match GPT-4, but can their collective intelligence bridge this gap?
Goal: To leverage the collaboration of multiple open-source LLMs to generate high-quality alignment data.
Key Insight: The Mixture of Agents (MoA) concept—multiple LLMs generate responses individually, and then an aggregator synthesizes the merits of different responses to produce a final response superior to that of any single model.
Core Idea: Applying the MoA framework to two stages of alignment data generation: (1) SFT data generation—multi-model collaboration to produce high-quality instruction-response pairs; (2) Preference data generation—utilizing the output differences of multiple models to naturally construct positive and negative sample pairs.
Method¶
Overall Architecture¶
Input: A set of open-source LLMs (e.g., LLaMA-3.1-70B, Qwen2-72B, Mixtral-8x22B, etc.), target alignment model (e.g., LLaMA-3.1-8B-Instruct)
Output: Target model with significantly improved alignment performance
Pipeline: 1. MoA-SFT: Generate SFT training data through multi-model collaboration → Fine-tune the target model 2. MoA-DPO: Construct preference pairs using multi-model outputs → Preference optimization
Key Designs¶
-
MoA Response Generation:
- Function: For each instruction prompt, multiple LLMs generate individual responses, which are then synthesized by an aggregator model into a final response.
- Mechanism: Layer 1—\(K\) LLMs generate individual responses \(\{r_1, \ldots, r_K\}\); Layer 2—the aggregator model receives all responses and the original prompt to generate the synthesized response \(r^*\).
- Design Motivation: Different models possess different knowledge and "personalities" (e.g., some excel in reasoning while others in writing), and MoA can integrate these complementary advantages.
-
MoA-SFT Data Construction:
- Function: Use high-quality responses generated by MoA as SFT training targets.
- Mechanism: \((prompt, r^*_{\text{MoA}})\) is used as the training pair. Since the MoA response quality exceeds that of any single model, the fine-tuned model can surpass its training data sources.
- Design Motivation: Substitute GPT-4 annotation while providing higher diversity.
-
MoA-DPO Preference Data Construction:
- Function: Utilize quality differences in multi-model outputs to construct preference pairs.
- Mechanism: The synthesized MoA response \(r^*\) is used as the "chosen" (positive sample), while the worst performing individual model response \(r_{\text{worst}}\) is used as the "rejected" (negative sample).
- Design Motivation: Eliminates the need for external evaluators (such as humans or GPT-4); preference signals are derived from internal comparisons within the model ensemble.
-
Self-Improvement Pipeline:
- Function: Use the model fine-tuned by MoAA as a participant in the next round of MoA.
- Mechanism: Iteration \(t\): Engage the current model in MoA → Generate better training data → Fine-tune → Engage the new model in MoA again.
- Design Motivation: This forms a positive feedback loop: model capability improvement → better data generation → further improvement.
Loss & Training¶
- SFT phase: Standard next-token cross-entropy loss
- DPO phase: \(\mathcal{L}_{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(r_w|x)}{\pi_{\text{ref}}(r_w|x)} - \beta \log \frac{\pi_\theta(r_l|x)}{\pi_{\text{ref}}(r_l|x)}\right)\)
Key Experimental Results¶
Main Results¶
| Model | Metric | MoAA | GPT-4o Distillation | Self-Data | Baseline (No Alignment) |
|---|---|---|---|---|---|
| LLaMA-3.1-8B-Instruct → Arena-Hard | Win Rate | 48.3 | 42.1 | 31.5 | 19.5 |
| LLaMA-3.1-8B-Instruct → AlpacaEval2 | Win Rate | 57.23 | 49.8 | 35.4 | 22.33 |
| LLaMA-3.1-8B-Instruct → MT-Bench | Average Score | 8.12 | 7.85 | 7.21 | 6.58 |
Ablation Study¶
| Configuration | Arena-Hard WR | AlpacaEval2 WR | Description |
|---|---|---|---|
| MoAA (SFT + DPO) | 48.3 | 57.23 | Full Method |
| MoA-SFT Only | 39.7 | 45.6 | DPO contributes ~8-12 WR |
| MoA-DPO Only | 35.2 | 41.8 | SFT foundation is important |
| Single Model (GPT-4o) SFT | 42.1 | 49.8 | MoA outperforms single strong model |
| Single Model (LLaMA-70B) SFT | 33.8 | 38.2 | Single open-source model is insufficient |
| Self-Improvement (2 Rounds) | 51.2 | 60.1 | Positive feedback loop is effective |
Key Findings¶
- MoAA improves the Arena-Hard Win Rate of LLaMA-3.1-8B from 19.5 to 48.3 (+28.8).
- The quality of data generated by the collaboration of multiple open-source models exceeds that of data generated individually by GPT-4o.
- Self-improvement is viable—the second iteration round further improves performance by 3-4 WR.
- MoA-SFT and MoA-DPO are complementary; both are indispensable.
- The more diverse the models participating in MoA (from different model families), the better the performance.
Highlights & Insights¶
- High Practical Value: Entirely based on open-source models, eliminating reliance on GPT-4.
- Self-Improvement: Demonstrates the potential of the open-source LLM ecosystem to transcend individual capability upper bounds through collaboration.
- Simple Methodology: The pipeline of MoA + SFT + DPO is straightforward and easy to replicate.
Limitations & Future Work¶
- Whether self-improvement will encounter a "ceiling effect" (where the model ensemble cannot provide signals beyond itself) remains to be validated in the long run.
- The computational cost of MoA scales linearly with the number of participating models—requiring 3-5 70B models to run simultaneously.
- The evaluation regarding harmlessness is insufficient.
- Quality control and filtering strategies for the data generated by MoAA are not discussed.
Related Work & Insights¶
- Mixture of Agents (Wang et al., 2024): The original work on MoA.
- Self-Play Fine-Tuning (Chen et al., 2024): Another self-improvement methodology.
- This work demonstrates the feasibility of "collective intelligence > individual capability" in LLM alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of MoA and alignment is novel, though the individual components (MoA, SFT, DPO) are already known.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple top-tier benchmarks evaluated, comprehensive ablations, and validated self-improvement.
- Writing Quality: ⭐⭐⭐⭐ The methodology description is clear and the experimental results are convincing.
- Value: ⭐⭐⭐⭐⭐ Significantly drives the development of the open-source LLM ecosystem.