Skip to content

Improving Model Alignment through Collective Intelligence of Open-Source LLMs

Conference: ICML 2025
arXiv: 2505.03059
Code: Coming soon
Area: LLM Alignment
Keywords: model alignment, mixture of agents, synthetic data, preference optimization, self-improvement

TL;DR

This paper proposes Mixture of Agents Alignment (MoAA), which leverages the collective intelligence of multiple open-source LLMs to generate high-quality alignment data (SFT data and preference data). This significantly improves the performance of the target model on Arena-Hard and AlpacaEval2, demonstrating self-improvement capabilities without external strong supervision.

Background & Motivation

Background: LLM alignment—making model outputs helpful and harmless—relies on high-quality human-annotated data for supervised fine-tuning (SFT) and preference optimization (DPO/RLHF).

Limitations of Prior Work: Human-annotated data is expensive, difficult to produce at scale, and may suffer from insufficient diversity and annotator bias. Existing synthetic data methods (such as generating alignment data using GPT-4) rely on a single strong model, which limits the diversity of generated data and creates a dependency on closed-source models.

Key Challenge: How to scale up the size and diversity of alignment data while reducing reliance on a single strong model? The individual capability of open-source models may not match GPT-4, but can their collective intelligence bridge this gap?

Goal: To leverage the collaboration of multiple open-source LLMs to generate high-quality alignment data.

Key Insight: The Mixture of Agents (MoA) concept—multiple LLMs generate responses individually, and then an aggregator synthesizes the merits of different responses to produce a final response superior to that of any single model.

Core Idea: Applying the MoA framework to two stages of alignment data generation: (1) SFT data generation—multi-model collaboration to produce high-quality instruction-response pairs; (2) Preference data generation—utilizing the output differences of multiple models to naturally construct positive and negative sample pairs.

Method

Overall Architecture

Input: A set of open-source LLMs (e.g., LLaMA-3.1-70B, Qwen2-72B, Mixtral-8x22B, etc.), target alignment model (e.g., LLaMA-3.1-8B-Instruct)
Output: Target model with significantly improved alignment performance

Pipeline: 1. MoA-SFT: Generate SFT training data through multi-model collaboration → Fine-tune the target model 2. MoA-DPO: Construct preference pairs using multi-model outputs → Preference optimization

Key Designs

  1. MoA Response Generation:

    • Function: For each instruction prompt, multiple LLMs generate individual responses, which are then synthesized by an aggregator model into a final response.
    • Mechanism: Layer 1—\(K\) LLMs generate individual responses \(\{r_1, \ldots, r_K\}\); Layer 2—the aggregator model receives all responses and the original prompt to generate the synthesized response \(r^*\).
    • Design Motivation: Different models possess different knowledge and "personalities" (e.g., some excel in reasoning while others in writing), and MoA can integrate these complementary advantages.
  2. MoA-SFT Data Construction:

    • Function: Use high-quality responses generated by MoA as SFT training targets.
    • Mechanism: \((prompt, r^*_{\text{MoA}})\) is used as the training pair. Since the MoA response quality exceeds that of any single model, the fine-tuned model can surpass its training data sources.
    • Design Motivation: Substitute GPT-4 annotation while providing higher diversity.
  3. MoA-DPO Preference Data Construction:

    • Function: Utilize quality differences in multi-model outputs to construct preference pairs.
    • Mechanism: The synthesized MoA response \(r^*\) is used as the "chosen" (positive sample), while the worst performing individual model response \(r_{\text{worst}}\) is used as the "rejected" (negative sample).
    • Design Motivation: Eliminates the need for external evaluators (such as humans or GPT-4); preference signals are derived from internal comparisons within the model ensemble.
  4. Self-Improvement Pipeline:

    • Function: Use the model fine-tuned by MoAA as a participant in the next round of MoA.
    • Mechanism: Iteration \(t\): Engage the current model in MoA → Generate better training data → Fine-tune → Engage the new model in MoA again.
    • Design Motivation: This forms a positive feedback loop: model capability improvement → better data generation → further improvement.

Loss & Training

  • SFT phase: Standard next-token cross-entropy loss
  • DPO phase: \(\mathcal{L}_{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(r_w|x)}{\pi_{\text{ref}}(r_w|x)} - \beta \log \frac{\pi_\theta(r_l|x)}{\pi_{\text{ref}}(r_l|x)}\right)\)

Key Experimental Results

Main Results

Model Metric MoAA GPT-4o Distillation Self-Data Baseline (No Alignment)
LLaMA-3.1-8B-Instruct → Arena-Hard Win Rate 48.3 42.1 31.5 19.5
LLaMA-3.1-8B-Instruct → AlpacaEval2 Win Rate 57.23 49.8 35.4 22.33
LLaMA-3.1-8B-Instruct → MT-Bench Average Score 8.12 7.85 7.21 6.58

Ablation Study

Configuration Arena-Hard WR AlpacaEval2 WR Description
MoAA (SFT + DPO) 48.3 57.23 Full Method
MoA-SFT Only 39.7 45.6 DPO contributes ~8-12 WR
MoA-DPO Only 35.2 41.8 SFT foundation is important
Single Model (GPT-4o) SFT 42.1 49.8 MoA outperforms single strong model
Single Model (LLaMA-70B) SFT 33.8 38.2 Single open-source model is insufficient
Self-Improvement (2 Rounds) 51.2 60.1 Positive feedback loop is effective

Key Findings

  • MoAA improves the Arena-Hard Win Rate of LLaMA-3.1-8B from 19.5 to 48.3 (+28.8).
  • The quality of data generated by the collaboration of multiple open-source models exceeds that of data generated individually by GPT-4o.
  • Self-improvement is viable—the second iteration round further improves performance by 3-4 WR.
  • MoA-SFT and MoA-DPO are complementary; both are indispensable.
  • The more diverse the models participating in MoA (from different model families), the better the performance.

Highlights & Insights

  • High Practical Value: Entirely based on open-source models, eliminating reliance on GPT-4.
  • Self-Improvement: Demonstrates the potential of the open-source LLM ecosystem to transcend individual capability upper bounds through collaboration.
  • Simple Methodology: The pipeline of MoA + SFT + DPO is straightforward and easy to replicate.

Limitations & Future Work

  • Whether self-improvement will encounter a "ceiling effect" (where the model ensemble cannot provide signals beyond itself) remains to be validated in the long run.
  • The computational cost of MoA scales linearly with the number of participating models—requiring 3-5 70B models to run simultaneously.
  • The evaluation regarding harmlessness is insufficient.
  • Quality control and filtering strategies for the data generated by MoAA are not discussed.
  • Mixture of Agents (Wang et al., 2024): The original work on MoA.
  • Self-Play Fine-Tuning (Chen et al., 2024): Another self-improvement methodology.
  • This work demonstrates the feasibility of "collective intelligence > individual capability" in LLM alignment.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of MoA and alignment is novel, though the individual components (MoA, SFT, DPO) are already known.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple top-tier benchmarks evaluated, comprehensive ablations, and validated self-improvement.
  • Writing Quality: ⭐⭐⭐⭐ The methodology description is clear and the experimental results are convincing.
  • Value: ⭐⭐⭐⭐⭐ Significantly drives the development of the open-source LLM ecosystem.