Improving Model Alignment through Collective Intelligence of Open-Source LLMs¶

Conference: ICML 2025
arXiv: 2505.03059
Code: Coming soon
Area: LLM Alignment
Keywords: model alignment, mixture of agents, synthetic data, preference optimization, self-improvement

TL;DR¶

This paper proposes Mixture of Agents Alignment (MoAA), which leverages the collective intelligence of multiple open-source LLMs to generate high-quality alignment data (SFT data and preference data). This significantly improves the performance of the target model on Arena-Hard and AlpacaEval2, demonstrating self-improvement capabilities without external strong supervision.

Background & Motivation¶

Background: LLM alignment—making model outputs helpful and harmless—relies on high-quality human-annotated data for supervised fine-tuning (SFT) and preference optimization (DPO/RLHF).

Limitations of Prior Work: Human-annotated data is expensive, difficult to produce at scale, and may suffer from insufficient diversity and annotator bias. Existing synthetic data methods (such as generating alignment data using GPT-4) rely on a single strong model, which limits the diversity of generated data and creates a dependency on closed-source models.

Key Challenge: How to scale up the size and diversity of alignment data while reducing reliance on a single strong model? The individual capability of open-source models may not match GPT-4, but can their collective intelligence bridge this gap?

Goal: To leverage the collaboration of multiple open-source LLMs to generate high-quality alignment data.

Key Insight: The Mixture of Agents (MoA) concept—multiple LLMs generate responses individually, and then an aggregator synthesizes the merits of different responses to produce a final response superior to that of any single model.

Core Idea: Applying the MoA framework to two stages of alignment data generation: (1) SFT data generation—multi-model collaboration to produce high-quality instruction-response pairs; (2) Preference data generation—utilizing the output differences of multiple models to naturally construct positive and negative sample pairs.

Method¶

Overall Architecture¶

Input: A set of open-source LLMs (e.g., LLaMA-3.1-70B, Qwen2-72B, Mixtral-8x22B, etc.), target alignment model (e.g., LLaMA-3.1-8B-Instruct)
Output: Target model with significantly improved alignment performance

Pipeline: 1. MoA-SFT: Generate SFT training data through multi-model collaboration → Fine-tune the target model 2. MoA-DPO: Construct preference pairs using multi-model outputs → Preference optimization

Key Designs¶

MoA Response Generation:
- Function: For each instruction prompt, multiple LLMs generate individual responses, which are then synthesized by an aggregator model into a final response.
- Mechanism: Layer 1—\(K\) LLMs generate individual responses \(\{r_1, \ldots, r_K\}\); Layer 2—the aggregator model receives all responses and the original prompt to generate the synthesized response \(r^*\).
- Design Motivation: Different models possess different knowledge and "personalities" (e.g., some excel in reasoning while others in writing), and MoA can integrate these complementary advantages.
MoA-SFT Data Construction:
- Function: Use high-quality responses generated by MoA as SFT training targets.
- Mechanism: \((prompt, r^*_{\text{MoA}})\) is used as the training pair. Since the MoA response quality exceeds that of any single model, the fine-tuned model can surpass its training data sources.
- Design Motivation: Substitute GPT-4 annotation while providing higher diversity.
MoA-DPO Preference Data Construction:
- Function: Utilize quality differences in multi-model outputs to construct preference pairs.
- Mechanism: The synthesized MoA response \(r^*\) is used as the "chosen" (positive sample), while the worst performing individual model response \(r_{\text{worst}}\) is used as the "rejected" (negative sample).
- Design Motivation: Eliminates the need for external evaluators (such as humans or GPT-4); preference signals are derived from internal comparisons within the model ensemble.
Self-Improvement Pipeline:
- Function: Use the model fine-tuned by MoAA as a participant in the next round of MoA.
- Mechanism: Iteration \(t\): Engage the current model in MoA → Generate better training data → Fine-tune → Engage the new model in MoA again.
- Design Motivation: This forms a positive feedback loop: model capability improvement → better data generation → further improvement.

Loss & Training¶

SFT phase: Standard next-token cross-entropy loss
DPO phase: \(\mathcal{L}_{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(r_w|x)}{\pi_{\text{ref}}(r_w|x)} - \beta \log \frac{\pi_\theta(r_l|x)}{\pi_{\text{ref}}(r_l|x)}\right)\)

Key Experimental Results¶

Main Results¶

Model	Metric	MoAA	GPT-4o Distillation	Self-Data	Baseline (No Alignment)
LLaMA-3.1-8B-Instruct → Arena-Hard	Win Rate	48.3	42.1	31.5	19.5
LLaMA-3.1-8B-Instruct → AlpacaEval2	Win Rate	57.23	49.8	35.4	22.33
LLaMA-3.1-8B-Instruct → MT-Bench	Average Score	8.12	7.85	7.21	6.58

Ablation Study¶

Configuration	Arena-Hard WR	AlpacaEval2 WR	Description
MoAA (SFT + DPO)	48.3	57.23	Full Method
MoA-SFT Only	39.7	45.6	DPO contributes ~8-12 WR
MoA-DPO Only	35.2	41.8	SFT foundation is important
Single Model (GPT-4o) SFT	42.1	49.8	MoA outperforms single strong model
Single Model (LLaMA-70B) SFT	33.8	38.2	Single open-source model is insufficient
Self-Improvement (2 Rounds)	51.2	60.1	Positive feedback loop is effective

Key Findings¶

MoAA improves the Arena-Hard Win Rate of LLaMA-3.1-8B from 19.5 to 48.3 (+28.8).
The quality of data generated by the collaboration of multiple open-source models exceeds that of data generated individually by GPT-4o.
Self-improvement is viable—the second iteration round further improves performance by 3-4 WR.
MoA-SFT and MoA-DPO are complementary; both are indispensable.
The more diverse the models participating in MoA (from different model families), the better the performance.

Highlights & Insights¶

High Practical Value: Entirely based on open-source models, eliminating reliance on GPT-4.
Self-Improvement: Demonstrates the potential of the open-source LLM ecosystem to transcend individual capability upper bounds through collaboration.
Simple Methodology: The pipeline of MoA + SFT + DPO is straightforward and easy to replicate.

Limitations & Future Work¶

Whether self-improvement will encounter a "ceiling effect" (where the model ensemble cannot provide signals beyond itself) remains to be validated in the long run.
The computational cost of MoA scales linearly with the number of participating models—requiring 3-5 70B models to run simultaneously.
The evaluation regarding harmlessness is insufficient.
Quality control and filtering strategies for the data generated by MoAA are not discussed.

Mixture of Agents (Wang et al., 2024): The original work on MoA.
Self-Play Fine-Tuning (Chen et al., 2024): Another self-improvement methodology.
This work demonstrates the feasibility of "collective intelligence > individual capability" in LLM alignment.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of MoA and alignment is novel, though the individual components (MoA, SFT, DPO) are already known.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple top-tier benchmarks evaluated, comprehensive ablations, and validated self-improvement.
Writing Quality: ⭐⭐⭐⭐ The methodology description is clear and the experimental results are convincing.
Value: ⭐⭐⭐⭐⭐ Significantly drives the development of the open-source LLM ecosystem.