MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation¶

Conference: ACL 2025 (Findings)
arXiv: 2504.12563
Code: None
Area: Agent / Synthetic Data Generation
Keywords: Meta-prompting, synthetic data diversity, domain adaptation, multi-agent collaboration, continual pre-training

TL;DR¶

This paper proposes MetaSynth, a meta-prompting-driven multi-agent collaborative framework that generates highly diverse synthetic data. Using only 25M tokens of synthetic data (without mixing real data), it successfully adapts Mistral-7B to financial and biomedical domains, achieving performance gains of \(4.08\%\) and \(13.75\%\) respectively, without compromising general capabilities.

Background & Motivation¶

Background: Synthetic data is a crucial resource for current LLM training (e.g., Phi-3.5 and Phi-4 utilize vast amounts of synthetic data), but a key issue of synthetic data is the lack of diversity — the generated texts are often highly repetitive in sentence structures, vocabulary choices, and semantic coverage.

Limitations of Prior Work: Traditional synthetic data generation methods rely on template prompting, which introduces variations through placeholders that are only superficial. For example, when generating financial texts, clichés like "In today's ever-changing financial landscape" and generic slogans repeatedly appear. Even when incorporating in-context examples from previously generated and real data into templates, the improvement in diversity remains limited.

Key Challenge: The core value of synthetic data lies in supplementing the deficiency of real data (especially in domain-specific scenarios), but low diversity limits its benefits on downstream tasks or even proves harmful — the model might experience "model collapse" due to data duplication.

Goal: To propose a systematic methodology that drastically improves the diversity of synthetic data generated by LLMs, enabling effective domain adaptation using only a small amount of highly diverse synthetic data.

Key Insight: Meta-prompting empowers the LLM to write its own prompts to solve problems, which can stimulate more diverse and creative outputs. Extending this into a comprehensive multi-agent system, a Meta-LLM orchestrates multiple expert agents to collaborate on data generation.

Core Idea: Let the Meta-LLM act as an orchestrator that dynamically selects and guides different expert agents (e.g., keyword experts, domain experts, summarization experts, content analysis experts) to iteratively generate diverse synthetic documents, while ensuring that each new document is sufficiently distinct from existing ones through conditional instance generation.

Method¶

Overall Architecture¶

At the core of MetaSynth are two mechanisms: (1) Meta-Prompting — where the Meta-LLM decomposes the data generation task into subtasks, dynamically selects expert agents, and generates instructions; (2) Conditional Instance Generation — where the Meta-LLM retains a memory (summaries of all previously generated instances) to ensure that each new instance satisfies two constraints: "conforms to the current seed set" and "is sufficiently distinct from all existing instances."

Key Designs¶

Meta-LLM Orchestrated Multi-Agent System:
- Function: Generates diverse data through the collaboration of multiple specialized agents.
- Mechanism: The Meta-LLM (Claude 3 Sonnet), acting as a centralized orchestrator, maintains a complete message history (memory). At each generation round, the Meta-LLM first invokes a "seed keyword extraction expert" to obtain representative keywords. It then designates a "domain expert" (e.g., a financier or venture capitalist) to draft the document based on the keywords, a "summarization expert" to compress the document for subsequent comparisons, and a "content analysis expert" to evaluate the differences between the new document and existing ones, offering diversity improvement suggestions. Agents can only see a portion of the information selectively shared by the Meta-LLM (the "fresh eyes" strategy), which helps introduce novel perspectives.
- Design Motivation: Regardless of variations, a single prompt template remains constrained by a fixed pattern. Allowing the LLM to autonomously write prompts and dynamically select experts fundamentally breaks the limitations of templates.
Conditional Instance Generation and Dynamic Seed Expansion:
- Function: Ensures that each instance in the generated dataset is sufficiently unique.
- Mechanism: Maintains a continuously expanding instance taxonomy that tracks and classifies all generated instances. Each new instance must satisfy two conditions: (a) conforms to the current seed keyword set; (b) is sufficiently distinct from all prior instances. After generating every \(M\) documents, an adaptive \(k\)-NN is used to retrieve new seeds from the domain corpus, requiring the topic labels of the new seeds (assigned by a topic label expert) to differ from those of the last \(M\) documents. The document length is constrained to 400 words.
- Design Motivation: Unconstrained generation quickly degenerates into repetitive outputs. Conditional generation combined with dynamic seed expansion forms a diversity guarantee mechanism.
MetaSynth-Instruct: Synthetic Instruction Evolution:
- Function: Further derives complex instruction-response pairs from synthetic documents for instruction pre-training.
- Mechanism: Based on the documents generated by MetaSynth, another meta-prompting workflow is used to generate and iteratively evolve instructions. Pre-defined agents include "document transformation experts", "role suggestion experts", "complexity experts", and "topic editing experts". The Meta-LLM autonomously decides the evolution direction (e.g., formatting changes, complexity increments, or introducing new perspectives), achieving a full pipeline from synthetic documents to synthetic instructions without any real data involvement.
- Design Motivation: Instruction data is crucial for LLM adaptation, but human-written instructions are costly and limited in coverage. Automatically evolving instructions can systematically cover a wider range of scenarios.

Loss & Training¶

MetaSynth itself is a data generation method and does not involve model training. The generated data is used for Continual Pre-training (CPT) on Mistral-7B-v0.3, where the loss is computed over all tokens rather than just the response part.

Key Experimental Results¶

Main Results (Mistral-7B, 25M tokens CPT)¶

Setting	Finance Avg	Biomedicine Avg
Base (No CPT)	63.40	52.94
Real + Template Synth (1:1)	63.23	50.48
Real Only	64.31	50.19
Real + MetaSynth Docs (1:1)	65.18	54.95
MetaSynth Docs + Instruct	65.99	60.22

Comparison of Diversity Metrics¶

Data Source	Task2Vec Coefficient \(\uparrow\)	Compression Ratio \(\downarrow\)	1-Gram Diversity \(\uparrow\)
Template Prompting	0.1576	3.6674	0.0198
MetaSynth (Keyword Seeds)	0.1757 (\(+11.5\%\))	3.4443	0.0345 (\(+74\%\))
MetaSynth (Doc Seeds)	0.1788 (\(+13.5\%\))	3.1495	0.0390 (\(+97\%\))
Common Crawl Real Data	0.2120 (\(+34.5\%\))	2.7380	0.0621 (\(+214\%\))

Key Findings¶

Synthetic data generated by template prompting slightly reduces performance when mixed with real data (Finance 63.40 \(\rightarrow\) 63.23), indicating that low-diversity synthetic data can be detrimental.
MetaSynth purely synthetic data (without mixing real data) can successfully improve the model — achieving a \(+4.08\%\) gain in finance and a \(+13.75\%\) gain in biomedicine, proving that "a small amount of highly diverse synthetic data outperforms a massive amount of low-quality data."
The biomedicine domain yields larger gains as it requires more specialized terminology and niche knowledge that are insufficiently covered by base models.
General evaluations (e.g., MMLU) show that CPT with MetaSynth barely harms general capabilities, with the largest degradation being only about \(1\%\).
MetaSynth comprehensively outperforms template prompting methods across all 7 automated diversity metrics.

Highlights & Insights¶

The core argument of "Diversity is Quality" is highly convincing — through systematic diversity measurement and downstream task validation, this work establishes a clear link between synthetic data diversity and downstream effectiveness, making an important methodological contribution to the synthetic data community.
The finding of purely synthetic data domain adaptation challenges the traditional belief that "real data mixing is mandatory" — if synthetic data is sufficiently diverse, 25M tokens are enough, and real data is completely unnecessary.
The design of conditional instance generation combined with seed expansion acts as a general diversity assurance mechanism, which can be easily applied to various data generation scenarios.

Limitations & Future Work¶

Using Claude 3 Sonnet as the Meta-LLM incurs non-trivial generation costs — each document requires multiple rounds of agent interactions, resulting in a large volume of API calls.
The method has only been validated in the finance and biomedicine domains; broader areas (e.g., law, code) require further validation.
Although the diversity metrics are comprehensive (7 types), their correlation with human judgment of "diversity" deserves more validation.
The document length is constrained to 400 words, which might obstruct the generation of long-form domain content that requires extensive elaboration.
Future work could integrate conditional instance generation with curriculum learning to generate data of increasing difficulty.

vs AgentInstruct (Mitra et al., 2024): AgentInstruct also uses agents to generate data, but its instruction evolution path is fixed. In contrast, the evolution of MetaSynth is autonomously determined by the Meta-LLM, providing superior flexibility.
vs Self-Prompting (Li et al., 2024): Self-Prompting uses template variants with limited diversity, whereas MetaSynth fundamentally breaks template boundaries via multi-agent collaboration.
This paper provides a comprehensive metrics framework for synthetic data quality evaluation, and the combination of 7 diversity metrics serves as a valuable reference for future work.

Rating¶

Novelty: ⭐⭐⭐⭐ The meta-prompting + multi-agent framework for synthetic data is creative, but its core lies in combining existing technologies.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely systematic evaluation with 9 training configurations, 10 task datasets, and 7 diversity metrics.
Writing Quality: ⭐⭐⭐⭐ The 33-page paper is highly detailed, but the appendix is excessively long (17 figures).
Value: ⭐⭐⭐⭐ Substantial contribution to the methodology of synthetic data generation, and the thesis "diversity is quality" is highly inspiring.