AIDE: Attribute-Guided Multi-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning¶

Conference: ACL 2025
arXiv: 2412.06136
Code: GitHub
Area: Data Synthesis / Instruction Tuning
Keywords: Data Expansion, Multi-Hop Synthesis, Attribute Guidance, Persona, Residual Connection

TL;DR¶

This paper proposes the AIDE framework, which generates around 3K high-quality task-specific training data points from only 10 seed samples through a multi-hop data expansion mechanism of "attribute guidance + Persona enhancement + residual connections." Fine-tuning Mistral-7B on this data outperforms human-annotated data fine-tuning by an average of 6% and SOTA methods like Evol-Instruct by 30% under zero-shot settings.

Background & Motivation¶

Background: Task-specific LLM fine-tuning requires diverse, high-quality training data, but acquisition costs are high. Existing data synthesis methods either rely on a large volume of seed data (e.g., Prompt2Model, DataTune) or generate data lacking task relevance and diversity.

Limitations of Prior Work: (a) Open-domain methods like Evol-Instruct lack task specificity in their generated data; (b) Task-specific methods like Prompt2Model depend on extensive candidate datasets; (c) Simple data paraphrasing methods struggle to balance both diversity and relevance.

Key Challenge: With only a very small number of seed samples (e.g., 10), how can one generate a sufficient quantity of highly diverse and task-relevant training data?

Key Insight: Data expansion is analogized to a multi-hop traversal on a graph—starting from the seed data and using knowledge-attribute triples to guide the synthesis direction of each hop.

Method¶

Overall Architecture¶

Given seed data \(D_{seed} = \{(X_i, Y_i)\}_{i=1}^n\) (\(n \approx 10\)), AIDE generates large-scale training data in four steps: (1) extracting knowledge triples using an LLM Extractor; (2) recursively generating data along the triple paths via multi-hop synthesis; (3) enhancing diversity through a Persona Hub; and (4) preventing semantic drift with residual connections.

Key Designs¶

Attribute-Guided Multi-Hop Synthesis:
- Function: Extracts knowledge triples \(\langle t, r, a \rangle\) (topic, relation, attribute) from seed data, and recursively synthesizes new data along the triple paths.
- Mechanism: For a seed \(X_i^{(0)}\), the LLM extracts its topic and key attributes. Each triple defines a synthesis path. Combining task demonstrations \(\mathcal{D}_T\) and a predefined operation \(Op\) (adding constraints, reasoning, or concretization), a new sample is generated as \(X^{(K)} = \text{LLM}(X^{(K-1)}, \langle t,r,a \rangle^{(K-1)}, Op, \mathcal{D}_T)\). The total data size is \(m = n(m_1 + m_2 + ... + m_K)\).
- Design Motivation: Triples serve as "control nodes" on the synthesis path, ensuring that the generated data expands in a meaningful semantic direction rather than drifting randomly.
Persona-Guided Diversity Enhancement:
- Function: Uses the seed data's topic embeddings to retrieve top-\(P\) relevant persona descriptions from a Persona Hub, introducing diverse perspectives into the synthesis.
- Mechanism: \(X^{(K)} = \text{LLM}(X^{(K-1)}, t, p_i, Op, \mathcal{D}_T)\), where \(p_i\) represents a persona descriptions like "an adventurous elderly person with experience living at high altitudes."
- Design Motivation: LLMs tend to generate similar content under identical prompts; personas inject different backgrounds and viewpoints to increase diversity.
Residual Connection Mechanism:
- Function: Passes the original seed data \(X^{(0)}\) as an additional input to the LLM during depth \(d \leq L\) synthesis.
- Mechanism: As the synthesis depth increases (e.g., to 10 hops), the generated content gradually undergoes semantic drift from the task topic. The residual connection "anchors" the synthesis direction back to the original seed.
- Design Motivation: Experiments demonstrate that 10-hop synthesis without a residual connection introduces completely irrelevant content, whereas adding a residual connection maintains topical relevance.

Loss & Training¶

Claude Sonnet 3.5 is utilized as the LLM synthesizer during the synthesis stage.
Self-Reflection filtering: The LLM scores the synthesized data (1-10), keeping samples with scores above 5.
Fine-tuning employs LoRA (\(r=8\), \(\alpha=16\)) with a learning rate of 5e-5 for 10 epochs, selecting the checkpoint with the lowest validation loss.
The default setting is \(K=2\) (2-hop), which generates approximately 3K samples from 10 seeds.

Key Experimental Results¶

Main Results¶

AIDE vs. Human Annotation vs. SOTA Methods (Mistral-7B, zero-shot):

Benchmark	AIDE	Human Annotation	Evol-Instruct	DataTune	Prompt2Model
BIG-Bench Avg (5 tasks)	74.2%	-	54.2%	35.2%	36.1%
MMLU Bio	75.5%	73.2%	-	-	-
TruthfulQA	69.2%	49.9%	-	-	-
MedQA	44.0%	37.0%	-	-	-
ARC-Challenge	74.7%	79.4%	-	-	-

Average relative improvement of AIDE fine-tuning vs. human data fine-tuning: Mistral-7B +7.0%, Llama-3.1-8B +0.7%, Llama-3.2-3B +1.5%.

Ablation Study¶

Contribution of each component (BIG-Bench Time task, Mistral-7B):

Attribute	Persona	Residual Connection	Accuracy
✓	✗	✗	60.1%
✗	✓	✗	49.3%
✓	✓	✗	72.2%
✓	✗	✓	75.0%
✓	✓	✓	90.3%

Synthesized data diversity (Self-BLEU↓):

Task	AIDE	Human Data
Code	0.59	0.50
CS(MMLU)	0.66	0.24
TruthfulQA	0.67	0.20

Key Findings¶

The primary advantage of AIDE lies in its task specificity: 3K samples generated from 10 seeds outperform the 250K general-domain data generated by Evol-Instruct.
The residual connection is a critical component: the leap from 72.2% to 90.3% is largely attributed to it.
Using GPT-3.5-Turbo as the synthesizer (more cost-effective) achieves comparable or even better results than Claude Sonnet 3.5.
The diversity of the synthesized data is higher than that of human-annotated data (higher Self-BLEU in most tasks), though not across all dimensions.

Highlights & Insights¶

Minimal Seed Required: Generating high-quality task training data from just 10 seed samples greatly lowers the data acquisition threshold.
Multi-Hop Tree Expansion: Draws on the concept of knowledge graph traversal for data synthesis, making it structured and controllable.
Clever Adaptation of Residual Connections: Successfully grafts the concept of residual connections from deep learning onto text synthesis to solve semantic drift.
Quality Control via Self-Reflection: Integrates post-synthesis scoring and filtering to ensure final data quality.

Limitations & Future Work¶

The LLM synthesizer might introduce bias and harmful content.
Limited effectiveness on mathematical reasoning tasks (only ~21% in zero-shot), which could be addressed by incorporating Chain-of-Thought (CoT).
Reliance on external LLM (Claude/GPT) APIs incurs cost.
Diminishing returns are observed when the value of \(K\) (multi-hop depth) exceeds 4.
The Persona Hub is currently English-only; multilingual expansion remains unexplored.

Evol-Instruct (WizardLM): Increases instruction complexity through specific operations but does not target specialized tasks.
DataTune: Retrieves and transforms from candidate datasets but depends on a substantial amount of candidate data.
Insight: AIDE's paradigm of "recursive expansion from small seeds" can be extended to other data augmentation scenarios, such as domain adaptation and continual learning.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined design of multi-hop synthesis, attribute guidance, Persona, and residual connections is highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, multiple baseline models, and thorough ablation studies are provided, though the main experiments primarily focus on Mistral-7B.
Writing Quality: ⭐⭐⭐⭐ The methodology is described in a formalized, clear manner, and the diagrams are intuitive.
Value: ⭐⭐⭐⭐ Holds significant practical value for LLM fine-tuning in low-data scenarios.