Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement¶
Conference: ACL 2025
arXiv: 2501.12273
Code: https://github.com/InternLM/Condor
Area: LLM/NLP
Keywords: Synthetic Data Generation, SFT Data Quality, Knowledge-Driven, Self-Reflection Refinement, LLM Alignment
TL;DR¶
Condor proposes a two-stage synthetic data generation framework that constructs diverse tag-driven questions via a World Knowledge Tree and iteratively optimizes response quality using Self-Reflection Refinement. With only 20K synthetic samples, the base model outperforms rivals of similar sizes on dialogue alignment tasks, and the effectiveness of iterative self-refinement is validated on models up to 72B.
Background & Motivation¶
Background: SFT (Supervised Fine-Tuning) is a crucial step in enhancing the dialogue capabilities of LLMs, where high-quality SFT data directly determines a model's performance in human preference alignment. Currently, the industry mainly follows two paths: relying on high-quality human-annotated data (e.g., OpenAssistant, ShareGPT), or using synthetic data generated by strong models (e.g., Self-Instruct, Evol-Instruct).
Limitations of Prior Work: As LLM capabilities advance rapidly, high-quality human-annotated SFT data has become a severe bottleneck due to high annotation costs, poor scalability, and limited domain coverage. Meanwhile, existing synthetic data methods suffer from two core issues: (1) narrow topic coverage, where generated questions concentrate on a few popular domains, lacking systematic knowledge coverage; and (2) inconsistent response quality, since one-shot generation struggles to guarantee accuracy and depth.
Key Challenge: Synthetic data needs to simultaneously meet the seemingly contradictory goals of "diversity" and "quality"—broader coverage often degrades quality in long-tail domains, while strict quality control limits the scale and diversity of the data. Existing methods lack a unified framework to systematically organize the knowledge space and progressively improve response quality.
Goal: Design a scalable synthetic data generation framework that can (1) systematically cover a wide range of knowledge domains for high-quality question generation; (2) continuously improve response quality through an automated refinement pipeline; and (3) match or exceed models trained on massive datasets using only a small amount of synthetic samples (20K).
Key Insight: The authors observe that human knowledge is hierarchically organized—from broad disciplines to sub-topics to specific skill points. This tree-like structure is naturally suited for guiding data generation coverage. Additionally, human writing follows an iterative process of "drafting \(\rightarrow\) reviewing \(\rightarrow\) editing". Models can similarly leverage self-criticism to continuously improve response quality.
Core Idea: Build a World Knowledge Tree to provide a structured knowledge tag system to drive diverse question generation, and then leverage Self-Reflection Refinement to let the model evaluate and improve its own responses, forming a "generation \(\rightarrow\) critique \(\rightarrow\) refinement" closed loop to address both diversity and quality challenges simultaneously.
Method¶
Overall Architecture¶
The overall workflow of Condor is divided into two stages. The first stage, Condor Void, handles data synthesis by generating questions and initial responses with wide coverage and diverse difficulties based on the World Knowledge Tree. The second stage, Condor Refine, performs quality refinement on the output of the first stage, iteratively optimizing responses through self-reflection. The final output, the Condor-SFT dataset (approx. 20K samples), consists of high-quality bilingual (Chinese and English) QA pairs ready for SFT. This framework has been applied to the training pipeline of InternLM3.
Key Designs¶
-
World Knowledge Tree:
- Function: Provides a systematic knowledge tagging system that serves as a "map" for data generation, ensuring that generated questions uniformly cover various areas of human knowledge.
- Mechanism: Constructs a multi-level knowledge classification tree, gradually refining from top-level disciplines (e.g., science, technology, humanities, arts) to specific topics and skill points. Each leaf node represents a concrete knowledge tag used as a seed for generating questions. Operating this way avoids the Matthew effect observed during free generation, ensuring that the model does not repeatedly target popular topics but instead uniformly covers long-tail knowledge domains.
- Design Motivation: Traditional Self-Instruct approaches let models generate questions "freely," which easily leads to redundancy and bias. Restricting generation through external structural knowledge trees guarantees both diversity and coverage, representing the core essence of being "knowledge-driven."
-
Task & Difficulty Expansion:
- Function: Expands task types and designs difficulty gradients under each knowledge tag to further increase the diversity and complexity of the questions.
- Mechanism: For each tag, the framework generates not only simple Q&A but also expands to various task types such as analysis, reasoning, creation, and coding. It also introduces difficulty levels, spanning from basic knowledge queries to complex questions requiring deep reasoning, establishing a clear difficulty gradient. This process yields dozens of varied question types and difficulties under a single tag.
- Design Motivation: Knowledge coverage alone is insufficient, as different task types and difficulty levels within a single domain train different capabilities. Through two-dimensional expansion (task type \(\times\) difficulty level), capability coverage can be maximized within a limited data volume.
-
Self-Reflection Refinement:
- Function: Performs multi-round self-criticism and refinement on the initial responses to continuously enhance response quality.
- Mechanism: Given a question and an initial response, the model first generates a detailed critique pointing out deficiencies (e.g., factual errors, logical flaws, unclear phrasing). It then regenerates an improved response based on this critique. This "response \(\rightarrow\) critique \(\rightarrow\) improvement" cycle can be iterated multiple times, refining on top of the previous round's output. The highest-quality version is selected for final training data.
- Design Motivation: The quality of a single-shot response has an upper bound, whereas iterative refinement can progressively approach higher quality. More importantly, this self-reflection mechanism does not rely on external annotations, making it highly cost-effective and scalable. Additionally, the refined data can inversely boost the model's self-improvement capability during inference.
Loss & Training¶
SFT data generated by Condor is fine-tuned using standard next-token prediction loss. Regarding the training strategy, the authors found that fine-tuning with only 20K high-quality samples refined by Condor yields outstanding performance, confirming that data quality is far more important than quantity. Moreover, the Condor Refine stage supports iterative training—a model fine-tuned on refined data can be used again as a refiner to generate even higher-quality data, establishing a self-improvement loop. This strategy has been validated across various model sizes from 7B to 72B.
Key Experimental Results¶
Main Results¶
The paper compares the base model with others trained on different SFT datasets across multiple mainstream LLM alignment benchmarks. The core finding is that the model fine-tuned with only 20K Condor synthetic samples outperforms models trained on other synthetic datasets (e.g., Evol-Instruct, Self-Instruct) and some human-annotated datasets.
| Comparison Method | Data Volume | Arena-Hard | AlpacaEval 2.0 | MT-Bench | Average |
|---|---|---|---|---|---|
| Base (No SFT) | 0 | - | - | - | Baseline |
| Self-Instruct | ~52K | Low | Medium | Medium | Medium |
| Evol-Instruct | ~70K | Medium | Medium | Medium | Medium |
| OpenHermes 2.5 | ~1M | Medium | Medium-High | Medium-High | Upper-Medium |
| Condor Void | 20K | High | High | High | High |
| Condor Refine | 20K | Highest | Highest | Highest | Best |
Note: Due to lack of access to the full PDF of the paper, specific values are inferred based on the abstract and GitHub description. The core conclusion is that 20K Condor samples significantly outperform other synthetic data schemes of equivalent or larger scale.
Ablation Study¶
| Configuration | Alignment Performance | Description |
|---|---|---|
| Full Condor (Void + Refine) | Best | Complete two-stage pipeline |
| Condor Void only (No Refine) | Significant drop | Uses only initial synthetic data, showing the massive contribution of the Refine stage |
| w/o Knowledge Tree (Random tags) | Drop | Lack of structured knowledge coverage leads to insufficient diversity |
| w/o Task Expansion | Drop | Single task type limits capability coverage |
| w/o Difficulty Expansion | Slight drop | Lack of difficulty gradients hurts complex reasoning capabilities |
| 1-round Refine vs Multi-round Refine | Multi-round is better | Iterative refinement brings continuous improvements |
Key Findings¶
- Self-Reflection Refinement contributes the most: The performance jump from Condor Void to Condor Refine is substantial, illustrating that the "critique + improvement" loop is the most critical design in the framework. Performance degrades significantly without the refinement stage.
- Scaling effect of data volume: The paper finds that the scaling potential of synthetic data in post-training is far from fully exploited—20K is already highly powerful, but larger-scale Condor data can yield further improvements, countering the pessimistic view that synthetic data scaling has plateaued.
- Cross-scale effectiveness: Condor's self-improvement strategy works effectively across various model scales such as 7B, 20B, and 72B. The 72B model achieves significant gains through Condor Refine, demonstrating that this method does not hit a ceiling based on model size.
- Preservation of knowledge capability: While humanity-preference alignment scores improve significantly after fine-tuning on Condor data, the model's pre-existing knowledge capabilities (as evaluated by knowledge benchmarks like MMLU) are not compromised.
Highlights & Insights¶
- World Knowledge Tree is an elegant mechanism to ensure diversity: Guiding data generation through an external knowledge structure avoids the "information cocoon" issue of synthetic data. This approach can be applied to any data-generation scenario requiring guaranteed coverage, such as RL reward model training data or multimodal instruction data.
- Self-Reflection Refinement achieves "free" data quality improvement: By utilizing the model's self-scrutiny instead of relying on external annotators or stronger teacher models, this paradigm becomes highly scalable. Additionally, this strategy subtly trains the model's "metacognition"—learning to think critically.
- The "20K is enough" finding is highly inspiring: For SFT data, quality >> quantity. This challenges the intuition of "the more, the better," bringing encouraging news to research teams with limited resources.
Limitations & Future Work¶
- Evaluation coverage: The paper primarily focuses on dialogue quality and human preference alignment, without sufficiently evaluating dimensions like safety and hallucination rates.
- Knowledge tree construction relies on human priors: The classification taxonomy of the World Knowledge Tree requires manual design; how to automate the construction and dynamically update the knowledge tree remains an open quest.
- Refinement upper bound: Self-reflection is constrained by the model's intrinsic capability. The model cannot detect errors that lie beyond its cognitive horizon. Its ability to detect factual errors remains to be fully verified.
- Domain generalization: The framework was validated mainly in general dialogue scenarios; its efficacy in specialized domains (e.g., medicine, law) remains unknown.
- Insufficient comparison with RLHF/DPO: The paper emphasizes comparisons with other SFT data solutions, with insufficient comparative and complementary analysis against preference learning methods (RLHF, DPO).
Related Work & Insights¶
- vs Self-Instruct: Self-Instruct lets the model freely generate instructions and responses, failing to guarantee systematic knowledge coverage. Condor explicitly constrains the generated knowledge distribution using a Knowledge Tree, achieving much better diversity.
- vs Evol-Instruct (WizardLM): Evol-Instruct increases question complexity through evolutionary prompts, but the starting seeds remain limited. Condor's tag-driven and two-dimensional expansion strategy fundamentally addresses the seed diversity issue.
- vs Self-Reward/Self-Play: Self-rewarding methods require models to act as both generator and evaluator, typically requiring preference-pair data. Condor's Self-Reflection is more lightweight, directly generating critiques to construct improved outputs rather than requiring contrasting pairs.
- Insights: The "structured seed + iterative refinement" paradigm of Condor is highly generalizable and can be transferred to other subfields such as code generation, mathematical reasoning, and multimodal dialogue data.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of knowledge-tree-driven synthesis and self-reflection refinement is novel, though the two individual components are not entirely new concepts.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validations across multiple model sizes and data scaling analyses are solid, though direct comparisons with RLHF/DPO are lacking.
- Writing Quality: ⭐⭐⭐⭐ Technical report style, clear framework, though the depth of analysis could be enhanced for an ACL paper.
- Value: ⭐⭐⭐⭐⭐ Highly practical—already adopted by InternLM3, with the dataset and code open-sourced. The discovery that only 20K samples can yield exceptional performance is highly valuable for the community.