TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data¶

Conference: ACL 2025
arXiv: 2412.19544
Code: https://github.com/cdhx/TARGA
Area: Other
Keywords: Semantic parsing, Knowledge Base Question Answering (KBQA), Synthetic Data Generation, In-Context Learning, Structured Reasoning

TL;DR¶

TARGA proposes a targeted synthetic data generation framework that dynamically generates highly relevant synthetic examples for in-context learning in KBQA without requiring any human annotation. Using only a 7B model, it significantly outperforms all non-fine-tuned methods on GrailQA (+7.7 F1) and KBQA-Agent (+12.2 F1).

Background & Motivation¶

Background: Semantic parsing (translating natural language questions to logical forms) is the dominant approach for reasoning in structured environments. Existing methods either require fine-tuning on large-scale human-annotated datasets (e.g., Pangu) or retrieve similar examples from annotated datasets for in-context learning (e.g., KB-Binder, KB-Coder).
Limitations of Prior Work: Two core challenges—(1) Annotation Reliance: Collecting human-annotated data in specific environments is time-consuming and labor-intensive, and pre-annotated data is typically unavailable in real-world scenarios; (2) Limited Generalization: Methods based on static training sets suffer from drastic performance drops (up to 20%+) when encountering unseen KB elements or query structures (non-I.I.D. settings).
Key Challenge: In complex environments such as Freebase with over 3 billion triples, representing all potential query schemas using pre-collected static datasets is impossible. Expanding annotation coverage leads to exponential growth in training and retrieval costs.
Goal: How can highly relevant examples be dynamically generated for each test question without any human annotation, while achieving outstanding performance with small models?
Key Insight: Starting from KB entities and relations related to the test question, valid query graphs are constructed directly on the knowledge base through layer-wise expansion and cross-layer combination. These queries are then translated into natural language questions to serve as examples for in-context learning.
Core Idea: Instead of retrieving examples from static datasets, targeted query-question pairs are dynamically synthesized on the KB for each test question to serve as in-context learning demonstrations.

Method¶

Overall Architecture¶

Given a natural language query $nlq$, TARGA operates in four steps: (1) retrieving candidate KB entities and relations; (2) constructing synthetic query graphs via layer-wise expansion (multi-hop) and cross-layer combination (multi-constraint); (3) ranking the synthetic queries using a bge-reranker to select the most relevant examples; (4) performing in-context learning with the ranked synthetic (NLQ, Query) pairs as demonstrations to yield the target query.

Key Designs¶

Layer-wise Expansion:
- Function: Build multi-hop chained query structures (depth of the query graph).
- Mechanism: Starting from the simplest single-hop structure $\mathcal{L}_1 = \{(s,p,o) | s \in E_{nlq}, p \in R_{nlq}, \text{Exec}((s,p,o), \mathcal{G}) \neq \emptyset\}$, the query graph is expanded outward layer by layer — connecting terminal variable nodes to new variables via new relations to construct $\mathcal{L}_2, \mathcal{L}_3$. The expansion stops when a complexity threshold (3 hops) is reached, as the entity-to-answer distance in reasonable questions typically does not exceed 3 hops. Key constraint: only queries with non-empty execution results are expanded to prevent exponential growth of invalid combinations.
- Design Motivation: An exploration strategy moving from simple to complex guarantees coverage while controlling computational overhead; execution validation automatically achieves joint entity-relation disambiguation.
Cross-layer Combination:
- Function: Build multi-constraint query structures (width of the query graph).
- Mechanism: Select two queries from different layers $q \in \mathcal{L}_x$ and $q' \in \mathcal{L}_y$, find their respective variables whose execution results overlap ($\mathcal{E}(o_i) \cap \mathcal{E}(o_j) \neq \emptyset$), and merge the two queries via this shared variable. The combination terminates at 5 edges, covering the vast majority of question structures in current datasets. Starting with simple combinations ($\mathcal{L}_{1\times 1}$), it gradually scales to more complex schemas ($\mathcal{L}_{2\times 3}$, $\mathcal{L}_{1\times(1\times 2)}$).
- Design Motivation: Multi-constraint queries are crucial in KBQA (e.g., "people who were born in X and attended Y"), which cannot be modeled solely by chained expansion.
Hierarchical Ranking + Query Textification (Re-ranking):
- Function: Filter the most relevant demonstrations from a potentially large volume of synthetic queries.
- Mechanism: First, Query Textification is used to translate SPARQL queries into quasi-natural language text using heuristic rules (bridging the semantic gap between embedding models and queries). Then, bge-reranker-v2-m3 is applied to calculate the similarity to the question. Hierarchical Ranking is adopted — keeping only the top-n sub-queries derived from the same parent query to prevent imbalance caused by the exponential growth of complex queries.
- Design Motivation: Many valid query structures in synthetic data are irrelevant to the current question and must be efficiently filtered; hierarchical ranking ensures queries across different complexity levels have a chance to be selected.

Loss & Training¶

TARGA is a zero-annotation, training-free framework where all core components run dynamically during inference: - Entity Linking: Directly reuse the results from Gu et al. (2023). - Relation Retrieval: Use text-embedding-ada-002 to compute similarity between the question and Freebase relations, selecting the top-20. - Number of examples: 10 demonstrations by default. - Backbone model: Qwen-2.5-7B-Instruct by default.

Key Experimental Results¶

Main Results¶

Dataset	Metric (F1)	TARGA (7B)	Prev. SOTA (non-fine-tuned)	Gain
GrailQA	F1	69.0	61.3 (KB-Coder-R, GPT-3.5)	+7.7
GraphQ	F1	50.6	50.8 (QueryAgent, GPT-3.5)	-0.2
KBQA-Agent	F1	46.5	34.3 (FUXI, GPT-3.5)	+12.2
MetaQA	F1	85.7	99.5 (KB-Binder-R)	-13.8

Stronger Backbone Models:

Model	GrailQA	GraphQ	KBQA-Agent
Qwen-2.5-7B	69.0	50.6	46.5
Qwen-2.5-72B	70.6	54.1	57.3
GPT-3.5-turbo	68.9	51.0	52.7
GPT-4-turbo	69.8	52.5	51.4

Ablation Study¶

Generalization Capability (Three settings in GrailQA):

Method	I.I.D.	Compositional	Zero-shot	Avg
KB-Binder-R	80.6	53.6	50.7	58.5
KB-Coder-R	81.0	57.8	54.1	61.3
TARGA	68.4	62.2	71.7	69.0
TARGA-R	80.8	63.6	71.6	71.9

Efficiency Analysis (GrailQA):

Method	TPQ (sec)	QPQ (queries)	CPQ (cost $)
KB-BINDER	51.2	3,297.7	0.010
QueryAgent	16.6	5.2	0.019
TARGA	4.5	256.8	0.000

Key Findings¶

Extremely High Sample Efficiency: Utilizing just 1 synthetic demonstration leaks stronger performance than the 20-shot random/similarity retrieval settings, demonstrating that synthetic data quality far exceeds static dataset retrieval.
Substantial Generalization: Under the Zero-shot setting (fully unseen KB elements/structures), TARGA achieves a 71.7 F1, outperforming KB-Coder-R (54.1) by 17.6 points. Pre-collected training sets offer almost no help for Zero-shot scenarios.
Strong Robustness: In adversarial settings (randomly replacing relations in demonstrations), TARGA's performance drops by only about 25%, compared to 75% for the random setting, proving that informational redundancy among synthetic examples provides robustness.
No Reliance on Powerful Closed-source Models: The 7B open-source model outperforms all non-fine-tuned methods based on GPT-3.5-turbo, owing to high-quality examples that simplify the task.
Fastest Response Speed: 4.5 seconds per question, which is 11× faster than KB-BINDER, with zero closed-source LLM call cost.

Highlights & Insights¶

"Dynamic Synthesis vs. Static Retrieval" Paradigm Shift: Instead of retrieving similar examples from static datasets, targeted examples are constructed dynamically on the KB in real-time based on the current question. This bypasses the root cause of generalization issues (where the training distribution cannot cover all testing scenarios).
Cleverly Leveraging KB Structural Constraints for Query Construction: Joint entity-relation disambiguation is automatically achieved via non-empty execution validation, constraining the combinatorial explosion to only dozens of valid candidate queries per question on average.
Query Textification Bridges Semantic Gap: Translating SPARQL queries to natural language via simple heuristic rules improves embedding ranking accuracy; it is a straightforward yet highly effective design.
From Simple to Complex Construction Strategy: Moving from $\mathcal{L}_1$ to $\mathcal{L}_3$, and from single-layer to cross-layer combinations, ensures that complex queries are invariably assembled from validated simple sub-queries.

Limitations & Future Work¶

Performance on MetaQA is lower than KB-Binder-R which requires a full training set (85.7 vs 99.5), indicating that dynamic synthesis might not offer advantages in simple environments with sufficient annotations.
It relies on existing entity linking methods; poor linking quality directly affects subsequent query construction.
The number of KB queries during query construction remains high (256.8 QPQ), which may pose a performance bottleneck on large-scale KBs.
It is currently limited to KBQA (Freebase); its efficacy on other structured reasoning tasks like Text2SQL has not been fully verified (though preliminary WikiSQL results are mentioned).
The top-n parameter in hierarchical ranking and the maximum complexity threshold require manual tuning.

vs. KB-Binder/KB-Coder: These retrieve examples from annotated training sets for ICL, and their performance drops by 20%+ in non-I.I.D. settings. TARGA dynamically generates examples and is entirely unaffected by training set distribution. In I.I.D. settings, incorporating training sets (TARGA-R) achieves competitive results.
vs. BYOKG (Agarwal et al.): BYOKG also requires no annotations but uses offline synthesis and static retrieval, which still suffers from generalization issues. TARGA synthesizes online, obtaining customized examples for every question.
vs. Agent-based (QueryAgent, AgentBench): Agent methods decompose questions progressively. Although they generalize well, they incur high computational costs (multiple LLM calls) and rely heavily on strong model capabilities. TARGA accomplishes the task with a single ICL call using a 7B model.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The dynamic, targeted synthesis of examples is highly ingenious, transforming KB structural constraints into data quality guarantees.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive analysis across 4 datasets, evaluating generalization, robustness, efficiency, model size, etc.
Writing Quality: ⭐⭐⭐⭐ Well-structured methodology descriptions with clear formulations.
Value: ⭐⭐⭐⭐⭐ Zero annotations and a small model outperforming closed-source LLM methods make it highly practical and directly applicable to real-world KBQA systems.