ACL 2025 Model Compression Function Calling Multi-turn Dialogue Graph Translation Data Synthesis Context Distillation DPO Preference Optimization

Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation¶

Conference: ACL 2025
arXiv: 2503.07826
Code: Not publicly released
Authors: Fan Yin, Zifeng Wang, I-Hung Hsu, Jun Yan, Ke Jiang, Yanfei Chen, Jindong Gu, Long T. Le, Kai-Wei Chang, Chen-Yu Lee, Hamid Palangi, Tomas Pfister
Affiliations: Google, UCLA
Area: Model Distillation / Tool-use Agents
Keywords: Function Calling, Multi-turn Dialogue, Graph Translation, Data Synthesis, Context Distillation, DPO, Preference Optimization

TL;DR¶

The Magnet framework is proposed to construct high-quality multi-turn function calling (FC) training trajectories based on random walks and node operations (Insert/Merge/Split) on a function dependency graph. Combined with prompt-based context distillation to generate positive/negative contrastive trajectories for SFT + mDPO training, the 14B model Magnet-14B-mDPO achieves a score of 68.01 on BFCL-v3 (ranking 4th), significantly outperforming the teacher model Gemini-1.5-pro-002 in multi-turn scenarios.

Background & Motivation¶

Background: LLM Agents need to invoke external tools (APIs/functions) to complete complex tasks. Current models perform well in single-step FC, but still face challenges in multi-turn, multi-step interactions.

Three key challenges of multi-turn FC: - Nested FCs: Certain turns require multiple or even nested function calls, which are not explicitly mentioned in the query. - Long Dependency: Certain turns require utilizing information from distant parts of the conversation history to assemble the FC. - Irrelevance: Certain turns have missing functionalities or parameters, requiring the model to ask clarifying questions.

Data Bottleneck: The multi-turn success rate of the best open-source models on BFCL-v3 is only ~10%. The low performance of existing models in multi-turn scenarios makes it extremely difficult to collect high-quality training trajectories.

Core Motivation: Design a principled data synthesis pipeline to construct reliable multi-turn FC training data from a graph perspective.

Method¶

Overall Architecture¶

Four-stage pipeline: Function Pool & Dependency Graph Construction \(\rightarrow\) Node Operation Augmentation \(\rightarrow\) Back-and-Forth Translation for Query-FC Pairs \(\rightarrow\) Context Distillation for Trajectory Generation

Phase 1: Function Dependency Graph Construction¶

Function Collection: 5,011 executable APIs across 49 categories were collected from StableToolBench and BFCL-v3.
Local Dependency Graph: Functions are treated as graph nodes. For each node, 30 neighbor candidates from the same category are sampled, and LLMs are used to determine if an input-output dependency exists. If the output of a source node is related to the input of a target node, a directed edge is established.
Initial FSP Sampling: Start from each node and perform random walks of \(S=7\) steps along the dependency edges to sample an initial Function Signature Path \(\tilde{\phi} = (\tilde{f}_1, \tilde{f}_2, \cdots, \tilde{f}_H)\).

Phase 2: Node Operation Augmentation¶

Three graph-level operations are designed to address the three major challenges of multi-turn FC:

Operation 1: Insert — Resolving Nested FCs and Long Dependencies - Traverse the last function signature \(\tilde{f}_{hk}\) of each turn in the FSP, and use an LLM to check if neighbor functions satisfy the conditions for nested calls. - If such functions exist, append the nested function to the current turn or insert it at a subsequent random position (simulating long dependency). - Example: Query "How many kilometers from San Francisco to San Mateo?" \(\rightarrow\) Requires calling get_distance() (returns miles) + convert_unit() (implicit nesting).

Operation 2: Merge — Creating Multi-FC within a Single Turn - Merge two adjacent turns into a single turn with probability \(p=0.3\). The model must understand the output of previous functions to assemble subsequent functions. - Difference from Insert: Functions merged by Merge are related, but not necessarily nested.

Operation 3: Split — Simulating Missing Information - Randomly select a turn and insert an empty node {} after it, marked as 'miss params' or 'miss func'. - The model must identify the missing information and ask clarifying questions.

Execution Order: First Merge \(\rightarrow\) then Insert \(\rightarrow\) generate augmented FSP \(\phi\); additionally, apply Split to \(\phi\) to generate FSP \(\hat{\phi}\) with missing information.

Phase 3: Back-and-Forth Translation¶

Iteratively translate the augmented FSP into query-FC pairs: - Back translation \(\mathcal{M}_b(f_h) = q_h\): Translates function signatures into simulated user queries. - Forward translation \(\mathcal{M}_f(q_h, f_h, t_{h-1}) = fc_h\): Translates queries into executable FCs (leveraging the previous turn's output \(t_{h-1}\)). - Iterate turn-by-turn to ensure that the outputs of previous turns are ready before passing them to subsequent turns.

Phase 4: Context Distillation for Trajectory Generation¶

Positive Trajectory Generation: - Inspired by Context Distillation, when the teacher model (Gemini-1.5-pro-002) generates trajectories, FC references are appended to the query as a [Hint]. - Ensure the teacher model generates actions as accurately as possible.

Negative Trajectory Generation: - Collect 10 inference trajectories from the SFT model on each data point. - Use LLMs to judge whether each turn contains erroneous FCs. - Provide the erroneous FCs as misleading [Hint] inputs to the SFT model to regenerate negative trajectories. - Form positive/negative trajectory pairs for mDPO training.

Loss & Training¶

\[\mathcal{L}(x; \tau_w, \tau_l) = \mathcal{L}_{\text{SFT}}(x; \tau_w) + \lambda \cdot \mathcal{L}_{\text{mDPO}}(x; \tau_w, \tau_l)\]

mDPO (multi-turn DPO) compares the action of each turn individually against the reference policy, rather than comparing the entire sequence as a whole.

Data Statistics¶

Category	SFT Count	mDPO Count
Single-turn	20,000	1,556
Multi-turn	7,800	2,250
Irrelevant	6,200	750
Avg. Turns (Multi-turn)	4.71	5.22
Avg. FCs (Multi-turn)	15.13	14.98

The total training set consists of 38,556 samples, which is only about half of APIGen (60K) and Hammer (67.5K).

Experiments¶

Main Results¶

Model	Overall	Single-turn	Multi-turn	Irrelevance
watt-tool-70B	74.31	—	58.75	76.32
GPT-4o (Prompt)	72.08	—	47.62	83.76
GPT-4o (FC)	69.58	—	41.00	83.15
o1	66.73	—	28.25	89.62
Gemini-1.5-pro-002	62.19	—	20.75	78.15
Magnet-14B-mDPO	68.01	—	37.88	84.78
Magnet-14B-SFT	66.83	—	33.38	82.59
Magnet-7B-mDPO	64.64	—	27.75	78.51
Qwen2.5-Coder-14B (base)	51.88	—	5.38	44.58
Qwen2.5-Coder-7B (base)	53.13	—	8.25	65.39

Key Findings: - Magnet-14B-mDPO ranks 4th on BFCL-v3, outperforming o1 and the teacher model Gemini-1.5-pro-002. - Multi-turn scenarios: 14B improves from the base model's 5.38 \(\rightarrow\) 37.88 (+32.5), and 7B improves from 8.25 \(\rightarrow\) 27.75 (+19.5). - All Magnet models outperform the teacher model Gemini-1.5-pro-002 (20.75) in multi-turn scenarios. - Compared with pure SFT, mDPO improves performance in multi-turn scenarios by about 2.5 to 4.5 percentage points.

ToolQuery Results¶

Model	Success Rate	Progress Rate
Magnet-14B-mDPO	73.3	78.7
Gemini-1.5-pro-002	68.3	74.6
GPT-4o	63.3	80.1
Magnet-7B-mDPO	67.7	73.4
Qwen-Coder-14B	51.7	68.7

The 14B model also achieves the highest success rate on ToolQuery.

Ablation Study¶

Component	Overall	Multi-turn
Initial Graph (No Node Operations)	58.54	12.75
+Merge	60.83	20.63
+Merge+Insert	64.39	29.25
+All Operations (SFT)	66.83	33.38
-Positive Trajectory Context Distillation	60.26	18.88
SFT+mDPO	68.01	37.88
-Negative Trajectory Context Distillation	67.35	36.25

Key Findings: 1. Each node operation brings significant improvement: Merge (+7.9 on multi-turn) \(\rightarrow\) Insert (+8.6) \(\rightarrow\) Split (+4.1). 2. Positive trajectory context distillation contributes enormously: removing it drops the multi-turn performance from 33.38 to 18.88 (-14.5). 3. Negative trajectory distillation contributes to mDPO, but to a lesser extent (-1.63 on multi-turn). 4. Data synthesis is generalizable to different base models and self-training scenarios.

Data Source Comparison¶

The 7B model trained on APIGen+ToolAce struggles with a multi-turn score of only 7.13, which is far lower than Magnet-7B's 26.50.
Including irrelevant category data improves the overall score (57.24) but slightly decreases multi-turn performance.

Highlights & Insights¶

Novel Graph-based Perspective: Abstracting multi-turn FC challenges as graph node operations (Insert/Merge/Split) provides an elegant, structured approach.
Achieving More with Less: Achieving the 4th rank with only 38K training data points (about half of APIGen/Hammer).
Student Outperforming the Teacher: All Magnet models outperform the teacher Gemini-1.5-pro-002 in multi-turn scenarios, indicating that the synthesis pipeline introduces beneficial extra supervision.
Crucial Contribution of Positive Prompt Distillation: Ablation reveals that removing positive trajectory hints almost halves multi-turn performance.
Multi-turn FC Remains an Active Challenge: Even Magnet-14B-mDPO only reaches a 37.88% success rate in multi-turn scenarios, and the strongest model (watt-tool-70B) only achieves 58.75%.

Limitations & Future Work¶

Reliance on the quality of the teacher model (Gemini-1.5-pro-002) — although the student outperforms the teacher, the initial data quality remains constrained by the teacher.
The function pool is mainly derived from StableToolBench and BFCL-v3; the API variety and complexity might not fully represent real-world scenarios.
Random walk sampling might miss certain complex function combination patterns.
Negative trajectory generation depends on mistakes made by the SFT model itself — if the SFT model is too weak or too strong, the quality of negative samples may suffer.
Experiments are only validated on the Qwen2.5-Coder series.

FC Evaluation Benchmarks: BFCL-v3 (Comprehensive), ToolBench / StableToolBench (Multi-step), ToolQuery (Multi-turn multi-step)
FC Training Data Synthesis: Toolformer (Inserting APIs into texts) \(\rightarrow\) APIGen/xLAM (Unified format + automated query generation) \(\rightarrow\) Hammer (Function masking + irrelevant functions) \(\rightarrow\) Magnet (Graph structure + node operations + contrastive distillation)
Preference Optimization: DPO \(\rightarrow\) mDPO (Multi-turn extension)

Rating¶

⭐⭐⭐⭐ (4/5)

Novelty: ⭐⭐⭐⭐ — The combination of a graph perspective, node operations, and contrastive distillation is highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covered two benchmarks, multiple model scales, detailed ablations, and data source comparisons.
Writing Quality: ⭐⭐⭐⭐ — Clear pipeline explanation with consistent notation.
Value: ⭐⭐⭐⭐⭐ — A directly deployable data synthesis pipeline that outperforms proprietary models.
Limitations: The absolute success rate of multi-turn FC is still not high enough, leaving a gap for practical deployment.