Scaling Generalist Data-Analytic Agents¶
Conference: ICLR 2026 arXiv: 2509.25084 Code: GitHub Area: LLM Reasoning Keywords: Data-Analytic Agent, Agent Training, Multi-turn Code Execution, Data Synthesis, SFT+RL
TL;DR¶
This paper proposes DataMind — a complete training framework for data-analytic agents — achieving diverse query synthesis via fine-grained task taxonomy with recursive difficulty composition, ensuring data quality through knowledge-augmented trajectory sampling and self-consistency filtering, employing a dynamic SFT+RL mixed training strategy, and implementing a memory-efficient asynchronous rollout framework. The resulting DataMind-14B achieves a 71.16% average score across multiple benchmarks, establishing a new state of the art and surpassing GPT-5 and DeepSeek-V3.1.
Background & Motivation¶
Background: Data-analytic agents discover actionable insights by generating code to process, model, and compute over data, serving as key catalysts for AI-driven scientific discovery and automated decision support. Existing data-analytic agents (DS-Agent, AutoKaggle, Data Interpreter, etc.) rely almost entirely on closed-source models through prompt engineering and multi-agent scaffolding.
Limitations of Prior Work: - Insufficient training data: Publicly available data analysis benchmarks provide only limited test sets without step-by-step trajectory annotations, making them unsuitable for direct training use. - Unclear training strategy: How to allocate steps and maintain stability under the conventional SFT-then-RL paradigm for long-horizon agent training remains an open question. - Unstable multi-turn code execution: Data files and code interpreters involve complex memory management; parallel agent rollout combined with multi-turn code generation frequently crashes under limited memory resources. - Capability gap in open-source models: The few existing open-source trained models (TableLLM, Table-R1) handle only simple table understanding tasks and fail when faced with large-scale data files in diverse formats and long-horizon multi-step reasoning.
Key Challenge: High-quality training demands large-scale diverse trajectory data, a stable training strategy, and reliable environment interaction — yet all three present unique challenges in data analysis settings, where task formats are heterogeneous (csv/xlsx/sqlite), reasoning chains are long, and code execution carries side effects.
Goal: The paper proposes DataMind, an end-to-end scalable data synthesis and agent training framework that systematically addresses the three challenges above.
Method¶
Overall Architecture¶
The DataMind pipeline consists of four key modules: 1. Data Synthesis: File collection → task taxonomy → query synthesis → trajectory sampling & filtering → DataMind-12K 2. Training Strategy: Dynamically weighted joint optimization of SFT loss and DAPO (RL) loss 3. Rollout Engineering: Asynchronous interaction + chunked code maintenance + secure sandbox 4. Reward Design: Format reward + answer reward (model-as-judge) + length penalty
Agents follow the ReAct paradigm: Thought → Action (Python/SQL code) → Observation (execution feedback), with a maximum of \(\mathcal{T}=10\) turns.
Key Design 1: Fine-Grained Task Taxonomy and Recursive Difficulty Composition¶
- Data analysis tasks are categorized into 18 fine-grained classes (data cleaning, descriptive statistics, correlation analysis, time-series analysis, anomaly detection, etc.), each accompanied by 4–6 exemplar queries as few-shot demonstrations.
- Recursive easy-to-hard composition: The output of one task type is used as input to the next, iterated 2–5 times to progressively escalate difficulty and create multi-hop analytical challenges that far exceed the demands of any single task type.
- Data files are sourced from Kaggle (3,400 csv + 560 xlsx) and BIRD/OmniSQL (1,954 sqlite).
Key Design 2: Knowledge-Augmented Trajectory Sampling and Self-Consistency Filtering¶
Trajectory sampling employs a two-level quality assurance scheme:
Level 1 — Knowledge-Augmented Sampling: - High-level workflow knowledge \(k\) is manually designed for each task category to guide an expert model (DeepSeek-V3.1) in generating trajectories. - \(\mathcal{N}=3\) independent trajectories are sampled per query.
Level 2 — Self-Consistency Filtering: - A judge model (GPT-4o-mini) verifies whether the final answers across \(\mathcal{N}\) trajectories are consistent. - Among consistent trajectories, the most concise and accurate one is selected as the training instance. - Inconsistent trajectories are returned to the agent along with the judge's chain-of-thought feedback for reflective revision, followed by re-filtering. - Rule-based filtering enforces format compliance, length constraints (answer < 1,024 tokens), and linguistic integrity.
The final dataset is DataMind-12K (11,707 high-quality trajectories).
Key Design 3: Dynamic Mixed SFT+RL Training¶
The conventional SFT-then-RL paradigm presents a dilemma: excessive SFT rigidifies reasoning patterns and suppresses RL exploration; premature RL leaves the model incapable of generating effective rollout groups.
This work adopts joint optimization:
- \(\gamma\) is dynamically scheduled: initialized at a large value (0.9) to enable knowledge absorption from expert data, then gradually annealed to a small value (0.05) to encourage exploration.
- The SFT loss is computed only on tokens generated by the agent, masking environment feedback tokens.
- The DAPO algorithm is used for RL, employing decoupled clipping and dynamic sampling.
- Training begins with a cold start on DataMind-12K.
Void Turns Filtering: Loss is masked for entire trajectories containing invalid turns (i.e., turns that produce neither effective code nor an answer), preventing trajectory collapse caused by distributional drift.
Key Design 4: Asynchronous Multi-Turn Rollout Engineering¶
- Asynchronous interaction: Model generation and code execution are decoupled across different samples, preventing simultaneous GPU and CPU memory spikes.
- Chunked code maintenance: Following a notebook-style paradigm, each step generates only the current code snippet; prior snippets are concatenated at execution time, avoiding the memory overhead of maintaining a global variable pool.
- Safety controls: Each trajectory runs in an isolated environment with CPU time and peak memory limits, and unsafe function calls are filtered.
Key Experimental Results¶
Main Results: Multi-Benchmark Performance Comparison¶
| Model Type | Method | DABench pass@1 | TableBench pass@1 | BIRD pass@1 | Avg pass@1 |
|---|---|---|---|---|---|
| Closed-source | GPT-4o | 76.39 | 64.97 | 50.20 | 63.85 |
| Closed-source | o4-mini | 79.12 | 71.03 | 57.04 | 69.06 |
| Closed-source | DeepSeek-R1 | 78.73 | 68.96 | 55.80 | 67.83 |
| Closed-source | DeepSeek-V3.1 | 81.32 | 72.52 | 57.89 | 70.58 |
| Closed-source | GPT-5 | 78.21 | 69.93 | 60.17 | 69.44 |
| Open-source-7B | ReAct (Qwen-Coder-7B) | 15.05 | 11.70 | 7.02 | 11.26 |
| Open-source-7B | TableLLM | 36.71 | 41.01 | 11.99 | 29.90 |
| Open-source-7B | Table-R1 | 42.54 | 56.36 | 10.69 | 36.53 |
| Open-source-7B | DataMind-7B | 77.30 | 67.60 | 59.41 | 68.10 |
| Open-source-14B | ReAct (Qwen-Coder-14B) | 71.21 | 56.96 | 41.76 | 56.64 |
| Open-source-14B | TableLLM | 38.26 | 46.44 | 20.99 | 35.23 |
| Open-source-14B | DataMind-14B | 80.29 | 70.95 | 62.23 | 71.16 |
Key findings: - DataMind-14B achieves a 71.16% average score, surpassing all closed-source models (including GPT-5 at 69.44% and DeepSeek-V3.1 at 70.58%). - DataMind-7B achieves 68.10%, the best among all open-source models. - Specialized models (OmniSQL/SQL-R1) remain competitive on BIRD but exhibit sharp performance drops on other benchmarks. - DataMind is trained on only 12K samples, far fewer than baselines (TableLLM: 20K; OmniSQL: 2.5M).
Ablation Study: Training Strategy Comparison¶
| Training Strategy | Avg pass@1 | Avg pass@3 |
|---|---|---|
| SFT only | 62.54 | 73.74 |
| zero-RL (no SFT) | 58.03 | 71.72 |
| SFT-then-RL | 63.42 | 75.46 |
| SFT-and-RL (dynamic \(\gamma\)) | 68.10 | 79.07 |
Key insights: - SFT alone improves the baseline from 11.26% to 62.54% — data quality accounts for the majority of the performance gain. - zero-RL underperforms SFT alone — the 7B model's limited multi-step reasoning capacity prevents it from independently producing high-quality rollout trajectories. - SFT-then-RL yields only marginal improvement and is prone to training instability. - The dynamic mixed strategy yields an additional 5.56-point improvement by balancing knowledge absorption with exploration.
Data and Filtering Analysis¶
| Filtering Strategy | Effect |
|---|---|
| Con-select (self-consistency + best selection) | Baseline setting |
| Non-select (retain all consistent trajectories) | Superior on DABench — trajectory diversity is more beneficial |
| Random-select (randomly choose a consistent trajectory) | Comparable to con-select — judge preference may reduce diversity |
| Non-con (no consistency filtering) | Significant degradation across all metrics — answer quality is the critical guarantee for trajectory quality |
Core finding: Self-consistency filtering is more critical than best-trajectory selection — answer correctness ensures the intrinsic quality of trajectories, and diverse reasoning paths are more beneficial for model learning than a single "best" path.
Highlights & Insights¶
Strengths¶
- End-to-end engineering completeness: The system encompasses data synthesis, training strategy, and rollout engineering, with independent innovations at each stage.
- Deep insights: Findings such as the SFT loss serving simultaneously as a stabilizer and a potential source of collapse during RL training, and self-consistency filtering being more important than best-trajectory selection, carry strong practical guidance.
- Compelling results: A 14B model trained on only 12K samples outperforms GPT-5 and specialized models trained on 2.5M samples.
- Training dynamics analysis: The "raising a child" analogy provides an intuitive explanation of the dynamic weight scheduling from SFT to RL.
Limitations & Future Work¶
- Evaluation relies on GPT-4o-mini as the judge; using the same judge for both training and evaluation introduces potential bias (though cross-validation reports a Pearson correlation of 0.96).
- The 18-category task taxonomy is manually designed; category boundaries and coverage may have omissions.
- Experiments are conducted only at the 7B and 14B scales; behavior at larger or smaller model sizes remains unknown.
- The chunked code maintenance strategy may become less efficient in scenarios with long dependency chains (e.g., cross-turn variable references).
Rating¶
⭐⭐⭐⭐⭐ — An exemplary systematic engineering contribution: the problem is clearly defined, the solution is comprehensive, experiments are rigorous, and insights are deep. This work offers strong reference value to the agent training community.