Skip to content

Scaling Generalist Data-Analytic Agents

Conference: ICLR 2026 arXiv: 2509.25084 Code: GitHub Area: LLM Reasoning Keywords: Data-Analytic Agent, Agent Training, Multi-turn Code Execution, Data Synthesis, SFT+RL

TL;DR

This paper proposes DataMind — a complete training framework for data-analytic agents — achieving diverse query synthesis via fine-grained task taxonomy with recursive difficulty composition, ensuring data quality through knowledge-augmented trajectory sampling and self-consistency filtering, employing a dynamic SFT+RL mixed training strategy, and implementing a memory-efficient asynchronous rollout framework. The resulting DataMind-14B achieves a 71.16% average score across multiple benchmarks, establishing a new state of the art and surpassing GPT-5 and DeepSeek-V3.1.

Background & Motivation

Background: Data-analytic agents discover actionable insights by generating code to process, model, and compute over data, serving as key catalysts for AI-driven scientific discovery and automated decision support. Existing data-analytic agents (DS-Agent, AutoKaggle, Data Interpreter, etc.) rely almost entirely on closed-source models through prompt engineering and multi-agent scaffolding.

Limitations of Prior Work: - Insufficient training data: Publicly available data analysis benchmarks provide only limited test sets without step-by-step trajectory annotations, making them unsuitable for direct training use. - Unclear training strategy: How to allocate steps and maintain stability under the conventional SFT-then-RL paradigm for long-horizon agent training remains an open question. - Unstable multi-turn code execution: Data files and code interpreters involve complex memory management; parallel agent rollout combined with multi-turn code generation frequently crashes under limited memory resources. - Capability gap in open-source models: The few existing open-source trained models (TableLLM, Table-R1) handle only simple table understanding tasks and fail when faced with large-scale data files in diverse formats and long-horizon multi-step reasoning.

Key Challenge: High-quality training demands large-scale diverse trajectory data, a stable training strategy, and reliable environment interaction — yet all three present unique challenges in data analysis settings, where task formats are heterogeneous (csv/xlsx/sqlite), reasoning chains are long, and code execution carries side effects.

Goal: The paper proposes DataMind, an end-to-end scalable data synthesis and agent training framework that systematically addresses the three challenges above.

Method

Overall Architecture

The DataMind pipeline consists of four key modules: 1. Data Synthesis: File collection → task taxonomy → query synthesis → trajectory sampling & filtering → DataMind-12K 2. Training Strategy: Dynamically weighted joint optimization of SFT loss and DAPO (RL) loss 3. Rollout Engineering: Asynchronous interaction + chunked code maintenance + secure sandbox 4. Reward Design: Format reward + answer reward (model-as-judge) + length penalty

Agents follow the ReAct paradigm: Thought → Action (Python/SQL code) → Observation (execution feedback), with a maximum of \(\mathcal{T}=10\) turns.

Key Design 1: Fine-Grained Task Taxonomy and Recursive Difficulty Composition

  • Data analysis tasks are categorized into 18 fine-grained classes (data cleaning, descriptive statistics, correlation analysis, time-series analysis, anomaly detection, etc.), each accompanied by 4–6 exemplar queries as few-shot demonstrations.
  • Recursive easy-to-hard composition: The output of one task type is used as input to the next, iterated 2–5 times to progressively escalate difficulty and create multi-hop analytical challenges that far exceed the demands of any single task type.
  • Data files are sourced from Kaggle (3,400 csv + 560 xlsx) and BIRD/OmniSQL (1,954 sqlite).

Key Design 2: Knowledge-Augmented Trajectory Sampling and Self-Consistency Filtering

Trajectory sampling employs a two-level quality assurance scheme:

Level 1 — Knowledge-Augmented Sampling: - High-level workflow knowledge \(k\) is manually designed for each task category to guide an expert model (DeepSeek-V3.1) in generating trajectories. - \(\mathcal{N}=3\) independent trajectories are sampled per query.

Level 2 — Self-Consistency Filtering: - A judge model (GPT-4o-mini) verifies whether the final answers across \(\mathcal{N}\) trajectories are consistent. - Among consistent trajectories, the most concise and accurate one is selected as the training instance. - Inconsistent trajectories are returned to the agent along with the judge's chain-of-thought feedback for reflective revision, followed by re-filtering. - Rule-based filtering enforces format compliance, length constraints (answer < 1,024 tokens), and linguistic integrity.

The final dataset is DataMind-12K (11,707 high-quality trajectories).

Key Design 3: Dynamic Mixed SFT+RL Training

The conventional SFT-then-RL paradigm presents a dilemma: excessive SFT rigidifies reasoning patterns and suppresses RL exploration; premature RL leaves the model incapable of generating effective rollout groups.

This work adopts joint optimization:

\[\mathcal{L}_{\text{Final}}(\theta) = \gamma \cdot \mathcal{L}_{\text{SFT}}(\theta) + (1-\gamma) \cdot \mathcal{L}_{\text{DAPO}}(\theta)\]
  • \(\gamma\) is dynamically scheduled: initialized at a large value (0.9) to enable knowledge absorption from expert data, then gradually annealed to a small value (0.05) to encourage exploration.
  • The SFT loss is computed only on tokens generated by the agent, masking environment feedback tokens.
  • The DAPO algorithm is used for RL, employing decoupled clipping and dynamic sampling.
  • Training begins with a cold start on DataMind-12K.

Void Turns Filtering: Loss is masked for entire trajectories containing invalid turns (i.e., turns that produce neither effective code nor an answer), preventing trajectory collapse caused by distributional drift.

Key Design 4: Asynchronous Multi-Turn Rollout Engineering

  • Asynchronous interaction: Model generation and code execution are decoupled across different samples, preventing simultaneous GPU and CPU memory spikes.
  • Chunked code maintenance: Following a notebook-style paradigm, each step generates only the current code snippet; prior snippets are concatenated at execution time, avoiding the memory overhead of maintaining a global variable pool.
  • Safety controls: Each trajectory runs in an isolated environment with CPU time and peak memory limits, and unsafe function calls are filtered.

Key Experimental Results

Main Results: Multi-Benchmark Performance Comparison

Model Type Method DABench pass@1 TableBench pass@1 BIRD pass@1 Avg pass@1
Closed-source GPT-4o 76.39 64.97 50.20 63.85
Closed-source o4-mini 79.12 71.03 57.04 69.06
Closed-source DeepSeek-R1 78.73 68.96 55.80 67.83
Closed-source DeepSeek-V3.1 81.32 72.52 57.89 70.58
Closed-source GPT-5 78.21 69.93 60.17 69.44
Open-source-7B ReAct (Qwen-Coder-7B) 15.05 11.70 7.02 11.26
Open-source-7B TableLLM 36.71 41.01 11.99 29.90
Open-source-7B Table-R1 42.54 56.36 10.69 36.53
Open-source-7B DataMind-7B 77.30 67.60 59.41 68.10
Open-source-14B ReAct (Qwen-Coder-14B) 71.21 56.96 41.76 56.64
Open-source-14B TableLLM 38.26 46.44 20.99 35.23
Open-source-14B DataMind-14B 80.29 70.95 62.23 71.16

Key findings: - DataMind-14B achieves a 71.16% average score, surpassing all closed-source models (including GPT-5 at 69.44% and DeepSeek-V3.1 at 70.58%). - DataMind-7B achieves 68.10%, the best among all open-source models. - Specialized models (OmniSQL/SQL-R1) remain competitive on BIRD but exhibit sharp performance drops on other benchmarks. - DataMind is trained on only 12K samples, far fewer than baselines (TableLLM: 20K; OmniSQL: 2.5M).

Ablation Study: Training Strategy Comparison

Training Strategy Avg pass@1 Avg pass@3
SFT only 62.54 73.74
zero-RL (no SFT) 58.03 71.72
SFT-then-RL 63.42 75.46
SFT-and-RL (dynamic \(\gamma\)) 68.10 79.07

Key insights: - SFT alone improves the baseline from 11.26% to 62.54% — data quality accounts for the majority of the performance gain. - zero-RL underperforms SFT alone — the 7B model's limited multi-step reasoning capacity prevents it from independently producing high-quality rollout trajectories. - SFT-then-RL yields only marginal improvement and is prone to training instability. - The dynamic mixed strategy yields an additional 5.56-point improvement by balancing knowledge absorption with exploration.

Data and Filtering Analysis

Filtering Strategy Effect
Con-select (self-consistency + best selection) Baseline setting
Non-select (retain all consistent trajectories) Superior on DABench — trajectory diversity is more beneficial
Random-select (randomly choose a consistent trajectory) Comparable to con-select — judge preference may reduce diversity
Non-con (no consistency filtering) Significant degradation across all metrics — answer quality is the critical guarantee for trajectory quality

Core finding: Self-consistency filtering is more critical than best-trajectory selection — answer correctness ensures the intrinsic quality of trajectories, and diverse reasoning paths are more beneficial for model learning than a single "best" path.

Highlights & Insights

Strengths

  • End-to-end engineering completeness: The system encompasses data synthesis, training strategy, and rollout engineering, with independent innovations at each stage.
  • Deep insights: Findings such as the SFT loss serving simultaneously as a stabilizer and a potential source of collapse during RL training, and self-consistency filtering being more important than best-trajectory selection, carry strong practical guidance.
  • Compelling results: A 14B model trained on only 12K samples outperforms GPT-5 and specialized models trained on 2.5M samples.
  • Training dynamics analysis: The "raising a child" analogy provides an intuitive explanation of the dynamic weight scheduling from SFT to RL.

Limitations & Future Work

  • Evaluation relies on GPT-4o-mini as the judge; using the same judge for both training and evaluation introduces potential bias (though cross-validation reports a Pearson correlation of 0.96).
  • The 18-category task taxonomy is manually designed; category boundaries and coverage may have omissions.
  • Experiments are conducted only at the 7B and 14B scales; behavior at larger or smaller model sizes remains unknown.
  • The chunked code maintenance strategy may become less efficient in scenarios with long dependency chains (e.g., cross-turn variable references).

Rating

⭐⭐⭐⭐⭐ — An exemplary systematic engineering contribution: the problem is clearly defined, the solution is comprehensive, experiments are rigorous, and insights are deep. This work offers strong reference value to the agent training community.