DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science¶

Conference: ICLR 2026 arXiv: 2602.24288 Code: https://github.com/Snowflake-Labs/dare-bench Area: LLM Evaluation Keywords: data science benchmark, instruction following, ML modeling, RLVR, LLM agent

TL;DR¶

DARE-bench is a large-scale verifiable benchmark for data science tasks, comprising 6,300 Kaggle-derived tasks that support evaluation across two dimensions—ML modeling and instruction following—along with training data for SFT and RL. SFT improves Qwen3-32B by 1.83×, while RL improves Qwen3-4B by more than 8×.

Background & Motivation¶

Background: LLMs are increasingly deployed as data science agents (data loading, transformation, and modeling), yet existing benchmarks (DS-1000, DSBench, MLE-bench, etc.) suffer from significant shortcomings: most evaluate only final answer accuracy while ignoring process fidelity; verifiable ground truth is absent; task scales are small (hundreds of instances); and no training data is provided.

Limitations of Prior Work: (a) Lack of standardized process-aware evaluation—whether the agent truly follows the specified DS pipeline; (b) scarcity of accurately annotated training data, limiting applicability of SFT and RL; (c) existing benchmarks are predominantly sourced from Kaggle competitions, resulting in narrow domain coverage (e.g., missing time-series forecasting).

Key Challenge: Evaluating process fidelity requires deterministic ground truth, yet DS tasks are inherently stochastic and environment-dependent. How can process evaluation be made verifiable?

Goal: (a) Construct a large-scale, verifiable, training-ready DS benchmark; (b) cover two complementary capabilities—instruction following and ML modeling; (c) support RLVR (reinforcement learning with verifiable rewards) training.

Key Insight: The high reproducibility of data science workflows is leveraged—by controlling random seeds and providing explicit instructions, faithful execution of a process yields deterministic outputs, enabling outcome-based verification of process fidelity.

Core Idea: Through engineered determinism (fixed seeds + sandboxed execution + reference solutions + verifiable ground truth), DS process evaluation is transformed into automatically verifiable, outcome-based assessment.

Method¶

Overall Architecture¶

DARE-bench encompasses three task families (classification, regression, and time-series forecasting), each with two variants. The evaluation pipeline proceeds as follows: given a natural-language problem and structured input files, an LLM generates and executes code within a sandbox to produce predictions, which are then automatically scored against the ground truth. Dataset construction follows a four-stage automated pipeline: data sourcing → LLM-assisted task design → post-processing → sandbox validation.

Key Designs¶

Dual-Variant Task Design (IF + MM / XF + CF):
- Function: Each dataset yields two complementary evaluation tasks.
- IF (Instruction Following): Requires the LLM to faithfully reproduce a reference workflow (specifying model type, hyperparameters, and preprocessing steps), evaluating process fidelity. Fixing random seeds ensures that faithful execution produces a uniquely determined output.
- MM (ML Modeling): Imposes no methodological constraints and evaluates only final prediction performance, assessing ML modeling capability.
- XF/CF (Time-Series): XF retains exogenous features; CF retains only timestamps and entity columns (the classical forecasting setting).
- Design Motivation: IF simulates the scenario of "strictly executing a data scientist's prescribed pipeline," while MM simulates "the client cares only about final performance." The two variants are complementary and both reflect real-world needs.
Automated Data Construction Pipeline:
- Function: Automatically constructs standardized ML tasks from Kaggle datasets.
- Core Procedure: (1) Dataset Sourcing—retrieves Kaggle datasets and metadata via API and web scraping; (2) LLM-Assisted Task Design—LLM determines whether the dataset supports a predictive task and identifies target columns and features; (3) Post-Processing—data splitting, noise injection for IF tasks, resampling and entity validation for time-series tasks; (4) Finalization—sandbox validation to confirm task solvability.
- Design Motivation: Overcomes the bottleneck of manual annotation; LLMs handle only auxiliary content (descriptions, metadata, rule extraction) and do not generate the training signal itself.
Verifiable Reward Design:
- Function: Enables DARE-bench to support RLVR training.
- Mechanism: Ground truth for IF tasks is derived from the execution output of reference solutions under fixed seeds; ground truth for MM tasks comes from original dataset labels. Both are deterministic numerical values amenable to automatic scoring.
- Design Motivation: This is the critical enabler for RL training—no human judgment or LLM-as-judge is required; verification is purely outcome-based.

Loss & Training¶

SFT: Qwen3-32B/4B is fine-tuned on the DARE-bench training split.
RL: The GRPO algorithm is applied with DARE-bench verifiable rewards, without requiring preference data.
Evaluation Metrics: accuracy/F1 for classification, R²/RMSE for regression, SMAPE/MAE for time-series forecasting.

Key Experimental Results¶

Main Results¶

Model	Baseline Score	SFT Score	RL Score	Gain
gpt-o4-mini	~45	-	-	Strongest closed-source model still struggles with ML modeling
Qwen3-32B	23.25	~42.5 (1.83×)	-	Substantial SFT improvement
Qwen3-4B	4.39	~25	37.40 (8.5×)	Remarkable RL gain

Ablation Study¶

Task Type	Key Finding
Classification-IF	Failure to follow seed/hyperparameter instructions is the primary cause of failure
Classification-MM	Under open-ended modeling, LLMs frequently select suboptimal models
Time-series-CF	Most challenging subtask; even strong models perform poorly
RL vs. SFT	RL substantially outperforms SFT on small models; SFT suffices for large models

Key Findings¶

Even gpt-o4-mini struggles on ML modeling tasks, indicating that the data science capabilities of current LLMs are far from mature.
Instruction following is the primary bottleneck: models frequently deviate from specified procedures (ignoring designated seeds, altering preprocessing steps, etc.), causing IF task failures.
RL yields exceptionally large gains on small models: Qwen3-4B improves from 4.39 to 37.40 (8.5×), demonstrating that verifiable rewards combined with RL are highly effective for enhancing DS agent capabilities.
Time-series forecasting is the greatest weakness: all models perform worst on the CF variant, suggesting that LLMs lack deep knowledge of time-series modeling.

Highlights & Insights¶

Transforming process fidelity into outcome-based evaluation: By exploiting the reproducibility of data science workflows, "whether the process is correct" is converted to "whether deterministic outputs match," elegantly resolving the objectivity problem in process evaluation.
Unified training and evaluation: 95% of the 6,300 tasks are available for training and 5% for testing, genuinely supporting a "train on benchmark" paradigm.
Dramatic RL gains on small models: An 8× improvement demonstrates that task-specific RL training can substantially unlock the potential of small models, with significant practical implications for DS agent deployment.
Scale far exceeding comparable benchmarks: 6,300 tasks versus the largest comparable benchmark SWE-bench (21K tasks, but not DS-specific), representing an order-of-magnitude improvement within the DS evaluation landscape.

Limitations & Future Work¶

Kaggle datasets may not represent the complexity of real-world industrial data science problems.
The "determinism" of IF tasks relies on complete control of random seeds; many real-world DS pipelines cannot be fully reproduced.
Multi-turn interaction and iterative modeling scenarios are not evaluated—actual DS workflows typically require multiple experimental iterations.
The choice of evaluation metrics for time-series tasks (e.g., SMAPE vs. MAE) may affect model rankings.

vs. DSBench/MLE-bench: Larger scale (6,300 vs. 540/75 tasks), includes training data, and covers time-series forecasting.
vs. SWE-bench: SWE-bench targets software engineering; DARE-bench targets data science—the two are strongly complementary.
vs. RLVR (DeepSeek-R1, etc.): DARE-bench provides a new RLVR domain beyond mathematics and code—verifiable rewards for DS tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The IF+MM dual-variant evaluation and DS process fidelity verification constitute novel contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-model evaluation with SFT and RL experiments across 6,300 tasks at considerable scale.
Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed pipeline descriptions.
Value: ⭐⭐⭐⭐ Fills an important gap in DS agent training and evaluation; RL training results are particularly compelling.