DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science¶
Conference: ICLR 2026 arXiv: 2602.24288 Code: https://github.com/Snowflake-Labs/dare-bench Area: LLM Evaluation Keywords: data science benchmark, instruction following, ML modeling, RLVR, LLM agent
TL;DR¶
DARE-bench is a large-scale verifiable benchmark for data science tasks, comprising 6,300 Kaggle-derived tasks that support evaluation across two dimensions—ML modeling and instruction following—along with training data for SFT and RL. SFT improves Qwen3-32B by 1.83×, while RL improves Qwen3-4B by more than 8×.
Background & Motivation¶
Background: LLMs are increasingly deployed as data science agents (data loading, transformation, and modeling), yet existing benchmarks (DS-1000, DSBench, MLE-bench, etc.) suffer from significant shortcomings: most evaluate only final answer accuracy while ignoring process fidelity; verifiable ground truth is absent; task scales are small (hundreds of instances); and no training data is provided.
Limitations of Prior Work: (a) Lack of standardized process-aware evaluation—whether the agent truly follows the specified DS pipeline; (b) scarcity of accurately annotated training data, limiting applicability of SFT and RL; (c) existing benchmarks are predominantly sourced from Kaggle competitions, resulting in narrow domain coverage (e.g., missing time-series forecasting).
Key Challenge: Evaluating process fidelity requires deterministic ground truth, yet DS tasks are inherently stochastic and environment-dependent. How can process evaluation be made verifiable?
Goal: (a) Construct a large-scale, verifiable, training-ready DS benchmark; (b) cover two complementary capabilities—instruction following and ML modeling; (c) support RLVR (reinforcement learning with verifiable rewards) training.
Key Insight: The high reproducibility of data science workflows is leveraged—by controlling random seeds and providing explicit instructions, faithful execution of a process yields deterministic outputs, enabling outcome-based verification of process fidelity.
Core Idea: Through engineered determinism (fixed seeds + sandboxed execution + reference solutions + verifiable ground truth), DS process evaluation is transformed into automatically verifiable, outcome-based assessment.
Method¶
Overall Architecture¶
DARE-bench encompasses three task families (classification, regression, and time-series forecasting), each with two variants. The evaluation pipeline proceeds as follows: given a natural-language problem and structured input files, an LLM generates and executes code within a sandbox to produce predictions, which are then automatically scored against the ground truth. Dataset construction follows a four-stage automated pipeline: data sourcing → LLM-assisted task design → post-processing → sandbox validation.
Key Designs¶
-
Dual-Variant Task Design (IF + MM / XF + CF):
- Function: Each dataset yields two complementary evaluation tasks.
- IF (Instruction Following): Requires the LLM to faithfully reproduce a reference workflow (specifying model type, hyperparameters, and preprocessing steps), evaluating process fidelity. Fixing random seeds ensures that faithful execution produces a uniquely determined output.
- MM (ML Modeling): Imposes no methodological constraints and evaluates only final prediction performance, assessing ML modeling capability.
- XF/CF (Time-Series): XF retains exogenous features; CF retains only timestamps and entity columns (the classical forecasting setting).
- Design Motivation: IF simulates the scenario of "strictly executing a data scientist's prescribed pipeline," while MM simulates "the client cares only about final performance." The two variants are complementary and both reflect real-world needs.
-
Automated Data Construction Pipeline:
- Function: Automatically constructs standardized ML tasks from Kaggle datasets.
- Core Procedure: (1) Dataset Sourcing—retrieves Kaggle datasets and metadata via API and web scraping; (2) LLM-Assisted Task Design—LLM determines whether the dataset supports a predictive task and identifies target columns and features; (3) Post-Processing—data splitting, noise injection for IF tasks, resampling and entity validation for time-series tasks; (4) Finalization—sandbox validation to confirm task solvability.
- Design Motivation: Overcomes the bottleneck of manual annotation; LLMs handle only auxiliary content (descriptions, metadata, rule extraction) and do not generate the training signal itself.
-
Verifiable Reward Design:
- Function: Enables DARE-bench to support RLVR training.
- Mechanism: Ground truth for IF tasks is derived from the execution output of reference solutions under fixed seeds; ground truth for MM tasks comes from original dataset labels. Both are deterministic numerical values amenable to automatic scoring.
- Design Motivation: This is the critical enabler for RL training—no human judgment or LLM-as-judge is required; verification is purely outcome-based.
Loss & Training¶
- SFT: Qwen3-32B/4B is fine-tuned on the DARE-bench training split.
- RL: The GRPO algorithm is applied with DARE-bench verifiable rewards, without requiring preference data.
- Evaluation Metrics: accuracy/F1 for classification, R²/RMSE for regression, SMAPE/MAE for time-series forecasting.
Key Experimental Results¶
Main Results¶
| Model | Baseline Score | SFT Score | RL Score | Gain |
|---|---|---|---|---|
| gpt-o4-mini | ~45 | - | - | Strongest closed-source model still struggles with ML modeling |
| Qwen3-32B | 23.25 | ~42.5 (1.83×) | - | Substantial SFT improvement |
| Qwen3-4B | 4.39 | ~25 | 37.40 (8.5×) | Remarkable RL gain |
Ablation Study¶
| Task Type | Key Finding |
|---|---|
| Classification-IF | Failure to follow seed/hyperparameter instructions is the primary cause of failure |
| Classification-MM | Under open-ended modeling, LLMs frequently select suboptimal models |
| Time-series-CF | Most challenging subtask; even strong models perform poorly |
| RL vs. SFT | RL substantially outperforms SFT on small models; SFT suffices for large models |
Key Findings¶
- Even gpt-o4-mini struggles on ML modeling tasks, indicating that the data science capabilities of current LLMs are far from mature.
- Instruction following is the primary bottleneck: models frequently deviate from specified procedures (ignoring designated seeds, altering preprocessing steps, etc.), causing IF task failures.
- RL yields exceptionally large gains on small models: Qwen3-4B improves from 4.39 to 37.40 (8.5×), demonstrating that verifiable rewards combined with RL are highly effective for enhancing DS agent capabilities.
- Time-series forecasting is the greatest weakness: all models perform worst on the CF variant, suggesting that LLMs lack deep knowledge of time-series modeling.
Highlights & Insights¶
- Transforming process fidelity into outcome-based evaluation: By exploiting the reproducibility of data science workflows, "whether the process is correct" is converted to "whether deterministic outputs match," elegantly resolving the objectivity problem in process evaluation.
- Unified training and evaluation: 95% of the 6,300 tasks are available for training and 5% for testing, genuinely supporting a "train on benchmark" paradigm.
- Dramatic RL gains on small models: An 8× improvement demonstrates that task-specific RL training can substantially unlock the potential of small models, with significant practical implications for DS agent deployment.
- Scale far exceeding comparable benchmarks: 6,300 tasks versus the largest comparable benchmark SWE-bench (21K tasks, but not DS-specific), representing an order-of-magnitude improvement within the DS evaluation landscape.
Limitations & Future Work¶
- Kaggle datasets may not represent the complexity of real-world industrial data science problems.
- The "determinism" of IF tasks relies on complete control of random seeds; many real-world DS pipelines cannot be fully reproduced.
- Multi-turn interaction and iterative modeling scenarios are not evaluated—actual DS workflows typically require multiple experimental iterations.
- The choice of evaluation metrics for time-series tasks (e.g., SMAPE vs. MAE) may affect model rankings.
Related Work & Insights¶
- vs. DSBench/MLE-bench: Larger scale (6,300 vs. 540/75 tasks), includes training data, and covers time-series forecasting.
- vs. SWE-bench: SWE-bench targets software engineering; DARE-bench targets data science—the two are strongly complementary.
- vs. RLVR (DeepSeek-R1, etc.): DARE-bench provides a new RLVR domain beyond mathematics and code—verifiable rewards for DS tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ The IF+MM dual-variant evaluation and DS process fidelity verification constitute novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-model evaluation with SFT and RL experiments across 6,300 tasks at considerable scale.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with detailed pipeline descriptions.
- Value: ⭐⭐⭐⭐ Fills an important gap in DS agent training and evaluation; RL training results are particularly compelling.