ICML 2025 Spotlight Video Generation Data-Model Co-development Multimodal Large Language Models Data Processing Operators Sandbox Experiments Data Quality and Diversity

Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development¶

Conference: ICML 2025 Spotlight
arXiv: 2407.11784
Code: modelscope/data-juicer
Area: Video Generation
Keywords: Data-Model Co-development, Multimodal Large Language Models, Data Processing Operators, Sandbox Experiments, Data Quality and Diversity

TL;DR¶

This work proposes Data-Juicer Sandbox, a feedback-driven sandbox suite that systematically explores the interactions between data processing operators (OPs) and model performance in low-cost, small-scale experiments through a "Probe-Analyze-Refine" workflow, transferring the obtained data recipes to large-scale scenarios and achieving first place on the VBench leaderboard.

Background & Motivation¶

The optimization of multimodal large models has long faced a fragmentation between "data-centric" and "model-centric" pathways:

Model-centric methods optimize architectures and algorithms under fixed data priors, ignoring the impact of data quality on the model.
Data-centric methods process datasets independently of the model training context, lacking model feedback signals.
The lack of synergy between the two results in heavy reliance on heuristic exploration and single-perspective expert experience.

In the era of large models, the cost of data processing and model training has grown exponentially. Researchers are often forced to choose between "pursuing results" and "deep exploration". The lack of a cost-controllable platform to accelerate data-model co-development makes it difficult for improvements in one domain to directly inform and enhance the other.

Core Problem: How to systematically explore the impact of data processing operations on model performance at a controllable cost, and transfer insights from small-scale experiments to large-scale production environments?

Method¶

Overall Architecture¶

Data-Juicer Sandbox is a feedback-driven sandbox suite employing a three-layer decoupled architecture:

Top Layer — Workflow Layer: Organizes co-development as an ordered list of jobs, divided into four stages: probing data/models, refining data recipes, executing data/model operations, and evaluation. Users can flexibly adjust task sequences or reuse built-in workflows.
Middle Layer — Action Layer: Encapsulates common actions as Hook functions, such as data analysis, triggering model training, and evaluation callbacks.
Bottom Layer — Capability Layer: Encapsulates underlying capabilities as Factory Classes, including over 100 multimodal data analysis, filtering, and synthesis operators (OPs) provided by Data-Juicer, as well as integrated SOTA open-source model training and evaluation frameworks (including Mini-Gemini, EasyAnimate, ModelScope, VBench, MMBench, TextVQA, MME, etc.).

All components are managed via configuration files, supporting custom orchestration and significantly reducing cognitive burden.

Key Designs: Probe-Analyze-Refine Workflow¶

This is the core methodology of the paper, divided into four progressive stages:

Stage 1: Single-Operator Data Pools

Given an initial dataset \(\mathcal{D}\), for each filtering operator of interest \(\mathcal{OP}_i\):

Process the dataset using the operator: \(\mathcal{P}_i = \mathcal{DJ}[\mathcal{OP}_i(\rho_i)](\mathcal{D})\)
Sort by the statistics generated by the operator and divide the results equally into three data pools: \(\mathcal{P}_{i,\text{low}}\), \(\mathcal{P}_{i,\text{mid}}\), and \(\mathcal{P}_{i,\text{high}}\)
Randomly sample \(\mathcal{D}\) as a control group \(\mathcal{D}_{rand}\) to ensure all \(3N+1\) data pools are of equal size
Independently train reference models on each data pool using consistent hyperparameters, data volume, and computational resources
Harness feedback from model evaluation metrics to mine insights and identify Top OPs

Stage 2: Multi-Operators Data Pools

Sequentially combine multiple OPs: \(\mathcal{P}_S = (\mathcal{DJ}[\mathcal{OP}_i] \circ \mathcal{DJ}[\mathcal{OP}_j] \circ \cdots)(\mathcal{D})\)

Since the number of combinations grows exponentially with the number of OPs, two practical combination strategies are proposed:

Strategy 1: Combine Top OPs in descending order based on their rankings in single-operator experiments.
Strategy 2: Cluster OPs based on Pearson correlation coefficients and combine Top OPs within each category.

Stage 3: Pyramid-shaped Data Pools

Addresses the trade-off between data quality and diversity: more OP combinations \(\rightarrow\) higher quality but smaller data volume.

Constructs a hierarchical pyramid structure where \(n_s\) Top OPs are combined into \(2^{n_s}-1\) data pools:

Top Layer: Combination of all OPs (e.g., \(\mathcal{OP}_{1,2,3}\)), with the smallest data volume but the highest quality.
Middle Layer: Pairwise combinations (e.g., \(\mathcal{OP}_{1,2}\)), with slightly larger data volume.
Bottom Layer: Single OPs, with the largest data volume but lower average quality.

Comparison of two training settings:

Repeated Training: Uses only the top-layer high-quality data pool, trained with different repetition rates.
Non-repeated Training: Progressively adds lower-layer data pools and deduplicates them to maintain the same computational cost.

Stage 4: Scaling and Transfer

All data pools are uniformly sampled and consistently derived from \(\mathcal{D}\), allowing insights from small-scale experiments to be extrapolated to larger-scale scenarios.

Loss & Training¶

This work does not modify the loss functions of model training; instead, it focuses on optimization on the data side. The core training strategies include:

Cost Control: All experiments maintain a consistent computational overhead. The training time for small pools, \(T_{pool}\), is scaled down by a ratio \(r\) compared to full training \(T_{full}\) (such that the total cost satisfies \((1+mr) \times T_{full} \leq M \times T_{full}\)).
Early Stopping: Early termination for unpromising experimental trials.
Transferability Guarantee: Theoretical analysis using Hoeffding's inequality is provided to prove that the performance discrepancy \(\epsilon\) between small-pool and full-scale experiments decays exponentially as \(r\) increases.

Key Experimental Results¶

Covers 5 models, 4 categories of tasks, 100+ experiments, 70+ evaluation metrics, and 40+ data processing operators.

Main Results¶

Single-Operator Rankings — Average Performance Changes of Top-3 OPs (Relative to Baseline %):

Task	Operator	\(\mathcal{P}_{low}\)	\(\mathcal{P}_{mid}\)	\(\mathcal{P}_{high}\)
I2T (Image-to-Text)	Image NSFW Filter	+7.13	+18.44	+66.38
I2T	Text Action Number	+59.90	+0.29	-2.05
T2V (Text-to-Video)	Video Aesthetics Score	-0.98	+0.13	+0.96
T2V	Video NSFW Score	+0.82	-0.05	-0.57
ITP (Pre-training)	CLIP Image-Text Sim.	-32.57	-6.39	+39.53
IC (Image Captioning)	Text Length	+0.76	-3.13	-11.36

Ablation Study¶

Data Quality vs. Diversity Trade-off (Pyramid Experiments):

Configuration	Key Metrics	Description
High-Quality Data × Repeated Training	Clear early-stage advantage	Data repetition works well with fewer epochs
Lower-Quality but More Data	Catches up or even surpasses in later stages	Diversity is more important with sufficient training
Optimal Balance Point	Depends on computational budget	Quality is favored with limited budgets; diversity is favored with abundant budgets

Key Findings¶

Output modality determines key OPs: Top OPs for T2V tasks are entirely video-related, while Top OPs for I2T and ITP tasks are mostly text or image-text related. Insight: More resources should be allocated to data processing corresponding to the model's output modality.
Optimal statistical ranges differ across OPs: Some OPs perform best in the high range (e.g., NSFW, Image-Text Similarity), while others perform best in the low range (e.g., Text Action Number), demonstrating the need for empirical probing over intuition.
OP combination effects are non-linear: Top-2 combinations typically outperform single operators, but Top-3 combinations do not necessarily yield further improvements.
Small-scale insights are transferable: The optimal recipes discovered on small data pools maintain their performance gains when applied to large-scale scenarios.
First place on VBench leaderboard: Insights obtained from small-pool experiments with EasyAnimate, when transferred to the architecturally different T2V-Turbo model, outperformed competitors such as Gen-3 and VEnhancer.
Scaling-law-like behavior: Optimal recipes found with low-FLOPs models in CLIP pre-training maintain their advantages as FLOPs scale up; the advantage of the recipes remains stable when expanding InternVL from 1B to 26B.

Highlights & Insights¶

Methodological Innovation: Proposes the first systematic data-model co-development sandbox, transforming traditional trial-and-error data cleaning and model tuning into a feedback-driven scientific workflow.
Outstanding Cost-Effectiveness: Subsidizes costly large-scale heuristic iterations with numerous small-scale experiments (\(r\) much smaller than 0.01), significantly reducing overall development costs.
Elegant Pyramid Design: The quality-diversity trade-off framework provides a clear analytical tool for data mixing strategies.
Cross-Model and Cross-Architecture Transferability: Data recipes discovered on EasyAnimate can be directly applied to the architecturally different T2V-Turbo and achieve SOTA, proving the universality of the recipes.
Infrastructure Value: As an open-source middleware, Data-Juicer Sandbox fills the infrastructure gap in multimodal data-model co-development.

Limitations & Future Work¶

Limited OP exploration space: Currently, only filtering OPs have been thoroughly explored, with limited exploration of synthesis-based (e.g., data augmentation, caption rewriting) and mapping-based OPs.
Limited model coverage: Chiefly validated on CLIP, LLaVA-like, and DiT-based models, leaving applicability to proprietary closed-source models unverified.
Non-trivial computational cost: The total cost of conducting 100+ experiments remains significant, which might still be prohibitive for resource-constrained teams.
Theoretical guarantees for recipe transfer: The analysis using Hoeffding's inequality is relatively simplified and does not fully account for practical issues like data distribution shift.
Room for automation improvement: OP combinations and hyperparameter search currently rely heavily on heuristics; future work could integrate AutoML or Bayesian optimization.

Data-Juicer (Chen et al., 2024): The underlying system of this work, which offers more than 100 multimodal data operators.
DataComp (Gadre et al., 2023): A CLIP benchmark competition highlighting the importance of data filtering for pre-training.
Mini-Gemini / InternVL-2.0: LLaVA-style image-text generation models, which serve as experimental testbeds in this study.
EasyAnimate / T2V-Turbo: DiT-based video generation models, used to validate the cross-architecture transferability of recipes.
VBench: A comprehensive evaluation benchmark for video generation, on which this work achieved the #1 spot.

Insights: The methodology of "systematic small-scale exploration followed by large-scale transfer" can be generalized to other resource-intensive AI development scenarios, such as RLHF data recipe optimization and searching for data mixing ratios in multi-task learning.

Rating¶

Dimension	Score	Description
Novelty	⭐⭐⭐⭐	First systematic data-model co-development sandbox
Technical Depth	⭐⭐⭐⭐	Well-designed workflow with insightful pyramid analysis
Experimental Thoroughness	⭐⭐⭐⭐⭐	100+ experiments, 4 task categories, 5 models, #1 on VBench
Writing Quality	⭐⭐⭐⭐	Clear structure, but heavy mathematical notations add some reading burden
Practicality	⭐⭐⭐⭐⭐	Open-source infrastructure with directly reusable recipes
Overall	⭐⭐⭐⭐☆	Engineering-oriented systematic work with solid experimentation and high practical value