Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: Dataset publicly released (The paper states it is publicly released, though no specific repository link is provided in the main text ⚠️ Subject to the original text)
Area: Diffusion Models / Image Editing / Datasets
Keywords: Instruction-based Image Editing, Synthetic Dataset, MLLM Evaluation, Preference Pairs, Multi-turn Editing

TL;DR¶

The authors utilized Nano-Banana (Gemini-2.5-Flash-Image) to batch-generate approximately 400,000 instruction-based image editing samples on real photos from OpenImages. Using Gemini-2.5-Pro for automated quality inspection, they constructed Pico-Banana-400K, an open-source dataset covering 35 editing types that supports single-turn SFT, preference learning, and multi-turn editing research.

Background & Motivation¶

Background: Multimodal Large Language Models (MLLM) like GPT-4o and Nano-Banana, alongside diffusion-based editing models, have enabled various image manipulations—from simple color adjustment to complex semantic and compositional transformations—based on natural language instructions. Consequently, instruction-based image editing has become a prominent research direction.

Limitations of Prior Work: Open research is constrained by a lack of data. Existing editing datasets are either synthetic data generated by proprietary models or small human-annotated subsets. These typically suffer from three issues: domain shift (synthetic vs. real photos), imbalanced distribution of editing types, and inconsistent quality control. These shortcomings result in models that lack robustness and are difficult to benchmark fairly.

Key Challenge: Achieving "large scale" often compromises "quality and diversity" (blind volume accumulation without fine-grained quality checks), while "high quality" is difficult to scale due to the cost and slowness of human annotation. Large-scale data sourced from real photos with clear commercial licensing is particularly scarce.

Goal: To create a large-scale, high-quality, and fully public image editing dataset based on real photos that supports not only single-turn SFT but also preference alignment and multi-turn editing scenarios.

Key Insight: Instead of relying on humans, frontier closed-source models can be treated as a "data factory"—Gemini-2.5-Flash writes instructions, Nano-Banana executes edits, and Gemini-2.5-Pro acts as a judge. A fine-grained taxonomy ensures coverage, while MLLM-based multi-dimensional scoring ensures content preservation and instruction faithfulness.

Core Idea: By implementing a fully automated pipeline comprising "taxonomy-driven instruction generation + model-based editing + MLLM automated evaluation + failure sample recovery," real images are expanded into 400,000 quality-controlled editing triplets, subdivided into dedicated subsets for single-turn, preference, and multi-turn tasks.

Method¶

Overall Architecture¶

Pico-Banana-400K focuses on an automated, scalable, and quality-controlled data construction pipeline. Inputting real photos from OpenImages, the pipeline outputs approximately 400,000 (image, instruction, edited-image) samples and derived subsets. The workflow consists of four steps: ① Sampling real images from OpenImages and assigning a primary editing type based on a taxonomy of 35 types across 8 categories; ② Using Gemini-2.5-Flash to generate both "long/detailed" and "short/colloquial" instructions; ③ Executing edits with Nano-Banana and scoring them via Gemini-2.5-Pro across four dimensions—samples exceeding a threshold ($\approx 0.7$) are kept as successful, while failures are recycled as negative examples; ④ Sampling single-turn data and appending 1–4 edit types to generate multi-turn chains (2–5 rounds). The final dataset comprises three subsets: 258K single-turn SFT, 56K preference pairs, and 72K multi-turn sequences. The entire process scales automatically without human annotation at a production cost of approximately $100,000.

Key Designs¶

1. 35-Category Editing Taxonomy: Ensuring Coverage and Balance

To address the uneven distribution and incomplete coverage in existing datasets, the authors defined a fine-grained taxonomy with 8 major categories and 35 operations: Pixel & Photometric (color, grain/vintage filters), Object-Level Semantic (add/remove/replace objects, attribute change, relocation, resizing/orientation), Scene Composition (background change, season/weather/lighting), Stylistic (artistic style transfer, cartoon/sketch conversion, anachronistic styles), Text & Symbol (replace/add/translate text, font changes), Human-Centric (accessories, clothing, facial expression/age/gender, style conversion like anime/Pixar/LEGO), Scale (upscaling), and Spatial/Layout (outpainting). Each (image, instruction) pair is assigned one primary type, with category-specific filtering for human or text operations. Categories where Nano-Banana performed inconsistently—such as brightness/contrast/saturation, sharpening/blurring (too subtle), heavy perspective/pose rewriting (structural artifacts), or dual-image composition—were proactively excluded to ensure stable supervision signals.

2. Dual Instruction Generation: Detailed Prompts vs. Colloquial Instructions

To balance the need for unambiguous supervision in training with the reality of short, vague user instructions, two instruction views are generated for each edit. Type I (Long) instructions are generated by Gemini-2.5-Flash, requiring perception of visible content (objects, colors, positions) to produce information-dense, photorealistic prompts for strong training supervision. Type II (Short) instructions are rewritten by Qwen2.5-7B-Instruct based on human-written examples to reflect colloquial user habits. This dual-instruction setup allows research into horizontal instruction granularity and supports both natural user prompt handling and dense supervision.

3. MLLM Four-Dimensional Evaluation & Failure Recovery: Automated Quality Gate and Preference Source

To bypass the cost of human evaluation, Gemini-2.5-Pro simulates professional human judgment using a structured system prompt, scoring edits based on: Instruction Compliance (40%), Seamlessness (25%), Preservation Balance (20%), and Technical Quality (15%). If the aggregated score exceeds a strict threshold ($\approx 0.7$), it is categorised as a success (~258K samples). Each (image, instruction) pair is attempted up to three times. If it fails all attempts, it is discarded; however, if it fails 1–2 times before succeeding, the failed edits are preserved to form (success, failure) preference triplets (~56K) for DPO or reward model training, effectively turning quality control into a preference alignment data source.

4. Multi-turn Editing Chains: Sequential Editing with Coreference Continuity

For research into iterative editing, the authors sampled 100K items from the single-turn data and appended 1–4 editing types to form sessions of 2–5 rounds. Gemini-2.5-Pro generates instructions given the "image + edit history," encouraged to use coreferences—e.g., Round 1 adds a "hat" and Round 2 says "change its color." The execution and evaluation follow the single-turn protocol, resulting in 72K sequences that test composability and pragmatic coreference.

Key Experimental Results¶

Dataset Composition and Scale¶

The total dataset size is approximately 386K–400K, consisting of: Single-turn SFT 258K (66.8%), Multi-turn SFT 72K (18.7%), and Single-turn preference pairs 56K (14.5%). All images originate from OpenImages real photos.

Subset	Scale	Purpose
Single-turn SFT	~258K	SFT for instruction-based editing
Preference Pairs	~56K	DPO / Reward Models / Alignment
Multi-turn Sequences	~72K	Iterative editing, context awareness, planning

Success Rate Analysis by Editing Type¶

The pass rate determined by Gemini-2.5-Pro indicates a clear trend: Global appearance/style edits are easy, while edits requiring precise spatial control, layout extrapolation, or symbolic fidelity are difficult.

Difficulty	Representative Editing Type	Success Rate
Easy	Strong artistic style transfer	$0.9340$
Easy	Film grain / Vintage filters	$0.9068$
Easy	Anachronistic style swaps	$0.8875$
Hard	Font / Text color change	$0.5759$
Hard	Object relocation	$0.5923$
Hard	Caricature	$0.5884$
Hard	Size / Shape / Orientation	$0.6627$
Hard	Outpainting	$0.6634$
Hard	Pixar/Disney 3D styles	$0.6463$

Comparison with Existing Datasets¶

Dataset	Scale	Image Source	Rounds
GIER	$10^4$	Real	Single
MagicBrush	$10^4$	Real	Single / Multi
HQ-Edit	$10^5$	Synthetic	Single
Echo-4o-Image	$10^5$	Synthetic	Single
UltraEdit	$10^6$	Real	Single
Ours (Pico-Banana-400K)	$10^5$	Real	Single / Multi

Key Findings¶

Clear Difficulty Boundaries: Global texture/tone/style edits (not requiring spatial reasoning) consistently achieve success rates $> 0.88$. Fine geometry, outpainting, and typography (alignment, letterform integrity) show the lowest success rates ($0.57–0.66$), often suffering from perspective inconsistency or topological artifacts.
Nano-Banana Capability Profile: While proficient in global transformations, fine-grained spatial editing and layout extrapolation remain open challenges. This suggests future directions: stronger spatial conditioning (region-referencing prompts), geometric-aware training objectives, and explicit OCR/rendering supervision.
Failures as Assets: Recycling failed edits from quality checks into preference pairs provides ~56K alignment training samples at zero additional cost.

Highlights & Insights¶

The "Frontier Model Pipeline" Paradigm: Assigning different roles to models (Flash for instructions, Nano-Banana for editing, Pro for judging) reduces human annotation costs to nearly zero while producing 400K quality-controlled samples.
Byproduct Preference Data: Successful/failed triplets naturally serve as (chosen, rejected) pairs for alignment research, merging quality filtering and preference data generation into a single efficient step.
Taxonomy-Driven Balance: Using a 35-type taxonomy solves the "editing type imbalance" at the source, while success rate statistics serve as a diagnostic report on current model capability boundaries.
Dual-Instruction Design: The inclusion of both long photorealistic prompts for supervision and short colloquial prompts for user simulation allows the dataset to serve both training and instruction-gap research.

Limitations & Future Work¶

Ceiling Restricted by Frontier Models: Since data is distilled from Nano-Banana, types the model struggles with (text, fine geometry) are underrepresented or lower quality. Furthermore, the "Gemini-2.5-Pro as judge" mechanism may carry model-specific biases and lacks cross-validation with human calibration ⚠️.
Empirical Thresholds and Weights: Success thresholds ($\approx 0.7$) and multi-dimensional weights are empirically set without extensive robustness verification.
Lack of Downstream Empirical Evidence: While the dataset is released and analyzed, the paper lacks downstream training or benchmarking results to prove its practical impact on model performance.
Future Improvements: Introducing human-in-the-loop calibration, enhancing data for difficult geometric/text tasks, and providing finer state annotations for multi-turn chains.

vs. MagicBrush / GIER: These are real-image datasets with human annotation but small scale ($10^4$). Ours utilizes model automation to reach $10^5$ while maintaining real-image sources and providing preference/multi-turn subsets.
vs. HQ-Edit / Echo-4o-Image: These use similar distillation routes but rely on synthetic images; ours uses real OpenImages photos and covers a more granular taxonomy.
vs. UltraEdit / GPT-Image-Edit-1.5M: While those datasets are larger ($10^6$), ours differentiates itself through quality control, preference subsets, multi-turn chains, and stylistic diversity in human-centric editing.

Rating¶

Novelty: ⭐⭐⭐⭐ The automated pipeline combined with failure recovery for preference pairs is clever and practical, though the "distillation from frontier models" paradigm is known.
Experimental Thoroughness: ⭐⭐⭐ Provides detailed success rate analysis and dataset comparisons but lacks empirical downstream training verification.
Writing Quality: ⭐⭐⭐⭐ Clearly explained pipeline, taxonomy, and subset divisions with sufficient documentation.
Value: ⭐⭐⭐⭐⭐ A large-scale, real-photo, commercially usable open-source dataset with preference and multi-turn subsets is a significant infrastructure contribution to the instruction editing community.

Difficulty	Representative Editing Type	Success Rate
Easy	Strong artistic style transfer	\(0.9340\)
Easy	Film grain / Vintage filters	\(0.9068\)
Easy	Anachronistic style swaps	\(0.8875\)
Hard	Font / Text color change	\(0.5759\)
Hard	Object relocation	\(0.5923\)
Hard	Caricature	\(0.5884\)
Hard	Size / Shape / Orientation	\(0.6627\)
Hard	Outpainting	\(0.6634\)
Hard	Pixar/Disney 3D styles	\(0.6463\)

Dataset	Scale	Image Source	Rounds
GIER	\(10^4\)	Real	Single
MagicBrush	\(10^4\)	Real	Single / Multi
HQ-Edit	\(10^5\)	Synthetic	Single
Echo-4o-Image	\(10^5\)	Synthetic	Single
UltraEdit	\(10^6\)	Real	Single
Ours (Pico-Banana-400K)	\(10^5\)	Real	Single / Multi