Gen-n-Val: Agentic Image Data Generation and Validation¶

Conference: CVPR 2026 arXiv: 2506.04676 Code: GitHub Area: LLM Agent Keywords: Data Augmentation, Synthetic Data, Agentic Data Generation, Long-Tail Distribution, Instance Segmentation

TL;DR¶

This paper proposes Gen-n-Val, an agentic synthetic data generation and validation framework that leverages an LLM to optimize Layer Diffusion prompts for generating high-quality single-object transparent images, and employs a VLLM to filter low-quality samples. The framework reduces invalid synthetic data from 50% to 7%, achieving a 7.6% mAP improvement on LVIS rare-category instance segmentation.

Background & Motivation¶

Background: Large-scale datasets (e.g., LVIS with 1,203 categories) exhibit severe long-tail distributions—rare categories appear in fewer than 10 images. Synthetic data is an important remedy for data scarcity. Existing approaches include Copy-Paste augmentation and diffusion-model-based generation (e.g., X-Paste, MosaicFusion).
Limitations of Prior Work: MosaicFusion generates segmentation masks using cross-attention maps, but approximately 50% of data is filtered out, and among the remaining data, roughly 50% still suffers from: (1) a single mask covering multiple objects; (2) inaccurate segmentation masks; (3) incorrect category labels. Directly using Layer Diffusion with standard prompts yields approximately 44% invalid data, as monotonous and vague descriptions lead to low diversity and extraneous objects.
Key Challenge: High-quality synthetic data requires "single object + precise mask + correct category + high diversity," yet standard prompts cannot simultaneously satisfy all these requirements, and manually designed filtering rules are inefficient and prone to omission.
Goal: Design an automated agentic pipeline to generate high-quality synthetic data for balancing long-tail datasets.
Key Insight: An LLM serves as the prompt agent to generate detailed and specific prompts (incorporating object category, style, color, lighting, etc.), while a VLLM serves as the validation agent to filter substandard images; the system prompts of both agents are optimized via TextGrad.
Core Idea: Layer Diffusion natively outputs an alpha channel that provides precise masks (eliminating the need for additional segmentation models); LLM-optimized prompts ensure single-object content and high diversity; VLLM validation serves as a final safeguard to catch remaining invalid samples.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) Open-Vocabulary Prompt Generation—the LLM agent, with its system prompt optimized via TextGrad, generates detailed LD prompts; (2) Foreground Image Generation—Layer Diffusion generates transparent single-object RGBA images guided by the optimized prompts; (3) Image Filtering—the VLLM validation agent inspects generated image quality and discards substandard samples. The validated foreground instances are then randomly pasted onto background images.

Key Designs¶

TextGrad-Optimized LD Prompt Agent:
Function: Generate detailed prompts that guide Layer Diffusion to produce high-quality single-object images.
Mechanism: Three LLMs collaborate: the LD prompt agent \(A_{p_{LD}}\) generates LD prompt \(p_{LD}\) from system prompt \(p_{\text{sys}}\); the prompt evaluator \(E_{\text{prompt}}\) assesses the quality of the generated prompt and outputs a textual loss \(L\); TextGrad's textual gradient descent optimizes \(p_{\text{sys}}^*\). The prompt verifier \(V_{\text{prompt}}\) compares prompt quality before and after optimization to decide whether to accept the update. The process iterates until the verifier accepts or the maximum iteration count is reached. The optimized prompt incorporates detailed attributes including object category, action, environment, style, color, texture, lighting, and viewpoint.

Design Motivation: Standard prompts (e.g., "a photo of a single \

Method	\(\text{mAP}_\text{mask}\)	\(\text{mAP}_\text{mask}^\text{rare}\)	Invalid Data Rate
Mask R-CNN (baseline)	21.7	9.6	—
MosaicFusion	23.1	15.2	~50%
Gen2Det	23.6	15.3	—
Gen-n-Val	25.6	17.2 (+7.6)	~7%

Method	mAP	\(\text{mAP}_\text{rare}\)
YOLO11m (baseline)	10.3	6.5
Copy-Paste	10.4	6.7
Gen-n-Val	14.5	10.1 (+3.6)

Configuration	Invalid Data Rate	Notes
Standard prompt + LD	~44%	No prompt optimization
TextGrad-optimized prompt + LD	~7%	Prompt agent effective
+ VLLM validation	<1%	Validation agent further reduces invalid rate
MosaicFusion	~50%	Baseline method

Gen-n-Val: Agentic Image Data Generation and Validation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Gen-n-Val: Agentic Image Data Generation and Validation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶