Referring Layer Decomposition¶

Conference: ICLR 2026 arXiv: 2602.19358 Code: https://yaojie-shen.github.io/project/RLD/ Area: Image Decomposition / Image Editing Keywords: Layer Decomposition, RGBA Layers, Multimodal Referring Input, Data Engine, RefLayer

TL;DR¶

This paper introduces the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image given flexible user-provided prompts (spatial, textual, or hybrid). It also constructs the RefLade dataset comprising 1.1 million samples and proposes an automated evaluation protocol.

Background & Motivation¶

Modern generative models typically process images holistically, lacking explicit representations of individual scene elements. This makes selective manipulation, cross-edit consistency maintenance, and semantic alignment particularly challenging. Image layers — transparent visual units in RGBA format — provide a more intuitive framework analogous to layer-based workflows in Photoshop.

Limitations of prior work: - MuLAn: Limited data scale (44K images) with a success rate of only 36%. - Text2Layer: Restricted to separating foreground and background into two layers only. - LayerDecomp: Relies on synthetic supervision and requires target masks.

The core innovation of the RLD task lies in supporting diverse user prompts (points, boxes, masks, and text) to enable on-demand extraction of target RGBA layers.

Method¶

Overall Architecture¶

RLD comprises three major components: 1. RefLade Dataset: 1.1 million image–layer–prompt triplets. 2. Automated Evaluation Protocol: Three-axis evaluation covering preservation, completion, and fidelity. 3. RefLayer Baseline Model: Conditional generation built upon Stable Diffusion 3.

Data Engine (6-Stage Pipeline)¶

Pre-filtering: Rule-based removal of low-quality images (86.1% retention rate).
Scene Understanding: Integration of closed-set detection, open-vocabulary detection, and MLLM-based localization.
Layer Completion: Reconstruction of occluded object regions.
Post-processing: Mask refinement and alpha matte prediction.
Prompt Generation: Generation of spatial, textual, and multimodal prompts.
Post-filtering: Assessment of RGBA layer fidelity, realism, and semantic consistency.

The overall success rate improves from MuLAn's 36% to 70%.

Evaluation Protocol (Three Dimensions)¶

Preservation score \(\mathcal{S}_{\text{vis}}\): LPIPS similarity over originally visible regions.

\[\mathcal{S}_{\text{vis}} = \mathbb{E}_{(p,g)\sim\mathcal{D}}[\text{LPIPS}(g_{\text{rgb}} \odot g_v, p_{\text{rgb}} \odot g_v)]\]

Completion score \(\mathcal{S}_{\text{gen}}\): Directional similarity based on CLIP features.

\[\mathcal{S}_{\text{gen}} = \mathbb{E}[\cos(f(g_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v), f(p_{\text{rgb}}) - f(g_{\text{rgb}} \odot g_v))]\]

Fidelity score \(\mathcal{S}_{\text{fid}}\): FID computed after alpha-compositing the predicted layer onto a background.

HPA composite score: A normalized weighted average calibrated via human-preference Elo rankings, which exhibits strong correlation with human judgments.

RefLayer Model¶

Built upon Stable Diffusion 3.
A VAE encoder encodes the source image and spatial prompts.
A lightweight convolutional layer compresses the channel dimensions.
Dual decoders: a standard RGB decoder and a custom alpha decoder.

Prompt encoding strategy: All spatial prompts are unified into a colored RGB image format: - Blue canvas → background - Green region → bounding box - Red region → mask - Gaussian heatmap → point

Experiments¶

Dataset Statistics¶

Dataset	Task	#Images	#Categories	#Instances	Occlusion Rate
MuLAn	LD	44,860	759	101,269	7.7%
RefLade	RLD	430,488	12K	871,829	60.8%

Evaluation Protocol Validation¶

The HPA score exhibits strong correlation with human Elo rankings, whereas individual metrics \(\mathcal{S}_{\text{vis}}\), \(\mathcal{S}_{\text{fid}}\), and \(\mathcal{S}_{\text{gen}}\) each fail to consistently reflect human preferences in isolation.

Quality Assessment¶

74.7% of foreground layers and 70.2% of background layers meet the quality threshold.
Human annotation was conducted over 43 days by 9 professional annotators.
Careful curation yields 59K high-quality images and 110K validated layers.

Key Findings¶

Coarse-grained prompts (e.g., a single point) may lead to coarse-grained outputs, whereas precise prompts yield accurate object-level layers.
RefLayer demonstrates strong zero-shot generalization capability.
The multi-granularity prompt system supports flexible control ranging from coarse to fine.

Highlights & Insights¶

This work is the first to formally define the layer decomposition task conditioned on multimodal referring inputs.
The data engine is systematically designed and raises the success rate from 36% to 70%.
The evaluation protocol aligns closely with human preferences, addressing a critical evaluation bottleneck.
The RefLade dataset is substantially larger than existing counterparts (430K vs. MuLAn's 44K).

Limitations & Future Work¶

The data engine relies on multiple external models (detection, segmentation, inpainting), making cascading errors unavoidable.
Human annotation incurs high cost (43 days × 9 annotators).
Ground-truth layers used in the evaluation protocol may themselves be imperfect.

Image Understanding and Editing: Detection, segmentation, inpainting, alpha matting, etc.
Compositional Image Representation: RGBA layer methods such as MuLAn and Text2Layer.
Referring Expression Segmentation: Promptable segmentation methods such as SAM output masks only and do not reconstruct occluded content.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The task definition is original and fills a clear research gap.
Data Contribution: ⭐⭐⭐⭐⭐ — Million-scale dataset + data engine + human annotation.
Evaluation: ⭐⭐⭐⭐ — Three-dimensional evaluation protocol aligned with human preferences.
Practicality: ⭐⭐⭐⭐ — Direct applicability to image editing and compositing.