Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting¶

Info	Content
Conference	ICCV2025
arXiv	2508.01098
Code	ykdai/trans-adapter
Area	Image Generation / Image Inpainting
Keywords	Transparent image inpainting, RGBA, diffusion models, adapter, alpha channel

TL;DR¶

This paper proposes Trans-Adapter, a plug-and-play adapter module that enables diffusion-based image inpainting models to directly process transparent (RGBA) images. It also introduces the LayerBench benchmark and the Alpha Edge Quality (AEQ) metric.

Background & Motivation¶

Core Problem¶

RGBA transparent images are widely used in film, animation, and game production, yet existing AI inpainting tools (e.g., generative fill) support only RGB images. This forces designers to manually repair transparent images, which is labor-intensive and difficult to maintain visual consistency.

Two Limitations of Prior Work¶

Inconsistency from separate processing: RGB is first inpainted on a background, then alpha is extracted via matting — the two channels are generated independently, leading to misalignment.

Poor edge quality: Matting and segmentation methods introduce jagged edges at transparency boundaries, degrading compositing quality.

Gap in the Field¶

Methods for transparent image/video generation exist (LayerDiffuse, Zippo, etc.), but no dedicated transparent image inpainting method has been proposed.
Existing inpainting methods (SD-Inpainting, BrushNet, etc.) are all designed for RGB.
No benchmark or metric specifically evaluating transparent image inpainting exists.

Method¶

Overall Architecture¶

The paper decomposes RGBA images into RGB and alpha channels, treating them as a "two-frame video." A T2I model is inflated to process this "video," with a spatial alignment module and cross-domain self-attention introduced to ensure RGB–alpha consistency.

Key Design 1: Network Inflation¶

RGBA is decomposed into a padded RGB image and an alpha channel, forming a 5D input tensor \(f \in \mathbb{R}^{b \times c \times 2 \times h \times w}\). During inference through the pretrained diffusion model, this is reshaped to \(f_r \in \mathbb{R}^{2b \times c \times h \times w}\), stacking RGB and alpha latent representations along the batch dimension.

Original network parameters are frozen; only newly introduced modules are trained. All new modules use zero-initialized output projection layers to avoid disrupting pretrained capabilities at the start of training.

Key Design 2: Spatial Alignment Module¶

Introduced in the shallow layers of the U-Net, this module enforces spatial synchronization between alpha and RGB at corresponding positions via convolutional layers:

\[f = f_r + \mathcal{Z}_c(\mathbf{ConvBlock}(f_r))\]

Features are reshaped to \(f_r \in \mathbb{R}^{b \times 2c \times h \times w}\) (concatenated along the channel dimension), processed by a convolutional block, and added back via a zero-initialized convolution.

Key Design 3: Cross-Domain Self-Attention¶

Introduced in the bottleneck layer of the U-Net, this allows masked regions to reference surrounding context during inpainting — particularly important for high-frequency details such as hair:

\[\textbf{self-attention}(f_r) = \text{softmax}\left(\frac{\mathbf{Q}_i\mathbf{K}_i}{\sqrt{D}}\right)\mathbf{V}_i\]

where \(f_r \in \mathbb{R}^{b \times 2hw \times c}\). After adding 2D positional embeddings, self-attention is applied, followed by a zero-initialized MLP:

\[f = f_r + \mathcal{Z}_M(\textbf{AttentionBlock}(f_r))\]

Key Design 4: Alpha Map LoRA¶

A LoRA module is introduced to enable the model to learn alpha channel inpainting:

LoRA weights are zero-initialized, learning only the residual required for alpha generation.
Both RGB and alpha are used as training targets simultaneously.
Different text prompts distinguish the two: alpha uses "alpha map of [prompt]," while RGB uses the original prompt.

Two-Stage Training¶

Stage 1: Only the Alpha Map LoRA is trained, endowing the model with alpha reconstruction capability.
Stage 2: The spatial alignment module, cross-domain self-attention, and Alpha Map LoRA are jointly fine-tuned to achieve aligned RGBA inpainting.

The training loss follows the standard DDPM objective:

\[\mathcal{L} = \mathbb{E}_{\mathcal{E}(x_0),y,\epsilon\sim\mathcal{N}(0,I),t}\left[\|\epsilon - \epsilon_\theta(z_t, t, \tau_\theta(y), C)\|_2^2\right]\]

Two Instantiations¶

SD-Inpainting style: Extends UNet input channels to encode the mask and masked image.
BrushNet style: Introduces a trainable inpainting branch.

Training Data¶

30K in-house high-quality transparent images (purchased from online PNG stock libraries and manually filtered).
90% of the MAGICK dataset merged in (150K SDXL-generated transparent images).
Each image is paired with a text description generated by LLaVA.

LayerBench Benchmark¶

Dataset Composition¶

800 transparent images: - LayerBench-Natural (400 images): Online PNGs combined with matting datasets (DIM, etc.). - LayerBench-Generated (400 images): 200 high-aesthetic-score images from MAGICK + 200 generated by LayerDiffusion.

A key characteristic: most inpainting masks overlap with alpha boundaries, specifically designed to test RGB–alpha alignment.

AEQ Metric¶

Alpha Edge Quality: a lightweight CNN binary classifier taking 8-channel input (white-background and black-background composites + alpha + alpha edge mask) and outputting the probability of low-quality edges:

\[\text{AEQ} = 1 - \frac{1}{|\mathcal{M}_e|}\sum_{(x,y)\in\mathcal{M}_e}\mathcal{F}(I_w, I_b, \alpha, \mathcal{M}_e)_{x,y}\]

AEQ ranges in \([0, 1]\); higher is better.

Key Experimental Results¶

Main Results (SD1.5-Inpainting + Blended Noise Strategy)¶

Method	AS↑	LPIPS↓	CLIP Sim↑	AEQ↑
ZIM (matting)	6.044	0.0526	27.040	0.9874
U²-Net (segmentation)	6.007	0.0560	26.978	0.9537
BiRefNet (segmentation)	6.055	0.0515	27.049	0.9886
Ours	6.097	0.0408	27.030	0.9878

Key findings: - LPIPS is substantially better than all two-stage methods (0.0408 vs. 0.0515–0.0560), indicating better preservation of inpainted regions. - AEQ is competitive with the best matting method and far superior to segmentation methods (U²-Net: 0.9537). - Achieves the highest aesthetic score (6.097).

SDXL Results¶

Method	AS↑	LPIPS↓	CLIP Sim↑	AEQ↑
LayerDiffusion	6.016	0.0642	27.097	0.9781
ZIM	6.115	0.0461	27.111	0.9828
BiRefNet	6.129	0.0453	27.104	0.9859
Ours	6.140	0.0434	27.134	0.9872

The method comprehensively outperforms competing approaches on SDXL as well.

Ablation Study¶

Variant	AS↑	LPIPS↓	CLIP Sim↑	AEQ↑
Full model	6.097	0.0408	27.030	0.9878
w/o MAGICK data	6.073	0.0435	27.037	0.9873
w/o in-house data	6.067	0.0457	27.071	0.9881
AnimateDiff substitute	6.067	0.0459	27.032	0.9872
w/o spatial alignment	5.542	—	—	—

Key findings: - The spatial alignment module is the most critical component; removing it causes a severe drop in AS. - The two data sources are complementary; mixing them yields the best results. - The Trans-Adapter design outperforms direct use of AnimateDiff.

Highlights & Insights¶

Pioneer contribution: The first dedicated transparent image inpainting method, filling an important gap in the field.
Plug-and-play design: Compatible with multiple architectures including SD1.5, SDXL, and BrushNet, with original parameters frozen to maintain compatibility.
Elegant "video" formulation: Treating RGB + alpha as a two-frame video is a natural and effective modeling strategy.
Zero-initialization strategy: Zero-initializing all new module outputs ensures training stability and progressive learning.
AEQ metric: Addresses the lack of edge alignment evaluation standards for transparent images.

Limitations & Future Work¶

Currently supports only single-layer transparent images; multi-layer RGBA editing is not supported.
Performance is bounded by the quality of the pretrained diffusion model.
The AEQ classifier itself may contain inherent biases.
Resolution is limited to 512/1024; higher-resolution scenarios have not been validated.

LayerDiffuse: A latent transparency approach capable of RGBA generation but not focused on inpainting.
AnimateDiff: A video generation adapter that inspired the inflation design of Trans-Adapter.
BrushNet: A plug-and-play inpainting branch that Trans-Adapter can seamlessly integrate.
ControlNet: Source of inspiration for conditional control paradigms and the zero-initialization strategy.
The proposed framework is extensible to transparent video inpainting and multi-layer RGBA editing.

Rating¶

⭐⭐⭐⭐ — Strong pioneering contribution with an elegant and practical design. Delivers a complete package of task definition, benchmark, and method, with direct applicability to real-world production workflows (Photoshop/After Effects).