Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting¶
| Info | Content |
|---|---|
| Conference | ICCV2025 |
| arXiv | 2508.01098 |
| Code | ykdai/trans-adapter |
| Area | Image Generation / Image Inpainting |
| Keywords | Transparent image inpainting, RGBA, diffusion models, adapter, alpha channel |
TL;DR¶
This paper proposes Trans-Adapter, a plug-and-play adapter module that enables diffusion-based image inpainting models to directly process transparent (RGBA) images. It also introduces the LayerBench benchmark and the Alpha Edge Quality (AEQ) metric.
Background & Motivation¶
Core Problem¶
RGBA transparent images are widely used in film, animation, and game production, yet existing AI inpainting tools (e.g., generative fill) support only RGB images. This forces designers to manually repair transparent images, which is labor-intensive and difficult to maintain visual consistency.
Two Limitations of Prior Work¶
Inconsistency from separate processing: RGB is first inpainted on a background, then alpha is extracted via matting — the two channels are generated independently, leading to misalignment.
Poor edge quality: Matting and segmentation methods introduce jagged edges at transparency boundaries, degrading compositing quality.
Gap in the Field¶
- Methods for transparent image/video generation exist (LayerDiffuse, Zippo, etc.), but no dedicated transparent image inpainting method has been proposed.
- Existing inpainting methods (SD-Inpainting, BrushNet, etc.) are all designed for RGB.
- No benchmark or metric specifically evaluating transparent image inpainting exists.
Method¶
Overall Architecture¶
The paper decomposes RGBA images into RGB and alpha channels, treating them as a "two-frame video." A T2I model is inflated to process this "video," with a spatial alignment module and cross-domain self-attention introduced to ensure RGB–alpha consistency.
Key Design 1: Network Inflation¶
RGBA is decomposed into a padded RGB image and an alpha channel, forming a 5D input tensor \(f \in \mathbb{R}^{b \times c \times 2 \times h \times w}\). During inference through the pretrained diffusion model, this is reshaped to \(f_r \in \mathbb{R}^{2b \times c \times h \times w}\), stacking RGB and alpha latent representations along the batch dimension.
Original network parameters are frozen; only newly introduced modules are trained. All new modules use zero-initialized output projection layers to avoid disrupting pretrained capabilities at the start of training.
Key Design 2: Spatial Alignment Module¶
Introduced in the shallow layers of the U-Net, this module enforces spatial synchronization between alpha and RGB at corresponding positions via convolutional layers:
Features are reshaped to \(f_r \in \mathbb{R}^{b \times 2c \times h \times w}\) (concatenated along the channel dimension), processed by a convolutional block, and added back via a zero-initialized convolution.
Key Design 3: Cross-Domain Self-Attention¶
Introduced in the bottleneck layer of the U-Net, this allows masked regions to reference surrounding context during inpainting — particularly important for high-frequency details such as hair:
where \(f_r \in \mathbb{R}^{b \times 2hw \times c}\). After adding 2D positional embeddings, self-attention is applied, followed by a zero-initialized MLP:
Key Design 4: Alpha Map LoRA¶
A LoRA module is introduced to enable the model to learn alpha channel inpainting:
- LoRA weights are zero-initialized, learning only the residual required for alpha generation.
- Both RGB and alpha are used as training targets simultaneously.
- Different text prompts distinguish the two: alpha uses "alpha map of [prompt]," while RGB uses the original prompt.
Two-Stage Training¶
- Stage 1: Only the Alpha Map LoRA is trained, endowing the model with alpha reconstruction capability.
- Stage 2: The spatial alignment module, cross-domain self-attention, and Alpha Map LoRA are jointly fine-tuned to achieve aligned RGBA inpainting.
The training loss follows the standard DDPM objective:
Two Instantiations¶
- SD-Inpainting style: Extends UNet input channels to encode the mask and masked image.
- BrushNet style: Introduces a trainable inpainting branch.
Training Data¶
- 30K in-house high-quality transparent images (purchased from online PNG stock libraries and manually filtered).
- 90% of the MAGICK dataset merged in (150K SDXL-generated transparent images).
- Each image is paired with a text description generated by LLaVA.
LayerBench Benchmark¶
Dataset Composition¶
800 transparent images: - LayerBench-Natural (400 images): Online PNGs combined with matting datasets (DIM, etc.). - LayerBench-Generated (400 images): 200 high-aesthetic-score images from MAGICK + 200 generated by LayerDiffusion.
A key characteristic: most inpainting masks overlap with alpha boundaries, specifically designed to test RGB–alpha alignment.
AEQ Metric¶
Alpha Edge Quality: a lightweight CNN binary classifier taking 8-channel input (white-background and black-background composites + alpha + alpha edge mask) and outputting the probability of low-quality edges:
AEQ ranges in \([0, 1]\); higher is better.
Key Experimental Results¶
Main Results (SD1.5-Inpainting + Blended Noise Strategy)¶
| Method | AS↑ | LPIPS↓ | CLIP Sim↑ | AEQ↑ |
|---|---|---|---|---|
| ZIM (matting) | 6.044 | 0.0526 | 27.040 | 0.9874 |
| U²-Net (segmentation) | 6.007 | 0.0560 | 26.978 | 0.9537 |
| BiRefNet (segmentation) | 6.055 | 0.0515 | 27.049 | 0.9886 |
| Ours | 6.097 | 0.0408 | 27.030 | 0.9878 |
Key findings: - LPIPS is substantially better than all two-stage methods (0.0408 vs. 0.0515–0.0560), indicating better preservation of inpainted regions. - AEQ is competitive with the best matting method and far superior to segmentation methods (U²-Net: 0.9537). - Achieves the highest aesthetic score (6.097).
SDXL Results¶
| Method | AS↑ | LPIPS↓ | CLIP Sim↑ | AEQ↑ |
|---|---|---|---|---|
| LayerDiffusion | 6.016 | 0.0642 | 27.097 | 0.9781 |
| ZIM | 6.115 | 0.0461 | 27.111 | 0.9828 |
| BiRefNet | 6.129 | 0.0453 | 27.104 | 0.9859 |
| Ours | 6.140 | 0.0434 | 27.134 | 0.9872 |
The method comprehensively outperforms competing approaches on SDXL as well.
Ablation Study¶
| Variant | AS↑ | LPIPS↓ | CLIP Sim↑ | AEQ↑ |
|---|---|---|---|---|
| Full model | 6.097 | 0.0408 | 27.030 | 0.9878 |
| w/o MAGICK data | 6.073 | 0.0435 | 27.037 | 0.9873 |
| w/o in-house data | 6.067 | 0.0457 | 27.071 | 0.9881 |
| AnimateDiff substitute | 6.067 | 0.0459 | 27.032 | 0.9872 |
| w/o spatial alignment | 5.542 | — | — | — |
Key findings: - The spatial alignment module is the most critical component; removing it causes a severe drop in AS. - The two data sources are complementary; mixing them yields the best results. - The Trans-Adapter design outperforms direct use of AnimateDiff.
Highlights & Insights¶
- Pioneer contribution: The first dedicated transparent image inpainting method, filling an important gap in the field.
- Plug-and-play design: Compatible with multiple architectures including SD1.5, SDXL, and BrushNet, with original parameters frozen to maintain compatibility.
- Elegant "video" formulation: Treating RGB + alpha as a two-frame video is a natural and effective modeling strategy.
- Zero-initialization strategy: Zero-initializing all new module outputs ensures training stability and progressive learning.
- AEQ metric: Addresses the lack of edge alignment evaluation standards for transparent images.
Limitations & Future Work¶
- Currently supports only single-layer transparent images; multi-layer RGBA editing is not supported.
- Performance is bounded by the quality of the pretrained diffusion model.
- The AEQ classifier itself may contain inherent biases.
- Resolution is limited to 512/1024; higher-resolution scenarios have not been validated.
Related Work & Insights¶
- LayerDiffuse: A latent transparency approach capable of RGBA generation but not focused on inpainting.
- AnimateDiff: A video generation adapter that inspired the inflation design of Trans-Adapter.
- BrushNet: A plug-and-play inpainting branch that Trans-Adapter can seamlessly integrate.
- ControlNet: Source of inspiration for conditional control paradigms and the zero-initialization strategy.
- The proposed framework is extensible to transparent video inpainting and multi-layer RGBA editing.
Rating¶
⭐⭐⭐⭐ — Strong pioneering contribution with an elegant and practical design. Delivers a complete package of task definition, benchmark, and method, with direct applicability to real-world production workflows (Photoshop/After Effects).