MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting¶

Conference: CVPR 2025
arXiv: 2506.23482
Code: None
Area: Diffusion Models / Image Inpainting
Keywords: Image Inpainting, Mask-Text Alignment, Style Consistency, Edge Prediction, Diffusion Models

TL;DR¶

MTADiffusion addresses three key challenges in object inpainting simultaneously—semantic misalignment, structural distortion, and style inconsistency—by constructing a mask-text aligned dataset with 5 million images, jointly training the inpainting and edge prediction tasks, and employing a VGG Gram matrix-based style consistency loss. It achieves SOTA performance on BrushBench and EditBench.

Background & Motivation¶

Background: Diffusion model-based image inpainting can generate contents based on prompts and masks in designated regions. Mainstream methods like BrushNet utilize a dual-branch strategy to guide the inpainting with pixel-level mask information.
Limitations of Prior Work: Existing methods suffer from three key issues: (a) Semantic misalignment: the generated content does not match the prompt, tending to fill the background rather than generating the new object; (b) Structural distortion: the structural details of the generated objects are chaotic, such as messy limbs; (c) Style inconsistency: the generated areas are incompatible with the original image in terms of hue, texture, and illumination.
Key Challenge: The root cause of semantic misalignment is that the masks are not strictly aligned with the text descriptions during training. SDI uses the caption of the whole image for training, leading to semantic misalignment already in the training stage; SmartBrush and PowerPaint use segmentation labels from OpenImages, which are oversimplified. Structural distortion occurs because the mask condition is not strong enough to effectively constrain the internal structure of objects. Style inconsistency lacks an explicit style constraint mechanism.
Goal: (1) How to construct aligned mask-text detailed descriptions at scale? (2) How to enhance the structural stability of the inpainting model? (3) How to guarantee style consistency between the generated region and the original image?
Key Insight: The authors observe that the issue stems from data (lack of mask-text alignment annotations), structure (lack of structural constraints), and style (lack of style losses), and propose solutions from these three dimensions respectively.
Core Idea: A three-pronged approach is proposed: automatically annotating a mask-text aligned dataset via the MTAPipeline, enhancing structure through multi-task learning for edge prediction, and maintaining consistency using a VGG Gram matrix style loss to tackle the inpainting quality bottleneck.

Method¶

Overall Architecture¶

The inputs are the original image, a mask, and a text prompt. The model adopts a dual-branch architecture similar to BrushNet: the UNet backbone is responsible for denoising generation, and the brush branch (using self-attention instead of residual blocks) receives the noise latent, masked image latent, and downsampled mask as inputs, connecting to the UNet layer by layer via zero convolutions. During training, the weights of VAE and UNet are frozen, and only the newly introduced attention branch and the VGG latent space style extractor are trained. The final loss is a weighted sum of three components: noise prediction loss, edge prediction loss, and style consistency loss.

Key Designs¶

MTAPipeline and MTADataset (Data Construction):
- Function: Automatically annotates detailed content and style descriptions for each mask, constructing a large-scale mask-text aligned training dataset.
- Mechanism: A two-stage pipeline—the first stage extracts object masks, labels, and bounding boxes using Grounded-SAM (confidence > 0.6); the second stage uses LLaVA as a VLM, takes the masked region image and labels as inputs, and generates detailed content + style descriptions via the prompt "Describe the {label} and its style in details". Based on a subset of LAION (aesthetic score > 5.8, resolution > 1024), the MTADataset is constructed with 5 million images and 25 million mask-text pairs.
- Design Motivation: Existing datasets either only have simple labels (such as OpenImages' class labels) or use whole-image captions (which do not align with the masked region), failing to meet the demand for scale training. Visual descriptions generated by LLaVA contain both content and style information, which is vastly superior to simple labels.
Multi-task Training with Edge Prediction (Structural Constraints):
- Function: Enhances the structural stability of the generated objects through joint training of the inpainting task and the edge prediction task.
- Mechanism: The output dimension of the last layer in the brush branch's attention block is expanded from \((b,c,h,w)\) to \((b,c+1,h,w)\), with the extra dimension dedicated to predicting the edge map. Ground truth edge maps are obtained by downsampling after extracting with the Sobel operator. The structural loss is defined as MSE with Frobenius norm: \(L_{structure} = \frac{1}{B}\sum_{i=1}^{B}\|s_{pred}^i - \tilde{s}^i\|_F^2\).
- Design Motivation: Relying solely on mask conditions cannot constrain the internal structure of objects. The edge prediction task serves as a complementary mechanism to guide the model in reconstructing contents with stable structural features.
VGG Gram Matrix Style Consistency Loss:
- Function: Ensures the style of the infilled region matches the original image in terms of hue, texture, and lighting.
- Mechanism: \(X_{t-1}\) is derived from the UNet-predicted noise \(z_t\) through a denoising function, and \(\tilde{X}_{t-1}\) is derived from the GT image via a noise addition function. These are fed to a pre-trained VGG network to extract multi-layer style features \((\alpha_1,...,\alpha_n)\) and \((\beta_1,...,\beta_n)\) respectively. The MSE loss of Gram matrices is computed as: \(L_{style} = \frac{1}{BN}\sum\|G(\alpha_i) - G(\beta_i)\|_F^2\). Meanwhile, the brush branch uses multi-resolution self-attention blocks (instead of BrushNet's residual blocks) to force the model to focus on global image information.
- Design Motivation: BrushNet's style heavily relies on the base model, leading to inconsistency. The self-attention mechanism enables the model to perceive the overall style, and computing style differences in the latent space through the VGG Gram matrix is more direct and effective.

Loss & Training¶

The total loss is a weighted sum of three parts: \(L = \gamma L_{noise} + \delta L_{style} + \eta L_{structure}\), where \(\gamma=1, \delta=100, \eta=0.1\). The learning rate is \(1\times10^{-5}\) for the attention branch and \(1\times10^{-7}\) for the VGG style extractor. Training is conducted on 8 V100 GPUs for 200k iterations. During training, random masks use whole-image captions while object masks use detailed descriptions, unified within a single model.

Key Experimental Results¶

Main Results¶

Dataset	Metric	MTADiffusion	BrushNet	SDI	Gain
BrushBench	IR×10↑	12.69	12.52	11.72	+0.17
BrushBench	CLIP Sim↑	26.52	26.32	26.17	+0.20
BrushBench	VQA Score×100↑	68.97	68.22	64.55	+0.75
EditBench	IR×10↑	4.82	4.46	1.86	+0.36
EditBench	CLIP Sim↑	29.12	28.87	28.00	+0.25

Ablation Study¶

Configuration	IR×10↑	CLIP Sim↑	Notes
Trained on BrushData	11.73	26.28	Original BrushNet data
Trained on MTADataset	12.08	26.41	Ours dataset, significant improvement in just 10k iterations
Original Caption	11.67	26.31	Original whole-image LAION caption
Grounded-SAM Label	11.41	26.23	Simple label, worst semantic performance
LLaVA Caption	11.76	26.36	Detailed description + style information, optimal across all metrics

User Study (30 participants × 60 images × 3 questions): In the three metrics of semantic alignment, structural stability, and style consistency, MTADiffusion gets voting rates of 66%, 60%, and 54% respectively, far exceeding BrushNet's 15%, 16%, and 13%.

Key Findings¶

The MTADataset is the primary contributor: with only 10k iterations, training BrushNet on MTADataset brings significant improvements in image quality and text consistency.
Detailed descriptions with style information from LLaVA yield the best results, whereas simple labels perform the worst in text alignment. This indicates that rich text descriptions are critical for semantic capability.
While SDI is weaker in semantic alignment, its training on 600 million images yields exceptional style consistency, demonstrating that data scale is crucial for style learning.

Highlights & Insights¶

Data annotation scheme of MTAPipeline is the most valuable contribution: Utilizing Grounded-SAM + LLaVA to realize automated mask-text alignment annotation is general and scalable, allowing migration to any task requiring local region-text alignment.
Computing VGG style loss in latent space is a clever design: It is calculated on UNet outputs at different noise levels rather than pixel space, avoiding the extra computational overhead of decoding to pixel space.
Edge prediction only requires extending a single output dimension to bring structural improvements, incurring almost zero extra computational cost.
The data construction pipeline scales well: As long as there is a segmentation model + VLM, MTAPipeline can run on any image data to generate mask-text pairs, which is methodologically not restricted to the inpainting task.

Limitations & Future Work¶

Lack of open-source code makes reproducibility questionable.
MTADataset relies on the upper limit of LLaVA and Grounded-SAM quality, which may generate erroneous annotations in complex scenes.
The individual contributions of the three loss functions are not detailed in the main text (placed in supplementary materials), making it hard to precisely judge the marginal effect of each loss.
The VGG input channels for style loss were changed from 3 to 4 to match the latent channels; whether this modification impacts the feature extraction ability of pre-trained VGG is worth discussing.
Utilizing a stronger VLM (such as GPT-4V or InternVL2) instead of LLaVA could be considered to improve annotation quality.
Applicability to DiT architectures (e.g., SD3/FLUX) has not been explored, and the effectiveness of the dual-branch strategy on DiT remains unknown.
The edge prediction target uses the Sobel operator to extract GT; such handcrafted edge detection might not be as effective as learned structural representations.

vs BrushNet: BrushNet uses dual-branch pixel-level information but does not align mask-text. This work builds on it by adding data alignment + structural constraints + style constraints. BrushNet's style is dominated by the base model; MTADiffusion explicitly optimizes it via the VGG Gram matrix.
vs SmartBrush/PowerPaint: They use simple labels from OpenImages (only category names). In contrast, this work uses VLM to generate detailed descriptions covering both content and style. The data scale (25M mask-text pairs vs tens of thousands) and annotation quality are significantly enhanced.
vs CAT-Diffusion: CAT uses a semantic inpainter to extract vision-text features cascading the diffusion model, but it is limited by the upper-bound capacity of the masked image and short labels. Our end-to-end training is cleaner.
vs SDI (Stable Diffusion Inpainting): SDI uses whole-image captions + random masks for training, causing semantic misalignment during the training phase. However, its style consistency achieved through training on 600M images is worth learning from.

Rating¶

Novelty: ⭐⭐⭐行业 Systematic solutions across three dimensions (data + structure + style), but the innovation of each single component is limited on its own.
Experimental Thoroughness: ⭐⭐⭐⭐ Adequate comparisons on BrushBench and EditBench, and a convincing user study, but the ablation is not fully integrated.
Writing Quality: ⭐⭐⭐⭐ Clarified problem definition and well-structured methodology description.
Value: ⭐⭐⭐⭐ MTAPipeline and MTADataset provide sustained value to the community, and the three strategies can be generalized to other inpainting models.