Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter¶

Conference: ICLR 2026 arXiv: 2505.18612 Code: Project Page Area: Diffusion Models / Personalized Generation Keywords: Multi-concept Personalization, Tuning-Free, DiT Modulation Space, Mixture-of-Experts, VLM Pre-training

TL;DR¶

This paper proposes Mod-Adapter, a tuning-free multi-concept personalization method that predicts concept-specific modulation directions in the modulation space of DiT, enabling decoupled customization of both object and abstract concepts (pose, lighting, material, etc.), substantially outperforming existing methods on multi-concept personalization benchmarks.

Background & Motivation¶

Background: Personalized text-to-image generation aims to synthesize target concepts from user-provided reference images. Existing methods predominantly focus on object concepts (persons, animals, everyday items), and multi-concept personalization methods similarly handle combinations of multiple objects.

Limitations of Prior Work: (a) Existing tuning-free methods (e.g., IP-Adapter, MS-Diffusion) fail to disentangle object and abstract concepts — when given an image of a person in a specific pose, they replicate the entire person rather than extracting only the pose; (b) TokenVerse supports abstract concepts but requires test-time fine-tuning for each new image, which is time-consuming and prone to overfitting.

Key Challenge: Abstract concepts (pose, lighting, material) are not independent visual entities — they are tightly coupled with objects and difficult to extract in isolation. Moreover, there exists a substantial gap between extracted visual features and the modulation space of DiT.

Goal: (i) Generalize to new concepts without test-time fine-tuning; (ii) Support customization of both object and abstract concepts simultaneously; (iii) Achieve decoupled control across multiple concepts.

Key Insight: The locality and semantic additivity of the AdaLN modulation space in DiT — assigning different modulation vectors to different tokens enables localized concept control.

Core Idea: Train a Mod-Adapter module to predict concept-specific modulation directions, with VLM-guided pre-training to bridge the large gap between the image and modulation spaces.

Method¶

Overall Architecture¶

Built upon FLUX (DiT architecture), Mod-Adapter takes a concept image and its corresponding concept word (e.g., "surface") as input and predicts concept-specific modulation directions \(\{\Delta_i \mid i=1,\ldots,N\}\) (for \(N=57\) DiT blocks). These directions are added to the modulation vectors of the corresponding concept text tokens, producing localized effects on concept-relevant image regions via the joint attention layers. During multi-concept inference, modulation directions for different concepts are applied independently to their respective text tokens.

Key Designs¶

Vision-Language Cross-Attention:
- Function: Extracts visual features of the target concept from the concept image.
- Mechanism: The concept word (e.g., "surface") is encoded via a CLIP text encoder and an MLP projection layer to obtain a neutral feature, which is projected into \(N\) queries (with sinusoidal positional encodings to distinguish different DiT blocks); the concept image is encoded via a CLIP image encoder to obtain keys and values; cross-attention \(\text{Attention}(Q_i, K, V)\) is used to extract concept visual features.
- Design Motivation: Leverages CLIP's vision-language alignment to precisely extract concept-relevant features from the image using the concept word as an anchor, rather than naively extracting global features.
Mixture-of-Experts (MoE) Projection:
- Function: Maps extracted concept visual features accurately into the DiT modulation space.
- Mechanism: Since different concept types (object vs. material vs. pose) exhibit substantially different mapping patterns, a single MLP is insufficient. Twelve expert MLPs are introduced, each handling concepts with similar mapping patterns. The routing mechanism adopts a parameter-free k-means clustering scheme — experts are assigned by clustering the neutral features of all concept words in the training set via k-means.
- Design Motivation: Learnable linear gating networks suffer from imbalanced expert utilization; k-means routing avoids this issue in a simple and effective manner.
VLM-Guided Pre-training:
- Function: Provides a good initialization for Mod-Adapter and bridges the large gap between the concept image space and the DiT modulation space.
- Mechanism: A VLM generates detailed descriptions \(p^+\) of concept images (e.g., "transparent cyan-green glass surface"), which are encoded as supervision signals in the modulation space. The pre-training loss is \(\mathcal{L}_{\text{pretrain}} = \frac{1}{N}\sum_{i=1}^N \|F_i^+ - \mathcal{M}(\text{CLIP}(p^+))\|_2^2\).
- Design Motivation: Pre-training does not require DiT forward passes, making it lightweight and efficient; the strong image understanding capability of VLMs provides high-quality semantic bridges.

Loss & Training¶

The pre-training stage uses only \(\mathcal{L}_{\text{pretrain}}\) (MSE loss) without connecting to DiT; the main training stage uses the standard diffusion denoising loss of FLUX. Training data includes MVImgNet (objects), AFHQ (animal faces), and FLUX self-distillation synthetic data (abstract concepts), totaling 106K images.

Key Experimental Results¶

Main Results¶

Method	Multi-concept CP	Multi-concept PF	Multi-concept CP·PF	Single-concept CP·PF
Emu2	0.53	0.48	0.25	0.42
MIP-Adapter	0.68	0.55	0.37	0.27
MS-Diffusion	0.62	0.51	0.32	0.23
TokenVerse (tuning)	0.56	0.56	0.31	0.38
Mod-Adapter	0.70	0.89	0.62	0.54

The multi-concept composite score CP·PF reaches 0.62, surpassing the second-best method MIP-Adapter (0.37) by 67.6%.

Ablation Study¶

Configuration	Multi-concept CP·PF	Single-concept CP·PF
w/o k-means routing	0.49	0.44
w/o MoE	0.35	0.42
w/o VL-attn	0.39	0.49
w/o pre-training	0.17	0.24
Full model	0.62	0.54

Key Findings¶

VLM pre-training is the most critical component — removing it causes CP·PF to drop sharply from 0.62 to 0.17, indicating that the gap from the image to the modulation space is substantial.
MoE outperforms a single MLP (0.62 vs. 0.35), and k-means routing outperforms learnable routing (0.62 vs. 0.49).
In a user study (32 participants, 4,000 votes), Mod-Adapter achieves a large margin of superiority over baselines on both CP and PF (multi-concept CP: 4.29/5, PF: 4.40/5).
Existing tuning-free methods generally fail on abstract concepts, tending to copy-paste the original object rather than extracting abstract attributes.

Highlights & Insights¶

First tuning-free method for abstract concept personalization: By exploiting the locality and semantic additivity of the DiT modulation space, Mod-Adapter achieves unified and decoupled customization of both object and abstract concepts — a capability previously unavailable in tuning-free frameworks.
VLM-guided pre-training: Using VLM image understanding as a semantic bridge to narrow the image-to-modulation gap is an elegant warm-up strategy. Since it does not require backpropagation through DiT, the pre-training overhead is minimal.
K-means MoE routing: Replacing learnable gating with parameter-free clustering fundamentally resolves the expert imbalance problem; the approach is simple yet effective.

Limitations & Future Work¶

The model has 1.67B parameters; while it is the only trainable component, this is considerably heavier than textual inversion-based methods.
Training data for abstract concepts is synthesized via FLUX self-distillation, which may limit data quality and diversity.
Inference efficiency is not discussed — multi-concept inference requires a separate Mod-Adapter forward pass for each concept.
The method is built on the FLUX architecture; transferring it to non-DiT architectures (e.g., U-Net) would require redesign.

vs. TokenVerse: Both exploit the DiT modulation space, but TokenVerse requires per-image fine-tuning of an MLP, whereas Mod-Adapter is a tuning-free generalization.
vs. IP-Adapter/MIP-Adapter: These methods inject image features via cross-attention but lack localized control, rendering them unable to handle abstract concepts.
vs. MS-Diffusion: Employs a layout-guided scheme for multi-subject generation but is likewise limited to object concepts.
The direction manipulation strategy in the modulation space may inspire other controllable generation tasks, such as affective control and style transfer.

Rating¶

Novelty: ⭐⭐⭐⭐ First tuning-free framework to unify object and abstract concept customization; leveraging the modulation space is a novel perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative + qualitative + user study + comprehensive ablation; inference efficiency analysis is absent.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear, method description is thorough, and figures are intuitive.
Value: ⭐⭐⭐⭐⭐ High practical value; tuning-free multi-concept personalization has broad application scenarios.