Skip to content

Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter

Conference: ICLR 2026 arXiv: 2505.18612 Code: Project Page Area: Diffusion Models / Personalized Generation Keywords: Multi-concept Personalization, Tuning-Free, DiT Modulation Space, Mixture-of-Experts, VLM Pre-training

TL;DR

This paper proposes Mod-Adapter, a tuning-free multi-concept personalization method that predicts concept-specific modulation directions in the modulation space of DiT, enabling decoupled customization of both object and abstract concepts (pose, lighting, material, etc.), substantially outperforming existing methods on multi-concept personalization benchmarks.

Background & Motivation

Background: Personalized text-to-image generation aims to synthesize target concepts from user-provided reference images. Existing methods predominantly focus on object concepts (persons, animals, everyday items), and multi-concept personalization methods similarly handle combinations of multiple objects.

Limitations of Prior Work: (a) Existing tuning-free methods (e.g., IP-Adapter, MS-Diffusion) fail to disentangle object and abstract concepts — when given an image of a person in a specific pose, they replicate the entire person rather than extracting only the pose; (b) TokenVerse supports abstract concepts but requires test-time fine-tuning for each new image, which is time-consuming and prone to overfitting.

Key Challenge: Abstract concepts (pose, lighting, material) are not independent visual entities — they are tightly coupled with objects and difficult to extract in isolation. Moreover, there exists a substantial gap between extracted visual features and the modulation space of DiT.

Goal: (i) Generalize to new concepts without test-time fine-tuning; (ii) Support customization of both object and abstract concepts simultaneously; (iii) Achieve decoupled control across multiple concepts.

Key Insight: The locality and semantic additivity of the AdaLN modulation space in DiT — assigning different modulation vectors to different tokens enables localized concept control.

Core Idea: Train a Mod-Adapter module to predict concept-specific modulation directions, with VLM-guided pre-training to bridge the large gap between the image and modulation spaces.

Method

Overall Architecture

Built upon FLUX (DiT architecture), Mod-Adapter takes a concept image and its corresponding concept word (e.g., "surface") as input and predicts concept-specific modulation directions \(\{\Delta_i \mid i=1,\ldots,N\}\) (for \(N=57\) DiT blocks). These directions are added to the modulation vectors of the corresponding concept text tokens, producing localized effects on concept-relevant image regions via the joint attention layers. During multi-concept inference, modulation directions for different concepts are applied independently to their respective text tokens.

Key Designs

  1. Vision-Language Cross-Attention:

    • Function: Extracts visual features of the target concept from the concept image.
    • Mechanism: The concept word (e.g., "surface") is encoded via a CLIP text encoder and an MLP projection layer to obtain a neutral feature, which is projected into \(N\) queries (with sinusoidal positional encodings to distinguish different DiT blocks); the concept image is encoded via a CLIP image encoder to obtain keys and values; cross-attention \(\text{Attention}(Q_i, K, V)\) is used to extract concept visual features.
    • Design Motivation: Leverages CLIP's vision-language alignment to precisely extract concept-relevant features from the image using the concept word as an anchor, rather than naively extracting global features.
  2. Mixture-of-Experts (MoE) Projection:

    • Function: Maps extracted concept visual features accurately into the DiT modulation space.
    • Mechanism: Since different concept types (object vs. material vs. pose) exhibit substantially different mapping patterns, a single MLP is insufficient. Twelve expert MLPs are introduced, each handling concepts with similar mapping patterns. The routing mechanism adopts a parameter-free k-means clustering scheme — experts are assigned by clustering the neutral features of all concept words in the training set via k-means.
    • Design Motivation: Learnable linear gating networks suffer from imbalanced expert utilization; k-means routing avoids this issue in a simple and effective manner.
  3. VLM-Guided Pre-training:

    • Function: Provides a good initialization for Mod-Adapter and bridges the large gap between the concept image space and the DiT modulation space.
    • Mechanism: A VLM generates detailed descriptions \(p^+\) of concept images (e.g., "transparent cyan-green glass surface"), which are encoded as supervision signals in the modulation space. The pre-training loss is \(\mathcal{L}_{\text{pretrain}} = \frac{1}{N}\sum_{i=1}^N \|F_i^+ - \mathcal{M}(\text{CLIP}(p^+))\|_2^2\).
    • Design Motivation: Pre-training does not require DiT forward passes, making it lightweight and efficient; the strong image understanding capability of VLMs provides high-quality semantic bridges.

Loss & Training

The pre-training stage uses only \(\mathcal{L}_{\text{pretrain}}\) (MSE loss) without connecting to DiT; the main training stage uses the standard diffusion denoising loss of FLUX. Training data includes MVImgNet (objects), AFHQ (animal faces), and FLUX self-distillation synthetic data (abstract concepts), totaling 106K images.

Key Experimental Results

Main Results

Method Multi-concept CP Multi-concept PF Multi-concept CP·PF Single-concept CP·PF
Emu2 0.53 0.48 0.25 0.42
MIP-Adapter 0.68 0.55 0.37 0.27
MS-Diffusion 0.62 0.51 0.32 0.23
TokenVerse (tuning) 0.56 0.56 0.31 0.38
Mod-Adapter 0.70 0.89 0.62 0.54

The multi-concept composite score CP·PF reaches 0.62, surpassing the second-best method MIP-Adapter (0.37) by 67.6%.

Ablation Study

Configuration Multi-concept CP·PF Single-concept CP·PF
w/o k-means routing 0.49 0.44
w/o MoE 0.35 0.42
w/o VL-attn 0.39 0.49
w/o pre-training 0.17 0.24
Full model 0.62 0.54

Key Findings

  • VLM pre-training is the most critical component — removing it causes CP·PF to drop sharply from 0.62 to 0.17, indicating that the gap from the image to the modulation space is substantial.
  • MoE outperforms a single MLP (0.62 vs. 0.35), and k-means routing outperforms learnable routing (0.62 vs. 0.49).
  • In a user study (32 participants, 4,000 votes), Mod-Adapter achieves a large margin of superiority over baselines on both CP and PF (multi-concept CP: 4.29/5, PF: 4.40/5).
  • Existing tuning-free methods generally fail on abstract concepts, tending to copy-paste the original object rather than extracting abstract attributes.

Highlights & Insights

  • First tuning-free method for abstract concept personalization: By exploiting the locality and semantic additivity of the DiT modulation space, Mod-Adapter achieves unified and decoupled customization of both object and abstract concepts — a capability previously unavailable in tuning-free frameworks.
  • VLM-guided pre-training: Using VLM image understanding as a semantic bridge to narrow the image-to-modulation gap is an elegant warm-up strategy. Since it does not require backpropagation through DiT, the pre-training overhead is minimal.
  • K-means MoE routing: Replacing learnable gating with parameter-free clustering fundamentally resolves the expert imbalance problem; the approach is simple yet effective.

Limitations & Future Work

  • The model has 1.67B parameters; while it is the only trainable component, this is considerably heavier than textual inversion-based methods.
  • Training data for abstract concepts is synthesized via FLUX self-distillation, which may limit data quality and diversity.
  • Inference efficiency is not discussed — multi-concept inference requires a separate Mod-Adapter forward pass for each concept.
  • The method is built on the FLUX architecture; transferring it to non-DiT architectures (e.g., U-Net) would require redesign.
  • vs. TokenVerse: Both exploit the DiT modulation space, but TokenVerse requires per-image fine-tuning of an MLP, whereas Mod-Adapter is a tuning-free generalization.
  • vs. IP-Adapter/MIP-Adapter: These methods inject image features via cross-attention but lack localized control, rendering them unable to handle abstract concepts.
  • vs. MS-Diffusion: Employs a layout-guided scheme for multi-subject generation but is likewise limited to object concepts.
  • The direction manipulation strategy in the modulation space may inspire other controllable generation tasks, such as affective control and style transfer.

Rating

  • Novelty: ⭐⭐⭐⭐ First tuning-free framework to unify object and abstract concept customization; leveraging the modulation space is a novel perspective.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative + qualitative + user study + comprehensive ablation; inference efficiency analysis is absent.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clear, method description is thorough, and figures are intuitive.
  • Value: ⭐⭐⭐⭐⭐ High practical value; tuning-free multi-concept personalization has broad application scenarios.