Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation¶
Conference: CVPR 2026 arXiv: 2604.00849 Code: Unavailable (planned open-source) Area: Image Generation Keywords: Subject-Driven T2I, Diffusion Transformer, GRPO, reward model, Textual-Visual Decoupling
TL;DR¶
This paper proposes the DisCo framework, which resolves the similarity-controllability paradox in subject-driven image generation by first decoupling textual and visual information (replacing entity words with pronouns to eliminate textual interference on the subject) and then re-coupling them via GRPO with a dedicated reward model.
Background & Motivation¶
The core tension in subject-driven T2I generation lies in the dual-optimality paradox: preserving high subject fidelity while accurately following textual editing instructions. Existing methods such as IP-Adapter, OminiControl, and DreamO adopt techniques like encoder injection or unified sequence modeling, yet none fundamentally resolves this conflict.
The paper's core insight is that the root cause lies in the role overload of the text prompt. Conventional prompts simultaneously describe the subject and the editing instruction (e.g., "a duck toy in the jungle"), where "duck toy" activates the model's prior knowledge and conflicts with the actual details of the reference image. Experiments (Fig. 1) demonstrate that replacing "a duck toy" with "this item" significantly improves subject fidelity—indicating that the problem is not insufficient model capacity, but rather that entity descriptors in the prompt introduce contradictory signals.
Method¶
Overall Architecture¶
DisCo is a two-stage "disentangle-then-recouple" framework built upon the FLUX DiT model: 1. Textual-Visual Decoupling (TVD) Module: Completely separates subject identity information from textual control instructions. 2. GRPO Re-Coupling Stage: Uses reinforcement learning to naturally reintegrate the decoupled visual subject with the textual context.
Key Designs¶
-
Prompt Simplification Strategy: Qwen2.5-VL 72B is used to analyze the prompt, identify the "entity words" corresponding to the subject (e.g., "a duck toy"), and replace them with generic pronouns ("this item" / "it"). This forces the model to obtain subject identity from the visual modality, eliminating interference from textual priors.
-
Visual Grounding Localization: After prompt simplification, the model cannot determine which object in the reference image "this item" refers to. GroundingDINO is applied with the original entity words to precisely localize the subject in the reference image, bridging the generic pronoun to specific visual features. Attention map visualizations (Fig. 2) confirm that after decoupling, attention to entity words is suppressed, while attention to the reference subject precisely focuses on the corresponding region in the generated image.
-
Dedicated Reward Model Training: Existing rewards (ImageReward, CLIP-T, HPS) evaluate only overall quality or text alignment and cannot capture subject fidelity or compositional coherence. The paper employs a VLM to automatically generate editing instructions and synthesize negative samples (by modifying subject identity features or subject-context interactions), constructing preference pairs to train a reward model based on Qwen3-VL-30B capable of jointly evaluating subject similarity and compositional naturalness.
-
GRPO Reinforcement Learning: For each prompt-image pair, \(G=12\) images are sampled; the reward model performs pairwise preference selection, and aggregated log-probabilities serve as rewards. Advantage values \(\hat{A}_t^i\) are computed via group-level normalization, and the policy model is optimized using a clipped objective with KL regularization.
Loss & Training¶
- Base model: FLUX; Dataset: Subjects200K
- Optimizer: AdamW, learning rate 1e-5, 8×H20 GPUs
- GRPO settings: sampling timestep=16, 12 images generated per prompt, noise level ε=0.3
- Reward model: Qwen3-VL-30B, trained on 25k preference pairs
Key Experimental Results¶
Main Results¶
Evaluated on DreamBench (30 subjects × 25 prompts = 750 cases):
| Metric | DisCo | FLUX Kontext | DreamO | UNO | Gain |
|---|---|---|---|---|---|
| CLIP-B-I↑ | 0.928 | 0.910 | 0.899 | 0.899 | +1.8% |
| CLIP-L-I↑ | 0.937 | 0.911 | 0.901 | 0.907 | +2.6% |
| DINO-I↑ | 0.903 | 0.839 | 0.813 | 0.827 | +7.6% |
| CLIP-B-T↑ | 0.329 | 0.321 | 0.322 | 0.311 | +2.5% |
| CLIP-L-T↑ | 0.273 | 0.268 | 0.267 | 0.255 | +1.9% |
| ImageReward↑ | 1.339 | 1.276 | 1.186 | 0.854 | +4.9% |
DisCo achieves state-of-the-art performance simultaneously on both subject similarity and text controllability, breaking the trade-off that plagued prior methods.
Ablation Study¶
| Configuration | CLIP-I↑ | CLIP-T↑ | IR↑ | Note |
|---|---|---|---|---|
| w/o TVD | 0.915 | 0.319 | 1.237 | No decoupling; subject fidelity degrades |
| w/o GRPO | 0.922 | 0.319 | 1.189 | No RL; compositional quality drops substantially |
| use CLIP (r) | 0.898 | 0.319 | 1.163 | CLIP cannot assess fine-grained quality |
| use IR (r) | 0.914 | 0.326 | 1.404 | IR improves quality but harms subject similarity |
| use pretrained (r) | 0.918 | 0.321 | 1.189 | General VLM fails to calibrate complex preferences |
| DisCo (Ours) | 0.928 | 0.329 | 1.339 | All components synergize optimally |
Key Findings¶
- The TVD module addresses subject fidelity, but strict decoupling can cause unnatural compositions (e.g., a candle floating in an urban background).
- GRPO is critical for bridging the compositional gap, raising IR from 1.189 to 1.339.
- The dedicated reward model substantially outperforms CLIP, ImageReward, and general-purpose VLMs as reward signals.
- User study (100 cases): DisCo achieves win rates of 80% over UNO, 82% over DreamO, and 51% over FLUX Kontext.
Highlights & Insights¶
- Precise problem diagnosis: The motivating experiment in Fig. 1 intuitively exposes the conflict between textual priors and visual references—an insight that is both concise and profound.
- Disentangle-then-recouple design philosophy: Eliminating conflicts via information isolation before restoring interaction through RL proves more effective than optimizing directly in an entangled space.
- Synthetic negative samples for reward model training: Preference pairs are automatically constructed using VLM-generated editing instructions, eliminating manual annotation while targeting failure modes specific to subject-driven generation.
- Attention map visualization provides direct empirical evidence for the effectiveness of decoupling.
Limitations & Future Work¶
- Inference requires Qwen2.5-VL 72B for prompt analysis and GroundingDINO for localization, introducing additional computational complexity.
- The reward model is trained on only 25k preference pairs, which may limit generalization to edge cases.
- Evaluation is conducted solely on DreamBench, leaving broader benchmark coverage unexplored.
- Multi-subject scenarios are not discussed.
Related Work & Insights¶
- The trend of migrating GRPO from LLMs (DeepSeek-R1) to diffusion models (Flow-GRPO, DanceGRPO) warrants continued attention.
- The pipeline of synthetic negative samples → reward model training → RL is transferable to other generation tasks requiring multi-dimensional evaluation.
- The principle of "information source decoupling" holds general value for multi-condition generation tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The core insight (prompt role overload) is precise; the disentangle-then-recouple framework is well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Quantitative, qualitative, user study, and ablation analyses are all comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ — The motivating example in Fig. 1 is highly convincing; the paper is clearly written throughout.
- Value: ⭐⭐⭐⭐ — Provides a systematic solution to subject-driven T2I generation.