Skip to content

Minority-Focused Text-to-Image Generation via Prompt Optimization

Conference: CVPR 2025
arXiv: 2410.07838
Code: https://github.com/soobin-um/MinorityPrompt
Area: Diffusion Models / Image Generation
Keywords: Minority Sample Generation, Prompt Optimization, Text-to-Image, Low-Density Sampling, Diffusion Model Bias

TL;DR

MinorityPrompt proposes an online prompt optimization framework. By iteratively optimizing a learnable token embedding during the inference stage to maximize a likelihood-related loss, it guides T2I diffusion models to generate minority samples located in the low-density regions of the data distribution, while maintaining semantic consistency and generation quality.

Background & Motivation

  1. Background: Text-to-image (T2I) diffusion models, coupled with CFG guidance, are capable of generating high-quality images faithful to the prompts. However, guidance techniques like CFG inherently tend to sample from high-density regions of the data manifold, generating "typical" images.
  2. Limitations of Prior Work: This preference for high-density regions makes it difficult for models to generate minority samples—unique instances located in low-density regions of the conditional data distribution. Consequently, the data generated by T2I models lacks diversity, and can perpetuate or even amplify biases (such as age or ethnic stereotypes) in downstream applications like data augmentation.
  3. Key Challenge: Existing minority sample sampling methods either require external classifiers (which are difficult to obtain) or work only on simple image benchmarks, showing limited performance in T2I scenarios. Meanwhile, existing online prompt optimization methods modify the entire text embedding, which easily disrupts the semantics of the original prompt.
  4. Goal: How to guide T2I models to generate unique minority samples from low-density regions while maintaining the original text semantics?
  5. Key Insight: Instead of modifying the entire text embedding, a learnable token is appended to the end of the prompt. Only the embedding of this token is optimized to maximize a reconstruction loss (acting as a proxy for likelihood), thereby encouraging the generation of unique characteristics while preserving semantics.
  6. Core Idea: By online optimizing the appended learnable token embedding at inference time to maximize a negative ELBO approximation, the generation is biased towards low-likelihood, unique minority samples.

Method

Overall Architecture

Given a user prompt \(\mathcal{P}\) (e.g., "A portrait of a dog"), MinorityPrompt appends a placeholder string \(\mathcal{S}\) to its end, resulting in an augmented prompt \(\mathcal{P}_\mathcal{S}\). At each sampling timestep \(t\), the token embedding \(\boldsymbol{v}\) corresponding to \(\mathcal{S}\) is optimized to ensure that the denoising output based on this embedding has a higher reconstruction loss (approximated low likelihood). The optimized embedding is used for sampling at the current step and is then transmitted to the next step as the initial point for ongoing optimization.

Key Designs

  1. Semantic-Preserving Prompt Optimization Framework:

    • Function: Introduce controllable additional semantic information without disrupting the original prompt's semantics.
    • Mechanism: Instead of optimizing the entire text embedding \(\mathcal{C}\), only the embedding vector \(\boldsymbol{v}\) of the appended learnable token is optimized. When the text encoder processes the augmented prompt \(\mathcal{P}_\mathcal{S}\), the token embeddings of each word in the original prompt remain unchanged, and only the embedding corresponding to \(\mathcal{S}\) is updated. The optimization objective is \(\boldsymbol{v}_t^* = \arg\max_{\boldsymbol{v}} \mathcal{J}(\boldsymbol{z}_t, \mathcal{C}_{\boldsymbol{v}})\).
    • Design Motivation: Directly modifying the entire \(\mathcal{C}\) changes all token embeddings, which harms semantics. Optimizing only the appended token is a safer approach, and it allows the embedding to adaptively change over timesteps (unlike methods like Textual Inversion that require pre-training to fix the embedding).
  2. Likelihood-Based Minority Objective Function:

    • Function: Drive the generation towards low-density regions.
    • Mechanism: Define the objective function \(\mathcal{J}_\mathcal{C}(\boldsymbol{z}_t, \mathcal{C}_{\boldsymbol{v}}) = \mathbb{E}_\epsilon[\|\hat{\boldsymbol{z}}_0(\boldsymbol{z}_t, \mathcal{C}_{\boldsymbol{v}}) - \hat{\boldsymbol{z}}_0(\boldsymbol{z}_{s|t,0}, \mathcal{C})\|^2_2]\), where the first term performs denoising conditioned on the optimized token, and the second term performs denoising with the original condition on a noisy version of the same clean estimation. The paper proves that this objective is equivalent to the negative ELBO of \(-\log p_\theta(\hat{\boldsymbol{z}}_0 | \mathcal{C})\), so maximizing it is equivalent to pushing the generated result away from high-density regions.
    • Design Motivation: Compared to naive CFG-based objective functions, this design avoids three issues: (i) it does not rely on CFG's denoising estimation, (ii) it allows gradients to flow through the second term, and (iii) the second term uses the original condition \(\mathcal{C}\) instead of \(\mathcal{C}_{\boldsymbol{v}}\).
  3. Stabilization Techniques (Stop-Gradient Trick + Annealed Timestep):

    • Function: Stabilize the optimization process and improve generation quality.
    • Mechanism: Split the objective function into \(\tilde{\mathcal{J}}_\mathcal{C} = \mathcal{J}^1_\mathcal{C} + \lambda \mathcal{J}^2_\mathcal{C}\), where \(\mathcal{J}^1\) applies stop-gradient to the second term, \(\mathcal{J}^2\) applies stop-gradient to the first term, and \(\lambda=1\) achieves the best performance. Additionally, use an annealed timestep \(s = T - t\) instead of a fixed value. Optimization is performed every \(N\) steps, while non-optimized steps use the original prompt \(\mathcal{C}\).
    • Design Motivation: The bidirectional stop-gradient lets both terms handle different optimization directions. The annealed timestep fits the optimal reconstruction scale across different noise levels. Interval-based optimization reduces computational overhead and stabilizes output quality.

Loss & Training

MinorityPrompt is an inference-time method and does not require additional training. During inference, the Adam optimizer is used to update \(\boldsymbol{v}\) with \(K\) iterations per optimization step. Experiments use 50-step DDIM + CFG \(w=7.5\) on SDv1.5 and SDv2.0, and 4-step + \(w=1.0\) on SDXL-Lightning.

Key Experimental Results

Main Results

Evaluated using 10K random captions from the MS-COCO validation set:

Model Method CLIPScore↑ PickScore↑ ImageReward↑ Likelihood↓
SDv1.5 DDIM 31.48 21.48 0.211 1.037
SDv1.5 SGMS 31.17 21.21 0.123 0.954
SDv1.5 MinorityPrompt 31.54 21.31 0.235 0.897
SDv2.0 DDIM 31.85 21.68 0.382 1.110
SDv2.0 MinorityPrompt 31.96 21.60 0.425 0.914
SDXL-LT DDIM 31.52 22.67 0.733 0.608
SDXL-LT SGMS 31.30 22.58 0.680 0.546
SDXL-LT MinorityPrompt 31.34 22.61 0.710 0.546

MinorityPrompt significantly reduces likelihood while maintaining text alignment and generation quality.

Ablation Study

Configuration CLIPScore↑ Likelihood↓ Description
Full embedding optimization 30.8 0.91 Severe semantic shift
Token optimization (Ours) 31.5 0.90 Better semantic preservation
No annealed timestep 31.3 0.93 Fixed s yields poor results
No stop-gradient trick 31.4 0.92 Unstable optimization

Key Findings

  • Token Optimization vs. Full Embedding Optimization: Token optimization is 0.7 higher on CLIPScore, proving better semantic preservation.
  • Effective Likelihood Reduction: MinorityPrompt is the only method that simultaneously reduces likelihood and maintains high text alignment across all models. Although SGMS also reduces likelihood, it sacrifices image quality.
  • Controllable Semantic Enhancement: By choosing a meaningful initial token embedding (such as "old" or "Asian"), the direction of minority features can be guided, which is impossible with pure latent-space methods.
  • 84% of users in the user study preferred the minority samples generated by MinorityPrompt.

Highlights & Insights

  • The strategy of only optimizing the appended token instead of the entire embedding is highly elegant: it preserves the original semantics while introducing learnable degrees of freedom. This idea can be transferred to any generative task requiring inference-time condition fine-tuning (e.g., style guidance, enhancing specific attributes).
  • The theoretical derivation connecting reconstruction loss to negative ELBO provides solid mathematical guarantees for the optimization objective, giving theoretical support to the seemingly heuristic "maximizing reconstruction error."
  • The method's implicit bias mitigation capability is significant: countering stereotypical biases in T2I models (such as associating "man" with "young") by generating minority samples holds strong social impact.

Limitations & Future Work

  • Each optimization step requires extra forward and backward passes, making inference speed several times slower than standard DDIM.
  • The definition of "minority" relies entirely on the distribution learned by the model. If the model has insufficient samples in certain areas, it may fail to reach those regions.
  • The effect on distilled models (SDXL-Lightning) is less significant than on full-step models, potentially due to the limited optimization space of 4-step sampling.
  • Lack of authenticity evaluation of the generated minority samples—low density does not necessarily translate to meaningful diversity.
  • vs. SGMS: SGMS is a prior SOTA minority sampling method, but is limited to non-T2I scenarios (LSUN/ImageNet) with limited performance in T2I. MinorityPrompt adapts to T2I architectures via prompt optimization, outperforming in both image quality and likelihood metrics.
  • vs. CADS: CADS focuses on diversity enhancement rather than minority generation by adding noise to conditional embeddings. MinorityPrompt has a clear likelihood objective, making it more targeted.
  • vs. Textual Inversion: Both use learnable tokens, but Textual Inversion requires offline training to learn visual concepts, whereas MinorityPrompt is an online, target-driven optimization.

Rating

  • Novelty: ⭐⭐⭐⭐ Translating the minority sampling problem into online prompt optimization is a novel formulation
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensively covering multiple SD versions, multiple metrics, ablations, and user studies
  • Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivation, consistent notation, and smooth writing
  • Value: ⭐⭐⭐⭐ Hold practical significance for T2I diversity and bias mitigation