ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis¶

Conference: ACL2026
arXiv: 2404.10141
Code: Unconfirmed in summary cache
Area: Image Generation / Text-to-Image
Keywords: Text-to-Image, Subject Conditioning, News Image-Text, Abstract caption, SAFE

TL;DR¶

This paper proposes the ANCHOR dataset, featuring 70K+ abstract captions from 5 news outlets to expose T2I model failures in multi-subject, contextual reasoning, and fine-grained grounding. It introduces SAFE, which utilizes LLMs to extract key subjects and reinforces subject representations at the embedding layer to enhance image-text consistency.

Background & Motivation¶

Background: Current text-to-image models are capable of generating high-quality images, and common evaluations measure image-text alignment, visual quality, and human preference under simple prompts. Many benchmark prompts consist of short, self-contained sentences with simple subject relationships.

Limitations of Prior Work: Real captions, especially in news, often contain multiple interacting subjects, contextual references, abstract expressions, and event backgrounds. The abstract explicitly points out that image-text encoders like CLIP consistently fail in multi-subject understanding, contextual reasoning, and nuanced grounding under these conditions. In other words, a model performing well on simple prompts does not necessarily understand subject relationships in real-world text.

Key Challenge: T2I models need to clearly encode "who the key subjects are, how they interact, and which semantics must be grounded in the image"; however, conventional text encoding tends to compress complex captions into a holistic embedding, where key subjects are easily diluted or confused.

Goal: On one hand, to construct ANCHOR, an evaluation/training resource that better reflects the complexity of real captions; on the other hand, to propose Subject-Aware Fine-tuning (SAFE) to help extract subjects via LLMs and enhance subject representations, thereby improving the consistency between generated images and captions.

Key Insight: The authors do not directly modify the diffusion model architecture but instead focus on the subject representation of text conditions. This approach is scalable: LLMs are already proficient at identifying entities, roles, and relationships in complex sentences. Linking this semantic parsing capability to T2I conditional encoding may be more effective than simply increasing the volume of caption data.

Core Idea: Use LLMs to extract key subjects from complex captions and then reinforce these subjects at the embedding-level, ensuring the T2I model focuses more on the objects and relationships that most need to be visualized in real captions.

Method¶

Cache Status Note: The local cache is the arXiv abstract page and does not include the full PDF body, method details, experimental tables, or author limitations. Therefore, only method-level notes supported by "abstract evidence" are provided below; specific training hyperparameters, loss forms, and experimental values are not fabricated.

Overall Architecture¶

The overall workflow identifiable from the abstract is: first, construct the ANCHOR dataset by collecting 70K+ abstract captions from 5 major news organizations; then, use this data to analyze the deficiencies of existing T2I models and image-text encoders under complex captions; finally, propose SAFE (Subject-Aware Fine-tuning), which extracts key subjects from captions via LLMs and enhances the representations of these subjects at the embedding layer to improve image-caption consistency and human preference alignment.

Key Designs¶

1. ANCHOR Complex Caption Dataset: Forcing models to reveal weaknesses hidden by simple prompts

Most common T2I benchmark prompts are short, self-contained, and have simple subject relationships. A model appearing "good" on these does not mean it truly understands complex text. ANCHOR collects 70K+ abstract captions from 5 major news organizations. these captions naturally involve multi-subject interactions, contextual references, and abstract phrasing—news text often involves events, human relationships, and context dependencies within a single sentence. Using this as evaluation/training corpora serves as a stress test for the model, bringing it closer to real-world user descriptions.

2. Defect Analysis of Complex Captions: Localizing "T2I failures" to specific failure modes

Without targeted diagnosis, it is impossible to distinguish whether a failure stems from the generator, the text encoder, or the complex structure of the prompt itself. The authors systematically examine existing image-text encoders and T2I models using ANCHOR, categorizing weaknesses into three types: multi-subject understanding, context reasoning, and nuanced grounding. Essentially, a model might generate certain keywords from a caption but fail to handle the relationships between multiple subjects and fine-grained semantics—encoders like CLIP, which provide global similarity, are particularly prone to averaging out subject relationships.

3. SAFE (Subject-Aware Fine-tuning): Identifying "what must be in the image" via LLMs and amplifying it in condition space

The real problem with complex captions is not a lack of words, but that key subjects and relationships are drowned out by the holistic embedding of the entire sentence. SAFE (Subject-Aware Fine-tuning) first uses LLMs to extract key subjects from the caption and then reinforces the representations of these subjects at the embedding-level, directing the generation process to focus on objects that must be visualized. This interface is practical: it does not require the T2I model to learn all language parsing itself, but instead transforms the structural capabilities (entities/roles/relationships) that LLMs already excel at into conditional control signals for the generator.

⚠️ The abstract does not provide specific formulas; it is impossible to determine whether embedding-level enhancement involves extra tokens, embedding reweighting, adapters, or other implementations. This description follows identifiable information from the abstract.

Loss & Training¶

The abstract only specifies that SAFE is a fine-tuning method using LLMs to extract key subjects and enhance embedding-level subject representations; it does not disclose specific loss functions, training data partitions, model backbones, learning rates, or evaluation protocols. Therefore, no unverified loss formulas or hyperparameters are included.

Key Experimental Results¶

Main Results¶

The primary experimental facts confirmed by the abstract cache are as follows:

Item	Confirmed Information in Cache	Remarks
Dataset Size	70K+ abstractive captions	From 5 major news organizations
Task Target	Text-to-Image synthesis	Focuses on complex real captions rather than simple prompts
Primary Failure Modes	Multi-subject understanding, context reasoning, nuanced grounding	Explicitly listed in the abstract
Method	SAFE: Subject-Aware Fine-tuning	Uses LLMs to extract key subjects and reinforces embedding-level representations
Experimental Conclusion	Significantly improves image-caption consistency and human preference alignment	Specific numerical values not provided in the abstract

Ablation Study¶

The cache does not include the main text, so there are no verifiable ablation tables. To avoid fabricating numbers, only "missing items" and confirmed statuses are recorded here.

Configuration / Information Item	Cache Status	Writeable Conclusion
SAFE vs. Original T2I backbone	Specific values not disclosed	Abstract claims significant improvement in consistency and human preference
Ablation of LLM Subject Extraction	Not disclosed	Cannot determine the magnitude of contribution from the extraction module
Ablation of Embedding-level Enhancement	Not disclosed	Cannot determine the impact of enhancement position and intensity
Grouping by Models/Data Sources	Not disclosed	Cannot compare which caption types are the most difficult

Key Findings¶

Real-world news captions serve as a stress test for T2I models because they simultaneously contain subjects, relationships, context, and abstract semantics.
The paper localizes the problem to "subject conditioning," which is more specific than broadly improving text encoding capabilities and is easier to migrate to existing T2I pipelines.
Due to the cache only containing the abstract, it is impossible to determine the magnitude of SAFE’s improvement, statistical significance, the adequacy of ablations, or the types of failure cases.

Highlights & Insights¶

Using news captions as a data source is compelling because news text is naturally closer to real user descriptions than artificial prompts: it features multiple subjects, strong context, and abstract expressions.
Utilizing LLMs for subject extraction is a practical interface: it avoids requiring the T2I model to learn complex language parsing itself, instead leveraging the LLM's semantic structural capabilities as conditional control signals.
This work's insights are applicable beyond image generation; they can be transferred to video generation, image-text editing, and cross-modal retrieval. In complex textual conditions, explicitly identifying key subjects and relationships before feeding them into a generation/retrieval model is often more stable than directly encoding the entire sentence.

Limitations & Future Work¶

The local cache is only the abstract page, lacking method details and experimental tables; thus, the real magnitude of SAFE's improvement cannot be evaluated, and it remains unconfirmed if code and data are fully open.
Based on the abstract, the method relies on LLM subject extraction; if the LLM misses key subjects, misunderstands references, or treats background entities as protagonists, the conditional enhancement could amplify errors.
The data primarily originates from news media, which is suitable for complex real captions but may be biased toward news events, people, and institutional scenes; generalization to artistic prompts, product images, scientific imagery, or long-tail user prompts requires confirmation from full-text experiments.
Follow-up work should include quality evaluations of subject extraction, results binned by caption complexity, and compatibility analysis of SAFE with different T2I backbones.

vs. Simple Prompt T2I Benchmarks: Traditional prompts are more self-contained and easier for automated evaluation but fail to expose context and multi-subject issues; ANCHOR's advantage is its proximity to real captions.
vs. CLIP-based Alignment: CLIP-style encoders provide global image-text similarity, but subject relationships may be averaged out under complex captions; SAFE attempts to explicitly reinforce subject representations.
vs. Prompt Engineering: Prompt engineering relies on users rewriting text, whereas SAFE integrates subject parsing and enhancement into the model side, making it more suitable for large-scale processing of real captions.
Insight: When performing text-to-image/video generation, one should not only ask "if the prompt is detailed" but also "if the key subjects are represented with high fidelity in the condition space."

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Combining news abstract captions with LLM subject extraction for T2I conditional control is a clearly defined problem.
Experimental Thoroughness: ⭐⭐⭐ Local cache only contains the abstract, so experimental details and ablations cannot be verified. Rating is conservative.
Writing Quality: ⭐⭐⭐⭐ The abstract is clearly expressed, but the lack of full text prevents evaluating the complete narrative.
Value: ⭐⭐⭐⭐⭐ If the full-text experiments are rigorous, this will be a useful resource and method for complex caption-driven T2I alignment.