CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration¶

Conference: CVPR 2026 arXiv: 2603.20741 Code: https://github.com/xiefan-guo/ctcal Area: Image Generation / Text-to-Image Diffusion Models Keywords: Text-to-Image Generation, Diffusion Models, Cross-Attention Alignment, Self-Calibration, Compositional Generation

TL;DR¶

This paper proposes CTCal (Cross-Timestep Self-Calibration), which leverages reliable text-image alignments (cross-attention maps) formed at small timesteps (low noise) to calibrate representation learning at large timesteps (high noise), providing explicit cross-timestep self-supervision for text-to-image generation. CTCal comprehensively outperforms existing methods on T2I-CompBench++ and GenEval.

Background & Motivation¶

Background: Diffusion models dominate text-to-image generation, yet precise text-image alignment—especially for compositional generation with complex prompts—remains an open challenge.

Limitations of Prior Work: (1) Conventional diffusion losses provide only implicit supervision, making it difficult to capture fine-grained text-image correspondences; (2) inference-time optimization methods (e.g., Attend-and-Excite) exhibit poor generalization and do not scale; (3) at large timesteps, severe noise degrades cross-attention map quality, preventing correct alignment—a critical bottleneck for generation quality.

Key Challenge: At small timesteps, text-image alignment is reliable but "too easy"; at large timesteps, alignment is poor but "critical" (determining generation quality in the early stages of inference).

Goal: How to provide explicit supervision for establishing correct text-image correspondences at large timesteps in diffusion models?

Key Insight: Key Observation—for the same image-text-noise triplet, cross-attention maps extracted at different timesteps during training differ dramatically in quality: maps at small timesteps closely match the true image structure and semantics, while those at large timesteps are entirely degraded.

Core Idea: Use reliable attention maps from small timesteps as a "teacher" to calibrate attention maps at large timesteps (the "student"), enabling the model to supervise itself.

Method¶

Overall Architecture¶

Given an image, text, and noise, two timesteps are sampled such that $t_{\text{tea}} < t_{\text{stu}}$. Forward passes are performed separately to obtain cross-attention maps $\mathbf{A}_{\text{tea}}$ and $\mathbf{A}_{\text{stu}}$. The latter is calibrated using $\mathbf{A}_{\text{tea}}$ as the target, with only the network parameters corresponding to $t_{\text{stu}}$ being optimized.

Key Designs¶

Part-of-Speech-based Cross-Attention Map Selection:
- Function: Only attention maps of noun tokens are extracted for the CTCal loss; tokens without spatial semantics—such as articles ("the") and conjunctions ("and")—are ignored.
- Mechanism: POS tagging is performed using Stanza; $\mathcal{Y}_{\text{noun}}$ denotes the set of nouns, and the loss is defined as $\mathcal{L}_{\text{CTCal}} = \frac{1}{N_{\text{noun}}} \sum_{\mathbf{y}_i \in \mathcal{Y}_{\text{noun}}} \mathcal{D}(\mathbf{A}_{\text{stu},\mathbf{y}_i}, \mathbf{A}_{\text{tea},\mathbf{y}_i})$
- Design Motivation: Ablation studies show that applying constraints to all tokens actually degrades performance, since attention maps of non-noun tokens carry no meaningful spatial semantic information and introduce noise.
Pixel-Semantic Space Joint Optimization:
- Function: Attention maps are aligned simultaneously in pixel space and semantic space. A lightweight autoencoder $(f_{\text{attn}}^{\text{enc}}, f_{\text{attn}}^{\text{dec}})$ is introduced, with a reconstruction auxiliary task to prevent mode collapse.
- Core Formula: $$\mathcal{L}_{\text{CTCal}} = \lambda_1 \underbrace{\mathcal{D}(\mathbf{A}_{\text{stu}}, \mathbf{A}_{\text{tea}})}_{\text{Pixel}} + \lambda_2 \underbrace{\mathcal{D}(f^{\text{enc}}(\mathbf{A}_{\text{stu}}), f^{\text{enc}}(\mathbf{A}_{\text{tea}}))}_{\text{Semantic}} + \lambda_3 \underbrace{\mathcal{D}(f^{\text{dec}}(\mathbf{A}_{\text{tea}}), \mathbf{A}_{\text{tea}})}_{\text{Reconstruction}}$$
- Design Motivation: Pixel-level alignment captures spatial location information; semantic-level alignment captures high-level semantic consistency; the reconstruction task prevents encoder degeneration.
Subject Response Alignment Regularization:
- Function: Aligns the attention responses of all subjects (nouns) to that of the highest-responding subject: $$\mathcal{R}_{\text{subject}} = \frac{1}{N_{\text{noun}}} \sum \text{ReLU}(\mathcal{S}_{\text{attn}} - \max(\mathbf{A}_{\text{stu},\mathbf{y}_i}) - \tau)$$
- Design Motivation: Prevents high-response subjects from suppressing low-response ones, which would cause the latter to fail to render correctly in the generated image (e.g., generating only a cat for the prompt "cat and dog").
Timestep-aware Adaptive Weighting:
- Function: $\lambda_t = t_{\text{stu}} / T_{\text{train}}$; the CTCal weight is larger at large timesteps, while the diffusion loss dominates at small timesteps.
- Design Motivation: At small timesteps, the diffusion loss alone is sufficient to establish alignment; explicit calibration from CTCal is only needed at large timesteps.

Loss & Training¶

$\mathcal{L} = \mathcal{L}_{\text{diffusion}} + \lambda_t \mathcal{L}_{\text{CTCal}}$
LoRA is used to fine-tune the self-attention layers of the text encoder and the attention layers of the denoising network.
For SD 2.1, $t_{\text{tea}}=0$ is set; for SD 3, $t_{\text{tea}}$ is selected according to the logit-normal sampling distribution.
Dataset: High-quality text-image pairs are selected from generated data using a reward-driven approach.

Key Experimental Results¶

Main Results — T2I-CompBench++¶

Method	Color↑	Shape↑	2D-Spatial↑	Numeracy↑	Complex↑
SD 2.1	0.507	0.422	0.134	0.458	0.339
SD 2.1 + AE	0.640	0.452	0.146	0.477	0.340
SD 2.1 + GORS	0.643	0.486	0.178	0.486	0.337
SD 2.1 + CTCal	0.723	0.515	0.214	0.508	0.340
SD 3 (2B)	0.813	0.589	0.320	0.617	0.377
SD 3 + CTCal	0.844	0.597	0.348	0.629	0.381

GenEval¶

Method	Overall↑	Two Object↑	Counting↑	Colors↑	Position↑
SD 3 (2B)	0.62	0.74	0.63	0.67	0.34
SD 3 + CTCal	0.69	0.85	0.70	0.79	0.38

Ablation Study¶

Configuration	Color↑	2D-Spatial↑	Note
SD 2.1 + GORS baseline	0.643	0.178	—
+ naive all-token constraint (a)	0.629 (−2.2%)	0.169 (−4.6%)	Performance drops!
+ noun selection (b)	Significant gain	Significant gain	Noun selection is critical
+ pixel + semantic (c)	Further gain	Further gain	Joint optimization is effective
+ response alignment (d)	Further gain	Further gain	Subject balancing helps
+ adaptive weighting (e)	Best	Best	Full CTCal

Key Findings¶

Noun selection is the most critical design choice—indiscriminately aligning all tokens actually harms performance.
CTCal is effective for both the cross-attention-based SD 2.1 and the MM-DiT-based SD 3, demonstrating generality.
In user studies, CTCal achieves an overwhelming preference rate (SD 2.1: 76.67%, SD 3: 54.17%).

Highlights & Insights¶

Training-stage perspective: Unlike inference-time optimization methods, CTCal addresses text-image alignment at training time, yielding lasting improvements with no inference overhead.
Self-supervised paradigm: The model uses its own small-timestep outputs to supervise large-timestep learning, requiring no additional labels or external teacher models.
Architecture-agnostic: Applicable to both U-Net (SD 2.1) and Transformer (SD 3) architectures.

Limitations & Future Work¶

Training data construction relies on reward-driven sampling, incurring non-trivial data preparation costs.
Only LoRA fine-tuning is explored; the effect of full fine-tuning remains uninvestigated.
Noun selection depends on POS tagging tools; applicability to non-English prompts is unknown.

Inference-time methods such as Attend-and-Excite motivated the focus on cross-attention, but CTCal addresses the problem more elegantly at the training stage.
The reward-driven data selection strategy from GORS serves as the foundation for CTCal, which adds explicit alignment supervision on top of it.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The cross-timestep self-calibration idea is novel and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two benchmarks, SD 2.1 and SD 3, user studies, and complete ablations.
Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from observation to design to validation is clear.
Value: ⭐⭐⭐⭐⭐ Delivers substantial improvements in text-image alignment for text-to-image generation.