Skip to content

CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration

Conference: CVPR 2026 arXiv: 2603.20741 Code: https://github.com/xiefan-guo/ctcal Area: Image Generation / Text-to-Image Diffusion Models Keywords: Text-to-Image Generation, Diffusion Models, Cross-Attention Alignment, Self-Calibration, Compositional Generation

TL;DR

This paper proposes CTCal (Cross-Timestep Self-Calibration), which leverages reliable text-image alignments (cross-attention maps) formed at small timesteps (low noise) to calibrate representation learning at large timesteps (high noise), providing explicit cross-timestep self-supervision for text-to-image generation. CTCal comprehensively outperforms existing methods on T2I-CompBench++ and GenEval.

Background & Motivation

Background: Diffusion models dominate text-to-image generation, yet precise text-image alignment—especially for compositional generation with complex prompts—remains an open challenge.

Limitations of Prior Work: (1) Conventional diffusion losses provide only implicit supervision, making it difficult to capture fine-grained text-image correspondences; (2) inference-time optimization methods (e.g., Attend-and-Excite) exhibit poor generalization and do not scale; (3) at large timesteps, severe noise degrades cross-attention map quality, preventing correct alignment—a critical bottleneck for generation quality.

Key Challenge: At small timesteps, text-image alignment is reliable but "too easy"; at large timesteps, alignment is poor but "critical" (determining generation quality in the early stages of inference).

Goal: How to provide explicit supervision for establishing correct text-image correspondences at large timesteps in diffusion models?

Key Insight: Key Observation—for the same image-text-noise triplet, cross-attention maps extracted at different timesteps during training differ dramatically in quality: maps at small timesteps closely match the true image structure and semantics, while those at large timesteps are entirely degraded.

Core Idea: Use reliable attention maps from small timesteps as a "teacher" to calibrate attention maps at large timesteps (the "student"), enabling the model to supervise itself.

Method

Overall Architecture

Given an image, text, and noise, two timesteps are sampled such that \(t_{\text{tea}} < t_{\text{stu}}\). Forward passes are performed separately to obtain cross-attention maps \(\mathbf{A}_{\text{tea}}\) and \(\mathbf{A}_{\text{stu}}\). The latter is calibrated using \(\mathbf{A}_{\text{tea}}\) as the target, with only the network parameters corresponding to \(t_{\text{stu}}\) being optimized.

Key Designs

  1. Part-of-Speech-based Cross-Attention Map Selection:

    • Function: Only attention maps of noun tokens are extracted for the CTCal loss; tokens without spatial semantics—such as articles ("the") and conjunctions ("and")—are ignored.
    • Mechanism: POS tagging is performed using Stanza; \(\mathcal{Y}_{\text{noun}}\) denotes the set of nouns, and the loss is defined as \(\mathcal{L}_{\text{CTCal}} = \frac{1}{N_{\text{noun}}} \sum_{\mathbf{y}_i \in \mathcal{Y}_{\text{noun}}} \mathcal{D}(\mathbf{A}_{\text{stu},\mathbf{y}_i}, \mathbf{A}_{\text{tea},\mathbf{y}_i})\)
    • Design Motivation: Ablation studies show that applying constraints to all tokens actually degrades performance, since attention maps of non-noun tokens carry no meaningful spatial semantic information and introduce noise.
  2. Pixel-Semantic Space Joint Optimization:

    • Function: Attention maps are aligned simultaneously in pixel space and semantic space. A lightweight autoencoder \((f_{\text{attn}}^{\text{enc}}, f_{\text{attn}}^{\text{dec}})\) is introduced, with a reconstruction auxiliary task to prevent mode collapse.
    • Core Formula: $\(\mathcal{L}_{\text{CTCal}} = \lambda_1 \underbrace{\mathcal{D}(\mathbf{A}_{\text{stu}}, \mathbf{A}_{\text{tea}})}_{\text{Pixel}} + \lambda_2 \underbrace{\mathcal{D}(f^{\text{enc}}(\mathbf{A}_{\text{stu}}), f^{\text{enc}}(\mathbf{A}_{\text{tea}}))}_{\text{Semantic}} + \lambda_3 \underbrace{\mathcal{D}(f^{\text{dec}}(\mathbf{A}_{\text{tea}}), \mathbf{A}_{\text{tea}})}_{\text{Reconstruction}}\)$
    • Design Motivation: Pixel-level alignment captures spatial location information; semantic-level alignment captures high-level semantic consistency; the reconstruction task prevents encoder degeneration.
  3. Subject Response Alignment Regularization:

    • Function: Aligns the attention responses of all subjects (nouns) to that of the highest-responding subject: $\(\mathcal{R}_{\text{subject}} = \frac{1}{N_{\text{noun}}} \sum \text{ReLU}(\mathcal{S}_{\text{attn}} - \max(\mathbf{A}_{\text{stu},\mathbf{y}_i}) - \tau)\)$
    • Design Motivation: Prevents high-response subjects from suppressing low-response ones, which would cause the latter to fail to render correctly in the generated image (e.g., generating only a cat for the prompt "cat and dog").
  4. Timestep-aware Adaptive Weighting:

    • Function: \(\lambda_t = t_{\text{stu}} / T_{\text{train}}\); the CTCal weight is larger at large timesteps, while the diffusion loss dominates at small timesteps.
    • Design Motivation: At small timesteps, the diffusion loss alone is sufficient to establish alignment; explicit calibration from CTCal is only needed at large timesteps.

Loss & Training

  • \(\mathcal{L} = \mathcal{L}_{\text{diffusion}} + \lambda_t \mathcal{L}_{\text{CTCal}}\)
  • LoRA is used to fine-tune the self-attention layers of the text encoder and the attention layers of the denoising network.
  • For SD 2.1, \(t_{\text{tea}}=0\) is set; for SD 3, \(t_{\text{tea}}\) is selected according to the logit-normal sampling distribution.
  • Dataset: High-quality text-image pairs are selected from generated data using a reward-driven approach.

Key Experimental Results

Main Results — T2I-CompBench++

Method Color↑ Shape↑ 2D-Spatial↑ Numeracy↑ Complex↑
SD 2.1 0.507 0.422 0.134 0.458 0.339
SD 2.1 + AE 0.640 0.452 0.146 0.477 0.340
SD 2.1 + GORS 0.643 0.486 0.178 0.486 0.337
SD 2.1 + CTCal 0.723 0.515 0.214 0.508 0.340
SD 3 (2B) 0.813 0.589 0.320 0.617 0.377
SD 3 + CTCal 0.844 0.597 0.348 0.629 0.381

GenEval

Method Overall↑ Two Object↑ Counting↑ Colors↑ Position↑
SD 3 (2B) 0.62 0.74 0.63 0.67 0.34
SD 3 + CTCal 0.69 0.85 0.70 0.79 0.38

Ablation Study

Configuration Color↑ 2D-Spatial↑ Note
SD 2.1 + GORS baseline 0.643 0.178
+ naive all-token constraint (a) 0.629 (−2.2%) 0.169 (−4.6%) Performance drops!
+ noun selection (b) Significant gain Significant gain Noun selection is critical
+ pixel + semantic (c) Further gain Further gain Joint optimization is effective
+ response alignment (d) Further gain Further gain Subject balancing helps
+ adaptive weighting (e) Best Best Full CTCal

Key Findings

  • Noun selection is the most critical design choice—indiscriminately aligning all tokens actually harms performance.
  • CTCal is effective for both the cross-attention-based SD 2.1 and the MM-DiT-based SD 3, demonstrating generality.
  • In user studies, CTCal achieves an overwhelming preference rate (SD 2.1: 76.67%, SD 3: 54.17%).

Highlights & Insights

  • Training-stage perspective: Unlike inference-time optimization methods, CTCal addresses text-image alignment at training time, yielding lasting improvements with no inference overhead.
  • Self-supervised paradigm: The model uses its own small-timestep outputs to supervise large-timestep learning, requiring no additional labels or external teacher models.
  • Architecture-agnostic: Applicable to both U-Net (SD 2.1) and Transformer (SD 3) architectures.

Limitations & Future Work

  • Training data construction relies on reward-driven sampling, incurring non-trivial data preparation costs.
  • Only LoRA fine-tuning is explored; the effect of full fine-tuning remains uninvestigated.
  • Noun selection depends on POS tagging tools; applicability to non-English prompts is unknown.
  • Inference-time methods such as Attend-and-Excite motivated the focus on cross-attention, but CTCal addresses the problem more elegantly at the training stage.
  • The reward-driven data selection strategy from GORS serves as the foundation for CTCal, which adds explicit alignment supervision on top of it.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The cross-timestep self-calibration idea is novel and elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two benchmarks, SD 2.1 and SD 3, user studies, and complete ablations.
  • Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from observation to design to validation is clear.
  • Value: ⭐⭐⭐⭐⭐ Delivers substantial improvements in text-image alignment for text-to-image generation.