DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers¶
Conference: ICCV 2025 arXiv: 2503.22796 Code: None Area: Image Generation / Diffusion Model Acceleration Keywords: Diffusion Transformer, Attention Compression, MMDiT, Sparse Attention, Inference Acceleration
TL;DR¶
DiTFastAttnV2 is proposed for multi-modality diffusion Transformers (MMDiT), achieving fine-grained attention compression via Head-wise Arrow Attention and Head-wise Caching mechanisms. It reduces attention FLOPs by 68% and achieves 1.5× end-to-end speedup on 2K image generation without visual quality degradation.
Background & Motivation¶
MMDiT architectures (e.g., SD3, FLUX) are the dominant paradigm for text-to-image generation, performing joint self-attention over concatenated visual and text tokens. However, attention computation remains the primary inference bottleneck. Existing acceleration methods (e.g., DiTFastAttn) exhibit three critical limitations:
Cross-modal attention pattern complexity: In MMDiT, visual tokens exhibit diagonal locality while text token interactions are highly semantics-dependent. A uniform sliding-window attention cannot capture this disparity, and forced application truncates text information.
Inter-head redundancy heterogeneity: Attention heads within the same layer exhibit drastically different behaviors—some approximate global attention while others are highly localized. Layer-level uniform caching or sparsity strategies discard critical information.
Prohibitive search cost: DiTFastAttn requires over 10 hours to search a compression scheme (50-step 2K FLUX generation), which would exceed 200 hours when extended to head-level granularity.
Method¶
Overall Architecture¶
DiTFastAttnV2 is a post-training compression framework comprising three components: Head-wise Arrow Attention (addressing spatial redundancy heterogeneity), Head-wise Caching (addressing timestep redundancy heterogeneity), and an efficient fused kernel (enabling practical acceleration), along with an efficient compression scheme search algorithm.
Key Designs¶
-
Head-wise Arrow Attention:
- The joint attention map is partitioned into four regions: visual-visual, visual-text, text-visual, and text-text.
- Local attention is applied to the visual-visual region (retaining only attention scores near the diagonal), discarding long-range token interactions.
- Full attention is preserved for the three regions involving text tokens without any compression.
- This pattern resembles an arrow, hence the name "arrow attention."
- Each attention head can independently select full attention or arrow attention (mixed attention design).
- Design Motivation: The diagonal locality of visual tokens is consistent across prompts, whereas text interactions are highly semantics-dependent and incompressible.
-
Head-wise Caching:
- Analysis reveals that inter-head similarity across adjacent timesteps varies significantly within the same layer.
- For heads with high inter-step similarity, attention computation at the current timestep is skipped and the cached output from the previous step is reused directly.
- Each head independently determines whether to use caching.
- Design Motivation: Timestep redundancy is exploited at head-level granularity to avoid discarding critical information from rapidly changing heads.
-
Fused Kernel:
- Integrates arrow attention and caching; each head can independently select one of three modes: full attention, computation skip (cache reuse), or arrow attention (with a specified window size).
- Implemented based on FlashAttention2 with a block-sparse pattern to ensure each computation block is dense, minimizing irregular memory access overhead.
- Mixed blocks are converted to dense blocks to reduce memory access costs.
Loss & Training¶
Efficient Compression Scheme Search:
-
Per-layer RSE Metric: The single-layer relative squared error (RSE), rather than the final output MSE, is used to measure the impact of each compression method, reducing calibration cost from \(T \times L \times M \times H\) full inferences to \(T \times M\): $\(\mathcal{I}(m) = \frac{\sum(y_m - \bar{y}_o)^2}{\sum(y_o - \bar{y}_o)^2}\)$
-
Head-wise Compression Scheme Optimization: The compression configuration per layer per timestep is modeled as an integer optimization problem, minimizing latency subject to an RSE budget constraint \(\delta\).
- Head Constraint Coefficient \(c\): A constraint \(\mathcal{I}(h,m) \leq \frac{c}{n}\delta\) is introduced to prevent any single head from absorbing a disproportionate share of the compression budget; default \(c=1.5\).
- A progressive update search strategy is adopted, proceeding timestep-by-timestep and layer-by-layer.
Key Experimental Results¶
Main Results — SD3 and FLUX Generation Quality (Table)¶
| Model | Resolution | Threshold δ | Attention Sparsity | LPIPS↓ | SSIM↑ | HPSv2↑ | CLIP↑ |
|---|---|---|---|---|---|---|---|
| SD3 | 1024 | Original | 0 | - | - | 0.2926 | 0.3254 |
| SD3 | 1024 | δ=0.2 | 0.41 | 0.182 | 0.716 | 0.2933 | 0.3251 |
| SD3 | 1024 | δ=0.6 | 0.63 | 0.266 | 0.616 | 0.2933 | 0.3246 |
| FLUX | 2048 | Original | 0 | - | - | 0.2862 | 0.3169 |
| FLUX | 2048 | δ=0.2 | 0.43 | 0.242 | 0.646 | 0.2883 | 0.3164 |
| FLUX | 2048 | δ=1.0 | 0.68 | 0.393 | 0.497 | 0.2852 | 0.3163 |
Ablation Study — Method Combinations and Constraint Coefficients (Table)¶
| Method Set | Attention Sparsity | LPIPS↓ | SSIM↑ | HPSv2↑ |
|---|---|---|---|---|
| Original | 0 | - | - | 0.2926 |
| AA only | 0.30 | 0.275 | 0.608 | 0.2943 |
| AA + OC | 0.55 | 0.238 | 0.644 | 0.2935 |
| + CFG Sharing | 0.54 | 0.249 | 0.649 | 0.2913 |
| + Residual Sharing | 0.56 | 0.196 | 0.704 | 0.2906 |
| Constraint Coefficient \(c\) | Sparsity | LPIPS↓ | SSIM↑ |
|---|---|---|---|
| No constraint | 0.50 | 0.249 | 0.640 |
| \(c=1\) | 0.55 | 0.240 | 0.641 |
| \(c=1.5\) | 0.55 | 0.238 | 0.644 |
| \(c=2\) | 0.55 | 0.253 | 0.627 |
Key Findings¶
- Significant speedup: 1.5× end-to-end acceleration on FLUX 2K image generation with up to 68% reduction in attention FLOPs.
- Lossless quality: At δ=0.2/0.6, HPSv2 and CLIP scores are comparable to or even exceed those of the original model.
- Substantial improvement over predecessor: At equivalent attention sparsity on SD3, DiTFastAttnV2 outperforms DiTFastAttn and its variants on all metrics.
- Search efficiency gain: Compression scheme search for 2K image generation is reduced from 10 hours to 15 minutes.
- CFG Sharing ineffective for MMDiT: The MMDiT design eliminates the need for CFG, leaving negligible redundancy for CFG sharing to exploit.
- Fused kernel performance: Achieves 3.55× speedup at 75% sparsity, approaching or exceeding the theoretical upper bound.
Highlights & Insights¶
- Re-examining attention compression at head granularity: The paper reveals high heterogeneity among attention heads in MMDiT and designs fine-grained strategies accordingly.
- Elegant Arrow Attention design: Precisely captures the locality of visual tokens and the global dependency of text tokens.
- Two-order-of-magnitude search efficiency improvement: The per-layer RSE metric combined with head-level optimization enables minute-scale search.
- High practical deployment value: The fused kernel achieves genuine 1.5× speedup rather than merely theoretical FLOPs reduction.
Limitations & Future Work¶
- Validation is limited to SD3 and FLUX; applicability to other MMDiT architectures (e.g., CogVideoX, HunyuanVideo, and other video models) remains unverified.
- The window size of Arrow Attention is determined via search and lacks an adaptive mechanism.
- The current implementation targets A100 GPUs; speedup ratios may differ on other hardware platforms.
- The combination of the proposed method with other compression approaches such as quantization and distillation is not explored.
- At high compression rates (δ=1.0), details and backgrounds may change, potentially unsuitable for scenarios requiring strict consistency.
Related Work & Insights¶
- DiTFastAttnV2 is a direct extension of DiTFastAttn, expanding from layer-level to head-level granularity and resolving MMDiT adaptation issues.
- Arrow Attention draws inspiration from Attention Sink and sparse attention research in LLMs.
- The per-layer metric calibration approach is generalizable to other scenarios requiring compression configuration search.
- The finding of head-level heterogeneity may also serve as a reference for ViT pruning and distillation research.
Rating¶
- Novelty: ⭐⭐⭐⭐ The head-wise combination of arrow attention and caching is novel, with significant search efficiency improvements.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-metric, multi-threshold, and thorough ablations, though evaluation on only two models is somewhat limited.
- Writing Quality: ⭐⭐⭐⭐ In-depth analysis, clear illustrations, and complete algorithmic pseudocode.
- Value: ⭐⭐⭐⭐⭐ Addresses practical pain points in MMDiT inference acceleration with high real-world applicability.