Diffusion-CAM: Faithful Visual Explanations for dMLLMs¶
Conference: ACL 2026
arXiv: 2604.11005
Code: GitHub
Area: Image Restoration
Keywords: Diffusion Multimodal Large Language Models, Class Activation Mapping, Visual Explanation, Explainable AI, Parallel Generation
TL;DR¶
Proposes Diffusion-CAM, the first interpretability method designed specifically for diffusion-based multimodal large language models (dMLLMs). By extracting structurally valid intermediate representations from the denoising trajectory and employing four post-processing modules (Adaptive Kernel Denoising, Distribution-Aware Confidence Gating, Contextual Background Attenuation, and Single-Instance Causal Debiasing), it significantly outperforms autoregressive CAM baselines on COCO Caption and GranDf.
Background & Motivation¶
Background: Multimodal LLMs are shifting from autoregressive architectures (LLaVA, Qwen-VL) to diffusion-based paradigms (LaViDa, LLaDA-V, MMaDA). Diffusion models generate entire sentences through parallel masked denoising, improving generation speed and global coherence.
Limitations of Prior Work: (1) Existing CAM methods (e.g., LLaVA-CAM, TAM) rely on the sequential, attention-rich nature of autoregressive models to track token generation—but dMLLMs lack explicit token-level attention weights and left-to-right causal structures; (2) Directly applying traditional CAM to dMLLMs produces diffuse, non-specific heatmaps; (3) The parallel denoising process of dMLLMs yields smooth, distributed activation patterns, fundamentally different from the local, sequential dependencies of autoregressive models.
Key Challenge: The architectural advantages of dMLLMs (parallel generation, global planning) are precisely the obstacles for traditional interpretability tools—the latter assume sequential dependency while the former are parallel.
Goal: Design the first visual explanation method adapted for diffusion-based multimodal models.
Key Insight: Identify "structurally valid" intermediate steps in the denoising trajectory—where image-conditioned spatial information is still preserved and can be linked to the final prediction via gradients.
Core Idea: Extract Gradient CAM from structurally valid steps of the denoising process combined with four diffusion-specific post-processing modules to solve issues like spatial noise, background diffusion, and redundant token correlation.
Method¶
Overall Architecture¶
Adapting CAM to dMLLM: (1) Register hooks in intermediate transformer blocks to extract features and gradients; (2) Dynamically locate the position boundaries of image tokens; (3) Backpropagate from final response scores to image region features at valid steps to generate the base CAM; (4) Refine with four post-processing modules.
Key Designs¶
-
Diffusion CAM Adaptation (Three-step Transformation):
- Function: Makes CAM compatible with non-autoregressive denoising generation processes.
- Mechanism: (1) Model-Aware Feature Extraction: Selects denoising steps satisfying the "feasibility condition"—where the hook hidden state sequence still contains the full image token span. (2) Dynamic Image Span Localization: Parses image token boundaries from info4cam metadata, extracts image features, and reshapes them into spatial feature maps. (3) Grad-CAM Aggregation: Averages gradients spatially to obtain channel weights, followed by a weighted sum and ReLU to obtain the base CAM.
- Design Motivation: Instead of assuming fixed image token positions or specific denoising steps, a general feasibility criterion is used for adaptive selection.
-
Adaptive Kernel Denoising Module:
- Function: Suppresses high-frequency architectural artifacts from Transformer self-attention.
- Mechanism: Dynamically scales the filter kernel size \(k_{\text{adaptive}}\), considering three factors: the number of denoising steps (larger kernel for more steps), spatial variance (larger kernel for high noise), and resolution (to ensure scale invariance). Employs rank-weighted Gaussian filtering—assigning weights according to activation value ranking rather than spatial distance.
- Design Motivation: Fixed kernel sizes cannot adapt to the varying noise characteristics of different denoising steps and image contents.
-
Distribution-Aware Confidence Gating + Contextual Background Attenuation + Single-Instance Causal Debiasing:
- Function: Respectively address high-variance activation artifacts, residual background signals, and abnormally high activation of repeated tokens.
- Mechanism: Confidence gating adaptively determines thresholds based on global statistics to differentiate high/low confidence regions; background attenuation defines foreground/background separation boundaries using multi-scale statistical integration; causal debiasing clears redundant responses through repeated token detection and abnormally high activation masking.
- Design Motivation: The multi-step denoising of diffusion models accumulates various noise sources, requiring specific modules to address them one by one.
Key Experimental Results¶
Main Results (COCO Caption + GranDf)¶
| Method | Localization Accuracy | Background Suppression | Visual Fidelity |
|---|---|---|---|
| LLaVA-CAM | Baseline | Weak | Weak |
| Grad-CAM (Direct Apply) | Poor | Poor | Poor |
| Diffusion-CAM | SOTA | SOTA | SOTA |
Ablation Study¶
| Module | Contribution |
|---|---|
| Adaptive Kernel Denoising | Suppresses high-frequency artifacts, improves heatmap smoothness |
| Confidence Gating | Distinguishes semantic regions from noise |
| Background Attenuation | Eliminates diffuse background responses |
| Causal Debiasing | Eliminates redundant activation caused by repeated tokens |
| Combined Four Modules | Optimal, modules are complementary |
Key Findings¶
- Directly applying autoregressive CAM to dMLLMs fails completely—producing diffuse, uninterpretable heatmaps.
- The four post-processing modules each solve a specific problem and are indispensable.
- The choice of denoising steps is crucial: meaningful visual attribution can only be extracted at structurally valid steps.
- Diffusion-CAM significantly outperforms all baselines in localization accuracy and visual fidelity.
Highlights & Insights¶
- Reveals the fundamental challenge of dMLLM interpretability for the first time: the conflict between parallel generation and sequential dependency. This issue will grow in importance as diffusion architectures gain popularity.
- The concept of "structurally valid steps" provides a general principle—in non-autoregressive models, attribution should be extracted from intermediate states that preserve input-conditioned spatial information.
- The four-module design, while engineering-oriented, is grounded in clear theoretical motivations (noise analysis).
Limitations & Future Work¶
- Currently validated only on the LaViDa series; adaptability to other dMLLMs (e.g., LLaDA-V, MMaDA) remains to be confirmed.
- Hyperparameters for the four modules (e.g., \(\delta_\sigma\), \(\delta_\mu\)) require adjustment based on the model.
- Gradient backpropagation paths may not be unique in parallel denoising; the causal validity of attribution needs deeper analysis.
- Computational overhead is higher than autoregressive CAM (requires storage of intermediate denoising states).
- Text-token level attribution was not explored (currently focused only on visual region attribution).
Related Work & Insights¶
- vs LLaVA-CAM: Designed for autoregressive models; performs extremely poorly on dMLLMs. Diffusion-CAM is a necessary alternative.
- vs DAAM (Tang et al.): DAAM performs attribution for text-to-image diffusion models, but the goals and methods differ from multimodal reasoning.
- vs Attention Visualization: dMLLMs lack explicit autoregressive attention weights, making attention-based methods inapplicable.
Rating¶
- Novelty: ⭐⭐⭐⭐ First interpretability method for dMLLMs, though the core idea (Grad-CAM + post-processing) is established.
- Experimental Thoroughness: ⭐⭐⭐⭐ Two benchmarks + ablation + comparison, though the dMLLM ecosystem is still small.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation and well-organized four-module design.
- Value: ⭐⭐⭐⭐ Importance will grow as dMLLMs become more prevalent.