Diffusion-CAM: Faithful Visual Explanations for dMLLMs¶

Conference: ACL 2026 arXiv: 2604.11005 Code: GitHub Area: Image Restoration Keywords: Diffusion Multimodal Model, Class Activation Mapping, Visual Explanation, Explainable AI, Parallel Generation

TL;DR¶

Diffusion-CAM is the first interpretability method for diffusion-based multimodal LLMs (dMLLMs), extracting structurally valid intermediate representations from denoising trajectories with four post-processing modules (adaptive kernel denoising, distribution-aware confidence gating, contextual background decay, single-instance causal debiasing), significantly outperforming autoregressive CAM baselines on COCO Caption and GranDf.

Background & Motivation¶

Background: Multimodal LLMs are transitioning from autoregressive architectures (LLaVA, Qwen-VL) to diffusion-based architectures (LaViDa, LLaDA-V, MMaDA). Diffusion models generate entire sentences through parallel mask denoising, improving generation speed and global coherence.

Limitations of Prior Work: (1) Existing CAM methods (LLaVA-CAM, TAM) rely on autoregressive models' sequential, attention-rich properties — dMLLMs lack explicit token-level attention weights and left-to-right causal structure; (2) Directly applying traditional CAM to dMLLMs produces diffuse, non-specific heatmaps; (3) Parallel denoising produces smooth, distributed activation patterns fundamentally different from autoregressive local sequential dependencies.

Key Challenge: The architectural advantages of dMLLMs (parallel generation, global planning) are precisely the obstacles for traditional interpretability tools — the latter assume sequential dependency while the former operates in parallel.

Goal: Design the first visual explanation method adapted for diffusion-based multimodal models.

Key Insight: Identify "structurally valid" intermediate steps in the denoising trajectory — where image-conditioned spatial information is preserved and can be linked to final predictions via gradients.

Core Idea: Extract gradient CAM from structurally valid steps of the denoising process + four diffusion-specific post-processing modules to address spatial noise, background diffusion, and redundant token correlations.

Method¶

Overall Architecture¶

Adapt CAM to dMLLMs: (1) register hooks at intermediate transformer blocks to extract features and gradients; (2) dynamically locate image token position boundaries; (3) backpropagate from final response scores to effective-step image region features to generate base CAM; (4) refine with four post-processing modules.

Key Designs¶

Diffusion CAM Adaptation (Three-Step Modification): Selects denoising steps satisfying feasibility conditions, dynamically locates image spans, and applies Grad-CAM aggregation. Design assumes no fixed image token positions, using universal feasibility criteria.
Adaptive Kernel Denoising Module: Suppresses high-frequency architectural artifacts from Transformer self-attention via dynamically scaled filter kernels considering denoising steps, spatial variance, and resolution.
Distribution-Aware Confidence Gating + Contextual Background Decay + Single-Instance Causal Debiasing: Respectively addresses high-variance activation artifacts, background residual signals, and abnormally high activations from repeated tokens.

Key Experimental Results¶

Main Results (COCO Caption + GranDf)¶

Method	Localization Accuracy	Background Suppression	Visual Fidelity
LLaVA-CAM	Baseline	Weak	Weak
Grad-CAM (direct)	Poor	Poor	Poor
Diffusion-CAM	SOTA	SOTA	SOTA

Key Findings¶

Directly applying autoregressive CAM to dMLLMs completely fails
All four post-processing modules are complementary and indispensable
Denoising step selection is crucial: only structurally valid steps yield meaningful visual attribution

Highlights & Insights¶

First reveals the fundamental challenge of dMLLM interpretability: parallel generation vs sequential dependency conflict
The "structurally valid step" concept provides a general principle for attribution in non-autoregressive models

Limitations & Future Work¶

Validated only on LaViDa series; adaptability to other dMLLMs remains to be confirmed
Four modules' hyperparameters require model-specific adjustment
Computational overhead exceeds autoregressive CAM

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐