DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities¶
Conference: AAAI 2026 arXiv: 2511.05968 Code: N/A Area: Medical Imaging Keywords: Radiology Report Generation, Missing Modalities, Disentangled Representation, VAE, MoE
TL;DR¶
This paper proposes DiA-gnostic VLVAE, a vision-language mixture-of-experts VAE that learns a three-factor latent space (\(Z_v\) visual-specific / \(Z_l\) language-specific / \(Z_s\) shared), with dual constraints of orthogonality and contrastive alignment for disentanglement. The model generates reliable radiology reports even when clinical context is absent, achieving competitive BLEU@4 on IU X-Ray and MIMIC-CXR.
Background & Motivation¶
Background: Radiology Report Generation (RRG) is a critical task of automatically converting medical images into textual reports. The field has evolved from purely image-based models (R2Gen) to knowledge graph-enhanced approaches (MKSG) and context-aware models (PromptMRG), progressively incorporating richer clinical information.
Limitations of Prior Work: - Missing modalities: In clinical practice, contextual information (medical history, symptoms, demographics) is frequently incomplete. - Feature entanglement: The intermingling of modality-specific and shared information leads to suboptimal fusion and clinically inaccurate hallucinated findings. - LLM-based methods incur heavy computational costs; knowledge graph-based methods exhibit poor adaptability. - Retrieval-augmented methods degrade to deterministic rules when context is missing.
Key Challenge: How can cross-modal alignment remain stable under missing modality conditions?
Goal: Achieve robust radiology report generation against missing modalities through disentangled representation learning.
Key Insight: Decompose the latent space into three factors (\(Z_v\) visual-specific, \(Z_l\) language-specific, \(Z_s\) shared), and adopt an MoE strategy to infer the shared latent variable such that missing modalities are automatically down-weighted.
Core Idea: Three-factor latent space disentanglement (orthogonality constraints for separation + contrastive alignment for semantic association) combined with an MoE shared encoder that gracefully degrades under missing modalities.
Method¶
Overall Architecture¶
- Feature Extraction: EfficientNetB0 + GCA for visual features; Transformer encoder for language features.
- Modality Abstractor: Bidirectional cross-attention to fuse visual and language features.
- VL-MoE-VAE: Learns the three-factor latent space \((Z_v, Z_l, Z_s)\).
- LLaMA-X Decoder: Generates reports from disentangled representations.
Key Designs¶
-
Three-Factor Latent Space Decomposition:
- Visual-specific \(Z_v\): inferred by VGG16 encoder, \(q_{\phi_v}(Z_v|V)\)
- Language-specific \(Z_l\): inferred by Transformer encoder, \(q_{\phi_l}(Z_l|L)\)
- Shared \(Z_s\): MoE strategy, \(q_{\phi_s}(Z_s|V,L) = \sum_{M} \pi_M q_{\phi_s}(Z_s|M)\)
- Design Motivation: Each latent variable is constrained to encode only its designated information — modality-specific latent variables must reconstruct their corresponding modality, while the shared latent variable encodes cross-modal semantics.
-
Disentangled Alignment Constraints:
- Orthogonality Constraint: \(\mathcal{L}_{orth} = \|\tilde{Z}_s^\top \tilde{Z}_v\|_F^2 + \|\tilde{Z}_s^\top \tilde{Z}_l\|_F^2 + \|\tilde{Z}_v^\top \tilde{Z}_l\|_F^2\), enforcing statistical independence among the three latent subspaces.
- Contrastive Alignment: InfoNCE loss maximizes \(I(Z_s; Z_v)\) and \(I(Z_s; Z_l)\), ensuring the shared space encodes semantics from both modalities.
- Design Motivation: The ELBO alone cannot guarantee meaningful latent factors. Orthogonality ensures separation while contrastive alignment ensures semantic relevance — the two constraints are complementary.
-
Inference under Missing Modalities:
- When the language modality is absent, a "null" token is passed in; the MoE router automatically assigns \(\pi_L \approx 0\) and \(\pi_V \approx 1\).
- Theoretical proof: the degraded objective remains a valid marginal ELBO lower bound.
- Contrastive alignment training ensures \(Z_s\) retains cross-modal semantics even when inferred from a single modality.
Loss & Training¶
\(\mathcal{L}_{total} = \mathcal{L}_{CE} + \mathcal{L}_{ELBO} + \lambda_1 \mathcal{L}_{orth} + \lambda_2 \mathcal{L}_{align}\). The LLaMA-X decoder incorporates RoPE, grouped-query attention, and SwiGLU FFN for efficiency.
Key Experimental Results¶
Main Results¶
| Method | IU X-Ray B@4 | MIMIC-CXR B@4 |
|---|---|---|
| R2Gen | 0.165 | 0.103 |
| XProNet | 0.199 | 0.105 |
| EKAGen | 0.203 | - |
| SEI | 0.263 | 0.131 |
| DiA (Ours) | 0.266 | 0.134 |
Ablation Study¶
- Removing orthogonality constraint: disentanglement quality degrades and feature interference increases.
- Removing contrastive alignment: significant performance drop under missing modality conditions.
- Replacing MoE with PoE: catastrophic performance degradation when modalities are missing.
Key Findings¶
- MoE is more suitable than PoE for handling missing modalities — PoE produces overconfident posteriors when a modality is absent.
- The dual orthogonality + contrastive constraint significantly outperforms either constraint alone.
- The LLaMA-X decoder is more efficient than large LLMs and avoids the limitations of template-driven generation.
- Performance degrades gracefully rather than catastrophically under missing modality conditions.
Highlights & Insights¶
- The theoretical analysis of MoE vs. PoE offers important reference value for multimodal VAE research — the overconfidence problem of PoE under missing modalities is a critical but often overlooked issue.
- The "graceful degradation" design philosophy is well-suited for clinical deployment — rather than failing when input is incomplete, the model automatically adjusts based on available information.
- The dual orthogonality + contrastive constraint represents strong practice in disentangled representation learning — the former enforces separation while the latter preserves semantics.
Limitations & Future Work¶
- Validation is limited to chest X-ray report generation; other imaging modalities (CT, MRI) remain untested.
- The specific scale and training details of LLaMA-X are insufficiently described.
- Sensitivity of model performance to the temperature parameter \(\tau\) in contrastive alignment is not thoroughly analyzed.
- NLG metrics (BLEU) may not fully reflect clinical accuracy.
Related Work & Insights¶
- vs. R2Gen/CvT2Dis: Image-only methods lacking contextual information; DiA improves by incorporating clinical context.
- vs. SEI: SEI employs retrieval augmentation without disentanglement, leaving feature interference unresolved; DiA's three-factor decomposition is more principled.
- vs. PromptMRG: Prompt- and LLM-based approaches are computationally expensive and template-dependent; DiA is more efficient and flexible.
- vs. DrFuse: DrFuse also performs disentanglement but uses an adversarial objective; DiA's contrastive + orthogonality approach is more stable.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of three-factor VAE + MoE + dual constraints demonstrates theoretical depth.
- Experimental Thoroughness: ⭐⭐⭐⭐ Two standard benchmarks with ablations and missing-modality experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations, clear architectural diagrams, and coherent logical flow.
- Value: ⭐⭐⭐⭐ Practically meaningful for multimodal medical report generation under missing modality conditions.