DiA-gnostic VLVAE: Disentangled Alignment-Constrained Vision Language Variational AutoEncoder for Robust Radiology Reporting with Missing Modalities¶

Conference: AAAI 2026 arXiv: 2511.05968 Code: N/A Area: Medical Imaging Keywords: Radiology Report Generation, Missing Modalities, Disentangled Representation, VAE, MoE

TL;DR¶

This paper proposes DiA-gnostic VLVAE, a vision-language mixture-of-experts VAE that learns a three-factor latent space (\(Z_v\) visual-specific / \(Z_l\) language-specific / \(Z_s\) shared), with dual constraints of orthogonality and contrastive alignment for disentanglement. The model generates reliable radiology reports even when clinical context is absent, achieving competitive BLEU@4 on IU X-Ray and MIMIC-CXR.

Background & Motivation¶

Background: Radiology Report Generation (RRG) is a critical task of automatically converting medical images into textual reports. The field has evolved from purely image-based models (R2Gen) to knowledge graph-enhanced approaches (MKSG) and context-aware models (PromptMRG), progressively incorporating richer clinical information.

Limitations of Prior Work: - Missing modalities: In clinical practice, contextual information (medical history, symptoms, demographics) is frequently incomplete. - Feature entanglement: The intermingling of modality-specific and shared information leads to suboptimal fusion and clinically inaccurate hallucinated findings. - LLM-based methods incur heavy computational costs; knowledge graph-based methods exhibit poor adaptability. - Retrieval-augmented methods degrade to deterministic rules when context is missing.

Key Challenge: How can cross-modal alignment remain stable under missing modality conditions?

Goal: Achieve robust radiology report generation against missing modalities through disentangled representation learning.

Key Insight: Decompose the latent space into three factors (\(Z_v\) visual-specific, \(Z_l\) language-specific, \(Z_s\) shared), and adopt an MoE strategy to infer the shared latent variable such that missing modalities are automatically down-weighted.

Core Idea: Three-factor latent space disentanglement (orthogonality constraints for separation + contrastive alignment for semantic association) combined with an MoE shared encoder that gracefully degrades under missing modalities.

Method¶

Overall Architecture¶

Feature Extraction: EfficientNetB0 + GCA for visual features; Transformer encoder for language features.
Modality Abstractor: Bidirectional cross-attention to fuse visual and language features.
VL-MoE-VAE: Learns the three-factor latent space \((Z_v, Z_l, Z_s)\).
LLaMA-X Decoder: Generates reports from disentangled representations.

Key Designs¶

Three-Factor Latent Space Decomposition:
- Visual-specific \(Z_v\): inferred by VGG16 encoder, \(q_{\phi_v}(Z_v|V)\)
- Language-specific \(Z_l\): inferred by Transformer encoder, \(q_{\phi_l}(Z_l|L)\)
- Shared \(Z_s\): MoE strategy, \(q_{\phi_s}(Z_s|V,L) = \sum_{M} \pi_M q_{\phi_s}(Z_s|M)\)
- Design Motivation: Each latent variable is constrained to encode only its designated information — modality-specific latent variables must reconstruct their corresponding modality, while the shared latent variable encodes cross-modal semantics.
Disentangled Alignment Constraints:
- Orthogonality Constraint: \(\mathcal{L}_{orth} = \|\tilde{Z}_s^\top \tilde{Z}_v\|_F^2 + \|\tilde{Z}_s^\top \tilde{Z}_l\|_F^2 + \|\tilde{Z}_v^\top \tilde{Z}_l\|_F^2\), enforcing statistical independence among the three latent subspaces.
- Contrastive Alignment: InfoNCE loss maximizes \(I(Z_s; Z_v)\) and \(I(Z_s; Z_l)\), ensuring the shared space encodes semantics from both modalities.
- Design Motivation: The ELBO alone cannot guarantee meaningful latent factors. Orthogonality ensures separation while contrastive alignment ensures semantic relevance — the two constraints are complementary.
Inference under Missing Modalities:
- When the language modality is absent, a "null" token is passed in; the MoE router automatically assigns \(\pi_L \approx 0\) and \(\pi_V \approx 1\).
- Theoretical proof: the degraded objective remains a valid marginal ELBO lower bound.
- Contrastive alignment training ensures \(Z_s\) retains cross-modal semantics even when inferred from a single modality.

Loss & Training¶

\(\mathcal{L}_{total} = \mathcal{L}_{CE} + \mathcal{L}_{ELBO} + \lambda_1 \mathcal{L}_{orth} + \lambda_2 \mathcal{L}_{align}\). The LLaMA-X decoder incorporates RoPE, grouped-query attention, and SwiGLU FFN for efficiency.

Key Experimental Results¶

Main Results¶

Method	IU X-Ray B@4	MIMIC-CXR B@4
R2Gen	0.165	0.103
XProNet	0.199	0.105
EKAGen	0.203	-
SEI	0.263	0.131
DiA (Ours)	0.266	0.134

Ablation Study¶

Removing orthogonality constraint: disentanglement quality degrades and feature interference increases.
Removing contrastive alignment: significant performance drop under missing modality conditions.
Replacing MoE with PoE: catastrophic performance degradation when modalities are missing.

Key Findings¶

MoE is more suitable than PoE for handling missing modalities — PoE produces overconfident posteriors when a modality is absent.
The dual orthogonality + contrastive constraint significantly outperforms either constraint alone.
The LLaMA-X decoder is more efficient than large LLMs and avoids the limitations of template-driven generation.
Performance degrades gracefully rather than catastrophically under missing modality conditions.

Highlights & Insights¶

The theoretical analysis of MoE vs. PoE offers important reference value for multimodal VAE research — the overconfidence problem of PoE under missing modalities is a critical but often overlooked issue.
The "graceful degradation" design philosophy is well-suited for clinical deployment — rather than failing when input is incomplete, the model automatically adjusts based on available information.
The dual orthogonality + contrastive constraint represents strong practice in disentangled representation learning — the former enforces separation while the latter preserves semantics.

Limitations & Future Work¶

Validation is limited to chest X-ray report generation; other imaging modalities (CT, MRI) remain untested.
The specific scale and training details of LLaMA-X are insufficiently described.
Sensitivity of model performance to the temperature parameter \(\tau\) in contrastive alignment is not thoroughly analyzed.
NLG metrics (BLEU) may not fully reflect clinical accuracy.

vs. R2Gen/CvT2Dis: Image-only methods lacking contextual information; DiA improves by incorporating clinical context.
vs. SEI: SEI employs retrieval augmentation without disentanglement, leaving feature interference unresolved; DiA's three-factor decomposition is more principled.
vs. PromptMRG: Prompt- and LLM-based approaches are computationally expensive and template-dependent; DiA is more efficient and flexible.
vs. DrFuse: DrFuse also performs disentanglement but uses an adversarial objective; DiA's contrastive + orthogonality approach is more stable.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of three-factor VAE + MoE + dual constraints demonstrates theoretical depth.
Experimental Thoroughness: ⭐⭐⭐⭐ Two standard benchmarks with ablations and missing-modality experiments.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations, clear architectural diagrams, and coherent logical flow.
Value: ⭐⭐⭐⭐ Practically meaningful for multimodal medical report generation under missing modality conditions.