Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering¶
Conference: ICLR 2026 arXiv: 2509.23899 Code: None Area: Medical Imaging Keywords: Medical VQA, Frequency-Domain Fusion, Quantum Retrieval Augmentation, Multimodal Fusion, Contrastive Learning
TL;DR¶
This paper proposes the Q-FSRU framework, which transforms medical image and text features into the frequency domain via FFT for fusion, and introduces a quantum-inspired retrieval augmentation mechanism (Quantum RAG) to retrieve medical facts from an external knowledge base, achieving 90.0% accuracy on the VQA-RAD dataset.
Background & Motivation¶
- Medical Visual Question Answering (Med-VQA) requires simultaneous understanding of medical images and clinical questions; existing methods face challenges including data scarcity, complex clinical terminology, and diverse imaging modalities.
- Most approaches (e.g., LLaVA-Med, STLLaVA-Med) operate solely in the spatial domain, potentially overlooking pathological pattern information encoded in the frequency domain.
- Existing retrieval augmentation methods rely on classical cosine similarity, which may fail to fully capture the complex semantic relationships required for clinical reasoning.
- Core Motivation: Frequency-domain transformations can capture global contextual patterns missed by spatial processing; quantum-inspired similarity metrics may outperform classical retrieval methods.
Method¶
Overall Architecture¶
Q-FSRU consists of four core modules: (1) multimodal feature extraction, (2) FFT frequency-domain processing, (3) quantum-inspired knowledge retrieval, and (4) multimodal fusion with contrastive learning. The pipeline is: image and text features → FFT spectral transformation → cross-modal co-selection → Quantum RAG retrieval → MLP classification.
Key Designs¶
-
Frequency-Spectral Representation and Fusion (FSRU):
- 1D FFT is applied separately to text features \(t\) and projected image features \(v_{\text{proj}}\), with magnitude spectra extracted: \(t_{\text{freq}} = |\mathcal{F}(t)|\), \(v_{\text{freq}} = |\mathcal{F}(v_{\text{proj}})|\)
- A learnable filter bank (\(K=4\)) compresses the frequency representations.
- Gated attention implements cross-modal co-selection: \(g_{\text{text}} = \sigma(W_{\text{gate1}} \cdot \text{AvgPool}(v_{\text{compressed}}))\), yielding enhanced text features \(t_{\text{enhanced}} = t_{\text{compressed}} \odot g_{\text{text}}\)
- Design Motivation: Frequency-domain transformation captures global pathological patterns in medical images; the gating mechanism enables mutual enhancement between modalities.
-
Quantum-Inspired Retrieval Augmentation (Quantum RAG):
- Embedding vectors are represented as quantum states: \(|\psi(x)\rangle = x / \|x\|_2\)
- Density matrices \(\rho(x) = |\psi(x)\rangle\langle\psi(x)|\) provide statistical robustness.
- Uhlmann fidelity measures similarity between the query and knowledge base entries: \(\text{Fid}(\rho_q, \rho_{k_i})\)
- Top-3 retrieved entries are aggregated via softmax-weighted summation: \(k_{\text{agg}} = \sum_{j=1}^{3} \text{softmax}(\text{Sim}_j / \tau) \cdot k_j\), with temperature \(\tau = 0.1\)
- Design Motivation: Quantum fidelity may more effectively capture semantic relationships in high-dimensional spaces.
-
Dual Contrastive Learning Framework:
- Intra-modal contrastive loss: \(\mathcal{L}_{\text{intra}}\), temperature \(\tau = 0.07\)
- Cross-modal contrastive loss: \(\mathcal{L}_{\text{cross}}\), temperature \(\tau = 0.05\)
- Design Motivation: Pulls representations of same-class samples closer while pushing apart those of different classes.
Loss & Training¶
- Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + (0.3 \cdot \frac{\mathcal{L}_{\text{intra-text}} + \mathcal{L}_{\text{intra-image}}}{2} + 0.7 \cdot \mathcal{L}_{\text{cross}})\)
- Optimizer: Adam, learning rate \(5 \times 10^{-5}\), L2 regularization weight \(10^{-5}\)
- 5-fold cross-validation, batch size 32, maximum 50 epochs, step-based decay (0.98 per 5 epochs), early stopping patience of 10
- Image encoder: ViT-B/16 (ImageNet pretrained); text encoder: 300-dimensional word embeddings with mean pooling
Key Experimental Results¶
Main Results¶
| Dataset | Metric | Q-FSRU | Prev. SOTA (FSRU) | Gain |
|---|---|---|---|---|
| VQA-RAD | Accuracy | 90.0% | 87.1% | +2.9% |
| VQA-RAD | F1-Score | 85.2% | 82.3% | +2.9% |
| VQA-RAD | AUC | 0.954 | 0.921 | +0.033 |
| VQA-RAD→PathVQA | Accuracy | 81.7% | 78.4% | +3.3% |
| PathVQA→VQA-RAD | Accuracy | 80.3% | 76.9% | +3.4% |
Ablation Study¶
| Configuration | Accuracy | Δ Acc. | Note |
|---|---|---|---|
| Q-FSRU (Full) | 90.0% | — | Complete model |
| w/o Frequency Processing | 85.1% | -4.9% | Largest contributor |
| w/o Quantum Retrieval | 86.8% | -3.2% | Significant contribution |
| w/o Contrastive Learning | 87.3% | -2.7% | Notable contribution |
| Spatial-only Fusion | 84.2% | -5.8% | Worst configuration |
| Cosine Similarity (replacing quantum) | 88.1% | -1.9% | Quantum similarity outperforms cosine |
Key Findings¶
- Frequency-domain processing contributes the largest performance gain (−4.9%), indicating that spectral representations capture clinically relevant patterns missed in the spatial domain.
- Quantum-inspired retrieval outperforms classical cosine similarity by 1.9%, though the margin is modest.
- The model contains only 92.4M parameters, far fewer than LLaVA-Med/STLLaVA-Med (7B), yet achieves superior performance on VQA-RAD.
- Strong cross-dataset generalization (+3.3%/+3.4%) suggests that the learned representations transfer well across domains.
Highlights & Insights¶
- Introducing frequency-domain analysis into Med-VQA is a novel research direction; the global information captured by FFT may be particularly beneficial for medical image analysis.
- Quantum-inspired retrieval is an interesting yet relatively preliminary exploration, applying quantum state representations to knowledge retrieval.
- The compact model size (92.4M parameters) offers practical deployment value in resource-constrained environments.
Limitations & Future Work¶
- Validation is limited to two datasets (VQA-RAD and PathVQA), both of modest scale (VQA-RAD contains only 3,515 QA pairs).
- The theoretical advantages of quantum retrieval are extensively discussed, but the empirical gains remain relatively limited (only 1.9% over cosine similarity).
- Comparisons with recent large language models (e.g., GPT-4V) are absent.
- The construction and maintenance of the knowledge base are not described in detail.
- Performance on more complex multi-choice or open-ended question answering remains unknown.
Related Work & Insights¶
- Frequency-domain approaches have demonstrated success in image analysis (FDTrans) and rumor detection (Lao et al. 2024); this work extends the paradigm to Med-VQA.
- Quantum-inspired information retrieval (Uprety et al. 2021) offers a novel perspective on similarity computation.
- Cross-modal contrastive learning has become a standard practice in multimodal fusion.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of frequency-domain processing and quantum retrieval is novel in the Med-VQA context.
- Experimental Thoroughness: ⭐⭐⭐ Datasets are small; comparisons with recent large vision-language models are lacking.
- Writing Quality: ⭐⭐⭐⭐ Structure is clear and mathematical derivations are complete.
- Value: ⭐⭐⭐ The lightweight design is valuable, but the practical advantages of quantum retrieval require more rigorous validation.