Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering¶

Conference: ICLR 2026 arXiv: 2509.23899 Code: None Area: Medical Imaging Keywords: Medical VQA, Frequency-Domain Fusion, Quantum Retrieval Augmentation, Multimodal Fusion, Contrastive Learning

TL;DR¶

This paper proposes the Q-FSRU framework, which transforms medical image and text features into the frequency domain via FFT for fusion, and introduces a quantum-inspired retrieval augmentation mechanism (Quantum RAG) to retrieve medical facts from an external knowledge base, achieving 90.0% accuracy on the VQA-RAD dataset.

Background & Motivation¶

Medical Visual Question Answering (Med-VQA) requires simultaneous understanding of medical images and clinical questions; existing methods face challenges including data scarcity, complex clinical terminology, and diverse imaging modalities.
Most approaches (e.g., LLaVA-Med, STLLaVA-Med) operate solely in the spatial domain, potentially overlooking pathological pattern information encoded in the frequency domain.
Existing retrieval augmentation methods rely on classical cosine similarity, which may fail to fully capture the complex semantic relationships required for clinical reasoning.
Core Motivation: Frequency-domain transformations can capture global contextual patterns missed by spatial processing; quantum-inspired similarity metrics may outperform classical retrieval methods.

Method¶

Overall Architecture¶

Q-FSRU consists of four core modules: (1) multimodal feature extraction, (2) FFT frequency-domain processing, (3) quantum-inspired knowledge retrieval, and (4) multimodal fusion with contrastive learning. The pipeline is: image and text features → FFT spectral transformation → cross-modal co-selection → Quantum RAG retrieval → MLP classification.

Key Designs¶

Frequency-Spectral Representation and Fusion (FSRU):
- 1D FFT is applied separately to text features \(t\) and projected image features \(v_{\text{proj}}\), with magnitude spectra extracted: \(t_{\text{freq}} = |\mathcal{F}(t)|\), \(v_{\text{freq}} = |\mathcal{F}(v_{\text{proj}})|\)
- A learnable filter bank (\(K=4\)) compresses the frequency representations.
- Gated attention implements cross-modal co-selection: \(g_{\text{text}} = \sigma(W_{\text{gate1}} \cdot \text{AvgPool}(v_{\text{compressed}}))\), yielding enhanced text features \(t_{\text{enhanced}} = t_{\text{compressed}} \odot g_{\text{text}}\)
- Design Motivation: Frequency-domain transformation captures global pathological patterns in medical images; the gating mechanism enables mutual enhancement between modalities.
Quantum-Inspired Retrieval Augmentation (Quantum RAG):
- Embedding vectors are represented as quantum states: \(|\psi(x)\rangle = x / \|x\|_2\)
- Density matrices \(\rho(x) = |\psi(x)\rangle\langle\psi(x)|\) provide statistical robustness.
- Uhlmann fidelity measures similarity between the query and knowledge base entries: \(\text{Fid}(\rho_q, \rho_{k_i})\)
- Top-3 retrieved entries are aggregated via softmax-weighted summation: \(k_{\text{agg}} = \sum_{j=1}^{3} \text{softmax}(\text{Sim}_j / \tau) \cdot k_j\), with temperature \(\tau = 0.1\)
- Design Motivation: Quantum fidelity may more effectively capture semantic relationships in high-dimensional spaces.
Dual Contrastive Learning Framework:
- Intra-modal contrastive loss: \(\mathcal{L}_{\text{intra}}\), temperature \(\tau = 0.07\)
- Cross-modal contrastive loss: \(\mathcal{L}_{\text{cross}}\), temperature \(\tau = 0.05\)
- Design Motivation: Pulls representations of same-class samples closer while pushing apart those of different classes.

Loss & Training¶

Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + (0.3 \cdot \frac{\mathcal{L}_{\text{intra-text}} + \mathcal{L}_{\text{intra-image}}}{2} + 0.7 \cdot \mathcal{L}_{\text{cross}})\)
Optimizer: Adam, learning rate \(5 \times 10^{-5}\), L2 regularization weight \(10^{-5}\)
5-fold cross-validation, batch size 32, maximum 50 epochs, step-based decay (0.98 per 5 epochs), early stopping patience of 10
Image encoder: ViT-B/16 (ImageNet pretrained); text encoder: 300-dimensional word embeddings with mean pooling

Key Experimental Results¶

Main Results¶

Dataset	Metric	Q-FSRU	Prev. SOTA (FSRU)	Gain
VQA-RAD	Accuracy	90.0%	87.1%	+2.9%
VQA-RAD	F1-Score	85.2%	82.3%	+2.9%
VQA-RAD	AUC	0.954	0.921	+0.033
VQA-RAD→PathVQA	Accuracy	81.7%	78.4%	+3.3%
PathVQA→VQA-RAD	Accuracy	80.3%	76.9%	+3.4%

Ablation Study¶

Configuration	Accuracy	Δ Acc.	Note
Q-FSRU (Full)	90.0%	—	Complete model
w/o Frequency Processing	85.1%	-4.9%	Largest contributor
w/o Quantum Retrieval	86.8%	-3.2%	Significant contribution
w/o Contrastive Learning	87.3%	-2.7%	Notable contribution
Spatial-only Fusion	84.2%	-5.8%	Worst configuration
Cosine Similarity (replacing quantum)	88.1%	-1.9%	Quantum similarity outperforms cosine

Key Findings¶

Frequency-domain processing contributes the largest performance gain (−4.9%), indicating that spectral representations capture clinically relevant patterns missed in the spatial domain.
Quantum-inspired retrieval outperforms classical cosine similarity by 1.9%, though the margin is modest.
The model contains only 92.4M parameters, far fewer than LLaVA-Med/STLLaVA-Med (7B), yet achieves superior performance on VQA-RAD.
Strong cross-dataset generalization (+3.3%/+3.4%) suggests that the learned representations transfer well across domains.

Highlights & Insights¶

Introducing frequency-domain analysis into Med-VQA is a novel research direction; the global information captured by FFT may be particularly beneficial for medical image analysis.
Quantum-inspired retrieval is an interesting yet relatively preliminary exploration, applying quantum state representations to knowledge retrieval.
The compact model size (92.4M parameters) offers practical deployment value in resource-constrained environments.

Limitations & Future Work¶

Validation is limited to two datasets (VQA-RAD and PathVQA), both of modest scale (VQA-RAD contains only 3,515 QA pairs).
The theoretical advantages of quantum retrieval are extensively discussed, but the empirical gains remain relatively limited (only 1.9% over cosine similarity).
Comparisons with recent large language models (e.g., GPT-4V) are absent.
The construction and maintenance of the knowledge base are not described in detail.
Performance on more complex multi-choice or open-ended question answering remains unknown.

Frequency-domain approaches have demonstrated success in image analysis (FDTrans) and rumor detection (Lao et al. 2024); this work extends the paradigm to Med-VQA.
Quantum-inspired information retrieval (Uprety et al. 2021) offers a novel perspective on similarity computation.
Cross-modal contrastive learning has become a standard practice in multimodal fusion.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of frequency-domain processing and quantum retrieval is novel in the Med-VQA context.
Experimental Thoroughness: ⭐⭐⭐ Datasets are small; comparisons with recent large vision-language models are lacking.
Writing Quality: ⭐⭐⭐⭐ Structure is clear and mathematical derivations are complete.
Value: ⭐⭐⭐ The lightweight design is valuable, but the practical advantages of quantum retrieval require more rigorous validation.