Skip to content

Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering

Conference: AAAI 2026 arXiv: 2508.12036 Code: N/A Area: Medical Imaging Keywords: Medical VQA, Frequency-Domain Fusion, Quantum Retrieval-Augmented Generation, Fast Fourier Transform, Cross-Modal Reasoning

TL;DR

This paper proposes Q-FSRU, a model that transforms medical image and text features into the frequency domain via FFT for multimodal fusion, and incorporates external medical knowledge through a quantum-inspired retrieval-augmented generation (Quantum RAG) mechanism, achieving 90% accuracy and a ROC-AUC of 0.9541 on the VQA-RAD dataset.

Background & Motivation

Medical Visual Question Answering (Medical VQA) is an interdisciplinary task that integrates computer vision, natural language processing, and clinical reasoning. In real-world clinical settings, radiologists frequently pose questions about medical images (e.g., "Are there pulmonary lesions?" or "Does the CT show effusion?"), which demands not only an understanding of visual content but also contextual knowledge and deep comprehension of natural language.

Existing models exhibit several core limitations:

Limitations of Spatial-Domain Features: Most VQA models operate in the spatial domain, relying on convolutional or attention-based mechanisms for feature extraction. Such approaches may overlook subtle frequency patterns in medical images—frequency-domain representations can capture global contextual cues that are often missed in the spatial domain.

Shallow Alignment in Classical Retrieval Methods: Although retrieval-augmented generation (RAG) methods can introduce external knowledge, they typically rely on classical metrics such as cosine similarity, whose shallow matching is difficult to fully align with medical reasoning.

Data Scarcity and High-Stakes Decision-Making: The distinctive characteristics of the medical domain—domain-specific language, complex imaging modalities, data scarcity, and high-stakes decisions—make it difficult to directly transfer general-purpose VQA models.

The central motivation of this paper is to combine the denoising and global pattern capture capabilities of frequency-domain representations with the deep knowledge alignment capability of quantum-inspired retrieval, thereby constructing a more robust and interpretable Medical VQA system.

Method

Overall Architecture

The Q-FSRU model consists of four core modules (see Figure 1):

  1. Unimodal Feature Encoding
  2. Frequency Spectrum Representation and Fusion (FSRU)
  3. Quantum-Augmented Knowledge Retrieval (Quantum RAG)
  4. Joint Reasoning and Answer Generation

The overall task is formulated as a classification problem: given a medical image \(x_i^{\text{image}} \in \mathbb{R}^{H \times W \times 3}\) and a natural language question \(q_i\), the model predicts an answer \(\hat{y}_i \in \{0, 1\}\).

Key Designs

  1. Unimodal Feature Encoding

  2. Text Encoder: BioBERT is used to encode clinical questions into 768-dimensional text embeddings \(t = E_t(Q) \in \mathbb{R}^{d_t}\).

  3. Visual Encoder: An ImageNet-pretrained ResNet-50 extracts 2048-dimensional visual embeddings \(v = E_v(I) \in \mathbb{R}^{d_v}\).

These two encoders capture semantic patterns from text and spatial features from images, respectively, providing foundational representations for subsequent frequency-domain fusion.

  1. Frequency Spectrum Representation and Fusion (FSRU)

This is one of the core innovations of the model. Rather than directly fusing spatial-domain features, embeddings from both modalities are first transformed into the frequency domain:

\(v_{\text{freq}} = \text{FFT}(v), \quad t_{\text{freq}} = \text{FFT}(t)\)

The advantages of frequency-domain transformation include: - Highlighting global patterns: Frequency components can capture global semantic features that are difficult to observe in the spatial domain. - Suppressing noise: Irrelevant spatial noise signals are filtered out. - Preserving cross-modal relationships: Frequency-domain fusion better retains global associations between modalities.

Fusion is achieved via vector concatenation:

\(f_{\text{freq}} = [v_{\text{freq}} \| t_{\text{freq}}] \in \mathbb{R}^{d_v + d_t}\)

Prior to concatenation, each modality is projected through learnable linear layers to ensure dimensional compatibility, and a gated attention mechanism is applied for alignment.

  1. Quantum Retrieval-Augmented Generation (Quantum RAG)

This is another core innovation of the model. Following frequency-domain fusion, a quantum-inspired retrieval mechanism is employed to incorporate external medical knowledge:

  • Knowledge Embedding: A collection of medical knowledge passages/keys \(k_i\) is pre-encoded using BioBERT and stored as a vector database.
  • Quantum Similarity Computation: The fused feature \(f_{\text{freq}}\) and each knowledge key \(k_i\) are encoded as quantum states \(\psi_f\) and \(\psi_{k_i}\), and the quantum inner product amplitude is computed:

\(\text{Sim}_q(f_{\text{freq}}, k_i) = |\langle \psi_f | \psi_{k_i} \rangle|^2\)

  • Top-K Aggregation: The top-\(k\) ranked knowledge vectors are retrieved and averaged:

\(k_{\text{agg}} = \text{TopK-Avg}(K, \text{Sim}_q)\)

The advantage of quantum similarity lies in its ability to capture non-classical correlations, offering finer-grained matching than conventional dot products.

  1. Answer Generation

The final fused feature is passed through a fully connected layer followed by Softmax to produce predictions:

\(\hat{y} = \text{Softmax}(W \cdot f + b)\)

The class with the highest probability is selected as the model output.

Loss & Training

  • Loss Function: Focal Loss with Label Smoothing to address class imbalance
  • Optimizer: Adam with learning rate \(1 \times 10^{-4}\)
  • Training Setup: 30 epochs, batch size 8, cosine annealing learning rate schedule
  • Evaluation Protocol: 5-fold stratified cross-validation
  • Hardware: CPU training (increases training time but does not affect experimental integrity)

Key Experimental Results

Main Results

Metric Q-FSRU Notes
Overall Accuracy 90.00% Binary classification task
Precision 83.04% Correctness among positive predictions
Recall 78.15% Coverage of positive instances
F1-Score 80.52% Harmonic mean of precision and recall
ROC-AUC 0.9541 Discriminability between positive and negative classes
Peak Training Accuracy 92.00% Model learning capacity

Per-class results:

Class Precision Recall F1-Score Support
Class 0 (Negative/Non-diagnostic) 92.31% 94.26% 93.27% 331
Class 1 (Positive/Diagnostic) 83.04% 78.15% 80.52% 119

Ablation Study

Detailed 5-fold cross-validation results:

Fold Accuracy Precision Recall F1-Score ROC-AUC
1 0.913 0.905 0.922 0.913 0.945
2 0.908 0.897 0.934 0.915 0.948
3 0.921 0.912 0.939 0.925 0.951
4 0.917 0.905 0.943 0.923 0.950
5 0.920 0.914 0.936 0.925 0.953
Mean 0.916 0.906 0.935 0.920 0.949

Key Findings

  1. ROC-AUC reaches 0.9541: The model demonstrates strong discriminability between positive and negative classes across varying thresholds.
  2. Slightly lower recall for Class 1 (78.15%): Some true positives are missed, reflecting the challenge posed by the limited number of positive samples.
  3. Stable 5-fold validation results: Accuracy fluctuates between 0.908 and 0.921 with minimal standard deviation.
  4. Reasonable train-validation accuracy gap (92% vs. 90%), with no signs of severe overfitting.
  5. Confusion matrix: among 450 samples, 312 true negatives, 93 true positives, 19 false positives, and 26 false negatives.

Highlights & Insights

  1. Frequency-domain fusion is a compelling idea: Applying FFT to fuse text and image embeddings can theoretically capture global frequency patterns missed in the spatial domain. This idea has been validated by Lao et al. (2024) for multimodal rumor detection; the present paper extends it to Medical VQA.

  2. Quantum-inspired retrieval is conceptually novel: Using quantum state inner product amplitudes as a similarity measure is an interesting attempt that theoretically enables the capture of nonlinear, non-classical semantic associations.

  3. Clear problem formulation: The formalization of Medical VQA as a classification problem is mathematically complete and well-defined.

Limitations & Future Work

  1. Absence of direct comparison with existing methods: The paper acknowledges "a lack of directly comparable models" and relies solely on its own 5-fold cross-validation as a baseline. The absence of comparisons with established Medical VQA methods such as MEVF, BAN, and SAN undermines the persuasiveness of the results.

  2. Evaluation limited to a single dataset (VQA-RAD): VQA-RAD contains only approximately 3,500 QA pairs and 315 images, making it a relatively small-scale benchmark. Validation on other benchmarks such as PathVQA and SLAKE is absent.

  3. Oversimplified binary classification setting: Reducing Medical VQA to yes/no binary classification disregards open-ended questions (e.g., "What is this lesion?"), limiting its practical clinical value.

  4. Questionable substantiveness of the "quantum" component: The so-called "quantum states" are in practice normalized classical vectors combined with inner product operations, bearing no substantive connection to genuine quantum computing (e.g., quantum superposition, entanglement). More accurately, this constitutes a classical similarity computation inspired by quantum mechanics.

  5. Constraints of CPU training: Training on CPU limits scalability and prevents validation on larger-scale datasets or models.

  6. Frequency-domain fusion reduces to simple concatenation: After FFT, the frequency vectors of two modalities are directly concatenated, lacking more sophisticated frequency-domain interaction mechanisms such as frequency-domain attention or frequency band selection.

  • FDTrans (Zhou et al. 2023): Frequency-domain Transformer for multimodal medical image analysis—shares the frequency-domain processing paradigm with this paper, though FDTrans does not address VQA.
  • FreqU-FNet (Singh & Patel 2024): Frequency-aware U-Net for medical segmentation—similarly employs FFT but targets pixel-level tasks.
  • RAG (Lewis et al. 2020): Retrieval-augmented generation—the foundational framework underlying Quantum RAG.
  • Lao et al. (2024): Multimodal spectral fusion for rumor detection—the direct source of inspiration for Q-FSRU.
  • While the combination of frequency-domain fusion and quantum-inspired retrieval is novel, the effectiveness of individual components still requires more rigorous ablation studies for validation.

Rating

  • Novelty: ⭐⭐⭐ — The combination of frequency-domain fusion and quantum-inspired retrieval is a novel attempt, though the depth of innovation in individual components is limited.
  • Technical Depth: ⭐⭐ — The method design is relatively straightforward; the "quantum" component is novel primarily in name, with the underlying computation being entirely classical.
  • Practicality: ⭐⭐ — Binary classification on a small-scale dataset, without direct comparison to SOTA methods, limits practical clinical value.
  • Clarity: ⭐⭐⭐ — Mathematical formulations are clear, but the experimental section lacks comparison baselines.