Skip to content

Retrieving to Recover: Towards Incomplete Audio-Visual Question Answering via Semantic-consistent Purification

Conference: ACL 2026
arXiv: 2604.10695
Code: None
Area: Audio & Speech / Multimodal Learning
Keywords: Audio-Visual Question Answering, Missing Modality, Retrieval-based Recovery, Semantic Purification, Mixture-of-Experts

TL;DR

This paper proposes the R2ScP framework, which shifts the missing modality handling paradigm in AVQA from traditional generative completion to retrieval-based recovery. By employing cross-modal retrieval and a context-aware adaptive purification mechanism to eliminate retrieval noise, it significantly improves question-answering performance in scenarios with incomplete modalities.

Background & Motivation

Background: Audio-Visual Question Answering (AVQA) requires models to perform reasoning across visual, audio, and textual modalities to understand dynamic scenes. Current methods typically assume that all modality data are fully available; however, performance degrades severely in practical scenarios such as equipment failure, sensor occlusion, or data transmission interruptions.

Limitations of Prior Work: Mainstream approaches for handling missing modalities rely on generative completion—synthesizing pseudo-features of the missing modality from existing ones. However, generative models tend to produce "common knowledge," which are generalized representations lacking fine-grained modality-specific information. For example, when inferring missing audio from a visual scene of a concert, a generative model might synthesize a generic "music" embedding but fail to capture the specific instrumental timbre visible in the frame, thereby introducing semantic hallucinations and noise.

Key Challenge: Generative methods essentially "imagine" missing information from existing modalities, and their outputs are limited by cross-modality shared knowledge, failing to recover unique modality-specific information. This information loss directly impacts question-answering tasks that require precise reasoning.

Goal: To shift the paradigm of handling missing modalities from generation to retrieval—recalling real, high-quality feature segments from a semantic database instead of synthesizing imperfect hallucinations.

Key Insight: The authors observe that real-world feature libraries contain a vast amount of reusable modality-specific knowledge; the key lies in how to accurately retrieve and denoise this information.

Core Idea: Replace generative completion with cross-modal retrieval and filter retrieval noise through a context-aware purification mechanism to preserve fine-grained modality-specific knowledge.

Method

Overall Architecture

R2ScP processes AVQA inputs where certain modalities may be missing (audio/visual/text) and outputs the answer. The Mechanism is to change the recovery of missing modalities from "generative completion" to "retrieval-based recovery." The pipeline consists of three steps: first, the Cross-Modal Retrieval (CMR) module uses available modalities as queries within a unified semantic space to recall real candidate features from an external memory bank; second, the Context-aware Adaptive Purification (CAP) mechanism uses both the context of available modalities and the textual question as constraints to eliminate retrieval noise and inject high-quality semantics; finally, a two-stage Mixture-of-Experts weights and fuses the original and recovered modalities based on reliability before passing them to a classification head for the answer.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    A["AVQA Input<br/>Audio/Visual/Text, one modality missing"] --> B["Cross-modal Retrieval Module (CMR)<br/>Query memory bank with available modalities,<br/>recall top-n real candidates"]
    B --> C
    subgraph C["Context-aware Adaptive Purification (CAP)"]
        direction TB
        C1["Consistency Noise Identification<br/>Calculate inconsistency score via context anchors,<br/>select top-k noise tokens"]
        C2["Text-guided Semantic Acquisition<br/>Question filters high-quality semantic indices"]
        C3["Selective Feature Injection<br/>Replace features only at noise positions"]
        C1 --> C2 --> C3
    end
    C --> D["Two-stage Mixture-of-Experts<br/>Independent Expert Pre-training +<br/>Router Weighted Fusion by Reliability"]
    D --> E["Classification Head<br/>Output QA Answer"]

Key Designs

1. Cross-Modal Retrieval Module (CMR): Replacing Imagined Pseudo-features with Real Feature Segments

The fundamental flaw of generative completion is that it can only "imagine" common knowledge from existing modalities, losing fine-grained modality-specific information (e.g., the timbre of a specific instrument in a frame). CMR adopts a retrieval approach: an external memory bank \(\mathcal{B} = \{(\mathbf{k}_i, \mathbf{v}_i)\}_{i=1}^{M}\) is constructed by encoding a large volume of real features into unified semantic embeddings using pre-trained multimodal models like ImageBind. When a modality is missing, the available modality is used as a query \(\mathbf{Q}_{avl}\), and top-n candidates are recalled based on cosine similarity \(S_i = \frac{\mathbf{Q}_{avl} \cdot \mathbf{k}_i}{\|\mathbf{Q}_{avl}\| \|\mathbf{k}_i\| + \epsilon}\). Since both keys and values reside in a unified semantic space, visual queries can directly align with semantically relevant real audio features, thereby retrieving reusable modality-specific knowledge from the database rather than synthesizing a generalized "music" embedding.

2. Context-aware Adaptive Purification (CAP): Denoising via Context Constraints and Question Guidance

Retrieval inevitably introduces irrelevant information—for instance, a violin performance might recall cello or applause features. Thus, CAP purifies in three steps. First, Consistency Noise Identification: calculate the inconsistency score \(\delta_i = 1 - \text{sim}(H_{miss} \cdot \mathbf{W}_{proj}, \mathbf{g}_{avl})\) between retrieved features and the global context anchor of available modalities, identifying a set of top-k discordant tokens as the noise index set \(\Omega_{noise}\). Second, Text-guided Semantic Acquisition: use multi-head cross-attention and self-attention to allow the textual question to sift through common knowledge for high-quality semantic indices \(\Omega_{salient}\) most relevant to the current question. Finally, Selective Feature Injection: only replace features at noise positions while keeping others intact: \(H_{miss}^{pur} = (\mathbf{1} - \mathcal{M}_{noise}) \odot H_{miss} + \mathcal{M}_{noise} \odot \text{Gather}(H_{guided}, \Omega_{salient})\). Constraints from available modalities prevent semantic drift, while question guidance ensures that the information retained is what is truly needed to answer the question.

3. Two-stage Mixture-of-Experts Training: Explicitly Distinguishing Reliability of Original and Recovered Modalities

Recovered modalities are ultimately less reliable than original ones; forced fusion can be biased by uncertain information. The authors split training into two stages. Stage one involves independently pre-training the Visual expert \(\mathcal{E}_v\), Audio expert \(\mathcal{E}_a\), and Text expert \(\mathcal{E}_t\), forcing each to extract discriminative representations without relying on cross-modal shortcuts, thus avoiding feature collapse. Stage two freezes the experts and only trains the Gating network (Router), which dynamically calculates weights \(\alpha_{m'} = \frac{\exp(g_{m'})}{\sum_{m} \exp(g_m)}\) based on input context to obtain the joint representation \(\mathbf{Z}_{joint} = \alpha_a H_a + \alpha_t H_t + \alpha_v H_v\). This allows recovered modalities to be assigned lower weights to reflect their uncertainty rather than being treated equally with real modalities.

Loss & Training

In addition to the standard cross-entropy loss \(\mathcal{L}_{task}\), a semantic ranking loss \(\mathcal{L}_{rank}\) is introduced to enforce that features recovered from positive samples are superior to negative samples but inferior to real features: \(\mathcal{L}_{total} = \mathcal{L}_{task} + \lambda(\mathcal{L}_{rank}^+ + \mathcal{L}_{rank}^-)\). This ensures that purified retrieved features reside within a valid semantic manifold.

Key Experimental Results

Main Results

Dataset Modality Setting Ours (R2ScP) Prev. SOTA (IMOL) Gain
Music-AVQA Missing Audio 69.37 67.11 +2.26
Music-AVQA Missing Visual 72.06 69.21 +2.85
Music-AVQA Complete 73.19 71.86 +1.33
AVQA Missing Audio 63.25 61.32 +1.93
AVQA Missing Visual 75.12 72.38 +2.74
AVQA Complete 90.64 90.28 +0.36

Ablation Study

Configuration Music-AVQA AVQA Description
w/o CMR w/o CAP 62.43 57.43 Baseline (Missing Audio)
+CMR only 67.21 61.78 Retrieval brings +4.78/+4.35
+CAP only 64.11 59.64 Purification itself is effective
+CMR+CAP (Full) 69.37 63.25 Combination yields best results

Key Findings

  • Retrieval-based recovery is more effective than generative completion, especially when the visual modality is missing (+2.85 vs IMOL).
  • CMR and CAP are independently effective, but their combination is optimal, indicating that retrieval and purification are complementary.
  • R2ScP still outperforms comparison methods in complete modality settings, suggesting the retrieval-recovery framework serves as a modal enhancement even without missing data.
  • The two-stage training strategy avoids cross-modal feature collapse by decoupling expert pre-training from gated mixture.

Highlights & Insights

  • Paradigm Innovation: The shift from "generative completion" to "retrieval-based recovery" is simple yet powerful, avoiding the hallucination issues of generative methods.
  • The three-stage purification design of CAP (noise identification → semantic acquisition → selective injection) is logically rigorous and fully utilizes guidance signals from available modalities and questions.
  • The semantic ranking loss cleverly establishes a quality gradient of "Real > Positive Retrieval > Negative Retrieval."
  • Performance gains even in complete modality scenarios indicate that retrieval mechanisms can serve as a general modality enhancement tool.

Limitations & Future Work

  • The construction and storage overhead of the external memory bank are significant, potentially posing a bottleneck for large-scale deployment.
  • It is currently assumed that the missing modality is completely unavailable; scenarios with partial loss or noise degradation are not addressed.
  • Retrieval quality is highly dependent on the alignment quality of the unified semantic space (ImageBind).
  • Validated only on the AVQA task; generalization to other multimodal reasoning tasks (e.g., VQA, dialogue) remains to be explored.
  • vs Missing-AVQA (ECCV 2024): Missing-AVQA uses relation-aware generators to synthesize missing features, which results in common knowledge; R2ScP preserves modality-specific info through retrieval.
  • vs IMOL (ACL 2025): While IMOL also uses retrieval, it is primarily for contrastive alignment rather than direct feature recovery; R2ScP directly replaces missing info with retrieved features.
  • vs MoMKE (MM 2024): MoMKE preserves modality-specific knowledge via mixture-of-experts but does not handle feature recovery; R2ScP combines retrieval and MoE architectures for a more complete solution.

Rating

  • Novelty: ⭐⭐⭐⭐ The paradigm shift from generation to retrieval is a clear innovation; the CAP purification mechanism is elegantly designed.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, multiple missing modality settings, and detailed ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem motivation and detailed method description.
  • Value: ⭐⭐⭐⭐ Provides a new research direction for handling missing multimodal data.