Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering¶

Conference: ICLR 2026 arXiv: 2601.19821 Code: To be released upon publication Area: Audio & Speech Keywords: Audio-Visual QA, Frequency-Domain Interaction, Query Guidance, Spatial-Temporal Perception, Multimodal Reasoning

TL;DR¶

This paper proposes QSTar, a framework that embeds query guidance throughout the entire processing pipeline and introduces a three-dimensional Spatial-Temporal-Frequency Interaction module (leveraging spectral features to distinguish timbres), achieving significant performance gains on Music Audio-Visual Question Answering (Music AVQA).

Background & Motivation¶

Background: Audio-visual question answering requires joint understanding of auditory, visual, and textual information, making it substantially more challenging than purely visual QA, as acoustic cues are often more critical than visual ones in many scenarios.

Limitations of Prior Work: Existing AVQA methods (e.g., PSTP, APL) primarily focus on visual information processing, treating audio merely as a supplement to video analysis, leaving its unique frequency-domain characteristics underexploited.

Key Challenge: Textual question information is typically fused only at the final reasoning stage via simple multiplication, causing audio-visual representations to lack semantic specificity.

Key Insight: Orchestral instruments (e.g., flute, clarinet) may produce highly subtle visual cues (minimal playing motion), yet exhibit markedly different spectral characteristics (harmonic distributions, overtone structures). Frequency-domain analysis is therefore essential for timbre discrimination.

Core Problem: In polyphonic scenarios where multiple instruments play simultaneously, temporal or spatial features alone are insufficient to effectively distinguish individual instrument contributions.

Method¶

Overall Architecture¶

QSTar comprises three core modules: (1) Query-Guided Multimodal Correlation (QGMC), which leverages question semantics in early stages to guide audio and visual feature refinement; (2) Spatial-Temporal-Frequency Interaction (STFI), which enhances feature interaction across spatial, temporal, and frequency dimensions; and (3) Query Context Reasoning (QCR), which injects task-relevant constraints via a prompt mechanism for final-stage reasoning.

Key Designs¶

1. Query-Guided Multimodal Correlation (QGMC)

Function: Introduces question semantics early in the pipeline to guide audio and visual feature refinement, rather than waiting until the final stage.
Mechanism: Operates in three steps — Self-enhancing (intra-modal self-attention to strengthen internal relationships) → Capturing (word-level textual features serve as queries; cross-attention retrieves shared semantics \(F_{qv}, F_{qa}\) from visual/audio streams) → Propagating (aggregated query-guided semantic context \(F_{qg}\) is back-propagated to the visual and audio streams via cross-attention).
Design Motivation: Questions typically focus on one or two instruments; injecting question information early enables the model to concentrate on semantically relevant audio-visual features and avoid redundant representations.

2. Spatial-Temporal Interaction (STI)

Function: Performs spatial-temporal interaction using patch-level visual features and audio features.
Mechanism: Spatial interaction — patch-level visual features are aligned with query-guided audio features via cross-attention to localize sounding regions; temporal interaction — query-guided visual and audio features compute temporal attention weights via dot-product and softmax, capturing global temporal dependencies.
Design Motivation: Video contains both a spatial dimension (where the sound originates) and a temporal dimension (when it occurs), each requiring dedicated modeling before fusion.

3. Temporal-Frequency Interaction (TFI)

Function: Extracts frequency-aware features using an Audio Spectrogram Transformer (AST) and enhances audio representations via a frequency attention mechanism.
Mechanism: (1) AST extracts a time-frequency representation \(F_{ast} \in \mathbb{R}^{T \times F \times D}\); (2) AST features are aggregated along the time dimension to obtain frequency representations; (3) frequency attention weights \(a_f\) are computed jointly with the question embedding to highlight question-relevant frequency bands; (4) weighted AST features are fused with query-guided audio features via convolution.
Design Motivation: Visually similar instruments (e.g., flute vs. clarinet) exhibit markedly different overtone and harmonic distributions in the frequency domain, providing discriminative cues unavailable from visual or temporal features alone.

4. Query Context Reasoning (QCR)

Function: Injects task-relevant linguistic context via a prompt mechanism for final-stage reasoning.
Mechanism: Instrument-related attribute keywords (type, performance duration, location, temporal order, loudness) are encoded as prompt embeddings \(F_{prompt}\), concatenated with sentence-level question embeddings, and processed via self-attention to produce query context \(F_{qc}\); cross-attention then guides the final refinement of visual and audio features.
Design Motivation: Different question types focus on different aspects; prompts supply focused task constraints to enable precise reasoning.

Loss & Training¶

Standard cross-entropy classification loss
AdamW optimizer; initial learning rate \(1 \times 10^{-4}\), decayed by a factor of 0.1 every 10 epochs
Batch size 64; trained for 30 epochs
Visual features: CLIP-ViT-L/14; audio features: VGGish + AST; all features projected to 512 dimensions

Key Experimental Results¶

Main Results¶

Accuracy (%) on the MUSIC-AVQA test set:

Method	Audio QA	Visual QA	Audio-Visual QA	Average
PSTP	70.91	77.26	72.57	73.52
APL	78.09	79.69	70.96	74.53
TSPM	76.91	83.61	73.51	76.79
QA-TIGER	78.58	85.14	73.74	77.62
QSTar	80.63	84.17	75.98	78.98

QSTar surpasses the previous SOTA QA-TIGER by 1.36% in overall accuracy, 2.05% in Audio QA, and 2.24% in Audio-Visual QA.

Ablation Study¶

Configuration	Audio QA	Visual QA	A-V QA	Average
w/o all	73.87	79.15	70.33	73.29
w/o QGMC	79.08	83.44	72.92	76.80
w/o QCR	79.33	83.24	75.43	78.19
w/o STI	—	−1.55%	—	−1.18%
w/o TFI	−2.42%	—	−1.59%	significant drop
Full QSTar	80.63	84.17	75.98	78.98

Key Findings¶

TFI is critical for audio-type questions: Removing TFI causes a 2.42% drop in Audio QA and 1.59% in Audio-Visual QA, confirming the indispensable role of spectral features in timbre discrimination.
Query guidance throughout the pipeline matters: Removing early-stage guidance (\(M_b^-\)) causes a 1.05% drop; removing the final prompt (\(M_f^-\)) causes a 0.73% drop.
Comparative and temporal question types benefit most: Gains exceeding 5% on these sub-types highlight the advantage of three-dimensional spatial-temporal-frequency interaction.
No object detector required: QSTar achieves competitive results without a pretrained object detector, trailing QA-TIGER in Visual QA by only 0.97%, demonstrating the inherent strength of its visual understanding.

Highlights & Insights¶

Frequency-domain analysis fills a gap in AVQA: Prior methods almost entirely overlook the spectral properties of audio signals; this work is the first to systematically exploit spectral features (via AST) for music-scene question answering.
End-to-end query-guided design significantly outperforms late fusion: Semantic information guides feature extraction from early stages, reducing redundant representations.
The frequency attention mechanism elegantly filters the spectrogram using the question text, enabling the model to focus on question-relevant frequency bands.
The flute case study is highly intuitive: virtually no visible motion change occurs, yet a clear attenuation in the high-frequency band of the spectrogram unambiguously marks the cessation of playing.

Limitations & Future Work¶

Reliance on pretrained feature extractors: CLIP, VGGish, and AST are all used as frozen pretrained models; end-to-end fine-tuning may yield further gains.
Evaluation limited to the music domain: MUSIC-AVQA covers only music scenarios; generalization to broader AVQA settings (dialogue, natural sounds) remains to be verified.
Slightly weaker Visual QA performance: The absence of an object detector limits spatial localization precision relative to QA-TIGER; incorporating a lightweight localization module is worth exploring.
Interpretability of frequency attention: Although spectral visualizations are provided, the semantic meaning of the frequency attention weights warrants deeper analysis.
Template-based QA constraints: MUSIC-AVQA generates question-answer pairs from predefined templates; the model's capacity for open-ended questions is unknown.

TSPM: Introduces temporal and spatial perception modules but remains visually dominant — QSTar elevates audio to an equally important modality.
QA-TIGER: Current SOTA but relies on complex visual processing — QSTar achieves superior overall performance with simpler visual processing through frequency-domain analysis.
Audio Spectrogram Transformer (AST): Effectively utilized as a frequency feature extractor — inspiring broader use of audio frequency-domain representations in other multimodal tasks.
Implications for video understanding: The query-guided feature refinement paradigm is transferable to tasks such as video QA and video grounding that similarly require question-conditioned reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ Frequency-domain interaction is a novel contribution to AVQA, though the overall framework architecture (stacked cross-attention) is relatively conventional.
Experimental Thoroughness: ⭐⭐⭐⭐ Ablation studies cover all modules and each stage of query guidance, but comprehensive evaluation is limited to MUSIC-AVQA.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated (the flute case study is effective); the method description is systematic, though notation-heavy.
Value: ⭐⭐⭐⭐ Achieves a new SOTA on Music AVQA; the introduction of frequency-domain analysis offers meaningful inspiration for multimodal understanding.