Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System¶

Conference: ACL 2025
arXiv: 2506.00421
Code: None
Area: Dialogue Systems

TL;DR¶

This paper proposes an immersive multimodal conversation system that endows chatbots with "eyes and ears." It constructs the M3C dataset, a multi-session multi-party dialogue dataset integrating vision and audio, and designs a dialogue model consisting of a dialogue module and a multimodal memory retrieval module, enabling dynamic, long-term conversations where multiple speakers share audiovisual experiences.

Background & Motivation¶

Existing multimodal dialogues prioritize "eyes" over "ears": Current research mainly focuses on image-related dialogues (visual dialogue, image instruction tuning, etc.), giving chatbots "eyes," whereas the auditory aspect ("ears") is severely lacking, with no solution simultaneously integrating vision and hearing.
Static interactions limit dialogue naturalness: In the existing paradigm, chatbots answer questions after receiving a shared image. This represents a static interaction mode of a "discussing modality" rather than a "naturally integrated modality," failing to capture the dynamic, real-time nature of real human communication.
Insufficient exploration of multi-party + multi-session scenarios: Although multimodality has been explored in multi-party and multi-session dialogues, it is constrained by specific task limitations and remains difficult to integrate seamlessly into dynamic, natural conversations.
Lack of datasets with shared spatio-temporal experiences: In existing datasets (e.g., PhotoChat, DialogCC), speakers do not experience the audiovisual inputs in the same space and time, which does not reflect real-world multi-person situations.
Importance of long-term memory mechanisms: In real conversations, people recall previous shared experiences. However, existing models lack effective multimodal memory storage and retrieval mechanisms to support coherent dialogues across multiple sessions.
Challenges in autonomous multi-party interaction: To achieve autonomous multi-agent dialogue without human intervention, the model needs to determine when it is its turn to speak, which remains an insufficiently solved key challenge in multi-party conversations.

Method¶

Dataset: M3C (Multimodal Multi-Session Multi-Party Conversation)¶

Data Scale and Structure: - 54K dialogue episodes (34K train/8K validation/12K test), totaling 2.5M dialogue turns. - Each episode contains 4 speakers across 3 consecutive sessions. - In each session, the main speaker interacts with 2 different partners. - Each session contains 2 multimodal inputs (visual or auditory). - All speakers share the same audiovisual experience in a unified spatio-temporal environment.

Data Construction Pipeline: 1. Modality Structuring: Using COCO images (24K) and AudioCaps/Clotho audio (73K) as seeds, image tags are optimized via GPT-4o mini. 2. Scene Preparation: Generating speaker profiles, session partners, modal inputs, and time intervals; grouping similar modalities based on K-means clustering (\(K=30\)) by location tags. 3. Dialogue and Memory Generation: Generating dialogues session-by-session, creating memory summaries from the main speaker's perspective, and connecting related elements via memory links. 4. Quality Filtering: Excluding episodes that fail spatio-temporal consistency checks via machine-verified questions.

Model Architecture¶

Base Model: Qwen2-VL-2B-Instruct, with audio understanding capabilities extended via CLAP + a linear adapter.

Dialogue Module: - Responsible for three tasks: dialogue generation, memory generation, and memory linking. - Generates responses based on dialogue history and multimodal inputs during an active session. - Constructs memory units integrated with multimodal perception after a session ends. - Explicitly associates new memories with semantically or perceptually related past memories via structured memory links.

Retrieval Module: - Retrieves relevant memories from the multimodal memory bank based on the current dialogue context. - Jointly embeds the entire session (dialogue + perceived modality) into a shared representation space. - Measures memory relevance using cosine similarity: \(\operatorname{sim}(c, m_i) = \cos(E_c(c), E_m(m_i))\). - Selects the Top-1 most relevant memory to augment the dialogue.

Training Strategy: Two-stage fine-tuning—first fine-tuning on vision-language tasks (with audio as text captions), then integrating the linear adapter to process raw audio. The model supports model-to-model dialogues and autonomously manages turn-taking.

Key Experimental Results¶

Human and Automatic Evaluation (Table 2)¶

Evaluation Dimension	Human Score	Automatic Score (o3-mini)
Dataset Quality
Coherence & Consistency	4.81	4.99
Memorability	4.63	4.99
Modality Alignment	4.21	4.26
Modality Engagement	4.36	4.57
Dataset Overall	4.50	4.70
Model Performance
Naturalness	4.34	4.68
Immersiveness	4.14	4.56
Memorability	4.35	4.46
Model Overall	4.28	4.57

Retrieval Module Performance (Table 4)¶

Model	Image R@1	Image MRR	Audio R@1	Audio MRR
Qwen2-VL-2B	66.77	77.56	-	-
LLaMA-3.2-11B-Vision	72.41	78.90	-	-
Qwen2-Audio-7B	-	-	69.94	80.72
Ours	92.99	95.06	92.83	94.78

Cross-Dataset Comparison (Table 3): When GPT-4o-mini and Claude-3.5-Sonnet evaluated the "immersive naturalness" of M3C versus other datasets, M3C achieved selection rates of 81% and 99%, respectively.

Multi-Party Dialogue Performance (Table 5): Next-speaker prediction accuracy—Ours 85.2% vs Qwen2-VL baseline 10.3%.

Ablation Study¶

Audio Captions vs. Raw Audio: The model equipped with the audio adapter aligns directly with auditory experiences, whereas caption-based models rely excessively on prompt text content.
Multimodal Memory: With the retriever, the model can reference specific details from prior sessions (e.g., "the shells collected last time"), whereas without the retriever, it generates generic responses.
Beyond Three Sessions: Although trained only on three-session data, the model supports long-term dialogues across more sessions thanks to its independent memory mechanism.

Highlights & Insights¶

First Audiovisual Multi-Party Multi-Session Dialogue Dataset: M3C is the first open-domain dialogue dataset where all speakers synchronously experience images and audio within a shared space and time.
Multimodal Memory Retrieval: Coherent cross-session dialogue is achieved through structured memory links and cross-modal retrieval, where associations are established at storage time (rather than search time).
Autonomous Multi-Party Dialogue: The model autonomously determines turn-taking and when modal inputs should appear, enabling multi-agent conversation without human intervention.
Substantial Lead in Retrieval Performance: Achieving over 92% R@1 in both image and audio retrieval, significantly outperforming comparison models.

Limitations & Future Work¶

The annotations for the audio dataset were not optimized to the same level as the image annotations, which may affect the immersive quality of the audio portion.
The base model is constrained by a 2B-parameter VLM, which lacks the native vision-audio-language joint understanding capabilities of larger models.
The dataset is generated by GPT-4o mini rather than human conversations, potentially introducing machine generation bias.
The model scale is relatively small (2B), and its performance and generalizability on larger models have not yet been verified.

Comparison Direction	Advantages of Ours
PhotoChat / DialogCC / Stark	These datasets support only the visual modality and are mostly single-party/single-session; M3C covers both images and audio, supports multi-party multi-session dialogues, and demonstrates dominating immersiveness with an 81-99% selection rate under GPT-4o-mini evaluation.
Audio Dialogues / MELD	Audio Dialogues only supports audio QA tasks; although MELD includes audio and video, it consists of short dialogues oriented toward sentiment analysis. M3C offers open-domain, long-term, multi-turn dialogues with coordinated audiovisual design and shared experience configurations.
MiSC (Jang et al., 2024)	MiSC supports multi-session multi-party dialogues but lacks modal inputs; M3C extends this to a multi-party, multi-partner per session setting with audiovisual modalities, dramatically increasing interaction complexity and realism.

Rating¶

⭐⭐⭐⭐ Novelty: The setting of audiovisual multi-party multi-session dialogue represents a clear innovation, and the design of multimodal memory retrieval is reasonable.
⭐⭐⭐⭐ Experimental Thoroughness: A multi-dimensional evaluation combining human, automatic, and cross-dataset metrics is presented, including ablation and quantitative analyses.
⭐⭐⭐⭐ Value: Provides a dataset and modeling paradigm for building more natural multimodal dialogue systems.
⭐⭐⭐⭐ Writing Quality: The structure is clear, cases are rich, and the dataset comparison tables are highly informative.