BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation¶
Conference: ECCV2024
arXiv: 2408.05926
Code: hee-suk-yoon/BI-MDRG
Area: Dialogue Systems
Keywords: Multimodal Dialogue, Image Consistency, Vision-Language Models, Text-to-Image Generation, Dialogue Response Generation
TL;DR¶
This paper proposes the BI-MDRG framework, which bridges image history information to enhance the image-grounding capability of textual responses and the object consistency in sequential image responses in multimodal dialogues.
Background & Motivation¶
Multimodal Dialogue Response Generation (MDRG) requires models to generate textual, visual, or mixed-modality responses based on the dialogue context. Due to the scarcity of large-scale multimodal dialogue datasets, previous approaches (such as Divter) utilize text as an intermediary: they first convert images in the dialogue history into textual descriptions, then generate responses based on pure text, and finally convert the textual descriptions into images via a text-to-image model.
This "detour" strategy suffers from two core issues:
Textual responses lack image grounding: The model can only understand images through textual descriptions (e.g., "a dog eating a watermelon"), failing to answer questions that require actual visual information (e.g., "What breed is your dog?").
Inconsistent objects in image responses: The generated images across multi-turn dialogues fail to maintain visual consistency of the same object (e.g., the "dog" in successive turns looks completely different).
Method¶
Overall Architecture¶
BI-MDRG consists of three core components:
- Textual Dialogue Response Generator \(\mathcal{G}\): Based on a decoder-only language model, it introduces visual cross-attention layers to directly receive image features extracted by the Visual Encoder \(\mathcal{V}\).
- Citation Module \(\mathcal{C}\): A training-free module that leverages off-the-shelf components (POS tagger + object detection + segmentation + feature extraction) to track recurring identical objects across the dialogue, appending citation tokens (e.g.,
[cite]0[/cite]) to textual image descriptions. - Customized Text-to-Image Model \(\mathcal{F}\): During inference, based on the citation tokens, it inputs the same object from historical images into a customized image generation model to maintain object consistency.
The overall framework is initialized from OpenFlamingo 4B (ViT-L + RedPajama 3B), and image descriptions are generated by BLIP2-flan-t5-xl.
Key Designs¶
1. Multimodal Causal Attention Mask Modulation
The traditional causal mask allows the textual response \(r_i^{\text{Text}}\) to access all preceding textual image descriptions \(u_{1:i-1}'\), which leads the model to rely on textual descriptions rather than the actual images. BI-MDRG modifies the attention mask to block the textual response from accessing preceding image descriptions, forcing the model to acquire visual information directly from image features through the cross-attention layer. Meanwhile, image descriptions are retained as input, as the model still needs to generate textual image descriptions for the text-to-image model.
2. Citation Module for Reference Tracking
Implemented as a pipeline using fully pre-trained, off-the-shelf components without requiring training on the target datasets:
- spaCy for POS tagging to locate main object words \(o_i\) in the image descriptions.
- GroundingDINO for open-vocabulary object detection to obtain object bounding boxes.
- SAM (Segment Anything Model) for image segmentation to generate object masks.
- DINOv2 to extract visual features \(f_i\) of the objects after removing backgrounds.
Finally, clustering is performed based on cosine similarity (threshold \(\tau=0.6\)). Different occurrences of the same object are assigned the same citation index. For example, if the features of "dog" in "a dog is in front of a fireplace" and "a dog running in the snow" are similar, both are labeled as dog[cite]0[/cite].
3. Consistent Image Maintenance during Inference
During inference, if the generator predicts an [IMG] token, it starts generating the image description \(u_t'\) with citation tokens. After extracting the citation token \(c_t\), the textual description \(u_t\) and all historical images sharing the same citation token are sent together to BLIP-Diffusion for customized generation:
If no historical citation match is found, standard Stable Diffusion 2.1 is used for generation.
Loss & Training¶
A two-stage training strategy is adopted:
- Stage 1: Only the language model layers \(\theta_{\mathcal{G}_l}\) are trained, with batch size=256, maximum token length=256.
- Stage 2: The perceiver resampler \(\theta_{\mathcal{V}}\) of the Visual Encoder and the visual cross-attention layers \(\theta_{\mathcal{G}_v}\) are jointly trained, with batch size=128, maximum token length=512.
Both stages utilize standard next token prediction loss (negative log-likelihood) using the AdamW optimizer with a learning rate of 1e-4, trained on 16 × NVIDIA A100 80GB GPUs.
Key Experimental Results¶
Main Results¶
Comprehensive evaluation on PhotoChat and MMDialog:
| Model | Intent F1 | IS | TID B1 | TID B2 | TID R-1 | TID R-L | TR B1 | TR B2 | TR R-1 | TR R-L |
|---|---|---|---|---|---|---|---|---|---|---|
| PhotoChat | ||||||||||
| Divter | 56.2 | 15.8 | 15.1 | 11.4 | - | 15.8 | 6.52 | 1.66 | - | 5.69 |
| Divter_LLM (3B) | 54.1 | 16.1 | 41.3 | 27.1 | 43.3 | 41.6 | 11.4 | 4.75 | 11.2 | 10.8 |
| BI-MDRG | 55.7 | 16.7 | 42.1 | 28.2 | 44.6 | 42.5 | 12.4 | 5.12 | 12.1 | 11.2 |
| MMDialog | ||||||||||
| Divter | 71.8 | 20.5 | - | - | - | - | 9.44 | 7.45 | - | 11.2 |
| MiniGPT5 (9B) | - | 20.2 | - | - | - | - | 29.1 | 19.5 | - | 12.1 |
| Divter_LLM (3B) | 67.3 | 21.0 | 44.2 | 35.7 | 45.5 | 43.6 | 21.3 | 16.2 | 20.4 | 19.4 |
| BI-MDRG (4B) | 70.5 | 22.4 | 52.2 | 44.7 | 53.2 | 51.6 | 27.6 | 23.5 | 25.7 | 24.8 |
Image grounding evaluation (ImageChat, zero-shot transfer):
| Model | B1 | R-1 | R-L |
|---|---|---|---|
| Divter_LLM | 8.6 | 10.3 | 9.6 |
| BI-MDRG w/o mask | 10.0 | 11.1 | 10.2 |
| BI-MDRG | 10.9 | 11.7 | 10.9 |
Ablation Study¶
Impact of the Citation framework on image consistency (MDIC dataset):
| Citation Method | VLM Size | Diffusion | DINOv2 ↑ |
|---|---|---|---|
| Citation Module | 4B | Custom+Diffusion | 0.53 |
| LLMCite | 4B | Custom+Diffusion | 0.34 |
| w/o Citation | 4B | Diffusion | 0.25 |
| LLMCite | 9B | Custom+Diffusion | 0.33 |
| w/o Citation | 9B | Diffusion | 0.26 |
Citation Token Prediction Accuracy:
| Model | Acc. | DINOv2 ↑ |
|---|---|---|
| Divter_LLM + LLMCite | 33.5 | 0.32 |
| BI-MDRG | 84.0 | 0.53 |
Quality of the Citation Module: Evaluation of pseudo-label quality on 300 dialogues in the MDIC dataset yields F1 = 0.72.
Key Findings¶
- BI-MDRG (4B) comprehensively outperforms MiniGPT5 (9B) on MMDialog, indicating that architectural design is more effective than scaling model size alone.
- Attention mask modulation is effective: On ImageChat, BI-MDRG improves B1 from 10.0 to 10.9 and R-L from 10.2 to 10.9 compared to the version without masking.
- Scaling the model size fails to resolve consistency issues: Scaling from 4B to 9B improves textual response quality, but the DINOv2 consistency score remains nearly unchanged (0.25 vs 0.26), necessitating the reliance on the Citation framework.
- The Citation Module is far superior to LLM-instructed methods: It achieves a citation prediction accuracy of 84.0% vs 33.5%, indicating that visual feature clustering is more reliable than pure text reasoning.
Highlights & Insights¶
- Precise problem definition: Clear analysis of the two fundamental drawbacks of the "text-as-intermediary" paradigm (lack of grounding + lack of consistency), with targeted designs for dual bridging pathways.
- Ingenious Citation Module design: Built entirely by composing off-the-shelf pre-trained models (spaCy + GroundingDINO + SAM + DINOv2), requiring zero extra training cost and ensuring modularity and replaceability.
- Attention mask modulation: A simple and elegant approach that forces the model to obtain image information from visual features rather than textual descriptions, with extremely low implementation cost.
- Creation of the MDIC benchmark: Fills the gap in image consistency evaluation for multimodal dialogues, providing 300 manually annotated dialogues.
- Important insight revealed: Scaling model size alone cannot solve the image consistency issue, highlighting the necessity of a dedicated maintenance framework.
Limitations & Future Work¶
- Pipeline-based architecture: Relies on a chain of independent components (POS tagger -> detector -> segmentation -> feature extraction -> clustering), which leads to cascading errors.
- Single-object tracking limitation: The Citation Module only extracts one primary object word per turn, making it incapable of handling scenarios where multiple objects require consistency simultaneously.
- Small-scale MDIC dataset: Consisting of only 300 dialogues, the evaluation pool has limited statistical significance.
- Dependence on intermediate text representations: Despite incorporating visual bridging, image generation still relies on the text description -> text-to-image pipeline, representing an incremental improvement over the legacy paradigm.
- Limited customization capability of BLIP-Diffusion: Zero-shot subject-driven generation quality is constrained by the performance of the customized generation model itself.
Related Work & Insights¶
- Flamingo: The architecture of BI-MDRG directly borrows Flamingo's cross-attention design to inject visual features into the language model.
- Divter: The direct predecessor of BI-MDRG, which established the text-as-intermediary paradigm for the MDRG task.
- BLIP-Diffusion: Provides zero-shot subject-driven image generation capabilities, serving as the underlying technology for maintaining image consistency.
- DINOv2 + SAM + GroundingDINO: Demonstrates the capability of pipelined foundation models, successfully executing complex object tracking tasks without training.
- Insights: Under circumstances where end-to-end multimodal generative models are still immature, this work shows how to cleverly leverage modular components and attention mechanisms to mitigate the information loss in pipeline-based methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The designs of the Citation Module and attention mask modulation are novel, although the framework is still incremental over existing paradigms.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Highly comprehensive, featuring three datasets, the self-created MDIC evaluation benchmark, and multi-dimensional ablations.
- Writing Quality: ⭐⭐⭐⭐ — Clear problem definition, highly illustrative figures, and well-structured organization.
- Value: ⭐⭐⭐⭐ — Multimodal dialogue consistency is a highly practical and significant problem. The proposed method is practical and open-sourced.