BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation¶

Conference: ECCV2024
arXiv: 2408.05926
Code: hee-suk-yoon/BI-MDRG
Area: Dialogue Systems
Keywords: Multimodal Dialogue, Image Consistency, Vision-Language Models, Text-to-Image Generation, Dialogue Response Generation

TL;DR¶

This paper proposes the BI-MDRG framework, which bridges image history information to enhance the image-grounding capability of textual responses and the object consistency in sequential image responses in multimodal dialogues.

Background & Motivation¶

Multimodal Dialogue Response Generation (MDRG) requires models to generate textual, visual, or mixed-modality responses based on the dialogue context. Due to the scarcity of large-scale multimodal dialogue datasets, previous approaches (such as Divter) utilize text as an intermediary: they first convert images in the dialogue history into textual descriptions, then generate responses based on pure text, and finally convert the textual descriptions into images via a text-to-image model.

This "detour" strategy suffers from two core issues:

Textual responses lack image grounding: The model can only understand images through textual descriptions (e.g., "a dog eating a watermelon"), failing to answer questions that require actual visual information (e.g., "What breed is your dog?").

Inconsistent objects in image responses: The generated images across multi-turn dialogues fail to maintain visual consistency of the same object (e.g., the "dog" in successive turns looks completely different).

Method¶

Overall Architecture¶

BI-MDRG consists of three core components:

Textual Dialogue Response Generator \(\mathcal{G}\): Based on a decoder-only language model, it introduces visual cross-attention layers to directly receive image features extracted by the Visual Encoder \(\mathcal{V}\).
Citation Module \(\mathcal{C}\): A training-free module that leverages off-the-shelf components (POS tagger + object detection + segmentation + feature extraction) to track recurring identical objects across the dialogue, appending citation tokens (e.g., [cite]0[/cite]) to textual image descriptions.
Customized Text-to-Image Model \(\mathcal{F}\): During inference, based on the citation tokens, it inputs the same object from historical images into a customized image generation model to maintain object consistency.

The overall framework is initialized from OpenFlamingo 4B (ViT-L + RedPajama 3B), and image descriptions are generated by BLIP2-flan-t5-xl.

Key Designs¶

1. Multimodal Causal Attention Mask Modulation

The traditional causal mask allows the textual response \(r_i^{\text{Text}}\) to access all preceding textual image descriptions \(u_{1:i-1}'\), which leads the model to rely on textual descriptions rather than the actual images. BI-MDRG modifies the attention mask to block the textual response from accessing preceding image descriptions, forcing the model to acquire visual information directly from image features through the cross-attention layer. Meanwhile, image descriptions are retained as input, as the model still needs to generate textual image descriptions for the text-to-image model.

2. Citation Module for Reference Tracking

Implemented as a pipeline using fully pre-trained, off-the-shelf components without requiring training on the target datasets:

spaCy for POS tagging to locate main object words \(o_i\) in the image descriptions.
GroundingDINO for open-vocabulary object detection to obtain object bounding boxes.
SAM (Segment Anything Model) for image segmentation to generate object masks.
DINOv2 to extract visual features \(f_i\) of the objects after removing backgrounds.

Finally, clustering is performed based on cosine similarity (threshold \(\tau=0.6\)). Different occurrences of the same object are assigned the same citation index. For example, if the features of "dog" in "a dog is in front of a fireplace" and "a dog running in the snow" are similar, both are labeled as dog[cite]0[/cite].

3. Consistent Image Maintenance during Inference

During inference, if the generator predicts an [IMG] token, it starts generating the image description \(u_t'\) with citation tokens. After extracting the citation token \(c_t\), the textual description \(u_t\) and all historical images sharing the same citation token are sent together to BLIP-Diffusion for customized generation:

\[r_t^{\text{Image}} = \mathcal{F}(u_t \mid \{r_i^{\text{Image}} \mid c_i = c_t\}_{i=1}^{t-1})\]

If no historical citation match is found, standard Stable Diffusion 2.1 is used for generation.

Loss & Training¶

A two-stage training strategy is adopted:

Stage 1: Only the language model layers \(\theta_{\mathcal{G}_l}\) are trained, with batch size=256, maximum token length=256.
Stage 2: The perceiver resampler \(\theta_{\mathcal{V}}\) of the Visual Encoder and the visual cross-attention layers \(\theta_{\mathcal{G}_v}\) are jointly trained, with batch size=128, maximum token length=512.

Both stages utilize standard next token prediction loss (negative log-likelihood) using the AdamW optimizer with a learning rate of 1e-4, trained on 16 × NVIDIA A100 80GB GPUs.

Key Experimental Results¶

Main Results¶

Comprehensive evaluation on PhotoChat and MMDialog:

Model	Intent F1	IS	TID B1	TID B2	TID R-1	TID R-L	TR B1	TR B2	TR R-1	TR R-L
PhotoChat
Divter	56.2	15.8	15.1	11.4	-	15.8	6.52	1.66	-	5.69
Divter_LLM (3B)	54.1	16.1	41.3	27.1	43.3	41.6	11.4	4.75	11.2	10.8
BI-MDRG	55.7	16.7	42.1	28.2	44.6	42.5	12.4	5.12	12.1	11.2
MMDialog
Divter	71.8	20.5	-	-	-	-	9.44	7.45	-	11.2
MiniGPT5 (9B)	-	20.2	-	-	-	-	29.1	19.5	-	12.1
Divter_LLM (3B)	67.3	21.0	44.2	35.7	45.5	43.6	21.3	16.2	20.4	19.4
BI-MDRG (4B)	70.5	22.4	52.2	44.7	53.2	51.6	27.6	23.5	25.7	24.8

Image grounding evaluation (ImageChat, zero-shot transfer):

Model	B1	R-1	R-L
Divter_LLM	8.6	10.3	9.6
BI-MDRG w/o mask	10.0	11.1	10.2
BI-MDRG	10.9	11.7	10.9

Ablation Study¶

Impact of the Citation framework on image consistency (MDIC dataset):

Citation Method	VLM Size	Diffusion	DINOv2 ↑
Citation Module	4B	Custom+Diffusion	0.53
LLMCite	4B	Custom+Diffusion	0.34
w/o Citation	4B	Diffusion	0.25
LLMCite	9B	Custom+Diffusion	0.33
w/o Citation	9B	Diffusion	0.26

Citation Token Prediction Accuracy:

Model	Acc.	DINOv2 ↑
Divter_LLM + LLMCite	33.5	0.32
BI-MDRG	84.0	0.53

Quality of the Citation Module: Evaluation of pseudo-label quality on 300 dialogues in the MDIC dataset yields F1 = 0.72.

Key Findings¶

BI-MDRG (4B) comprehensively outperforms MiniGPT5 (9B) on MMDialog, indicating that architectural design is more effective than scaling model size alone.
Attention mask modulation is effective: On ImageChat, BI-MDRG improves B1 from 10.0 to 10.9 and R-L from 10.2 to 10.9 compared to the version without masking.
Scaling the model size fails to resolve consistency issues: Scaling from 4B to 9B improves textual response quality, but the DINOv2 consistency score remains nearly unchanged (0.25 vs 0.26), necessitating the reliance on the Citation framework.
The Citation Module is far superior to LLM-instructed methods: It achieves a citation prediction accuracy of 84.0% vs 33.5%, indicating that visual feature clustering is more reliable than pure text reasoning.

Highlights & Insights¶

Precise problem definition: Clear analysis of the two fundamental drawbacks of the "text-as-intermediary" paradigm (lack of grounding + lack of consistency), with targeted designs for dual bridging pathways.
Ingenious Citation Module design: Built entirely by composing off-the-shelf pre-trained models (spaCy + GroundingDINO + SAM + DINOv2), requiring zero extra training cost and ensuring modularity and replaceability.
Attention mask modulation: A simple and elegant approach that forces the model to obtain image information from visual features rather than textual descriptions, with extremely low implementation cost.
Creation of the MDIC benchmark: Fills the gap in image consistency evaluation for multimodal dialogues, providing 300 manually annotated dialogues.
Important insight revealed: Scaling model size alone cannot solve the image consistency issue, highlighting the necessity of a dedicated maintenance framework.

Limitations & Future Work¶

Pipeline-based architecture: Relies on a chain of independent components (POS tagger -> detector -> segmentation -> feature extraction -> clustering), which leads to cascading errors.
Single-object tracking limitation: The Citation Module only extracts one primary object word per turn, making it incapable of handling scenarios where multiple objects require consistency simultaneously.
Small-scale MDIC dataset: Consisting of only 300 dialogues, the evaluation pool has limited statistical significance.
Dependence on intermediate text representations: Despite incorporating visual bridging, image generation still relies on the text description -> text-to-image pipeline, representing an incremental improvement over the legacy paradigm.
Limited customization capability of BLIP-Diffusion: Zero-shot subject-driven generation quality is constrained by the performance of the customized generation model itself.

Flamingo: The architecture of BI-MDRG directly borrows Flamingo's cross-attention design to inject visual features into the language model.
Divter: The direct predecessor of BI-MDRG, which established the text-as-intermediary paradigm for the MDRG task.
BLIP-Diffusion: Provides zero-shot subject-driven image generation capabilities, serving as the underlying technology for maintaining image consistency.
DINOv2 + SAM + GroundingDINO: Demonstrates the capability of pipelined foundation models, successfully executing complex object tracking tasks without training.
Insights: Under circumstances where end-to-end multimodal generative models are still immature, this work shows how to cleverly leverage modular components and attention mechanisms to mitigate the information loss in pipeline-based methods.

Rating¶

Novelty: ⭐⭐⭐⭐ — The designs of the Citation Module and attention mask modulation are novel, although the framework is still incremental over existing paradigms.
Experimental Thoroughness: ⭐⭐⭐⭐ — Highly comprehensive, featuring three datasets, the self-created MDIC evaluation benchmark, and multi-dimensional ablations.
Writing Quality: ⭐⭐⭐⭐ — Clear problem definition, highly illustrative figures, and well-structured organization.
Value: ⭐⭐⭐⭐ — Multimodal dialogue consistency is a highly practical and significant problem. The proposed method is practical and open-sourced.