Merlin: Empowering Multimodal LLMs with Foresight Minds¶
Conference: ECCV 2024
arXiv: 2312.00589
Code: GitHub
Area: Multimodal VLM
Keywords: Multimodal Large Language Models, Future Reasoning, Trajectory Prediction, Foresight Thinking, Visual Understanding
TL;DR¶
Proposes a two-stage training paradigm consisting of Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT). By incorporating trajectory modeling, it empowers Multimodal Large Language Models (MLLMs) with "foresight thinking" capabilities, enabling them to predict future events and reason based on current observations.
Background & Motivation¶
Humans can predict future events based on current observations, a phenomenon referred to in neuroscience as "predictive processing." However, although existing Multimodal Large Language Models (MLLMs) such as GPT-4V and Bard excel in image understanding and logical reasoning, they lack the ability to predict future events from current image observations. Even when provided with multi-frame image sequences, existing MLLMs still struggle to analyze and infer the specific behaviors of targets (such as predicting object movement or interaction).
The authors decompose the human process of anticipating the future into two stages:
Observing Dynamic Cues: Observing the motion and state changes of the targets.
Analyzing Behavioral Patterns and Reasoning: Analyzing behavioral patterns based on observations to infer potential events.
The LLM component of existing MLLMs already possesses robust logical reasoning capabilities (the second stage). The key challenge lies in the first stage—how to enable the MLLM to correctly acquire spatiotemporal dynamic information from multi-image observations. Directly modeling the next frame (e.g., reconstructing the next image frame) suffers from visual information redundancy, making it difficult to directly extract dynamic cues. To address this, the authors propose using trajectories as the learning objective. As highly structured representations, trajectories can connect past and future temporal contexts.
Method¶
Overall Architecture¶
Merlin consists of three core components:
- Image Encoder: Uses a pre-trained CLIP ViT-L/14 with an input image resolution of 448×448, generating 1024 encoded tokens.
- Large Language Model Decoder: Uses the open-source Vicuna-7B v1.5.
- Modality Alignment Projector: Uses a 3×3 2D convolutional layer (stride=2, padding=1) to achieve dimension projection and token aggregation.
Reasons for choosing 2D convolution over 1D linear layers or cross-attention layers as the connector: - 2D convolution can aggregate local visual tokens across spatial scales, efficiently converting spatial information to channel information. - Compared to cross-attention, 2D convolution exhibits better convergence properties, laying a solid foundation for two-stage training. - It effectively compresses the number of tokens, supporting high-resolution and multi-frame inputs.
Key Designs¶
Foresight Pre-Training (FPT)¶
The core idea of FPT is causal modeling of raw temporal trajectories interleaved with multi-frame images, enabling the MLLM to perceive dynamic cues across frames.
Modeling formulation: Given multi-frame images in a video clip, the model takes the observation (description or location) of the target in the first frame as the query, and needs to predict the complete trajectory of the target throughout the video:
where \(X_i\) is the \(i\)-th frame, \(O_{first}\) is the first-frame observation, and \(Y\) is the trajectory of the target.
Data construction follows three principles:
- Precisely defined task prompts and answer formats: Task prompts are used to inform the MLLM of the specific task (detection or tracking), and the answer format is specified in the question, allowing different types of tasks to be flexibly organized without compromising general language capabilities.
- Clear indicators for multimodal information: Special frame indicators (e.g.,
frame1:<image>,frame2:<image>) are added for each set of image tokens to help the MLLM better attend to the corresponding images. - Interleaved organization of frames and observations: For the same target identity, its appearing frames and location observations are arranged interleavingly, wrapped with ID tokens (
<Idi>and</Idi>) to construct the trajectory.
Observation types are categorized into three: location description, appearance description, and action description. One is randomly selected as the query.
Training strategy: Unlike previous separated practices that first perform modality alignment and then multi-task pre-training, Merlin merges them into a single stage and unfreezes all modules for pre-training. It performs multi-task learning using a hybrid dataset of 10M image-text pairs and approximately 5M QA data from various sources.
Foresight Instruction Tuning (FIT)¶
FPT equips the model with the ability to observe multi-frame dynamic cues, but this alone is insufficient to achieve true "foresight thinking." On top of FPT, FIT introduces Trajectory Chain-of-Thought (T-CoT), utilizing trajectory modeling as a bridge in the logical reasoning chain.
Core formulation:
where \(Z\) is the future observation (which can be actions, events, trends, or possibilities), conditioned on the multi-frame images, the first-frame observation, and the trajectory altogether, enabling the MLLM to causally predict the future.
How T-CoT works: When a user queries the future of a target, Merlin first outputs the observed trajectory of this target, then outputs the trajectories of other relevant targets, and finally reasons about potential future events based on these trajectories. For example, in a soccer scene, Merlin first outputs the trajectory of a player in red, then outputs the trajectory of a player in white, and infers that the player in white might slide tackle, causing both players to fall.
FIT Data construction: Uses GPT-4 to generate approximately 30K T-CoT dialogue pairs based on the trajectory and action information from three scene datasets: MultiSports, TITAN, and STAR.
Loss & Training¶
Two-stage training configuration:
| Configuration | Pre-training (FPT) | Instruction Tuning (FIT) |
|---|---|---|
| Vision Encoder | Unfrozen | Frozen |
| Projector | Unfrozen | Unfrozen |
| LLM | Unfrozen | Unfrozen |
| Learning Rate | 5e-5 | 5e-5 |
| Global Batch Size | 2048 | 256 |
| Training Steps | 7k | 3k |
| Optimizer | AdamW (\(\beta_2=0.95\)) | AdamW (\(\beta_2=0.95\)) |
| LR Scheduler | cosine decay | cosine decay |
| Precision | bfloat16 | bfloat16 |
Data composition: - FPT phase: 10M image-text pairs (LAION) + 5M multi-task QA data - FIT phase: 665K LLaVA instruction data + 30K T-CoT dialogues + 40K FPT sampled data
Training is conducted on 64 NVIDIA A800 GPUs, taking approximately 12 hours for pre-training and 3 hours for instruction tuning.
Key Experimental Results¶
Main Results¶
Table 1: Future Reasoning Capabilities (MMBench Sub-tasks)
| Method | LLM | Dev Avg | OL | PPR | FR | IR | FP | Test Avg |
|---|---|---|---|---|---|---|---|---|
| mPLUG-Owl | 7B | 41.0 | 18.5 | 18.7 | 66.7 | 86.7 | 14.3 | 45.9 |
| Shikra | 7B | 51.5 | 32.1 | 30.7 | 63.0 | 88.9 | 42.9 | 60.0 |
| Kosmos-2 | 1.6B | 54.4 | 38.3 | 33.3 | 56.8 | 91.1 | 52.4 | 58.2 |
| LLaVA-1.5 | 7B | 59.6 | 43.2 | 52.0 | 71.6 | 93.3 | 38.1 | - |
| Merlin | 7B | 64.4 | 42.0 | 54.7 | 72.8 | 97.8 | 54.8 | 66.5 |
Merlin achieves the best results in 8 out of 10 metrics, leading significantly in overall score.
Table 2: Object tracking evaluation
| Method | LaSOT Success | GOT10k AO | GOT10k SR₀.₅ | GOT10k SR₀.₇₅ |
|---|---|---|---|---|
| SiamFC | 33.6 | 34.8 | 35.3 | 9.8 |
| SiamRPN++ | 49.6 | 51.8 | 61.8 | 32.5 |
| LLaVA-1.5 (Tracking Fine-tuned) | 19.4 | 23.5 | 20.2 | 9.7 |
| Merlin | 39.8 | 51.4 | 55.9 | 42.8 |
Merlin is the first MLLM capable of performing tracking tasks, and achieves performance comparable to expert models using only a small amount of tracking data.
Table 3: Hallucination evaluation (POPE)
| Method | Random Acc | Popular Acc | Adversarial Acc |
|---|---|---|---|
| LLaVA (7B) | 72.16 | 61.37 | 58.67 |
| LLaVA-1.5 (7B) | 83.29 | 81.88 | 78.96 |
| Qwen-VL (7B) | 84.73 | 84.13 | 82.26 |
| Merlin (7B) | 91.58 | 89.53 | 84.10 |
Merlin significantly outperforms existing methods across all settings, with the "yes" ratio close to 50%, demonstrating excellent visual perception capabilities.
Ablation Study¶
Table 4: Ablation on FPT and FIT strategies
| Pre-training (ITP+FPT) | Fine-tuning (ITD+FIT) | GOT10K AO | Inference Avg |
|---|---|---|---|
| ITP only | ITD only | - | 59.5 |
| ITP only | ITD+FIT | - | 60.7 |
| FPT only | ITD+FIT | 15.5 | 52.8 |
| ITP+FPT | ITD only | 51.4 | 61.2 |
| ITP+FPT | ITD+FIT | 51.4 | 64.4 |
Table 5: Ablation on model configurations
| Resolution | Projector | Encoder | Token Count | Inference | GOT10K |
|---|---|---|---|---|---|
| 448x | Conv2d | Unfrozen | 256 | 64.4 | 51.4 |
| 336x | Conv2d | Unfrozen | 256 | 59.8 | 47.3 |
| 336x | MLP | Unfrozen | 576 | 58.1 | 23.5 |
| 448x | Conv2d | Frozen | 256 | 60.8 | 28.4 |
Key Findings¶
- Complementary nature of FPT and FIT: FPT provides dynamic cue perception, while FIT activates foresight reasoning capabilities through T-CoT; both are indispensable.
- Image-text pair data is essential: The absence of image-text pairs in pre-training severely harms the model's general capability.
- High resolution facilitates precise localization: 448x resolution yields a significant performance boost over 336x in localization and tracking tasks.
- Conv2d projector outperforms MLP: It effectively compresses the number of tokens, supporting multi-image input without performance degradation.
- Foresight learning unexpectedly reduces hallucinations: By learning trajectory correspondences, the model gains more precise target attention capabilities, preventing misidentification.
- Precise task descriptions are critical: The absence of precise task descriptions causes tracking performance to plunge from 51.4% to 28.4%.
Highlights & Insights¶
- Insight of trajectories as a structured learning target: Compared to directly predicting the next image frame, trajectories provide highly abstract and structured spatiotemporal representations, effectively avoiding visual redundancy.
- Innovation of Trajectory Chain-of-Thought (T-CoT): Extends the CoT concept to the domain of visual trajectories, utilizing trajectories as a "bridge" in the reasoning chain that connects observations with future predictions.
- Unified multi-task conversation format design: Through precise task definitions and format specifications, multiple tasks such as detection, tracking, referring expression comprehension, and future reasoning are handled within a single model.
- Spillover effect of foresight learning: Training the model to predict trajectories not only improves future reasoning capabilities, but also unexpectedly enhances general visual understanding and reduces hallucinations, providing a new perspective for MLLM training.
Limitations & Future Work¶
- Constraints in handling long videos: Current support is limited to \(\le 8\) frames due to the reliance on image encoders rather than video encoders, making it unable to handle long-term video sequences.
- Incomplete evaluation benchmarks: Existing future reasoning evaluation benchmarks are not comprehensive; the benchmark constructed based on MMBench sub-tasks is only a preliminary attempt.
- Video encoding efficiency: More efficient long-term video tokenizers need to be developed.
- Limitations of trajectory representation: Current trajectories only use bounding boxes, without getting into finer-grained spatial information (such as poses or fine-grained actions).
- Dialogue data scale: The T-CoT dataset is limited to only 30K; expanding the scale could further improve the reasoning quality.
Related Work & Insights¶
- Relationship with LLaVA-1.5: Merlin is based on the LLaVA architecture, replacing the MLP projector with Conv2d and introducing multi-frame and trajectory modeling.
- Relationship with Shikra: Shikra pioneered introducing spatial coordinate dialogue capabilities in MLLMs; Merlin extends this to the temporal dimension.
- Takeaways:
- Trajectories can serve as a cross-modal "language" connecting visual and textual modalities.
- Foresight learning can be a general visual representation enhancement strategy.
- Multi-task pre-training and instruction tuning can be combined into a single stage.
Rating¶
| Dimension | Score (1-5) | Explanation |
|---|---|---|
| Novelty | 4.5 | Systematically introduces "foresight thinking" to MLLMs for the first time, with Trajectory CoT being an innovative contribution |
| Technical Depth | 4.0 | The two-stage training paradigm is well-designed, and the data construction details are comprehensive |
| Experimental Thoroughness | 4.0 | Multi-task evaluation is comprehensive, though the future reasoning benchmark remains relatively simple |
| Writing Quality | 4.0 | The writing is fluent with rich illustrations and clear methodological explanations |
| Overall | 4.0 | A pioneering work in the direction of MLLM future reasoning, featuring an innovative idea and thorough verification |