Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains¶

Conference: ACL 2025
arXiv: 2504.20199
Code: GitHub - VISC
Area: Multimodal VLMs
Keywords: Multi-image Reasoning, Vision-Language Models, Data Synthesis, Chain-of-Thought, Multimodal Reasoning

TL;DR¶

This paper proposes the Focus-Centric Visual Chain multi-image reasoning paradigm, achieving cross-image reasoning through question decomposition and stepwise focusing on key visual information. It constructs the VISC-150K dataset, leading to consistent performance improvements of 2-3% across seven multi-image benchmarks.

Background & Motivation¶

Vision-Language Models (VLMs) have reached human-level performance on single-image tasks but suffer from significant performance degradation in multi-image scenarios. Two main challenges of multi-image tasks are:

Cross-image correlations: Diverse relationships (temporal, spatial, semantic) exist across images, requiring a holistic understanding of their contextual connections.

Visual discontinuity: Information is distributed fragmentedly across images, making it difficult to accurately capture cross-image relations.

Limitations of existing solutions: - Generating reasoning chains directly with multimodal models lacks reliability; even GPT-4o exhibits unstable performance on multi-image tasks. - Knowledge distillation from stronger models is costly and difficult to scale. - Existing multi-image reasoning datasets are extremely scarce.

Method¶

Overall Architecture¶

The proposed method consists of two parts: (1) The Focus-Centric Visual Chain reasoning paradigm, which performs multi-step reasoning through question decomposition and stepwise focusing; and (2) The Focus-Centric Data Synthesis (FCDS) framework, which synthesizes high-quality multi-image reasoning data in a bottom-up manner.

Key Designs¶

Focus-Centric Visual Chain Reasoning Paradigm: Given an image collection \(\mathcal{G} = \{I_k\}\) and a question \(Q\), the model \(\mathcal{M}\) incrementally constructs a reasoning chain \(\mathcal{R}\). At step \(i\), the model generates a sub-question \(q_i\) and identifies the corresponding focus-centric visual subset \(G_i\) (a minimized subset of visual information), obtaining an intermediate answer \(a_i\) by jointly analyzing \(q_i\) and \(G_i\). The model dynamically decides whether to terminate the reasoning (via a stop signal \(z_i\)) and finally synthesizes the ultimate answer from all QA pairs. The core idea is to decompose complex multi-image tasks into a sequence of sub-tasks focused on localized visual inputs.
Feature Extraction Module: Constructs a detailed textual profile for each image, comprising four elements: global view, background description, object attributes, and object interactions. LLaVA-OneVision-7B serves as the base model for the Extractor, generating profiles through three components: the visual encoder \(f_e\), the vision-language connector \(f_c\), and the LLM \(f_\phi\).
Pair Connection Module: Determines associations between image nodes based on two criteria: (1) object-oriented context (images sharing the same objects) and (2) event-oriented context (images describing related events). Qwen2.5-7B-Instruct is utilized as the Connector to identify valid pairwise connections based on the profile collection.
Relevance Annotation Module: Classifies the relationships between image pairs into three categories:
- Temporal: Images depicting chronological sequences.
- Spatial: Visual elements presenting geometric and positional correlations.
- Semantic: Abstract connections involving themes, logic, and causal associations. LLaVA-OneVision-7B acts as the Annotator to annotate the relationship for each pair of connected images.
Question Generation Module: Samples continuous node chains along the reasoning path, and generates sub-questions for each pair of connected images based on their relevance annotations and profiles, eventually synthesizing a comprehensive question. Qwen2.5-7B-Instruct is employed as the Questioner. This bottom-up design ensures data quality while maintaining computational efficiency, relying solely on open-source models throughout the pipeline.

Loss & Training¶

LoRA fine-tuning is conducted based on LLaVA-OneVision-7B and Qwen2-VL-7B-Instruct.
Trained for 1 epoch on VISC-150K, with a batch size of 8 and a learning rate of 1e-5.
Warmup ratio of 0.05 with a cosine scheduler.
Maximum context length of 32,768.

Key Experimental Results¶

Main Results¶

Model	MMIU	MuirBench	MIRB	BLINK	NLVR2	Mantis-Eval	MVBench
LLaVA-OneVision-7B	40.32	41.77	51.18	48.20	89.40	64.20	56.70
+VISC-150K	46.52(↑6.20)	49.62(↑7.85)	53.02(↑1.84)	50.24(↑2.04)	89.88(↑0.48)	66.36(↑2.16)	58.23(↑1.53)
Qwen2-VL-7B	50.00	39.12	58.67	53.20	86.42	69.60	67.00
+VISC-150K	52.76(↑2.76)	44.50(↑5.38)	60.16(↑1.49)	55.34(↑2.14)	89.82(↑3.40)	69.12(↓0.48)	68.01(↑1.01)

Ablation Study¶

Experimental Question	Key Results	Explanation
Impact of Data Scale (RQ1)	Rapid improvement from 0 to 25K, gradually converging from 125K to 150K	25K data is sufficient to activate multi-image reasoning capabilities
Sub-task Performance (RQ2)	Significant improvement in 8 out of 12 MuirBench sub-tasks	Similarity analysis and comparative reasoning show the largest improvement
Number of Input Images (RQ3)	Most significant improvement with 3-8 images; slight degradation with 15+ images	Medium-scale image sets benefit the most
Impact on General Capabilities (RQ4)	Performance is maintained or slightly improved on 4 single-image benchmarks	Does not sacrifice general vision-language capabilities
Data Quality (RQ5)	97.5% overall accuracy (evaluated by 3 human reviewers)	Fleiss' \(\kappa = 0.637\), indicating high reliability

Key Findings¶

LLaVA-OneVision improves by 6.20% on MMIU and 7.85% on MuirBench.
Establishes new SOTA results on 4 out of 7 benchmarks (MMIU, MIRB, BLINK, NLVR2).
Even the already powerful Qwen2-VL achieves an average improvement of 2.24%.
The method also yields improvements on the video benchmark MVBench, demonstrating the domain-agnostic nature of the paradigm.
The entire data synthesis process employs open-source models, with a 97.5% accuracy rate validating the reliability of the method.

Highlights & Insights¶

Complete Closed Loop from Reasoning Paradigm to Data Synthesis: The reasoning paradigm (top-down decomposition) and data synthesis (bottom-up construction) form a dual design, which is logically elegant.
Purely Open-Source Solution: Data synthesis entirely relies on 7B-level open-source models, keeping costs controllable and reproducible.
Cross-Architecture Consistency: Improvements are observed across two distinct architectures, LLaVA-OneVision and Qwen2-VL, demonstrating the general value of the data.
No Compromise on General Capabilities: Performance is maintained or even slightly improved on single-image benchmarks such as HallusionBench and MMStar.
25K Data Activation Effect: A small amount of data can unlock the model's multi-image reasoning potential, suggesting that this is a process of "capability activation" rather than learning brand-new capabilities.

Limitations & Future Work¶

The quadratic complexity of pairwise image relevance annotation limits the scalability of the image set size.
The dataset mainly covers real photos and comics, leaving its effectiveness on structured visual content, such as charts and code screenshots, unverified.
The number of reasoning steps is constrained by the intrinsic capabilities of the backbone language models.
Complex spatial dynamic understanding and domain-specific visual tasks remain weak points.
Performance slightly degrades when input images exceed 15, indicating room for improvement in long-sequence image processing.

Similar to the success of Chain-of-Thought in textual reasoning, the Visual Chain introduces step-by-step reasoning to multi-image visual understanding.
The bottom-up design of the data synthesis framework avoids the high costs and unreliability associated with relying on closed-source models.
Insight: The core of multi-image reasoning lies in dynamic selective attention, and the Focus-Centric mechanism essentially implements this capability.
Future Direction: Extending this paradigm to video understanding (which has been initially validated) and longer sequence image understanding.

Rating¶

Novelty: ⭐⭐⭐⭐ The Focus-Centric reasoning paradigm and bottom-up data synthesis represent a novel combined innovation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely thorough, encompassing 7 benchmarks, 2 architectures, 5 research questions, and human quality evaluation.
Writing Quality: ⭐⭐⭐⭐ Systematic description of the method and clear formulation of equations, though some notations are somewhat redundant.
Value: ⭐⭐⭐⭐ The high-quality 150K dataset and effective reasoning paradigm make a significant contribution to multi-image VLM research.