Interleaved-Modal Chain-of-Thought¶

CVPR 2025 Reasoning Chain-of-Thought Multimodal Reasoning VLM Interleaved Modality Attention Selection Plug-and-Play

Conference: CVPR 2025
arXiv: 2411.19488
Code: https://github.com/jungao1106/ICoT
Area: LLM Reasoning
Keywords: Chain-of-Thought, Multimodal Reasoning, VLM, Interleaved Modality, Attention Selection, Plug-and-Play

TL;DR¶

Proposes Interleaved-Modal Chain-of-Thought (ICoT), which interleaves image region crops as visual rationales within reasoning steps. By using a parameter-free Attention-driven Selection (ADS) to intelligently select and insert key regions from the input image into the generated sequence, it achieves up to a 14% improvement over existing multimodal CoTs on Chameleon and Qwen2-VL.

Background & Motivation¶

Background: CoT prompting enables LLMs to generate intermediate reasoning steps before producing answers, which has been extended to multimodal VLMs. Existing multimodal CoTs (such as CCoT generating scene graphs, DDCoT decomposing subproblems, and SCAFFOLD overlaying coordinate grids) only generate text-only rationales.

Limitations of Prior Work: Text-only rationales struggle to accurately express fine-grained associations with the original image. For instance, text descriptions like "at the top of the image" are too coarse to precisely locate specific fruits in the picture, leading to reasoning errors.

Key Challenge: To achieve interleaved image-text reasoning steps, VLMs need to generate fine-grained multimodal content. However, VLMs with Perceiver architectures (such as Qwen2-VL) cannot generate images, while unified modeling VLMs (such as Chameleon) can generate images but have fixed resolutions and suffer from "multimodal generation laziness".

Key Insight: The required visual information is usually just a part of the input image—there is no need to generate new images; one only needs to "select" relevant regions from the input image to insert into the reasoning sequence.

Core Idea: Leverage the VLM's own attention maps to identify the image regions most attended to during the current reasoning step, and automatically insert the visual tokens of the corresponding patches into the generation sequence to form "visual + textual" interleaved reasoning steps.

Method¶

Overall Architecture¶

Based on standard multimodal CoT, ICoT extends each reasoning step from text-only to a paired "image region + textual rationale" format. During generation, whenever the VLM reaches a reasoning step boundary (detected via the newline character \n), ADS selects the most relevant patches from the input image to insert, and then autoregressive generation of the subsequent text continues.

Key Designs¶

Attention-driven Selection (ADS):
- Function: Selects and inserts the most relevant patches from the input image into the generation sequence at the start of each reasoning step.
- Mechanism: Leverages the attention distribution of the VLM when generating signal tokens (newlines) to locate the input image patches that the current step focuses on the most. It takes the attention weights of the signal token over all visual tokens in the last layer, selects the top-k patches (defaulting to 64 for Chameleon and 16 for Qwen2-VL), and copies and inserts their visual tokens into the current position.
- Design Motivation: Parameter-free—using only attention maps, requiring no training of any new modules, and is plug-and-play; virtually zero extra latency; adaptable to various VLM architectures.
ICoT Prompt Design:
- Function: Designs few-shot examples containing interleaved visual-text rationales to guide the VLM.
- Mechanism: Manually constructs 1-shot examples where each reasoning step includes manually selected fine-grained image regions and corresponding text explanations.
- Design Motivation: Ablation studies show that manually designed fine-grained examples yield better performance than those automatically generated by the model (+0.8~1.6).

Loss & Training¶

Training-Free: ADS is a plug-and-play strategy during inference with zero parameterization.
The default signal token is the newline character \n.
Patch granularity: 64 patches of size 16×16 are selected for Chameleon, and 16 patches of size 28×28 are selected for Qwen2-VL.
Eager attention is used to retrieve attention maps.

Key Experimental Results¶

Main Results¶

0-shot and 1-shot results based on Chameleon-7B:

Method	M3CoT (0-shot)	ScienceQA (0-shot)	LLaVA-W (0-shot)
No-CoT	24.6	44.6	22.3
CoT	26.1	46.2	23.5
CCoT	25.8	48.1	24.0
DDCoT	27.3	49.3	23.9
SCAFFOLD	28.0	50.2	23.1
ICoT (ours)	29.8	51.0	25.2
Relative Gain	+6.4%	+1.6%	+5.0%

The improvement is even larger on Qwen2-VL-7B (M3CoT +4.6%, LLaVA-W +5.3%).

Ablation Study¶

Configuration	M3CoT	ScienceQA	LLaVA-W
ICoT (Full)	32.3	53.4	27.6
w/o ADS (Text-only)	29.2 (-3.1)	52.4 (-1.0)	24.5 (-3.1)
w/o FVI (Random patch)	30.6 (-1.8)	52.8 (-0.6)	25.9 (-1.7)
w/o ADS+FVI	29.1 (-3.2)	51.0 (-2.4)	23.0 (-4.6)

Patch Quantity Sensitivity (Chameleon-7B, M3CoT)¶

Top-k Patch Count	16	32	64	128
Accuracy	28.4	29.1	29.8	29.5

Selecting too few patches provides insufficient information, while selecting too many introduces noise; k=64 is the optimal balance point on Chameleon.

Key Findings¶

ADS contributes the most: Removing ADS drops performance by 3.1 points (M3CoT), showing that the interleaved reasoning itself is more crucial than good examples.
Most significant improvement on LLaVA-W: This is because the reference answers of this benchmark contain abundant image details.
KV Cache Copying vs. Token Insertion: Directly copying KV Cache is slightly worse (-0.5~0.8), as positional information has already been early-fused in the KV Cache.
Relatively smaller improvement on ScienceQA: This dataset is relatively simple and does not heavily rely on fine-grained visual information.
Controllable inference overhead: ADS only reads existing attention matrices and performs top-k sorting, with <5% extra latency.

Highlights & Insights¶

Simple yet powerful core insight: The visual information required in multimodal reasoning is usually already present in the input image. There is no need to "generate" new images; instead, just "select" them—greatly simplifying the implementation.
Truly plug-and-play: It requires no training and no changes to the model architecture, utilizing only the existing model's attention maps. It can be immediately applied to any VLM.
Alignment with human thinking: When humans reason about visual tasks, they naturally alternate between "looking at a specific region + thinking + looking again + thinking further." ICoT is a direct simulation of this process.

Limitations & Future Work¶

Limited to sub-regions of the input image: If reasoning requires external visual information (e.g., imagination, knowledge retrieval), ICoT cannot assist.
Fixed patch granularity: The granularity of selection is fixed across all reasoning steps, whereas some steps might benefit from coarser or finer visual focus.
Requirement of eager attention: It is incompatible with flash attention, which might affect long-sequence inference efficiency.
Relatively limited evaluation benchmarks: It has not been validated on more complex mathematical reasoning benchmarks (e.g., MathVista).

vs. CCoT: CCoT generates scene graphs (JSON descriptions), which are essentially text. ICoT directly inserts image patches, providing more precise visual grounding.
vs. SCAFFOLD: SCAFFOLD overlays coordinate grids to let the VLM describe positions using coordinates, which still relies on text. ICoT directly uses visual tokens, bypassing the textual medium.
vs. DDCoT: DDCoT focuses on the reasoning structure (decomposing subproblems), while ICoT focuses on the reasoning modality (interleaved image-text). The two can be combined.

Rating¶

Novelty: ⭐⭐⭐⭐ The first work to insert image patches within CoT reasoning steps, with precise observations.
Experimental Thoroughness: ⭐⭐⭐ Three benchmarks are relatively limited; lacks evaluations on more challenging reasoning tasks.
Writing Quality: ⭐⭐⭐⭐ Clear problem formulation, simple and easy-to-understand method.
Value: ⭐⭐⭐⭐ Strong practical value as a plug-and-play approach, with genuine significance for multimodal reasoning.