AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning¶

Conference: ICLR 2026 arXiv: 2509.25699 Code: Available (anonymous link) Area: LLM Reasoning Keywords: multimodal CoT, information gain, active visual probing, attention map, interleaved reasoning

TL;DR¶

AIMCoT reframes visual information selection in multimodal CoT from "passively attending to high-attention regions" to "actively seeking regions of maximal information gain." Three collaborative modules — CAG (Context-enhanced Attention-map Generation), AVP (Active Visual Probing), and DAT (Dynamic Attention-shifting Trigger) — constitute a training-free, plug-and-play framework that outperforms ICoT by 18.25% on LLaVA-W (0-shot).

Background & Motivation¶

Background: Interleaved-modal CoT methods (e.g., ICoT) enhance VLM reasoning by alternately inserting text and visual patches into the reasoning chain, representing a frontier direction in multimodal reasoning.

Limitations of Prior Work: Existing methods rely on a "passive" strategy — selecting the Top-K regions with the highest attention scores and inserting them at newline positions. Experiments reveal three issues: (1) high-attention regions are frequently redundant or introduce noise; (2) attention maps miss critical visual details, especially when text-vision granularity is mismatched; (3) inserting visual information at newline positions lacks theoretical justification.

Key Challenge: Attention maps reflect token relevance rather than "information useful for answering the question," yet existing methods conflate the two. Passive Top-K selection has no explicit objective and is fundamentally undirected.

Goal: (1) How to obtain reliable attention maps? (2) How to actively select visual regions most helpful for answering? (3) When is the optimal moment to insert visual information?

Key Insight: Grounded in Information Foraging Theory, the paper reframes region selection as an information-gain maximization problem.

Core Idea: Replace attention-score-driven passive selection with information-gain-driven active probing, enabling VLMs to "actively seek the regions they most need to observe."

Method¶

Overall Architecture¶

AIMCoT is a training-free plug-and-play framework comprising three collaborative modules: CAG first generates context-enhanced attention maps → AVP greedily selects the K regions with the highest information gain from a candidate set → DAT monitors attention shifts to determine when to insert visual information into the CoT.

Key Designs¶

Context-enhanced Attention-map Generation (CAG):
- Function: Generates context-aware image descriptions to mitigate text-vision granularity mismatch.
- Mechanism: The VLM first generates an explanatory description \(\mathcal{D}_{CAG}\) of the image conditioned on the question. This description is then appended to the question, and the enhanced context is used to recompute the attention map \(A'\), producing more reliable task-relevant localization.
- Design Motivation: Raw attention maps are unreliable under large text-vision granularity gaps (experiments show that masking the Top-10 attention regions reduces performance by only 3.93%). Enriching the textual context bridges this gap.
Active Visual Probing (AVP):
- Function: Actively selects the most informative visual regions based on information gain.
- Mechanism: A diverse candidate set \(C = C_{attn} \cup C_{exp}\) is constructed (Top-N attention regions + M randomly sampled grid regions). Information gain is defined as \(IG(\{R_i\}) = U_B - U_{C,i}\) (baseline uncertainty minus conditional uncertainty after introducing region \(R_i\), both measured by entropy over the vocabulary distribution). A greedy algorithm iteratively selects the K regions with the highest information gain.
- Design Motivation: High-attention regions may carry redundant information; information gain naturally eliminates redundancy — if a region's information is already covered by previously selected regions, its marginal gain decreases. Experiments show that information-gain-based selection precisely localizes critical details.
Dynamic Attention-shifting Trigger (DAT):
- Function: Intelligently determines when to insert visual information into the CoT.
- Mechanism: At each token generation step, the model monitors the total visual-context attention score \(A_{visual}(t)\) (averaged over the last 3 layers) and computes the attention shift \(\Delta A_{visual}(t)\). When the shift exceeds threshold \(\delta\), AVP is triggered to insert visual regions.
- Design Motivation: Empirical analysis reveals a strong correlation between high-quality outputs and visual information insertion timed to coincide with attention shifting from text toward vision; low-quality outputs lack this pattern.

Loss & Training¶

No training is required. All modules are plug-and-play at inference time.

Key Experimental Results¶

Main Results¶

Model	Method	M3CoT (0-shot)	ScienceQA (0-shot)	LLaVA-W (0-shot)
Chameleon-7B	No-CoT	29.1	47.7	13.1
	ICoT	29.8	51.0	25.2
	AIMCoT	31.4	53.1	29.8
	Gain vs. ICoT	+5.5%	+4.1%	+18.3%
Qwen2-VL-7B	No-CoT	43.6	56.3	32.7
	ICoT	44.1	56.8	34.2
	AIMCoT	44.7	57.4	36.3
	Gain vs. ICoT	+1.4%	+1.1%	+6.2%

Ablation Study¶

Configuration	Effect	Note
Mask Top-10 attention regions	Only −3.93%	Confirms that high-attention regions are not universally critical
CAG only	Moderate improvement	More reliable attention maps
CAG + AVP	Significant improvement	Active selection substantially outperforms passive Top-K
CAG + AVP + DAT (full)	Best	Three-module synergy surpasses any subset
Inference time vs. ICoT	≤1.36×	Acceptable overhead

Key Findings¶

The largest advantage appears on the open-ended LLaVA-W benchmark (+18.3%), as open-ended settings demand more active information seeking.
The 0-shot advantage is more pronounced than the 1-shot advantage, indicating that AIMCoT better elicits the model's intrinsic reasoning capacity.
The exploratory candidate set \(C_{exp}\) (randomly sampled regions) provides many useful regions not covered by the attention map.
The information gain function empirically exhibits approximate submodularity, supporting the effectiveness of the greedy algorithm.

Highlights & Insights¶

Paradigm shift from passive to active: Reframing visual region selection from "where is the model looking" to "what information is most helpful to the model" represents a significant conceptual advance in multimodal reasoning.
Information gain as a selection metric: Quantifying the utility of visual regions via changes in predictive entropy is theoretically well-grounded and naturally resolves redundancy.
Independent value of the dynamic trigger: The idea of monitoring cross-modal attention shifts to determine insertion timing generalizes to other multimodal information fusion scenarios.

Limitations & Future Work¶

Computing information gain for candidate regions requires multiple forward passes (\(|C| + 1\)), making computational overhead the primary bottleneck.
The attention-monitoring mechanism in DAT may behave inconsistently across different VLM architectures.
Validation is limited to 7B-scale models; performance on larger models remains unknown.
The quality of the random exploratory set depends on the image content distribution and may be inefficient in specific scenarios.

vs. ICoT: Both perform interleaved CoT, but AIMCoT upgrades passive Top-K selection to active information-gain-driven selection, resolving ICoT's unreliable high-attention-region problem.
vs. DDCoT/CCoT: These methods generate only textual reasoning chains, whereas AIMCoT directly inserts visual evidence into the chain, providing stronger visual grounding.
The application of Information Foraging Theory to NLP and multimodal settings warrants further exploration.

Rating¶

Novelty: ⭐⭐⭐⭐ Information-gain-driven active selection is a meaningful innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks, two backbones, and detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Motivation is carefully analyzed with a solid theoretical foundation.
Value: ⭐⭐⭐⭐ The training-free framework is highly practical, though gains diminish on stronger models.