Skip to content

AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

Conference: ICLR 2026 arXiv: 2509.25699 Code: Available (anonymous link) Area: LLM Reasoning Keywords: multimodal CoT, information gain, active visual probing, attention map, interleaved reasoning

TL;DR

AIMCoT reframes visual information selection in multimodal CoT from "passively attending to high-attention regions" to "actively seeking regions of maximal information gain." Three collaborative modules — CAG (Context-enhanced Attention-map Generation), AVP (Active Visual Probing), and DAT (Dynamic Attention-shifting Trigger) — constitute a training-free, plug-and-play framework that outperforms ICoT by 18.25% on LLaVA-W (0-shot).

Background & Motivation

Background: Interleaved-modal CoT methods (e.g., ICoT) enhance VLM reasoning by alternately inserting text and visual patches into the reasoning chain, representing a frontier direction in multimodal reasoning.

Limitations of Prior Work: Existing methods rely on a "passive" strategy — selecting the Top-K regions with the highest attention scores and inserting them at newline positions. Experiments reveal three issues: (1) high-attention regions are frequently redundant or introduce noise; (2) attention maps miss critical visual details, especially when text-vision granularity is mismatched; (3) inserting visual information at newline positions lacks theoretical justification.

Key Challenge: Attention maps reflect token relevance rather than "information useful for answering the question," yet existing methods conflate the two. Passive Top-K selection has no explicit objective and is fundamentally undirected.

Goal: (1) How to obtain reliable attention maps? (2) How to actively select visual regions most helpful for answering? (3) When is the optimal moment to insert visual information?

Key Insight: Grounded in Information Foraging Theory, the paper reframes region selection as an information-gain maximization problem.

Core Idea: Replace attention-score-driven passive selection with information-gain-driven active probing, enabling VLMs to "actively seek the regions they most need to observe."

Method

Overall Architecture

AIMCoT is a training-free plug-and-play framework comprising three collaborative modules: CAG first generates context-enhanced attention maps → AVP greedily selects the K regions with the highest information gain from a candidate set → DAT monitors attention shifts to determine when to insert visual information into the CoT.

Key Designs

  1. Context-enhanced Attention-map Generation (CAG):

    • Function: Generates context-aware image descriptions to mitigate text-vision granularity mismatch.
    • Mechanism: The VLM first generates an explanatory description \(\mathcal{D}_{CAG}\) of the image conditioned on the question. This description is then appended to the question, and the enhanced context is used to recompute the attention map \(A'\), producing more reliable task-relevant localization.
    • Design Motivation: Raw attention maps are unreliable under large text-vision granularity gaps (experiments show that masking the Top-10 attention regions reduces performance by only 3.93%). Enriching the textual context bridges this gap.
  2. Active Visual Probing (AVP):

    • Function: Actively selects the most informative visual regions based on information gain.
    • Mechanism: A diverse candidate set \(C = C_{attn} \cup C_{exp}\) is constructed (Top-N attention regions + M randomly sampled grid regions). Information gain is defined as \(IG(\{R_i\}) = U_B - U_{C,i}\) (baseline uncertainty minus conditional uncertainty after introducing region \(R_i\), both measured by entropy over the vocabulary distribution). A greedy algorithm iteratively selects the K regions with the highest information gain.
    • Design Motivation: High-attention regions may carry redundant information; information gain naturally eliminates redundancy — if a region's information is already covered by previously selected regions, its marginal gain decreases. Experiments show that information-gain-based selection precisely localizes critical details.
  3. Dynamic Attention-shifting Trigger (DAT):

    • Function: Intelligently determines when to insert visual information into the CoT.
    • Mechanism: At each token generation step, the model monitors the total visual-context attention score \(A_{visual}(t)\) (averaged over the last 3 layers) and computes the attention shift \(\Delta A_{visual}(t)\). When the shift exceeds threshold \(\delta\), AVP is triggered to insert visual regions.
    • Design Motivation: Empirical analysis reveals a strong correlation between high-quality outputs and visual information insertion timed to coincide with attention shifting from text toward vision; low-quality outputs lack this pattern.

Loss & Training

No training is required. All modules are plug-and-play at inference time.

Key Experimental Results

Main Results

Model Method M3CoT (0-shot) ScienceQA (0-shot) LLaVA-W (0-shot)
Chameleon-7B No-CoT 29.1 47.7 13.1
ICoT 29.8 51.0 25.2
AIMCoT 31.4 53.1 29.8
Gain vs. ICoT +5.5% +4.1% +18.3%
Qwen2-VL-7B No-CoT 43.6 56.3 32.7
ICoT 44.1 56.8 34.2
AIMCoT 44.7 57.4 36.3
Gain vs. ICoT +1.4% +1.1% +6.2%

Ablation Study

Configuration Effect Note
Mask Top-10 attention regions Only −3.93% Confirms that high-attention regions are not universally critical
CAG only Moderate improvement More reliable attention maps
CAG + AVP Significant improvement Active selection substantially outperforms passive Top-K
CAG + AVP + DAT (full) Best Three-module synergy surpasses any subset
Inference time vs. ICoT ≤1.36× Acceptable overhead

Key Findings

  • The largest advantage appears on the open-ended LLaVA-W benchmark (+18.3%), as open-ended settings demand more active information seeking.
  • The 0-shot advantage is more pronounced than the 1-shot advantage, indicating that AIMCoT better elicits the model's intrinsic reasoning capacity.
  • The exploratory candidate set \(C_{exp}\) (randomly sampled regions) provides many useful regions not covered by the attention map.
  • The information gain function empirically exhibits approximate submodularity, supporting the effectiveness of the greedy algorithm.

Highlights & Insights

  • Paradigm shift from passive to active: Reframing visual region selection from "where is the model looking" to "what information is most helpful to the model" represents a significant conceptual advance in multimodal reasoning.
  • Information gain as a selection metric: Quantifying the utility of visual regions via changes in predictive entropy is theoretically well-grounded and naturally resolves redundancy.
  • Independent value of the dynamic trigger: The idea of monitoring cross-modal attention shifts to determine insertion timing generalizes to other multimodal information fusion scenarios.

Limitations & Future Work

  • Computing information gain for candidate regions requires multiple forward passes (\(|C| + 1\)), making computational overhead the primary bottleneck.
  • The attention-monitoring mechanism in DAT may behave inconsistently across different VLM architectures.
  • Validation is limited to 7B-scale models; performance on larger models remains unknown.
  • The quality of the random exploratory set depends on the image content distribution and may be inefficient in specific scenarios.
  • vs. ICoT: Both perform interleaved CoT, but AIMCoT upgrades passive Top-K selection to active information-gain-driven selection, resolving ICoT's unreliable high-attention-region problem.
  • vs. DDCoT/CCoT: These methods generate only textual reasoning chains, whereas AIMCoT directly inserts visual evidence into the chain, providing stronger visual grounding.
  • The application of Information Foraging Theory to NLP and multimodal settings warrants further exploration.

Rating

  • Novelty: ⭐⭐⭐⭐ Information-gain-driven active selection is a meaningful innovation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks, two backbones, and detailed ablations.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is carefully analyzed with a solid theoretical foundation.
  • Value: ⭐⭐⭐⭐ The training-free framework is highly practical, though gains diminish on stronger models.