AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning¶
Conference: ICLR 2026 arXiv: 2509.25699 Code: Available (anonymous link) Area: LLM Reasoning Keywords: multimodal CoT, information gain, active visual probing, attention map, interleaved reasoning
TL;DR¶
AIMCoT reframes visual information selection in multimodal CoT from "passively attending to high-attention regions" to "actively seeking regions of maximal information gain." Three collaborative modules — CAG (Context-enhanced Attention-map Generation), AVP (Active Visual Probing), and DAT (Dynamic Attention-shifting Trigger) — constitute a training-free, plug-and-play framework that outperforms ICoT by 18.25% on LLaVA-W (0-shot).
Background & Motivation¶
Background: Interleaved-modal CoT methods (e.g., ICoT) enhance VLM reasoning by alternately inserting text and visual patches into the reasoning chain, representing a frontier direction in multimodal reasoning.
Limitations of Prior Work: Existing methods rely on a "passive" strategy — selecting the Top-K regions with the highest attention scores and inserting them at newline positions. Experiments reveal three issues: (1) high-attention regions are frequently redundant or introduce noise; (2) attention maps miss critical visual details, especially when text-vision granularity is mismatched; (3) inserting visual information at newline positions lacks theoretical justification.
Key Challenge: Attention maps reflect token relevance rather than "information useful for answering the question," yet existing methods conflate the two. Passive Top-K selection has no explicit objective and is fundamentally undirected.
Goal: (1) How to obtain reliable attention maps? (2) How to actively select visual regions most helpful for answering? (3) When is the optimal moment to insert visual information?
Key Insight: Grounded in Information Foraging Theory, the paper reframes region selection as an information-gain maximization problem.
Core Idea: Replace attention-score-driven passive selection with information-gain-driven active probing, enabling VLMs to "actively seek the regions they most need to observe."
Method¶
Overall Architecture¶
AIMCoT is a training-free plug-and-play framework comprising three collaborative modules: CAG first generates context-enhanced attention maps → AVP greedily selects the K regions with the highest information gain from a candidate set → DAT monitors attention shifts to determine when to insert visual information into the CoT.
Key Designs¶
-
Context-enhanced Attention-map Generation (CAG):
- Function: Generates context-aware image descriptions to mitigate text-vision granularity mismatch.
- Mechanism: The VLM first generates an explanatory description \(\mathcal{D}_{CAG}\) of the image conditioned on the question. This description is then appended to the question, and the enhanced context is used to recompute the attention map \(A'\), producing more reliable task-relevant localization.
- Design Motivation: Raw attention maps are unreliable under large text-vision granularity gaps (experiments show that masking the Top-10 attention regions reduces performance by only 3.93%). Enriching the textual context bridges this gap.
-
Active Visual Probing (AVP):
- Function: Actively selects the most informative visual regions based on information gain.
- Mechanism: A diverse candidate set \(C = C_{attn} \cup C_{exp}\) is constructed (Top-N attention regions + M randomly sampled grid regions). Information gain is defined as \(IG(\{R_i\}) = U_B - U_{C,i}\) (baseline uncertainty minus conditional uncertainty after introducing region \(R_i\), both measured by entropy over the vocabulary distribution). A greedy algorithm iteratively selects the K regions with the highest information gain.
- Design Motivation: High-attention regions may carry redundant information; information gain naturally eliminates redundancy — if a region's information is already covered by previously selected regions, its marginal gain decreases. Experiments show that information-gain-based selection precisely localizes critical details.
-
Dynamic Attention-shifting Trigger (DAT):
- Function: Intelligently determines when to insert visual information into the CoT.
- Mechanism: At each token generation step, the model monitors the total visual-context attention score \(A_{visual}(t)\) (averaged over the last 3 layers) and computes the attention shift \(\Delta A_{visual}(t)\). When the shift exceeds threshold \(\delta\), AVP is triggered to insert visual regions.
- Design Motivation: Empirical analysis reveals a strong correlation between high-quality outputs and visual information insertion timed to coincide with attention shifting from text toward vision; low-quality outputs lack this pattern.
Loss & Training¶
No training is required. All modules are plug-and-play at inference time.
Key Experimental Results¶
Main Results¶
| Model | Method | M3CoT (0-shot) | ScienceQA (0-shot) | LLaVA-W (0-shot) |
|---|---|---|---|---|
| Chameleon-7B | No-CoT | 29.1 | 47.7 | 13.1 |
| ICoT | 29.8 | 51.0 | 25.2 | |
| AIMCoT | 31.4 | 53.1 | 29.8 | |
| Gain vs. ICoT | +5.5% | +4.1% | +18.3% | |
| Qwen2-VL-7B | No-CoT | 43.6 | 56.3 | 32.7 |
| ICoT | 44.1 | 56.8 | 34.2 | |
| AIMCoT | 44.7 | 57.4 | 36.3 | |
| Gain vs. ICoT | +1.4% | +1.1% | +6.2% |
Ablation Study¶
| Configuration | Effect | Note |
|---|---|---|
| Mask Top-10 attention regions | Only −3.93% | Confirms that high-attention regions are not universally critical |
| CAG only | Moderate improvement | More reliable attention maps |
| CAG + AVP | Significant improvement | Active selection substantially outperforms passive Top-K |
| CAG + AVP + DAT (full) | Best | Three-module synergy surpasses any subset |
| Inference time vs. ICoT | ≤1.36× | Acceptable overhead |
Key Findings¶
- The largest advantage appears on the open-ended LLaVA-W benchmark (+18.3%), as open-ended settings demand more active information seeking.
- The 0-shot advantage is more pronounced than the 1-shot advantage, indicating that AIMCoT better elicits the model's intrinsic reasoning capacity.
- The exploratory candidate set \(C_{exp}\) (randomly sampled regions) provides many useful regions not covered by the attention map.
- The information gain function empirically exhibits approximate submodularity, supporting the effectiveness of the greedy algorithm.
Highlights & Insights¶
- Paradigm shift from passive to active: Reframing visual region selection from "where is the model looking" to "what information is most helpful to the model" represents a significant conceptual advance in multimodal reasoning.
- Information gain as a selection metric: Quantifying the utility of visual regions via changes in predictive entropy is theoretically well-grounded and naturally resolves redundancy.
- Independent value of the dynamic trigger: The idea of monitoring cross-modal attention shifts to determine insertion timing generalizes to other multimodal information fusion scenarios.
Limitations & Future Work¶
- Computing information gain for candidate regions requires multiple forward passes (\(|C| + 1\)), making computational overhead the primary bottleneck.
- The attention-monitoring mechanism in DAT may behave inconsistently across different VLM architectures.
- Validation is limited to 7B-scale models; performance on larger models remains unknown.
- The quality of the random exploratory set depends on the image content distribution and may be inefficient in specific scenarios.
Related Work & Insights¶
- vs. ICoT: Both perform interleaved CoT, but AIMCoT upgrades passive Top-K selection to active information-gain-driven selection, resolving ICoT's unreliable high-attention-region problem.
- vs. DDCoT/CCoT: These methods generate only textual reasoning chains, whereas AIMCoT directly inserts visual evidence into the chain, providing stronger visual grounding.
- The application of Information Foraging Theory to NLP and multimodal settings warrants further exploration.
Rating¶
- Novelty: ⭐⭐⭐⭐ Information-gain-driven active selection is a meaningful innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks, two backbones, and detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ Motivation is carefully analyzed with a solid theoretical foundation.
- Value: ⭐⭐⭐⭐ The training-free framework is highly practical, though gains diminish on stronger models.