Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping¶

Conference: ICLR 2026 arXiv: 2510.09741 Code: Project Page Area: Multimodal VLM Keywords: MLLM, image warping, attention-guided, fine-grained perception, test-time intervention

TL;DR¶

This paper proposes AttWarp, a plug-and-play test-time image warping method that leverages the MLLM's own cross-modal attention maps to perform rectilinear grid resampling — expanding high-attention regions and compressing low-attention regions — achieving consistent accuracy improvements, enhanced compositional reasoning, and reduced hallucinations across 5 benchmarks and 4 MLLMs.

Background & Motivation¶

Background: Multimodal large language models (MLLMs) such as LLaVA and Qwen-VL have made notable progress in image-based dialogue and reasoning, yet still exhibit significant deficiencies in fine-grained perception — missing small objects, confusing visually similar entities, and misinterpreting spatial relationships.

Limitations of Prior Work: Existing improvement methods either require external detectors (bounding boxes/masks), rely on multi-step reasoning chains, or employ cropping/masking strategies that discard global context.

Key Challenge: Small objects lose spatial detail during feature extraction, and subsequent attention-level improvements cannot recover this information; yet naive cropping or magnification discards global layout.

Goal: To enhance the resolution of query-relevant regions through spatial transformations at the input level, without modifying model weights or architecture.

Key Insight: Analogous to human foveal vision — densely sampling attended regions while sparsely sampling the periphery, thereby preserving global information.

Core Idea: Use the model's own attention to guide a single rectilinear warping transformation, enabling the same model to "see more clearly."

Method¶

Overall Architecture¶

Input image + query → MLLM extracts cross-modal attention maps → aggregated into an attention score matrix → marginal attention distributions computed → CDF inverse transform produces warp mapping → bilinear resampling yields warped image → same MLLM processes warped image to generate answer.

Key Designs¶

Rectilinear Warping: The 2D attention matrix is decomposed into horizontal and vertical marginal distributions \(m_x(j), m_y(i)\). The CDF is computed and its inverse is used as the warp mapping: \(f_X^{\text{Warp}}(j) = W \cdot M_x^{-1}(j/W), \quad f_Y^{\text{Warp}}(i) = H \cdot M_y^{-1}(i/H)\) This preserves the regular grid structure for compatibility with standard visual encoders. All original image information is retained (no cropping or masking); only pixel density is redistributed.
AttWarp-Chain (Iterative Warping): Warping improves attention → better attention yields better warping, forming a positive feedback loop. KL divergence serves as the termination criterion: \(\mathcal{D}_{KL}(P^{(d)} | P^{(d-1)}) < \epsilon_{KL}\)
AttWarp-Distill (Distilled Variant): A lightweight network (CLIP ViT-L/14 + FiLM conditioning + Conv1D) is trained to directly predict marginal distributions \((\hat{m}_x, \hat{m}_y)\) from image-text pairs, bypassing the attention extraction step. Trained with L1 loss; inference requires a single forward pass, achieving 3× speedup over ViCrop.
Attention Score Matrix: Cross-modal attention is extracted from specified decoder layers of the MLLM and averaged across output tokens, attention heads, and layers, then upsampled to image resolution and smoothed: \(\tilde{A}_{i,j} = \frac{1}{n_{\text{out}} \cdot n_{\text{heads}} \cdot |\mathcal{L}|} \sum_{\ell \in \mathcal{L}} \sum_m \sum_h a^{(\ell,h)}_{m,t}\)

Loss & Training¶

AttWarp / AttWarp-Chain: No training required; purely test-time methods.
AttWarp-Distill: Trained on TextVQA/GQA/DocVQA training sets using teacher attention as supervision targets with L1 loss.

Key Experimental Results¶

Main Results¶

Results on LLaVA-v1.5-7B (accuracy %):

Method	TextVQA	GQA	MMMU	POPE	DocVQA
Base MLLM	49.3	60.5	36.9	85.3	18.1
ViCrop	56.3	60.9	37.2	87.0	22.5
AttWarp	58.1	63.7	40.4	87.5	25.5
AttWarp-Chain	60.3	64.4	41.6	88.2	27.6
Δ vs. strongest baseline	+4.0	+3.5	+4.4	+1.2	+5.1

Consistent improvements are also observed on Qwen2.5-VL (+2.1–3.6%).

Ablation Study¶

Attention distribution improvement validation (TextVQA):

Metric	w/o Warping	w/ AttWarp
Pointing Game Accuracy	37.4%	42.4% (+5%)
Proportion (attention within bbox)	0.117	0.155 (+3.8%)

Distribution shift analysis: AttWarp achieves KID = 31.5 vs. Non-Rectilinear Warp KID = 174.9 (distance from training distribution), demonstrating that rectilinear warping introduces negligible distribution shift.

Key Findings¶

Warping demonstrably concentrates attention on correct regions, improving Pointing Game accuracy by 5%.
The rectilinear design is critical — non-rectilinear warping causes severe distribution shift (KID increases from 31.5 to 174.9).
AttWarp-Distill requires only 8.7 TFLOPs, close to the Base MLLM's 8.5 TFLOPs and far more efficient than ViCrop's 24.2 TFLOPs.
Error analysis indicates that AttWarp primarily reduces errors in fine-grained detail perception and compositional reasoning.

Highlights & Insights¶

Philosophy of "Constructive Distortion": Inspired by human foveal vision, actively warping the input is a principled and effective strategy for improving perception.
Plug-and-Play: Requires no model modification; consistently effective across 4 diverse MLLM architectures (LLaVA, Qwen-VL, InternVL, InstructBLIP).
Information-Preserving: Unlike cropping or masking, warping retains all pixel information and only redistributes density.
CDF Inverse Transform Framework: The mathematical framework converting attention distributions into warp mappings is elegant and concise, requiring only a single CDF forward pass.
Positive Feedback in AttWarp-Chain: Iterative reinforcement where warping improves attention and attention improves warping, with KL divergence providing automatic termination.
Distribution Preservation Analysis: Rigorous validation that rectilinear warping does not introduce distribution shift (verified via KID/FID/Mahalanobis distance).

Limitations & Future Work¶

Requires two MLLM forward passes (one for attention extraction, one for inference), doubling latency.
Warping may suppress peripheral context beneficial for global reasoning, particularly for questions requiring full-scene understanding.
Absolute scale information is lost after warping, potentially affecting size-related questions.
The number of AttWarp-Chain iterations depends on the KL threshold hyperparameter.
Attention quality is a prerequisite — if the initial attention is severely misaligned, warping will be counterproductive.
There is no theoretical upper bound on warp magnitude; extreme warping may cause severe compression of non-target regions.
Application to video understanding models (temporally consistent warping) remains unexplored.

Compared to test-time intervention methods such as FGVP/SoM/ViCrop: AttWarp is the only method that preserves complete image information.
Compared to APIPrompting: the latter overlays attention heatmaps, introducing non-original information; AttWarp maintains pure image input.
A modern revival of classical methods such as seam carving and saliency-aware warping — whereas traditional approaches rely on optimization (minutes per image), AttWarp operates via a single CDF forward pass.
Insight: Input-level intervention (as opposed to intermediate representation manipulation) is an underexplored yet effective strategy for improving perceptual models.
Implication for Embodied AI / AR Devices: AttWarp-Distill's single-pass inference is well-suited for low-latency deployment scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ — Attention-guided warping is a novel idea; the CDF inverse transform framework is elegant, drawing inspiration from foveal vision.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 5 benchmarks, 4 models, distribution analysis, attention verification, and error analysis; highly comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ — Motivation → method → experiments flow logically; figures are clear and analysis is thorough.
Value: ⭐⭐⭐⭐ — High practical value as a plug-and-play approach, though it is fundamentally a test-time trick with limited theoretical depth.