Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding¶

Conference: CVPR 2026 arXiv: 2512.10548 Code: N/A Area: Multimodal Large Language Models / Image Restoration (Visual Perception Enhancement) Keywords: Visual token resolution, dynamic attention, multimodal large language models, saliency guidance, token super-resolution

TL;DR¶

This paper proposes Blink, a framework that dynamically expands and discards visual tokens across different Transformer layers of an MLLM — simulating the human "rapid blinking" scanning process — to adaptively enhance visual perception within a single forward pass, improving LLaVA-1.5 performance across multiple multimodal benchmarks.

Background & Motivation¶

Background: Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks (e.g., LLaVA, Qwen-VL), yet their visual perception capability remains insufficient and prone to hallucinations.

Limitations of Prior Work: Existing MLLMs process visual inputs using conventional LLM architectures, lacking explicit exploitation of salient visual regions. Post-processing methods (e.g., identifying salient regions followed by cropping and re-inference) are computationally inefficient and can only focus on a single region at a time.

Key Challenge: Humans perceive visual scenes through a dynamic "scan–focus–shift" process, whereas MLLMs treat all visual tokens uniformly and cannot simulate cross-layer attention shifts.

Goal: How can an MLLM's visual perception capability be dynamically enhanced within a single forward pass?

Key Insight: A pilot study first uncovers two key insights — (a) different layers attend to different visual regions, and (b) allocating additional computation to high-attention tokens improves perception — which then motivate the design of the dynamic framework.

Core Idea: Leveraging the non-uniform distribution of attention maps, the framework dynamically decides at each layer whether to expand (via super-resolution enhancement) or discard visual tokens, thereby simulating the human "scan–focus–shift" cognitive process.

Method¶

Overall Architecture¶

Standard MLLM forward pass → At selected layers: compute saliency map → determine whether expansion/discard thresholds are exceeded → if so, apply the TokenSR module to expand tokens in salient regions → discard expanded tokens in subsequent layers if attention shifts → final output.

Key Designs¶

Saliency-Guided Scanning (SGS):
- At each participating layer \(L\), the attention of the last text token over all visual tokens is computed as: \(S_v^{(L)} = q_{t_n}^{(L)} (k_v^{(L)})^\top\)
- Visual tokens are reshaped into an \(H \times W\) grid and divided into \(p \times p\) patches; the aggregated saliency of each patch is then computed.
- A saliency ratio is defined as \(\rho^{(L)} = \frac{\mathcal{S}_{r_{\max}}^{(L)}}{\sum_i \mathcal{S}_{r_i}^{(L)}}\), reflecting the degree of attention concentration.
- Design Motivation: The pilot study reveals substantial variation in attention distributions across layers; high concentration indicates that the model is "confident" in attending to a specific region, making that region a suitable candidate for enhancement.
Dynamic Token Resolution (DTR):
- Token Expansion: When \(\rho^{(L)} > \tau_{\text{exp}}\), the TokenSR module performs super-resolution enhancement on the salient patch: \(hs_{SR}^{(L)} = \text{TokenSR}^{(L)}(hs_{LR}^{(L)})\) The enhanced tokens are then inserted into the sequence: \([hs_s; hs_v; hs_{SR}; hs_t]\)
- Token Discarding: When \(\rho^{(L)} < \tau_{\text{drop}}\), previously expanded tokens are removed and the original sequence is restored.
- Design Motivation: Expansion increases computational investment in salient regions, while discarding prevents low-information tokens from interfering with subsequent reasoning.
Token Super-Resolution Module (TokenSR):
- A lightweight module consisting of three 2D convolutional layers with ReLU activations.
- During training, salient-region tokens from the full image are upscaled, with tokens from the corresponding cropped image serving as reference; the training objective minimizes KL divergence between the two.
- The MLLM backbone is frozen; only TokenSR parameters are trained.
- Design Motivation: Inspired by image super-resolution, a lightweight network recovers fine-grained details from low-resolution tokens while preserving semantic consistency.

Loss & Training¶

TokenSR training: minimizes KL divergence between enhanced tokens and cropped reference tokens.
Training data: LLaVA-1.5 training set (COCO + GQA + OCR-VQA + TextVQA + VisualGenome).
All operations are performed prior to layer normalization, ensuring the Transformer correctly handles expanded or pruned sequences.

Key Experimental Results¶

Main Results (LLaVA-1.5-7B)¶

Benchmark	Vanilla	Blink-interp	Blink	Gain
MME Perception	1505.72	1514.08	1519.74	+14.02
MME Cognition	357.86	353.21	361.79	+3.93
GQA	61.93	61.93	61.98	+0.05
MMBench	64.60	64.69	64.69	+0.09
MMBench-CN	58.08	58.51	58.59	+0.51
POPE	85.17	85.17	85.23	+0.06
ScienceQA	69.46	69.51	69.66	+0.20
MM-Vet	32.20	31.70	33.40	+1.20

Ablation Study¶

Configuration	MME Total	Change	Notes
Blink (full)	1881.53	—	Best
w/o SGS (random selection)	1879.38	-2.15	Saliency guidance is necessary
w/o DTR (fixed period)	1840.46	-41.07	Dynamic resolution adjustment is critical
w/o Drop	1884.03	+2.50	Omitting discard yields a marginal gain under Blink
High \(\tau_{\text{exp}}\)	1865.54	-15.99	Excessively high threshold limits effective expansion

Key Findings¶

Removing the DTR module causes the largest performance drop (−41.07), confirming it as the core component of the framework.
Blink-interp (training-free interpolation) also improves MME Perception by 8.36 points, demonstrating the intrinsic value of the dynamic inference pipeline.
The full Blink model consistently matches or outperforms the baseline across all benchmarks.
The selected layer range (layers 12–18) corresponds to the "correct-attention intermediate layer" interval identified in the pilot study.

Highlights & Insights¶

The two findings from the pilot study (cross-layer attention shifts + efficacy of increased computation on salient tokens) provide a solid empirical foundation for the method design.
The "dynamic scan–focus" mechanism elegantly simulates the human visual cognition process.
The plug-and-play design requires training only the lightweight TokenSR module, with the backbone entirely frozen.
The Blink-interp variant demonstrates that gains can be achieved through the inference pipeline alone, even without training.

Limitations & Future Work¶

The absolute performance gains are modest (MME total score +17.95), though the direction is promising.
Validation is limited to LLaVA-1.5-7B; evaluation on larger models and more recent architectures remains to be conducted.
The thresholds \(\tau_{\text{exp}}\) and \(\tau_{\text{drop}}\) require manual tuning; adaptive learning of these values warrants exploration.
Currently, only one salient patch is selected per layer; scenarios with multiple salient regions may require an extended design.

Post-processing zoom methods (e.g., LLaVA-HR) require multiple forward passes and are computationally inefficient.
Visual token pruning approaches (e.g., FastV, LLaVA-PruMerge) represent a complementary direction — Blink focuses on "enhancing the important" rather than "removing the unimportant."
Insight: The internal attention distributions of MLLMs contain rich visual perception signals that merit further investigation.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of dynamic token resolution adjustment is novel, and the pilot study provides strong motivation.
Experimental Thoroughness: ⭐⭐⭐⭐ Seven benchmarks, detailed ablation studies, and visualization analyses.
Writing Quality: ⭐⭐⭐⭐ The logical chain from empirical findings to method design is clearly articulated.
Value: ⭐⭐⭐⭐ Introduces a new perspective for enhancing MLLM visual perception in a plug-and-play manner.