HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models¶

Conference: ACL 2026 arXiv: 2508.00553 Code: GitHub Area: Multimodal Efficiency / Visual Token Compression Keywords: visual token pruning, hierarchical attention, training-free, model-agnostic, VLM acceleration

TL;DR¶

This paper identifies a hierarchical attention pattern in vision encoders—middle layers attend to foreground objects while deep layers capture global information—and proposes HiPrune, a training-free, model-agnostic visual token pruning method. By selecting three categories of tokens (Anchor/Buffer/Register) to preserve information at different semantic levels, HiPrune retains 99.3% of performance using only 1/3 of the tokens while reducing FLOPs by 58.7%.

Background & Motivation¶

Background: VLMs encode images into large numbers of tokens (576 in LLaVA-1.5, exceeding 10,000 in high-resolution settings), incurring substantial computational and memory overhead. Visual tokens exhibit high redundancy—randomly removing 50% of visual tokens causes far less performance degradation than removing 5% of text tokens.

Limitations of Prior Work: (1) Methods such as FastV prune tokens within the LLM decoder based on attention scores, without exploiting the intrinsic properties of the vision encoder itself. (2) CLS-token-based attention methods cannot be applied to encoders lacking a CLS token (e.g., SigLIP). (3) Most approaches are sensitive to specific model architectures and require targeted tuning.

Key Challenge: Existing pruning methods either rely on feedback from the LLM side (wasteful—all tokens are encoded before pruning) or employ single-dimensional metrics (e.g., only the final-layer attention), overlooking the fact that different layers of the vision encoder capture information at different semantic granularities.

Goal: Leverage the intrinsic hierarchical attention patterns of vision encoders to design a general token pruning strategy.

Key Insight: Systematic analysis of hierarchical attention patterns across multiple vision encoders—CLIP, SigLIP, DeiT, and VJEPA2—reveals a consistent layer-wise specialization.

Core Idea: Middle-layer high-attention tokens correspond to foreground objects (Anchor); their spatial neighbors (Buffer) preserve local semantics; deep-layer high-attention tokens are distributed uniformly across the image (Register) and capture global information. Allocating a token budget across these three categories yields a compact yet comprehensive image representation.

Method¶

Overall Architecture¶

HiPrune operates prior to the vision encoder output: (1) attention scores are extracted from an intermediate layer to select high-attention tokens as Anchors; (2) spatial neighbors of Anchors are selected as Buffers; (3) attention scores from the output layer are used to select remaining budget tokens as Registers, excluding already-selected tokens; (4) HiPrune++ optionally supplements with tokens selected via text-visual cosine similarity. Only the output-layer tokens at the selected indices are retained.

Key Designs¶

Anchor Token (middle-layer object tokens):
- Function: Preserve local detail of foreground objects.
- Mechanism: Attention scores \(\mathbf{a}^{[l]}\) are extracted from a designated object layer \(l\) (a middle layer of the vision encoder), and the top-\(N_a\) high-attention tokens are selected. Quantitative validation shows that the top-10% high-attention tokens at middle layers achieve the highest IoU with COCO object segmentation masks.
- Design Motivation: Middle-layer attention focuses on image foregrounds, a pattern that holds consistently across CLIP, SigLIP, DeiT, and VJEPA2.
Buffer Token (spatial neighborhood tokens):
- Function: Mitigate attention noise and preserve spatial relationships.
- Mechanism: For each Anchor token, its four spatial neighbors (top, bottom, left, right, forming a cross pattern) are selected as Buffers: \(\mathcal{I}_B = \cup\{\mathcal{I}_A - 1, \mathcal{I}_A + 1, \mathcal{I}_A - c, \mathcal{I}_A + c\}\).
- Design Motivation: Attention maps are noisy—a small number of high-attention tokens may be scattered across the image rather than concentrated on objects. Buffers correct this via spatial continuity.
Register Token (deep-layer global tokens):
- Function: Preserve global context and holistic image understanding.
- Mechanism: The remaining budget is allocated to top-attention tokens from the output layer (last layer), excluding already-selected Anchor and Buffer tokens. Deep-layer high-attention tokens are distributed uniformly across the image, encoding global information.
- Design Motivation: Retaining only object tokens discards global context (e.g., scene type, spatial layout); Register tokens compensate for this deficit.

Loss & Training¶

HiPrune is entirely training-free and does not modify any model parameters. HiPrune++ additionally selects a small number of tokens based on cosine similarity between text and visual tokens to enhance instruction following.

Key Experimental Results¶

Main Results¶

LLaVA-1.5-7B (576→192 tokens, 33.3% retained)

Method	GQA	MMB	MME	POPE	SQA	VQAv2	Avg.
Original (576 tokens)	61.9	64.7	1862	85.9	69.5	78.5	100%
ToMe	54.3	60.5	—	—	—	—	—
FastV	58.2	62.1	—	—	—	—	—
HiPrune	61.4	64.2	1852	85.6	69.1	78.1	99.3%

Ablation Study¶

Configuration	Avg. Performance Retained
Anchor only (middle layer)	94.2%
Register only (deep layer)	92.8%
Anchor + Register	97.5%
Anchor + Buffer + Register	99.3%

Key Findings¶

Retaining only 1/3 of tokens preserves 99.3% of performance with a 58.7% FLOPs reduction, confirming the high redundancy of visual tokens.
The hierarchical attention pattern is consistently observed across six diverse architectures (CLIP-L/B, SigLIP, SigLIP2, DeiT, VJEPA2), indicating it is an intrinsic property of vision encoders rather than an artifact of specific training.
HiPrune++ retains 96.1% performance under an extremely low token budget (1/9 of tokens) and significantly reduces hallucinations.
The contribution of Buffer tokens, while modest, is stable—improving performance from 97.5% to 99.3%—and is insensitive to the shape of the neighborhood pattern (cross vs. square).

Highlights & Insights¶

The finding that "middle layers attend to objects while deep layers capture global context" is concise and compelling, supported by both quantitative analysis (IoU) and qualitative visualization (attention maps).
The training-free and model-agnostic design makes HiPrune a truly plug-and-play tool.
The three token categories correspond to three levels of image understanding: local detail, spatial context, and global semantics.

Limitations & Future Work¶

The object layer \(l\) must be determined for each encoder type (typically a middle layer).
For tasks requiring precise pixel-level understanding (e.g., OCR), pruning may discard critical information.
Extension to video token pruning remains underexplored.
Integration with dynamic-resolution encoders requires further validation.

vs. FastV: FastV prunes within the LLM decoder after encoding all tokens; HiPrune prunes at the vision encoder stage, operating earlier and more efficiently.
vs. CLS-based methods: Reliance on the CLS token limits generality (SigLIP has no CLS token); HiPrune uses hierarchical attention scores and applies to any ViT.
vs. ToMe: ToMe merges tokens by similarity and requires training or additional computation; HiPrune performs pure index selection with zero additional overhead.

Rating¶

Novelty: ⭐⭐⭐⭐ The hierarchical attention analysis is a novel finding, though the pruning mechanism itself is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Consistent validation across 4 VLMs, multiple encoders, and 6 vision encoder architectures.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clear; the logical chain from observation to method is complete.
Value: ⭐⭐⭐⭐⭐ Plug-and-play deployment with 58.7% FLOPs reduction offers high practical utility.