HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model¶

Conference: ICCV 2025 arXiv: 2503.13026 Code: github.com/yayafengzi/LMM-HiMTok Area: Image Segmentation / Large Multimodal Models Keywords: Hierarchical Mask Tokenization, Large Multimodal Model, Segmentation, Vector Quantization, Visual Grounding

TL;DR¶

This paper proposes HiMTok (Hierarchical Mask Tokenizer), which represents segmentation masks as up to 32 coarse-to-fine discrete tokens, enabling LMMs to directly generate segmentation results in the same manner as text generation — without any additional image-conditioned mask decoder — achieving state-of-the-art performance across multiple segmentation benchmarks.

Background & Motivation¶

Existing LMM-driven segmentation approaches fall into three main paradigms, each with notable shortcomings:

Boundary point sequences (e.g., PolyFormer, VistaLLM): Masks are represented as sequences of polygon vertices, but a limited number of vertices cannot adequately capture complex shapes or multi-region objects.

Hidden states + mask decoder (e.g., LISA, PixelLM, PSALM): The LMM outputs hidden states of special tokens, which are then decoded by an external SAM/Mask2Former module. This introduces three limitations: - The LLM does not sufficiently learn precise spatial localization. - Mask representations are inconsistent between input and output (special tokens serve merely as identifiers, discarding the corresponding hidden state information). - The architecture is complex, as the mask decoder requires re-access to the original image.

Image generation approaches (VQ-GAN quantized into 2D tokens): These are overly redundant and fail to achieve competitive performance.

Core Problem: Can LMMs natively acquire segmentation capability — generating masks as they generate text — without relying on external segmentation models?

Method¶

Overall Architecture¶

The HiMTok system consists of three components: - Mask Tokenizer (MT): Encodes segmentation masks into a set of 1D latent tokens. - Vector Quantization layer (VQ): Discretizes the latent tokens. - Mask Detokenizer (MD): Reconstructs segmentation masks from the discrete tokens.

A three-stage training scheme progressively integrates segmentation capability into the LMM: - Stage 1: Train HiMTok (unimodal mask reconstruction). - Stage 2: Joint training of LMM + HiMTok (low-resolution images; aligns vision–language and mask tokens). - Stage 3: Train LMM only (high-resolution images; fine-tuning).

Key Designs¶

Hierarchical Mask Tokenization: Inspired by TiTok, masks are compressed into 32 1D discrete tokens. The key innovation is a causal attention mechanism in which each latent token is conditioned on the input mask patches and all preceding tokens, enforcing a coarse-to-fine hierarchical structure:

\[p(m_1,...,m_K|\mathcal{M}) = \prod_{k=1}^K p(m_k|\mathcal{M}, m_1,...,m_{k-1})\]

Early tokens encode coarse spatial locations and prototypes, while later tokens capture local fine-grained details. This design is naturally aligned with the autoregressive principle of LLMs.

Hierarchical Mask Loss (HML): Explicit supervision at different levels enforces the hierarchical property. At level \(l\), the first \(l\) mask tokens are used by the MD to independently reconstruct \(\hat{M}^{(l)}\), supervised against a Gaussian-blurred mask label \(M^{(l)}\) with a kernel size corresponding to the level:

\[\mathcal{L}_{mask} = \sum_l \mathcal{L}_{mask}^{(l)}(\hat{M}^{(l)}, M^{(l)})\]

Each level's loss combines BCE Loss and Dice Loss. Fewer tokens correspond to coarser Gaussian distributions, while more tokens correspond to finer boundaries. During training, a subset of levels is sampled according to an inverse power-law distribution for efficiency.

Bidirectional Information Flow: Training data includes both box→mask and mask→box conversion directions, enabling the LMM to learn the intrinsic relationship between detection and segmentation. Bounding boxes are generated directly by the LMM rather than parsed from masks. A notable finding is that outputting mask tokens before bounding boxes (a form of visual chain-of-thought) improves visual grounding accuracy.

Loss & Training¶

Stage 1: Mask reconstruction task; HiMTok fully trained; 256×256 resolution; codebook size 1024; 32 latent tokens.
Stage 2: Joint optimization with cross-entropy loss + HML; LMM (InternVL 2.5 backbone) + HiMTok partially trained; low-resolution 448×448 input; 7.1M data samples.
Stage 3: Cross-entropy loss only; LMM training only; high-resolution input; 5.0M data samples (segmentation data ratio reduced to 0.24).
Compute: 2,752 A800 GPU-hours in total (192 + 1,920 + 640).

Key Experimental Results¶

Main Results¶

Referring Expression Segmentation (RefCOCO/+/g, cIoU):

Method	w/ SFM	RefCOCO val	RefCOCO+ val	RefCOCOg val
LISA-7B(ft)	✓	74.9	65.1	67.9
PixelLM-7B	✓	73.0	66.3	69.3
PSALM	✓	83.6	72.9	73.8
u-LLaVA	✓	83.0	77.1	77.1
LMM_HiMTok-8B	✗	81.1	77.1	75.8
LMM_HiMTok-8B(ft)	✗	85.0	79.7	80.0
LMM_HiMTok-8B(ft)+SAM	✓	85.9	80.5	80.1

Without relying on any segmentation foundation model (SFM), the proposed method achieves state-of-the-art performance, substantially outperforming both prior SFM-free and SFM-based approaches.

Open-Vocabulary Segmentation (mIoU):

Method	ADE20K (A-150)	PASCAL Context	PASCAL VOC
PSALM	18.2	48.5	81.3
LaSagnA	14.3	46.1	69.8
LMM_HiMTok-8B	25.0	43.9	82.0

Ablation Study¶

Effect of Hierarchical Mask Loss (HML) (RefCOCO val / RefCOCO+ val / RefCOCOg val):

HML	RefCOCO	RefCOCO+	RefCOCOg
✗	79.2	64.7	63.9
✓	81.1	77.1	75.8

Without HML, RefCOCO+/g scores drop substantially (−12.4/−11.9), and the model requires the full 32-token sequence to function; with HML, flexible token lengths are supported.

Effect of Mask Token Length on REC (Visual Grounding):

Token count → box	Acc@0.5	Acc@0.9
0 (direct box prediction)	~90.3	~57
16 → box	~92	~73
32 → box	~93	~78

Mask tokens serving as a visual chain-of-thought significantly improve high-precision localization (Acc@0.9).

Key Findings¶

16 mask tokens already achieve 82.8% cIoU; 32 tokens provide an additional 2.5% gain.
In the bidirectional information flow, the mask→box direction is more beneficial, as preceding mask tokens are easier to generate than directly predicting bounding boxes.
Nearly identical scores on ReasonSeg val and test (60.7 vs. 60.8 gIoU) indicate strong textual reasoning capability.
General image understanding ability is largely preserved, with results on MME comparable to InternVL2.5-8B.
Small object segmentation remains the primary challenge, with cIoU significantly lower than the overall performance.

Highlights & Insights¶

Paradigm shift: This work is the first to achieve high-quality LMM segmentation without relying on external segmentation models; mask tokens are learned by the LLM as a new form of language.
Input–output consistency: Mask tokenization and detokenization are consistent across LLM input and output — a property that prior hidden-state-based approaches cannot guarantee.
Hierarchical design naturally aligns with LLM autoregression: The coarse-to-fine token hierarchy perfectly matches next-token prediction.
Visual CoT effect of bidirectional information flow: The segment-then-localize paradigm is a novel contribution that offers a new perspective on visual reasoning in LMMs.
Architectural simplicity: The detokenizer is a lightweight transformer that does not require the original image at inference time.

Limitations & Future Work¶

The mask token length must be predefined and cannot adapt to the shape complexity of individual objects.
The current model is passive in nature, requiring the user to specify a referring expression rather than proactively segmenting all objects of interest.
The absence of multi-scale feature design limits performance on fine-grained regional segmentation.
A notable performance gap exists between small object segmentation and overall results.
Stage 2 training requires 1,920 A800 GPU-hours, incurring non-trivial computational cost.

TiTok demonstrated that natural images can be compressed into a small number of 1D tokens, inspiring the compact representation of masks in this work.
The coarse-to-fine next-scale prediction paradigm from VAR is adapted for mask hierarchy.
InternVL 2.5 serves as the backbone LMM, providing a strong vision–language foundation.
The limitations of hidden-state-based approaches such as LISA and PSALM are clearly analyzed, motivating the proposed new paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ A fundamentally new mask representation paradigm with an elegant hierarchical design and insightful bidirectional information flow.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers RES/GRES/ReasonSeg/OVS/REC/general understanding with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ The three-paradigm comparison figure is clear, and the method is described comprehensively.
Value: ⭐⭐⭐⭐⭐ Opens a new direction for LMM-based segmentation; open-sourced code; high practical value.