ZIM: Zero-Shot Image Matting for Anything¶

Conference: ICCV 2025 arXiv: 2411.00626 Code: https://naver-ai.github.io/ZIM Area: Image Segmentation Keywords: Image Matting, Zero-Shot, SAM, Label Conversion, Hierarchical Decoder

TL;DR¶

This paper proposes ZIM, a zero-shot image matting model that constructs the SA1B-Matte dataset by converting SA1B segmentation labels into fine-grained matting labels via a label converter. A hierarchical pixel decoder and a prompt-aware masked attention mechanism are further introduced to achieve micro-level fine-grained matting while preserving zero-shot generalization capability.

Background & Motivation¶

SAM has achieved remarkable progress in zero-shot segmentation; however, its generated masks lack fine edge quality (e.g., hair strands, branches). Existing methods that extend SAM to matting (Matte-Any, Matting-Any) rely on fine-tuning with public matting datasets, which contain only macro-level labels (e.g., full human portraits). This causes fine-tuned models to lose zero-shot capability at the micro level—yielding macro-level outputs even when given micro-level prompts (catastrophic forgetting).

Root Cause: - SAM: Strong zero-shot generalization, but coarse masks with checkerboard artifacts. - Existing matting models: High-fidelity outputs, but poor zero-shot generalization. - Data bottleneck: Large-scale micro-level matting annotation is prohibitively expensive.

The key insight of this paper is that a label converter can automatically transform the large-scale micro-level segmentation labels in SA1B into matting labels, enabling large-scale training data acquisition without manual annotation.

Method¶

Overall Architecture¶

ZIM is built upon the SAM architecture, comprising four components: (1) an image encoder (ViT-B, stride 16); (2) a prompt encoder; (3) a Transformer decoder (token-to-image and image-to-token cross-attention); and (4) an improved hierarchical pixel decoder. The two core contributions correspond to data construction and network architecture, respectively.

Key Designs¶

Label Converter: Built upon MGMatting with a Hiera-base-plus backbone. It takes an image and a segmentation label as input and outputs a fine-grained matting label. Training data are sourced from six public matting datasets (20,591 natural images + 118,749 synthetic images). Coarse segmentation labels are derived from matting labels via thresholding, downsampling, Gaussian blurring, and dilation/erosion operations.

Two key strategies address training challenges: - Spatial Generalization Augmentation (SGA): Randomly crops identical regions from segmentation and matting labels, forcing the converter to handle incomplete or irregular input patterns, thereby improving generalization to micro-level segmentation labels. - Selective Transformation Learning (STL): Not all objects require fine-grained matting (e.g., cars, tables). Coarse-grained object masks from ADE20K (187,063 masks) are incorporated, where the ground-truth matte equals the original segmentation label (i.e., no conversion), teaching the model to selectively refine only objects that require fine-grained processing.

Training loss: \(L = L_{l1} + \lambda L_{grad}\), where \(L_{grad}\) denotes the gradient loss.

Hierarchical Pixel Decoder: The original SAM decoder contains only two transposed convolution layers (stride 4), which are prone to checkerboard artifacts. The proposed decoder adopts a multi-resolution feature pyramid design, generating feature maps at stride 2/4/8 from the input image. The image embeddings are progressively upsampled and concatenated with features at the corresponding resolution, yielding a high-resolution feature map at stride 2. Only 10 ms of additional inference latency is incurred.
Prompt-Aware Masked Attention:
Box prompt: Generates a binary attention mask \(\mathcal{M}^b \in \{0, -\infty\}\) to constrain model attention within the box region.
Point prompt: Generates a soft attention mask \(\mathcal{M}^p \in [0,1]\) based on a 2D Gaussian distribution (standard deviation \(\sigma=21\)).
Applied exclusively to token-to-image cross-attention layers (experiments show that applying it to image-to-token attention disrupts global feature capture).

Loss & Training¶

SA1B-Matte dataset: ~2.2M labels from SA1B (1% subset) are converted for training.
Training loss follows the label converter: \(L = L_{l1} + \lambda L_{grad}\) (\(\lambda=10\)).
AdamW optimizer, lr=1e-5, cosine decay, 500K iterations.
Fine-tuned from SAM pre-trained weights.

Key Experimental Results¶

Main Results (MicroMat-3K Benchmark, Box Prompt)¶

Method	Prompt	Fine-grained SAD↓	Fine-grained MSE↓	Fine-grained Grad↓	Coarse SAD↓	Coarse MSE↓
SAM	box	36.086	11.057	14.867	3.516	1.044
HQ-SAM	box	124.262	42.457	13.673	8.458	2.733
Matte-Any	box	34.661	9.746	7.021	6.950	1.983
Matting-Any	box	246.214	68.372	19.185	109.639	23.780
ZIM (ours)	box	9.961	1.893	4.813	1.860	0.448

ZIM outperforms all baselines by a substantial margin across all metrics, reducing MSE by 83% relative to SAM and 81% relative to Matte-Any.

Ablation Study (Component Analysis, Box Prompt)¶

Attention	Decoder	Fine-grained SAD↓	Fine MSE↓	Fine Grad↓	Coarse SAD↓	Coarse MSE↓
✗	✗	13.623	2.718	6.516	2.071	0.474
✓	✗	13.198	2.504	6.445	2.049	0.471
✗	✓	11.074	2.094	5.401	2.069	0.487
✓	✓	9.961	1.893	4.813	1.860	0.448

The two components are complementary: the hierarchical decoder primarily reduces Grad error (suppressing artifacts), while the attention mechanism further improves overall accuracy.

Key Findings¶

Downstream transferability: Replacing SAM with ZIM in multiple downstream frameworks—including Matte-Any, HQ-SAM, Inpainting Anything, medical image segmentation, and 3D segmentation—consistently yields significant improvements.
Training data impact: Training ZIM on public matting data causes MSE on MicroMat-3K to surge from 1.893 to 38.332, confirming the necessity of micro-level training data.
Domain shift: ZIM performs poorly on traditional matting benchmarks (full-body portraits) under box prompts due to SAM's inherent prompt ambiguity; however, it surpasses all existing methods when dense multi-point prompts are used.

Highlights & Insights¶

Data engineering innovation: The label converter approach of converting segmentation labels into matting labels is both elegant and effective; the SGA and STL strategies are carefully designed.
Lightweight improvement with large gains: The hierarchical decoder adds only 10 ms of latency while resolving the long-standing checkerboard artifact issue in SAM.
MicroMat-3K benchmark: 3,000 high-quality micro-level matting annotations that fill a critical gap in zero-shot matting evaluation.
Strong practicality: Supports multiple prompt modalities, including points, boxes, text, and strokes.

Limitations & Future Work¶

Inherits SAM's ambiguity under vague prompts (whole-object vs. part-level ambiguity).
Cannot handle transparency estimation for transparent objects (e.g., glass, fire).
The quality ceiling of the label converter is bounded by the coverage of source matting datasets.
The potential of larger backbones (ViT-H) remains unexplored.

The label conversion paradigm can be generalized to other dense prediction tasks (e.g., upgrading coarse annotations to fine-grained labels).
The Selective Transformation Learning (STL) idea is applicable to any scenario requiring discrimination between objects that do and do not need refinement.
The hierarchical decoder design serves as a plug-and-play improvement for the SAM family of models.

Rating¶

Novelty: ⭐⭐⭐⭐ (Novel label conversion framework and dataset construction pipeline)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Zero-shot evaluation on 23 datasets, multiple downstream tasks, and extensive ablations)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, rich figures and tables)
Value: ⭐⭐⭐⭐⭐ (High practical utility, filling the gap in zero-shot matting)