Skip to content

ZIM: Zero-Shot Image Matting for Anything

Conference: ICCV 2025 arXiv: 2411.00626 Code: https://naver-ai.github.io/ZIM Area: Image Segmentation Keywords: Image Matting, Zero-Shot, SAM, Label Conversion, Hierarchical Decoder

TL;DR

This paper proposes ZIM, a zero-shot image matting model that constructs the SA1B-Matte dataset by converting SA1B segmentation labels into fine-grained matting labels via a label converter. A hierarchical pixel decoder and a prompt-aware masked attention mechanism are further introduced to achieve micro-level fine-grained matting while preserving zero-shot generalization capability.

Background & Motivation

SAM has achieved remarkable progress in zero-shot segmentation; however, its generated masks lack fine edge quality (e.g., hair strands, branches). Existing methods that extend SAM to matting (Matte-Any, Matting-Any) rely on fine-tuning with public matting datasets, which contain only macro-level labels (e.g., full human portraits). This causes fine-tuned models to lose zero-shot capability at the micro level—yielding macro-level outputs even when given micro-level prompts (catastrophic forgetting).

Root Cause: - SAM: Strong zero-shot generalization, but coarse masks with checkerboard artifacts. - Existing matting models: High-fidelity outputs, but poor zero-shot generalization. - Data bottleneck: Large-scale micro-level matting annotation is prohibitively expensive.

The key insight of this paper is that a label converter can automatically transform the large-scale micro-level segmentation labels in SA1B into matting labels, enabling large-scale training data acquisition without manual annotation.

Method

Overall Architecture

ZIM is built upon the SAM architecture, comprising four components: (1) an image encoder (ViT-B, stride 16); (2) a prompt encoder; (3) a Transformer decoder (token-to-image and image-to-token cross-attention); and (4) an improved hierarchical pixel decoder. The two core contributions correspond to data construction and network architecture, respectively.

Key Designs

  1. Label Converter: Built upon MGMatting with a Hiera-base-plus backbone. It takes an image and a segmentation label as input and outputs a fine-grained matting label. Training data are sourced from six public matting datasets (20,591 natural images + 118,749 synthetic images). Coarse segmentation labels are derived from matting labels via thresholding, downsampling, Gaussian blurring, and dilation/erosion operations.

Two key strategies address training challenges: - Spatial Generalization Augmentation (SGA): Randomly crops identical regions from segmentation and matting labels, forcing the converter to handle incomplete or irregular input patterns, thereby improving generalization to micro-level segmentation labels. - Selective Transformation Learning (STL): Not all objects require fine-grained matting (e.g., cars, tables). Coarse-grained object masks from ADE20K (187,063 masks) are incorporated, where the ground-truth matte equals the original segmentation label (i.e., no conversion), teaching the model to selectively refine only objects that require fine-grained processing.

Training loss: \(L = L_{l1} + \lambda L_{grad}\), where \(L_{grad}\) denotes the gradient loss.

  1. Hierarchical Pixel Decoder: The original SAM decoder contains only two transposed convolution layers (stride 4), which are prone to checkerboard artifacts. The proposed decoder adopts a multi-resolution feature pyramid design, generating feature maps at stride 2/4/8 from the input image. The image embeddings are progressively upsampled and concatenated with features at the corresponding resolution, yielding a high-resolution feature map at stride 2. Only 10 ms of additional inference latency is incurred.

  2. Prompt-Aware Masked Attention:

  3. Box prompt: Generates a binary attention mask \(\mathcal{M}^b \in \{0, -\infty\}\) to constrain model attention within the box region.
  4. Point prompt: Generates a soft attention mask \(\mathcal{M}^p \in [0,1]\) based on a 2D Gaussian distribution (standard deviation \(\sigma=21\)).
  5. Applied exclusively to token-to-image cross-attention layers (experiments show that applying it to image-to-token attention disrupts global feature capture).

Loss & Training

  • SA1B-Matte dataset: ~2.2M labels from SA1B (1% subset) are converted for training.
  • Training loss follows the label converter: \(L = L_{l1} + \lambda L_{grad}\) (\(\lambda=10\)).
  • AdamW optimizer, lr=1e-5, cosine decay, 500K iterations.
  • Fine-tuned from SAM pre-trained weights.

Key Experimental Results

Main Results (MicroMat-3K Benchmark, Box Prompt)

Method Prompt Fine-grained SAD↓ Fine-grained MSE↓ Fine-grained Grad↓ Coarse SAD↓ Coarse MSE↓
SAM box 36.086 11.057 14.867 3.516 1.044
HQ-SAM box 124.262 42.457 13.673 8.458 2.733
Matte-Any box 34.661 9.746 7.021 6.950 1.983
Matting-Any box 246.214 68.372 19.185 109.639 23.780
ZIM (ours) box 9.961 1.893 4.813 1.860 0.448

ZIM outperforms all baselines by a substantial margin across all metrics, reducing MSE by 83% relative to SAM and 81% relative to Matte-Any.

Ablation Study (Component Analysis, Box Prompt)

Attention Decoder Fine-grained SAD↓ Fine MSE↓ Fine Grad↓ Coarse SAD↓ Coarse MSE↓
13.623 2.718 6.516 2.071 0.474
13.198 2.504 6.445 2.049 0.471
11.074 2.094 5.401 2.069 0.487
9.961 1.893 4.813 1.860 0.448

The two components are complementary: the hierarchical decoder primarily reduces Grad error (suppressing artifacts), while the attention mechanism further improves overall accuracy.

Key Findings

  • Downstream transferability: Replacing SAM with ZIM in multiple downstream frameworks—including Matte-Any, HQ-SAM, Inpainting Anything, medical image segmentation, and 3D segmentation—consistently yields significant improvements.
  • Training data impact: Training ZIM on public matting data causes MSE on MicroMat-3K to surge from 1.893 to 38.332, confirming the necessity of micro-level training data.
  • Domain shift: ZIM performs poorly on traditional matting benchmarks (full-body portraits) under box prompts due to SAM's inherent prompt ambiguity; however, it surpasses all existing methods when dense multi-point prompts are used.

Highlights & Insights

  • Data engineering innovation: The label converter approach of converting segmentation labels into matting labels is both elegant and effective; the SGA and STL strategies are carefully designed.
  • Lightweight improvement with large gains: The hierarchical decoder adds only 10 ms of latency while resolving the long-standing checkerboard artifact issue in SAM.
  • MicroMat-3K benchmark: 3,000 high-quality micro-level matting annotations that fill a critical gap in zero-shot matting evaluation.
  • Strong practicality: Supports multiple prompt modalities, including points, boxes, text, and strokes.

Limitations & Future Work

  • Inherits SAM's ambiguity under vague prompts (whole-object vs. part-level ambiguity).
  • Cannot handle transparency estimation for transparent objects (e.g., glass, fire).
  • The quality ceiling of the label converter is bounded by the coverage of source matting datasets.
  • The potential of larger backbones (ViT-H) remains unexplored.
  • The label conversion paradigm can be generalized to other dense prediction tasks (e.g., upgrading coarse annotations to fine-grained labels).
  • The Selective Transformation Learning (STL) idea is applicable to any scenario requiring discrimination between objects that do and do not need refinement.
  • The hierarchical decoder design serves as a plug-and-play improvement for the SAM family of models.

Rating

  • Novelty: ⭐⭐⭐⭐ (Novel label conversion framework and dataset construction pipeline)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Zero-shot evaluation on 23 datasets, multiple downstream tasks, and extensive ablations)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure, rich figures and tables)
  • Value: ⭐⭐⭐⭐⭐ (High practical utility, filling the gap in zero-shot matting)