Skip to content

LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

Conference: ICCV 2025 arXiv: 2508.01152 Code: https://github.com/XinyuYanTJU/LawDIS Area: Segmentation Keywords: Dichotomous Image Segmentation, Diffusion Models, Language Control, Window Refinement, High-Resolution Segmentation

TL;DR

This paper proposes LawDIS, a language-window dual-control dichotomous image segmentation framework built upon Stable Diffusion. In macro mode, language prompts guide target segmentation; in micro mode, variable-size windows refine local details. LawDIS comprehensively outperforms 11 state-of-the-art methods on DIS5K.

Background & Motivation

Dichotomous Image Segmentation (DIS) aims to precisely segment foreground objects from high-resolution images, requiring the capture of fine structures and internal details. It serves as a foundation for applications such as 3D reconstruction, image editing, and augmented reality.

Limitations of existing DIS methods:

  1. Semantic ambiguity: When multiple foreground entities are present, discriminative methods (per-pixel classification) cannot allow users to specify the target object.
  2. Resolution constraints: Most methods downsample images to 1024px, while arbitrarily scaling input resolution remains computationally infeasible.
  3. Lack of local refinement: Some methods (e.g., MVANet) divide images into fixed-size patches but fail at inference when encountering different patch sizes.
  4. No interactivity: Existing methods produce fixed outputs with no mechanism for user-driven adjustment.

Mechanism: DIS is reformulated as an image-conditioned mask generation task based on a Latent Diffusion Model (LDM), which naturally accommodates diverse user control signals.

Method

Overall Architecture

LawDIS is built on Stable Diffusion v2 and supports two control modes via a Mode Switcher:

  • Macro Mode: Language-controlled segmentation — users provide language prompts to specify the target object.
  • Micro Mode: Window-controlled refinement — users specify a local window region for fine-grained refinement.

The two modes can operate independently or jointly.

Key Designs

1. Generative Formulation

DIS is modeled as conditional denoising diffusion: forward noising and reverse denoising are applied to the latent representation of the segmentation mask. - The input image \(x\) is encoded into a latent representation via VAE as a condition. - The encoded mask is noised and the UNet predicts the noise. - A TCD (Trajectory Consistency Distillation) scheduler enables single-step denoising, substantially improving efficiency.

2. Mode Switcher

  • A one-dimensional vector is added to the diffusion model's time embedding after positional encoding.
  • Different values activate different modes (macro or micro).
  • Both modes jointly optimize the same UNet during training.

3. Language-Controlled Segmentation (Macro Mode)

  • Language prompts are generated by a VLM or provided by the user.
  • Prompts are encoded via CLIP into control embeddings and injected into the UNet via cross-attention.
  • The loss function follows the standard diffusion denoising objective.

4. Window-Controlled Refinement (Micro Mode)

  • Unsatisfactory regions from the initial segmentation are cropped as refinement windows.
  • Core innovation: The local mask from the initial segmentation (rather than Gaussian noise) is used as the diffusion starting point, implicitly conveying global context.
  • Empty prompts are used; cropped local patches are upsampled to the model input size.
  • Refined results are pasted back into the corresponding positions of the initial segmentation map.

5. VAE Decoder Fine-tuning

  • The VAE Encoder and UNet are frozen; only the Decoder is fine-tuned.
  • Encoder-Decoder skip connections are added.
  • The output channel is changed from 3 to 1 (for mask output), with weights initialized by channel averaging.

Loss & Training

  • UNet training: \(L_u = L_{\text{macro}} + L_{\text{micro}}\) (joint training)
  • VAE Decoder training: \(L_d = L_{\text{wbce}} + L_{\text{wiou}}\) (wBCE + wIoU)
  • UNet trained for 30K iterations; VAE Decoder trained for 6K iterations.
  • Batch size = 32, Adam optimizer, \(lr = 3 \times 10^{-5}\).
  • All inputs are uniformly resized to \(1024 \times 1024\).
  • DDPM with 1000 steps is used for UNet training; TCD single-step scheduling is used for Decoder training.

Key Experimental Results

Main Results

Evaluated on the DIS5K test set (DIS-TE, 2000 images), compared against 11 methods:

Method \(F_\beta^w\) \(F_\beta^{mx}\) MAE \(S_\alpha\) \(E_\phi^{mn}\)
BiRefNet'24 0.858 0.896 0.035 0.901 0.934
MVANet'24 0.862 0.907 0.034 0.909 0.938
Ours-S (Language only) 0.898 0.929 0.027 0.925 0.955
Ours-R (Language + Window) 0.908 0.932 0.024 0.926 0.959

On DIS-TE1, Ours-S improves \(F_\beta^w\) over MVANet by 6.6%; Ours-R achieves a 7.0% improvement.

Ablation Study

Ablation on the DIS-TE4 subset:

Configuration \(F_\beta^{mx}\) MAE \(S_\alpha\) \(E_\phi^{mn}\)
Baseline (vanilla SD) 0.904 0.047 0.904 0.916
w/o micro training 0.912 0.037 0.909 0.943
w/o VAE Decoder fine-tuning 0.919 0.040 0.915 0.933
Full Ours-S 0.926 0.032 0.920 0.955

Micro mode effectiveness (DIS-TE4):

Configuration \(F_\beta^w\) MAE \(\text{BIoU}_m\) HCE
Ours-S (baseline) 0.890 0.032 0.795 2481
Initialize from noise -4.7% +1.9% -7.1% -863
Automatic window selection +1.7% -0.5% +2.9% -767
Semi-automatic window selection +2.0% -0.6% +3.2% -871

Key Findings

  1. Initializing from diffusion latents significantly outperforms initialization from noise (\(F_\beta^w\) difference of 4.7%), confirming the necessity of propagating global context from the initial segmentation.
  2. Joint training of both modes outperforms training the macro mode alone, as the two modes mutually enhance geometric representation.
  3. VAE Decoder fine-tuning is essential — omitting it increases MAE from 0.032 to 0.040.
  4. Even fully automatic window selection effectively improves segmentation quality, indicating that user interaction is not strictly required.
  5. Using language prompts at both training and test time yields the best performance.

Highlights & Insights

  • Reformulating DIS from discriminative to generative opens an entirely new methodological direction.
  • The Mode Switcher design is remarkably elegant: a single one-dimensional vector controls the same UNet to switch between two modes.
  • Initializing the micro-mode diffusion process from the initial segmentation result rather than noise is the key mechanism for transferring cross-scale context.
  • Language control endows DIS with interactivity and personalization for the first time.
  • TCD single-step denoising makes diffusion-based approaches practically efficient.

Limitations & Future Work

  1. The framework relies on a VLM to generate language prompts; prompt quality directly affects segmentation performance.
  2. Micro mode requires users or an automatic algorithm to select window positions, increasing interaction complexity.
  3. The architecture based on Stable Diffusion v2 involves a large parameter count and slower inference compared to discriminative methods.
  4. Uniform resizing to \(1024 \times 1024\) may lose fine details in ultra-high-resolution images.
  5. The DIS5K benchmark covers only 225 semantic categories; generalization to broader scenarios remains to be validated.
  • GenPercept also applies diffusion models to dense prediction but adopts a deterministic single-step paradigm; LawDIS preserves the diffusion process and introduces a dual-control mechanism.
  • MVANet handles high resolution via patch splitting but lacks adaptability to varying patch sizes; LawDIS's window refinement naturally supports arbitrary sizes.
  • The Mode Switcher design can be generalized to other visual generation tasks requiring multi-granularity control.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Reformulates DIS as conditional diffusion generation; the dual-control mode design is highly original.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive comparison against 11 methods on DIS5K; detailed ablations and clear qualitative comparisons.
  • Writing Quality: ⭐⭐⭐⭐ — Method description is clear; figures are intuitive.
  • Value: ⭐⭐⭐⭐⭐ — Introduces interactivity and controllability to DIS, opens a new research direction, and sets new SOTA across all metrics.