LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation¶

Conference: ICCV 2025 arXiv: 2508.01152 Code: GitHub Area: Image Segmentation Keywords: Dichotomous image segmentation, latent diffusion model, language control, window refinement, high-precision segmentation

TL;DR¶

This paper proposes LawDIS, a controllable dichotomous image segmentation framework built upon a latent diffusion model. It achieves high-quality foreground mask generation through the synergy of macro-level language control (LS) and micro-level window refinement (WR), comprehensively outperforming 11 state-of-the-art methods on the DIS5K benchmark.

Background & Motivation¶

Dichotomous image segmentation (DIS) aims to accurately segment foreground objects from high-resolution images, requiring pixel-level precise boundary delineation. With the proliferation of high-quality imaging devices, segmentation tasks have evolved from coarse localization to fine-grained boundary description. DIS finds broad applications in 3D reconstruction, image editing, augmented reality, and medical image segmentation.

Existing DIS methods face two core challenges:

Semantic ambiguity: When an image contains multiple foreground entities, discriminative learning paradigms (per-pixel classification) cannot flexibly specify the target object, lacking user interaction capability.

Geometric detail capture bottleneck: To capture geometric details of high-resolution targets, existing methods typically introduce additional high-resolution data streams or split images into fixed-size patches, but cannot accommodate variable patch sizes. For instance, MVANet degrades significantly on local patches at non-training resolutions.

The core motivation of this paper is to reframe DIS as an image-conditioned mask generation task, leveraging the generative capability of latent diffusion models to seamlessly integrate user control and address both challenges.

Method¶

Overall Architecture¶

LawDIS builds upon pre-trained Stable Diffusion v2 and reformulates DIS as a conditional denoising diffusion process. The framework consists of three core components: - Mode Switcher: A one-dimensional vector added to the diffusion model's time embedding via positional encoding, switching between macro and micro modes. - Macro Mode: Language-controlled segmentation strategy (LS), generating an initial mask based on user language prompts. - Micro Mode: Window-controlled refinement strategy (WR), refining masks within user-specified variable-size windows.

Key Design 1: Generative DIS Paradigm¶

DIS is modeled as a conditional probability distribution $D(s|x)$, where $s$ is the segmentation mask and $x$ is the RGB image. A VAE encoder $\phi$ maps both the segmentation mask and image to a low-dimensional latent space, where the diffusion process is performed:

Forward process: Starting from $\mathbf{z}_0^{(s)}$, Gaussian noise is incrementally added to construct a discrete Markov chain.
Reverse process: A U-Net $f_\theta$ predicts the noise at each timestep, progressively denoising conditioned on image features $\mathbf{z}^{(x)}$.
Architectural modification: The U-Net input layer is duplicated to accommodate concatenated image and noisy mask features; weights are copied and halved to prevent activation explosion.

Key Design 2: Dual-Mode Joint Training¶

Macro mode training: Mode $\psi_a$ is activated; the model receives the full image $x$, segmentation mask $s$, and a VLM-generated language prompt $\mathcal{T}$. The prompt is encoded via CLIP into a control embedding $c_\mathcal{T}$ and injected into the U-Net via cross-attention: $$\mathcal{L}_{macro} = \|\boldsymbol{\epsilon} - f_\theta(\mathbf{z}_t^{(s)}, \mathbf{z}^{(x)}, c_\mathcal{T}, t, \psi_a)\|_2^2$$

Micro mode training: Mode $\psi_b$ is activated; the minimum bounding rectangle of the foreground object is selected as a local window, cropped to obtain a local patch $x_p$ and local mask $s_p$. A null prompt $c_\varnothing$ is used to avoid semantic mismatch: $$\mathcal{L}_{micro} = \|\boldsymbol{\epsilon}_p - f_\theta(\mathbf{z}_t^{(s_p)}, \mathbf{z}^{(x_p)}, c_\varnothing, t, \psi_b)\|_2^2$$

Joint training: $\mathcal{L}_u = \mathcal{L}_{macro} + \mathcal{L}_{micro}$. Both modes share the same U-Net, enabling mutual enhancement.

Key Design 3: VAE Decoder Fine-tuning¶

After training the U-Net, the encoder and U-Net are frozen, and only the VAE decoder $\varphi$ is fine-tuned: - Shortcut connections from the encoder to the decoder are added. - The output channels are reduced from 3 to 1 (single-channel mask), with weights initialized via channel averaging. - The TCD (Trajectory Consistency Distillation) scheduler is introduced to reduce sampling to a single step, saving memory and improving inference efficiency.

Loss & Training¶

VAE decoder fine-tuning employs a structural loss: $$\mathcal{L}_d = \mathcal{L}_{wbce}(\hat{s}, s) + \mathcal{L}_{wiou}(\hat{s}, s)$$ comprising weighted binary cross-entropy loss and weighted IoU loss, respectively.

Inference Pipeline¶

Language-controlled segmentation (macro mode): Full image + language prompt → single-step TCD denoising → decoded initial segmentation map.
Window-controlled refinement (micro mode, optional): User selects an unsatisfactory region → crop local patch → use initial segmentation result (rather than pure noise) as the diffusion starting point → single-step denoising → refined mask replaces the original region. This process can be repeated indefinitely until satisfactory results are achieved.

Key Experimental Results¶

Main Results: DIS5K Benchmark (DIS-TE, 2000 images)¶

Method	$F_\beta^\omega$ ↑	$F_\beta^{mx}$ ↑	$\mathcal{M}$ ↓	$\mathcal{S}_\alpha$ ↑	$E_\phi^{mn}$ ↑
IS-Net (2022)	0.726	0.799	0.070	0.819	0.858
InSPyReNet (2022)	0.838	0.891	0.039	0.900	0.923
BiRefNet (2024)	0.858	0.896	0.035	0.901	0.934
GenPercept (2024)	0.816	0.868	0.043	0.880	0.923
MVANet (2024)	0.862	0.907	0.034	0.909	0.938
Ours-S (LS only)	0.898	0.929	0.027	0.925	0.955
Ours-R (LS+WR)	0.908	0.932	0.024	0.926	0.959

Ours-S surpasses MVANet by 6.6% in $F_\beta^\omega$ on DIS-TE1; Ours-R further improves by 2.0% on DIS-TE4.
With both controls integrated, the improvement over MVANet on DIS-TE1 reaches 7.0%.

Ablation Study¶

Setting	$F_\beta^{mx}$ ↑	$\mathcal{M}$ ↓	$\mathcal{S}_\alpha$ ↑	$E_\phi^{mn}$ ↑
Baseline (no mode switcher / prompt / VAE fine-tuning)	0.904	0.047	0.904	0.916
Without micro mode training	0.912	0.037	0.909	0.943
Without VAE decoder fine-tuning	0.919	0.040	0.915	0.933
Full Ours-S	0.926	0.032	0.920	0.955

Macro control ablation:

Setting	$F_\beta^{mx}$ ↑	$\mathcal{M}$ ↓	$\mathcal{S}_\alpha$ ↑
No prompt (train + test)	0.912	0.036	0.908
No prompt (test only)	0.915	0.036	0.909
With prompt (train + test)	0.926	0.032	0.920

Micro control ablation:

Setting	$F_\beta^\omega$	$\mathcal{M}$	$BIoU^m$ ↑	$HCE_\gamma$ ↓
Base Ours-S	0.890	0.032	0.795	2481
Initialized from Gaussian noise	-4.7%	+1.9%	-7.1%	-863
Automatic window selection	+1.7%	-0.5%	+2.9%	-767
Semi-automatic window selection	+2.0%	-0.6%	+3.2%	-871

Key Findings¶

Initializing the micro mode diffusion process from the initial segmentation result (rather than Gaussian noise) is critical, as it implicitly conveys global context information.
Even with fully automatic window selection without user intervention, the WR strategy effectively improves boundary accuracy.
Dual-mode joint training provides scalable geometric representation capability, enabling the model to adapt to varying input sizes.
VAE decoder fine-tuning is essential for high-resolution segmentation, complementing denoised mask features with fine-grained details.

Highlights & Insights¶

Paradigm shift: This work is the first to reformulate DIS from discriminative per-pixel classification to generative mask generation, leveraging the encyclopedic visual-language understanding of diffusion models.
Controllability design: Macro language control addresses the semantic ambiguity of "which object to segment," while micro window refinement addresses the geometric precision issue of "insufficiently fine boundaries." Both are unified within a single diffusion model via the mode switcher.
Efficient inference: The TCD scheduler enables single-step denoising, making diffusion models practically viable for segmentation in terms of inference speed.
Clever initialization strategy: The micro mode uses the macro mode's segmentation result (rather than pure noise) as the diffusion starting point, implicitly transferring contextual information between the two modes — a key factor in performance gains.
Flexible user interaction: Window refinement can be repeated indefinitely, supporting progressive refinement and making the framework suitable for high-precision personalized applications.

Limitations & Future Work¶

Built upon Stable Diffusion v2, the model has a large parameter count, incurring higher deployment costs than traditional discriminative methods.
Language prompts rely on VLM-based automatic generation or manual user input, requiring an additional prompt generation module in fully automated scenarios.
Automatic window selection in micro mode still relies on edge-detection heuristics, yielding slightly weaker results than manual user selection.
All inputs are uniformly resized to $1024^2$, which may introduce distortion for images with extreme aspect ratios.
Evaluation is conducted solely on the DIS5K benchmark; generalizability to other high-precision segmentation tasks (e.g., portrait matting, medical segmentation) remains unverified.

Dichotomous image segmentation: IS-Net introduces intermediate supervision; FP-DIS exploits frequency priors; BiRefNet/MVANet enhance detail via multi-resolution patches. However, these discriminative methods lack flexible semantic control and adaptive local window capability.
Diffusion models for segmentation: GenPercept converts generative models into a deterministic single-step paradigm; Wang et al. propose a diffusion refinement model to enhance mask quality. This paper is the first to extend a single Stable Diffusion model into a macro+micro dual-mode framework.
High-resolution segmentation: InSPyReNet and BiRefNet enhance detail through additional resolution streams; MVANet employs fixed-size patch splitting, but none of these approaches adaptively handle variable patch sizes.

Rating¶

Novelty: ⭐⭐⭐⭐ — Reformulating DIS as a controllable generative task is a creative paradigm shift; the dual-mode switcher design is concise and effective.
Technical Depth: ⭐⭐⭐⭐ — Adaptations of the diffusion model (input layer duplication, TCD single-step inference, VAE fine-tuning strategy) are well-considered; the micro mode initialization strategy is cleverly designed.
Experimental Thoroughness: ⭐⭐⭐⭐ — Outperforms 11 methods across all metrics on DIS5K with ablations covering each component; however, evaluations on additional datasets and efficiency comparisons are lacking.
Writing Quality: ⭐⭐⭐⭐ — Structure is clear; the macro-micro narrative logic flows naturally.
Recommendation: ⭐⭐⭐⭐ — Introduces a novel generative paradigm for high-precision segmentation with convincing experimental results.

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Novelty: Pending
Experimental Thoroughness: Pending
Writing Quality: Pending
Value: Pending

LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Design 1: Generative DIS Paradigm¶

Key Design 2: Dual-Mode Joint Training¶

Key Design 3: VAE Decoder Fine-tuning¶

Loss & Training¶

Inference Pipeline¶

Key Experimental Results¶

Main Results: DIS5K Benchmark (DIS-TE, 2000 images)¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶