LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation¶
Conference: ICCV 2025 arXiv: 2508.01152 Code: https://github.com/XinyuYanTJU/LawDIS Area: Segmentation Keywords: Dichotomous Image Segmentation, Diffusion Models, Language Control, Window Refinement, High-Resolution Segmentation
TL;DR¶
This paper proposes LawDIS, a language-window dual-control dichotomous image segmentation framework built upon Stable Diffusion. In macro mode, language prompts guide target segmentation; in micro mode, variable-size windows refine local details. LawDIS comprehensively outperforms 11 state-of-the-art methods on DIS5K.
Background & Motivation¶
Dichotomous Image Segmentation (DIS) aims to precisely segment foreground objects from high-resolution images, requiring the capture of fine structures and internal details. It serves as a foundation for applications such as 3D reconstruction, image editing, and augmented reality.
Limitations of existing DIS methods:
- Semantic ambiguity: When multiple foreground entities are present, discriminative methods (per-pixel classification) cannot allow users to specify the target object.
- Resolution constraints: Most methods downsample images to 1024px, while arbitrarily scaling input resolution remains computationally infeasible.
- Lack of local refinement: Some methods (e.g., MVANet) divide images into fixed-size patches but fail at inference when encountering different patch sizes.
- No interactivity: Existing methods produce fixed outputs with no mechanism for user-driven adjustment.
Mechanism: DIS is reformulated as an image-conditioned mask generation task based on a Latent Diffusion Model (LDM), which naturally accommodates diverse user control signals.
Method¶
Overall Architecture¶
LawDIS is built on Stable Diffusion v2 and supports two control modes via a Mode Switcher:
- Macro Mode: Language-controlled segmentation — users provide language prompts to specify the target object.
- Micro Mode: Window-controlled refinement — users specify a local window region for fine-grained refinement.
The two modes can operate independently or jointly.
Key Designs¶
1. Generative Formulation
DIS is modeled as conditional denoising diffusion: forward noising and reverse denoising are applied to the latent representation of the segmentation mask. - The input image \(x\) is encoded into a latent representation via VAE as a condition. - The encoded mask is noised and the UNet predicts the noise. - A TCD (Trajectory Consistency Distillation) scheduler enables single-step denoising, substantially improving efficiency.
2. Mode Switcher
- A one-dimensional vector is added to the diffusion model's time embedding after positional encoding.
- Different values activate different modes (macro or micro).
- Both modes jointly optimize the same UNet during training.
3. Language-Controlled Segmentation (Macro Mode)
- Language prompts are generated by a VLM or provided by the user.
- Prompts are encoded via CLIP into control embeddings and injected into the UNet via cross-attention.
- The loss function follows the standard diffusion denoising objective.
4. Window-Controlled Refinement (Micro Mode)
- Unsatisfactory regions from the initial segmentation are cropped as refinement windows.
- Core innovation: The local mask from the initial segmentation (rather than Gaussian noise) is used as the diffusion starting point, implicitly conveying global context.
- Empty prompts are used; cropped local patches are upsampled to the model input size.
- Refined results are pasted back into the corresponding positions of the initial segmentation map.
5. VAE Decoder Fine-tuning
- The VAE Encoder and UNet are frozen; only the Decoder is fine-tuned.
- Encoder-Decoder skip connections are added.
- The output channel is changed from 3 to 1 (for mask output), with weights initialized by channel averaging.
Loss & Training¶
- UNet training: \(L_u = L_{\text{macro}} + L_{\text{micro}}\) (joint training)
- VAE Decoder training: \(L_d = L_{\text{wbce}} + L_{\text{wiou}}\) (wBCE + wIoU)
- UNet trained for 30K iterations; VAE Decoder trained for 6K iterations.
- Batch size = 32, Adam optimizer, \(lr = 3 \times 10^{-5}\).
- All inputs are uniformly resized to \(1024 \times 1024\).
- DDPM with 1000 steps is used for UNet training; TCD single-step scheduling is used for Decoder training.
Key Experimental Results¶
Main Results¶
Evaluated on the DIS5K test set (DIS-TE, 2000 images), compared against 11 methods:
| Method | \(F_\beta^w\) | \(F_\beta^{mx}\) | MAE | \(S_\alpha\) | \(E_\phi^{mn}\) |
|---|---|---|---|---|---|
| BiRefNet'24 | 0.858 | 0.896 | 0.035 | 0.901 | 0.934 |
| MVANet'24 | 0.862 | 0.907 | 0.034 | 0.909 | 0.938 |
| Ours-S (Language only) | 0.898 | 0.929 | 0.027 | 0.925 | 0.955 |
| Ours-R (Language + Window) | 0.908 | 0.932 | 0.024 | 0.926 | 0.959 |
On DIS-TE1, Ours-S improves \(F_\beta^w\) over MVANet by 6.6%; Ours-R achieves a 7.0% improvement.
Ablation Study¶
Ablation on the DIS-TE4 subset:
| Configuration | \(F_\beta^{mx}\) | MAE | \(S_\alpha\) | \(E_\phi^{mn}\) |
|---|---|---|---|---|
| Baseline (vanilla SD) | 0.904 | 0.047 | 0.904 | 0.916 |
| w/o micro training | 0.912 | 0.037 | 0.909 | 0.943 |
| w/o VAE Decoder fine-tuning | 0.919 | 0.040 | 0.915 | 0.933 |
| Full Ours-S | 0.926 | 0.032 | 0.920 | 0.955 |
Micro mode effectiveness (DIS-TE4):
| Configuration | \(F_\beta^w\) | MAE | \(\text{BIoU}_m\) | HCE |
|---|---|---|---|---|
| Ours-S (baseline) | 0.890 | 0.032 | 0.795 | 2481 |
| Initialize from noise | -4.7% | +1.9% | -7.1% | -863 |
| Automatic window selection | +1.7% | -0.5% | +2.9% | -767 |
| Semi-automatic window selection | +2.0% | -0.6% | +3.2% | -871 |
Key Findings¶
- Initializing from diffusion latents significantly outperforms initialization from noise (\(F_\beta^w\) difference of 4.7%), confirming the necessity of propagating global context from the initial segmentation.
- Joint training of both modes outperforms training the macro mode alone, as the two modes mutually enhance geometric representation.
- VAE Decoder fine-tuning is essential — omitting it increases MAE from 0.032 to 0.040.
- Even fully automatic window selection effectively improves segmentation quality, indicating that user interaction is not strictly required.
- Using language prompts at both training and test time yields the best performance.
Highlights & Insights¶
- Reformulating DIS from discriminative to generative opens an entirely new methodological direction.
- The Mode Switcher design is remarkably elegant: a single one-dimensional vector controls the same UNet to switch between two modes.
- Initializing the micro-mode diffusion process from the initial segmentation result rather than noise is the key mechanism for transferring cross-scale context.
- Language control endows DIS with interactivity and personalization for the first time.
- TCD single-step denoising makes diffusion-based approaches practically efficient.
Limitations & Future Work¶
- The framework relies on a VLM to generate language prompts; prompt quality directly affects segmentation performance.
- Micro mode requires users or an automatic algorithm to select window positions, increasing interaction complexity.
- The architecture based on Stable Diffusion v2 involves a large parameter count and slower inference compared to discriminative methods.
- Uniform resizing to \(1024 \times 1024\) may lose fine details in ultra-high-resolution images.
- The DIS5K benchmark covers only 225 semantic categories; generalization to broader scenarios remains to be validated.
Related Work & Insights¶
- GenPercept also applies diffusion models to dense prediction but adopts a deterministic single-step paradigm; LawDIS preserves the diffusion process and introduces a dual-control mechanism.
- MVANet handles high resolution via patch splitting but lacks adaptability to varying patch sizes; LawDIS's window refinement naturally supports arbitrary sizes.
- The Mode Switcher design can be generalized to other visual generation tasks requiring multi-granularity control.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Reformulates DIS as conditional diffusion generation; the dual-control mode design is highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive comparison against 11 methods on DIS5K; detailed ablations and clear qualitative comparisons.
- Writing Quality: ⭐⭐⭐⭐ — Method description is clear; figures are intuitive.
- Value: ⭐⭐⭐⭐⭐ — Introduces interactivity and controllability to DIS, opens a new research direction, and sets new SOTA across all metrics.