Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation¶

Conference: CVPR 2026 arXiv: 2604.07723 Code: GitHub Area: Image Segmentation Keywords: Open-vocabulary semantic segmentation, training-free, distributional discrepancy, optimal transport, Markov process

TL;DR¶

This paper proposes an open-vocabulary semantic segmentation method that bypasses the logits optimization process entirely. Based on the assumption that homogeneous regions exhibit consistent distributional discrepancies from their logits to a degenerate distribution, the method directly constructs segmentation maps via either the optimal transport path or the analytical solution of maximum transport velocity. The approach achieves state-of-the-art performance on 8 benchmarks without requiring training or model-specific modulation.

Background & Motivation¶

Open-vocabulary semantic segmentation (OVSS) requires pixel-level vision-language alignment. The dominant paradigm in existing methods can be characterized as logits optimization: computing cosine similarity (logits) between visual and linguistic features, minimizing the discrepancy between the logits distribution and the ground-truth (GT) distribution to obtain optimal logits, and then applying argmax to produce the segmentation map. This paradigm is realized in two ways:

Iterative training paradigm: Requires GT annotations and time-consuming training.

Attention modulation paradigm (training-free): Calibrates self-attention computation to correct fine-grained alignment, but the denoising operation is data-agnostic yet model-specific (e.g., CLIP-specific attention substitution), limiting generalizability.

Both approaches first derive optimal logits and then construct the segmentation map. The authors' core insight is: can we entirely bypass logits optimization and directly obtain segmentation maps from distributional discrepancies themselves?

Key assumption: Homogeneous regions exhibit consistent distributional discrepancies, while heterogeneous regions exhibit distinct ones. If this holds, distributional discrepancies inherently encode semantic information, rendering explicit logits optimization unnecessary.

Method¶

Overall Architecture¶

Compute cosine similarity between visual and linguistic features using CLIP to obtain logits.
Apply non-maximum suppression (NMS) and normalization to the logits.
Compute the distributional discrepancy from the normalized logits to a degenerate distribution (uniform distribution $\frac{1}{N}\mathbf{1}_N$).
Restore the original resolution via joint bilateral upsampling (JBU).
Apply argmax to produce the final segmentation map.

The optimization formulation $\mathcal{Q}^* = \arg\min_\mathcal{Q} \mathbf{D}(\mathcal{P}\|\mathcal{Q})$ is reformulated as an analytical solution $\mathbf{M} = \arg\max_{N_c} \mathbf{D}(\mathcal{S}\|\mathcal{Q})$, where $\mathcal{S}$ is a degenerate distribution substituting the GT distribution.

Key Designs¶

Degenerate Distribution as GT Substitute (§3.3):
- At inference time, the GT distribution is unavailable and must be approximated. The authors propose using a degenerate (uniform) distribution as a substitute.
- Experimental validation demonstrates that KL divergence from logits to GT ($\mathbf{D}(\mathcal{P}\|\mathcal{Q})$) and from logits to the degenerate distribution ($\mathbf{D}(\mathcal{S}\|\mathcal{Q})$) yield highly consistent performance across 5 datasets.
- Visualizations show that $\mathcal{S}$ and $\mathcal{P}$ occupy antipodal positions in feature space — logits optimization moves toward the GT endpoint, while the proposed method computes discrepancies to the degenerate endpoint.
- Design Motivation: The degenerate distribution is the only distribution determinable at inference time without additional information.
Optimal Transport Path (§3.4):
- Intuition: Homogeneous regions should share consistent degeneration paths; thus, the path itself quantifies discrepancy.
- The problem is formulated as Sinkhorn optimal transport: $$\boldsymbol{\pi}^* = \min_{\boldsymbol{\pi}} \sum_{i,j} \mathbf{C}_{i,j}\boldsymbol{\pi}_{i,j} - \epsilon\sum_{i,j}\boldsymbol{\pi}_{i,j}(\ln\boldsymbol{\pi}_{i,j} - 1)$$
- The cost matrix $\mathbf{C}$ uses hierarchically averaged self-attention tensors from Stable Diffusion v2.
- Via Lagrange multipliers, the analytical solution is: $\boldsymbol{\pi}^* = \text{diag}(\boldsymbol{\mu})\mathbf{K}\text{diag}(\boldsymbol{\nu})$, where the Gibbs kernel $\mathbf{K} = \exp(-\mathbf{C}/\epsilon)$.
- $\boldsymbol{\mu}$ and $\boldsymbol{\nu}$ are updated via Sinkhorn iterations (50 iterations, $\epsilon=0.1$).
Maximum Transport Velocity (§3.5):
- Intuition: Transport velocity also quantifies discrepancy — given the same path, slower velocity implies greater discrepancy.
- The convergence of logits to a stationary distribution is modeled as a Markov process: $\mathbf{f}^{c(l)} = \mathbf{f}^{c(0)} \cdot \mathbf{T}^l$
- The transition matrix $\mathbf{T}$ is obtained by transforming the self-attention tensor into a doubly stochastic matrix via iterative proportional fitting (IPF, 15 iterations).
- The maximum transport velocity for each patch is defined as the reciprocal of the number of steps to convergence: $\mathbf{v}_i^c = \max\{1/l : |\mathbf{f}_i^{c(l)} - \mathbf{f}_i^{c(l-1)}| \leq \tau\}$
- $\tau=0.3$ is the convergence threshold.
Source of Self-Attention Tensors:
- Self-attention from Stable Diffusion v2 is used rather than from CLIP.
- Noise-free latent features are directly encoded; self-attention is extracted via single-step unconditional denoising.
- Combining tensors from the $\text{up}_0$ and $\text{up}_1$ blocks yields the best results.

Loss & Training¶

The method is entirely training-free. No training or fine-tuning is involved. Off-the-shelf CLIP (ViT-B/16 or ViT-L/14) and Stable Diffusion v2 weights are used. Inference is performed in 16-bit floating point precision on whole images without sliding windows.

Key Experimental Results¶

Main Results¶

CLIP ViT-B/16 Backbone:

Method	Paradigm	VOC21	Context60	COCO-Stuff	Cityscapes	ADE20K	Avg
SCLIP	M.M.	59.1	30.4	22.4	32.2	16.1	38.2
NACLIP	M.M.	58.9	32.2	23.3	35.5	17.4	39.4
CASS	M.M.	65.8	36.7	26.7	39.4	20.4	44.4
Ours (O.P.)	-	66.9	37.6	28.6	41.7	22.8	46.2
Ours (M.V.)	-	67.8	38.3	28.9	43.3	23.0	46.9

CLIP ViT-L/14 Backbone:

Method	VOC21	Context60	COCO-Stuff	Cityscapes	ADE20K	Avg
SC-CLIP	65.0	36.9	26.9	41.3	21.7	45.2
Ours (M.V.)	68.9	38.7	29.2	43.9	23.4	47.8

Ablation Study¶

Configuration	VOC21	COCO-Stuff	Cityscapes	ADE20K	Avg
(I) Baseline (raw logits)	18.6	7.2	6.7	3.2	8.9
(II) +KL Divergence	44.2	12.1	8.6	6.4	17.8
(III) +NMS	45.9	13.0	9.6	7.7	19.1
(IV) +JBU	46.3	13.3	10.1	8.8	19.6
(V) +Optimal Transport Path	66.9	28.6	41.7	22.8	40.0
(VI) +Maximum Transport Velocity	67.8	28.9	43.3	23.0	40.8
(VII) Fusion of (V)+(VI)	64.9	26.8	41.4	20.5	38.4

Key Findings¶

Distributional discrepancy can replace logits optimization: Simple KL divergence alone yields a +8.9% mIoU gain; optimal transport/Markov process further contributes +22%.
Maximum velocity slightly outperforms optimal path: +0.7% average on B/16 and +0.6% on L/14.
Fusing both modes degrades performance: The two discrepancy measures capture different aspects (high-frequency textures vs. inter-class boundaries); naive fusion introduces interference.
SD2 self-attention outperforms ViT-based models: SD2 self-attention tensors are more effective for constructing transition matrices.
Fewer denoising steps are preferable: Encoding without noise injection ensures deterministic feature extraction.
$\tau=0.3$ is the optimal threshold: Higher thresholds cause premature degeneration before the logits distribution reaches the optimal degenerate state.

Highlights & Insights¶

Paradigm shift: Moving from "optimize logits then construct segmentation map" to "directly obtain segmentation map from distributional discrepancies," eliminating the need for training and model-specific modulation.
Theoretical elegance: Connecting the segmentation problem to optimal transport and Markov processes provides dual geometric and probabilistic interpretations.
Degenerate distribution as GT substitute: The antipodal relationship between GT and degenerate distributions in feature space is cleverly exploited, obviating the need for GT at inference time.
Triple freedom: No GT annotations, no time-consuming training, and no model-specific modulation are required.
Complementarity of optimal path vs. maximum velocity: The former is sensitive to high-frequency textures; the latter to inter-class boundaries.
Stable Diffusion as a feature extractor: SD2 self-attention tensors are better suited for constructing inter-patch transition probabilities than those from CLIP or DINO.

Limitations & Future Work¶

Dependency on Stable Diffusion: Loading the SD2 model for self-attention extraction adds memory and computational overhead at inference time.
Computational cost of Sinkhorn iterations: 50 iterations of optimal transport computation may be slow for high-resolution images.
Manual tuning of $\tau$ and $\epsilon$: Although experiments suggest relative robustness to these hyperparameters, empirical selection is still required.
Fusing the two modes fails to accumulate gains: While this is an interesting finding, it also implies a missed potential performance ceiling.
Validation limited to semantic segmentation: Applicability to more complex tasks such as panoptic and instance segmentation remains unexplored.
Limited theoretical guarantees for the degenerate distribution substitute: Feasibility is empirically validated, but rigorous theoretical analysis is absent.

Contrasted with attention substitution methods such as ClearCLIP, SCLIP, and NACLIP — these remain within the logits optimization paradigm.
VFM proxy methods such as ProxyCLIP and CASS incorporate DINO features; this work innovatively introduces SD2 self-attention.
The application of optimal transport (Sinkhorn algorithm) to segmentation provides a geometric perspective on distributional discrepancy measurement.
Using Markov process convergence speed as a semantic measure is a novel contribution.
The model-agnostic nature of the method (not bound to a specific CLIP architecture) gives it potential to generalize to future vision-language models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The paradigm of bypassing logits optimization is original and convincing.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 8 benchmarks, two CLIP scales, detailed ablations and analyses.
Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are clear, though some notation is dense.
Value: ⭐⭐⭐⭐⭐ — New SOTA for training-free OVSS with a concise and generalizable methodology.