Skip to content

VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation

Conference: NeurIPS 2025 arXiv: 2601.10124 Code: GitHub Area: Medical Imaging Keywords: Semi-supervised segmentation, vector quantization, feature perturbation, consistency learning, medical image segmentation

TL;DR

VQ-Seg is proposed as the first method to introduce vector quantization into semi-supervised medical image segmentation. A Quantization Perturbation Module (QPM) replaces conventional dropout to achieve more controllable feature perturbation, complemented by a dual-branch architecture and foundation-model-guided alignment to compensate for quantization information loss.

Background & Motivation

In semi-supervised medical image segmentation, consistency learning combined with feature perturbation is a widely adopted strategy. However, existing methods rely heavily on dropout for feature-level perturbation, which introduces fundamental problems:

Low dropout rates (e.g., 0.3, 0.5): Insufficient perturbation with negligible effect on segmentation performance, failing to provide meaningful regularization.

High dropout rates (e.g., ≥0.7): Severe performance degradation — Dice and Jaccard drop sharply while HD95 and ASD increase substantially; at DR=0.9 the output becomes completely unusable.

Finding the optimal dropout rate is extremely difficult: It depends on the dataset, task, and network architecture, requiring extensive manual tuning.

From a theoretical perspective, the KL divergence between the posterior and prior under dropout can be approximated as:

\[D_{KL}(P||Q) \approx \frac{1}{2}\left(\frac{p}{1-p} + \log(1-p)\right)\]

As \(p\) increases, the KL divergence grows rapidly, leading to over-regularization and learning collapse. This motivates the idea of performing more controllable perturbations within a discrete vector-quantized space.

Method

Overall Architecture

VQ-Seg comprises four core components: (1) a VQ encoder that quantizes continuous features into a discrete codebook space; (2) a Quantization Perturbation Module (QPM) that performs controllable perturbation in the codebook index space; (3) a dual-branch architecture sharing the post-quantization space for joint optimization of reconstruction and segmentation; and (4) a foundation-model-guided Post-quantization Feature Adapter (PFA) to compensate for semantic loss. A teacher–student framework is employed for consistency learning.

Key Designs

  1. Quantization Perturbation Module (QPM): The encoder output \(z = f_{\text{enc}}(x)\) is VQ-quantized to the nearest codebook index \(i = \arg\min_j \|z - c_j\|\). QPM defines a perturbation policy — given the original codeword \(c_i\), it replaces it with another codeword \(c_j\) with probability \(\pi(j|i)\):
\[\pi(j|i) = \begin{cases} 1 - \epsilon, & \text{if } j = i \\ \frac{\epsilon \exp(-d(c_i, c_j))}{Z_i}, & \text{if } j \neq i \end{cases}\]

where \(\epsilon \in [0,1]\) controls perturbation intensity, \(d(c_i, c_j)\) is the codeword distance, and \(Z_i\) is the normalization factor. The key advantage is that, unlike dropout, the perturbation distribution \(Q(c_j|\epsilon)\) of QPM remains bounded at all times, guided by the learned codebook structure to substitute semantically similar codewords — yielding more controllable and interpretable perturbations. For instance, at perturbation intensity \(\epsilon=0.7\), the nearest codeword c₂ is selected with 49% probability.

  1. Dual-Branch Shared Post-Quantization Space: VQ quantization may discard fine-grained visual information. To address this, a dual-branch architecture is designed in which post-quantization features are fed into both an image decoder \(D_i\) and a segmentation decoder \(D_s\):
\[\hat{x} = D_i(q(\mathbf{z})), \quad \hat{y} = D_s(q(\mathbf{z}))\]

For labeled data: \(\mathcal{L}_l = \mathcal{L}_{rec}(x_l, \hat{x}_l^S) + \mathcal{L}_{seg}(y_l, \hat{y}_l^S)\)

For unlabeled data, pseudo-labels \(\tilde{y}_u\) are generated by the teacher network: \(\mathcal{L}_u = \mathcal{L}_{rec}(x_u, \hat{x}_u^S) + \mathcal{L}_{seg}(\tilde{y}_u, \hat{y}_u^S) + \mathcal{L}_{seg}(\tilde{y}_u, \hat{y}_a^S)\)

The reconstruction branch serves as a self-supervised signal, encouraging the VQ encoder to learn better representations.

  1. Foundation-Model-Guided Post-quantization Feature Adapter (PFA): A frozen DINOv2 is used as an external semantic prior. PFA aligns quantized features with foundation model features via resize and 1×1 convolution for dimension matching, followed by patch-wise contrastive learning:
\[\mathcal{L}_{\text{align}} = -\frac{1}{HW} \sum_{i=1}^{HW} \log \frac{\exp(\text{sim}(f_i^{\text{pfa}}, f_i^{\text{fm}})/\tau)}{\sum_{j=1}^{HW} \exp(\text{sim}(f_i^{\text{pfa}}, f_j^{\text{fm}})/\tau)}\]

This localized semantic supervision compensates for detail loss and semantic drift introduced by the quantization process.

Loss & Training

Total loss: \(\mathcal{L} = \mathcal{L}_{db} + \lambda_a \mathcal{L}_{\text{align}}\), where \(\mathcal{L}_{db} = \mathcal{L}_l + \lambda_u \mathcal{L}_u\). \(\mathcal{L}_{rec}\) is the L1 loss and \(\mathcal{L}_{seg}\) is the cross-entropy loss. The teacher network is updated via EMA (\(\alpha=0.996\)). Codebook size is set to \(K=16384\); training runs for 100K iterations with the AdamW optimizer on 4 RTX 4090 GPUs.

Key Experimental Results

Main Results (LC Lung Cancer Dataset)

Method 5% Dice↑ 5% Jaccard↑ 5% HD95↓ 10% Dice↑ 10% Jaccard↑ 10% HD95↓
UNet-F (Full sup.) 0.8345 0.7386 6.9634 0.8345 0.7386 6.9634
UNet-S 0.4343 0.3118 26.0498 0.6490 0.5175 21.4063
UA-MT 0.6029 0.4647 48.6681 0.7222 0.5989 11.6724
Unimatch 0.6493 0.5071 17.8700 0.7511 0.6333 17.0178
VQ-Seg 0.6643 0.5257 12.2525 0.7852 0.6731 11.6179

At 10% labeled data, Dice improves by 2.97% and Jaccard by 3.17% over the previous best.

Ablation Study

Base QPM DB PFA Dice↑ Jaccard↑ HD95↓ ASD↓
0.7443 0.6238 14.2153 5.2301
0.7701 0.6559 13.0246 4.9378
0.7784 0.6620 12.4728 4.6013
0.7852 0.6731 11.6179 4.2094

QPM contributes the largest individual gain (+2.58% Dice); all three modules together yield the best performance.

Key Findings

  • Perturbation intensity \(\epsilon=0.7\) is optimal; performance degrades noticeably at \(\epsilon=0.9\), but far less catastrophically than dropout.
  • DINOv2 outperforms other foundation model priors (CLIP, BiomedCLIP, MAE, Rad-DINO) as the semantic prior.
  • Codebook size \(K=16384\) is optimal: smaller sizes (1024) lack representational capacity, while larger sizes (65536) suffer from reduced utilization (92%).
  • The newly collected lung cancer dataset (828 CT cases) provides clinically valuable annotations for central-type lung cancer.

Highlights & Insights

  • First application of VQ to semi-supervised segmentation: The discrete codebook space serves as a structured perturbation carrier, offering more principled control than dropout.
  • Complete theoretical-to-empirical justification: The logical chain — from KL divergence analysis of dropout instability, to QPM design, to experimental validation — is coherent and well-grounded.
  • Dual-branch design elegantly addresses quantization information loss: The reconstruction branch not only preserves visual information but also provides a self-supervised signal for the VQ encoder.
  • New dataset contribution: The 828-case lung cancer CT dataset is a valuable clinical resource that fills a gap in central-type lung cancer segmentation benchmarks.

Limitations & Future Work

  • Experiments are conducted only on 2D slices; the behavior of VQ in 3D settings remains unvalidated.
  • Codebook learning stability and efficiency may become bottlenecks in more complex tasks.
  • The QPM perturbation strategy relies on codeword distances and is therefore sensitive to codebook quality.
  • Only DINOv2 is used as the foundation model prior; multi-model ensembles or domain-specific medical foundation models remain unexplored.

Compared to UA-MT (Monte Carlo Dropout), BCP, and Unimatch, the core distinction of VQ-Seg lies in replacing random dropout in continuous space with structured perturbation in discrete space. Relative to general VQ-VAE literature, VQ-Seg cleverly leverages the VQ space for both perturbation and reconstruction simultaneously. Key insight: when regularization strategies in continuous space exhibit instability, mapping to a discrete space may offer a more controllable alternative.

Rating

  • Novelty: ⭐⭐⭐⭐ First introduction of VQ into semi-supervised segmentation for perturbation, with solid theoretical grounding.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Ablations are detailed, though only two datasets (LC and ACDC) are used; greater diversity would strengthen the evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clear, figures are intuitive, and the overall structure is well-organized.
  • Value: ⭐⭐⭐⭐ Offers a novel perspective on perturbation strategies for semi-supervised segmentation; the new dataset provides an additional contribution.