VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation¶

Conference: NeurIPS 2025 arXiv: 2601.10124 Code: GitHub Area: Medical Imaging Keywords: Semi-supervised segmentation, vector quantization, feature perturbation, consistency learning, medical image segmentation

TL;DR¶

VQ-Seg is proposed as the first method to introduce vector quantization into semi-supervised medical image segmentation. A Quantization Perturbation Module (QPM) replaces conventional dropout to achieve more controllable feature perturbation, complemented by a dual-branch architecture and foundation-model-guided alignment to compensate for quantization information loss.

Background & Motivation¶

In semi-supervised medical image segmentation, consistency learning combined with feature perturbation is a widely adopted strategy. However, existing methods rely heavily on dropout for feature-level perturbation, which introduces fundamental problems:

Low dropout rates (e.g., 0.3, 0.5): Insufficient perturbation with negligible effect on segmentation performance, failing to provide meaningful regularization.

High dropout rates (e.g., ≥0.7): Severe performance degradation — Dice and Jaccard drop sharply while HD95 and ASD increase substantially; at DR=0.9 the output becomes completely unusable.

Finding the optimal dropout rate is extremely difficult: It depends on the dataset, task, and network architecture, requiring extensive manual tuning.

From a theoretical perspective, the KL divergence between the posterior and prior under dropout can be approximated as:

\[D_{KL}(P||Q) \approx \frac{1}{2}\left(\frac{p}{1-p} + \log(1-p)\right)\]

As \(p\) increases, the KL divergence grows rapidly, leading to over-regularization and learning collapse. This motivates the idea of performing more controllable perturbations within a discrete vector-quantized space.

Method¶

Overall Architecture¶

VQ-Seg comprises four core components: (1) a VQ encoder that quantizes continuous features into a discrete codebook space; (2) a Quantization Perturbation Module (QPM) that performs controllable perturbation in the codebook index space; (3) a dual-branch architecture sharing the post-quantization space for joint optimization of reconstruction and segmentation; and (4) a foundation-model-guided Post-quantization Feature Adapter (PFA) to compensate for semantic loss. A teacher–student framework is employed for consistency learning.

Key Designs¶

Quantization Perturbation Module (QPM): The encoder output \(z = f_{\text{enc}}(x)\) is VQ-quantized to the nearest codebook index \(i = \arg\min_j \|z - c_j\|\). QPM defines a perturbation policy — given the original codeword \(c_i\), it replaces it with another codeword \(c_j\) with probability \(\pi(j|i)\):

\[\pi(j|i) = \begin{cases} 1 - \epsilon, & \text{if } j = i \\ \frac{\epsilon \exp(-d(c_i, c_j))}{Z_i}, & \text{if } j \neq i \end{cases}\]

where \(\epsilon \in [0,1]\) controls perturbation intensity, \(d(c_i, c_j)\) is the codeword distance, and \(Z_i\) is the normalization factor. The key advantage is that, unlike dropout, the perturbation distribution \(Q(c_j|\epsilon)\) of QPM remains bounded at all times, guided by the learned codebook structure to substitute semantically similar codewords — yielding more controllable and interpretable perturbations. For instance, at perturbation intensity \(\epsilon=0.7\), the nearest codeword c₂ is selected with 49% probability.

Dual-Branch Shared Post-Quantization Space: VQ quantization may discard fine-grained visual information. To address this, a dual-branch architecture is designed in which post-quantization features are fed into both an image decoder \(D_i\) and a segmentation decoder \(D_s\):

\[\hat{x} = D_i(q(\mathbf{z})), \quad \hat{y} = D_s(q(\mathbf{z}))\]

For labeled data: \(\mathcal{L}_l = \mathcal{L}_{rec}(x_l, \hat{x}_l^S) + \mathcal{L}_{seg}(y_l, \hat{y}_l^S)\)

For unlabeled data, pseudo-labels \(\tilde{y}_u\) are generated by the teacher network: \(\mathcal{L}_u = \mathcal{L}_{rec}(x_u, \hat{x}_u^S) + \mathcal{L}_{seg}(\tilde{y}_u, \hat{y}_u^S) + \mathcal{L}_{seg}(\tilde{y}_u, \hat{y}_a^S)\)

The reconstruction branch serves as a self-supervised signal, encouraging the VQ encoder to learn better representations.

Foundation-Model-Guided Post-quantization Feature Adapter (PFA): A frozen DINOv2 is used as an external semantic prior. PFA aligns quantized features with foundation model features via resize and 1×1 convolution for dimension matching, followed by patch-wise contrastive learning:

\[\mathcal{L}_{\text{align}} = -\frac{1}{HW} \sum_{i=1}^{HW} \log \frac{\exp(\text{sim}(f_i^{\text{pfa}}, f_i^{\text{fm}})/\tau)}{\sum_{j=1}^{HW} \exp(\text{sim}(f_i^{\text{pfa}}, f_j^{\text{fm}})/\tau)}\]

This localized semantic supervision compensates for detail loss and semantic drift introduced by the quantization process.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_{db} + \lambda_a \mathcal{L}_{\text{align}}\), where \(\mathcal{L}_{db} = \mathcal{L}_l + \lambda_u \mathcal{L}_u\). \(\mathcal{L}_{rec}\) is the L1 loss and \(\mathcal{L}_{seg}\) is the cross-entropy loss. The teacher network is updated via EMA (\(\alpha=0.996\)). Codebook size is set to \(K=16384\); training runs for 100K iterations with the AdamW optimizer on 4 RTX 4090 GPUs.

Key Experimental Results¶

Main Results (LC Lung Cancer Dataset)¶

Method	5% Dice↑	5% Jaccard↑	5% HD95↓	10% Dice↑	10% Jaccard↑	10% HD95↓
UNet-F (Full sup.)	0.8345	0.7386	6.9634	0.8345	0.7386	6.9634
UNet-S	0.4343	0.3118	26.0498	0.6490	0.5175	21.4063
UA-MT	0.6029	0.4647	48.6681	0.7222	0.5989	11.6724
Unimatch	0.6493	0.5071	17.8700	0.7511	0.6333	17.0178
VQ-Seg	0.6643	0.5257	12.2525	0.7852	0.6731	11.6179

At 10% labeled data, Dice improves by 2.97% and Jaccard by 3.17% over the previous best.

Ablation Study¶

Base	QPM	DB	PFA	Dice↑	Jaccard↑	HD95↓	ASD↓
✓				0.7443	0.6238	14.2153	5.2301
✓	✓			0.7701	0.6559	13.0246	4.9378
✓	✓	✓		0.7784	0.6620	12.4728	4.6013
✓	✓	✓	✓	0.7852	0.6731	11.6179	4.2094

QPM contributes the largest individual gain (+2.58% Dice); all three modules together yield the best performance.

Key Findings¶

Perturbation intensity \(\epsilon=0.7\) is optimal; performance degrades noticeably at \(\epsilon=0.9\), but far less catastrophically than dropout.
DINOv2 outperforms other foundation model priors (CLIP, BiomedCLIP, MAE, Rad-DINO) as the semantic prior.
Codebook size \(K=16384\) is optimal: smaller sizes (1024) lack representational capacity, while larger sizes (65536) suffer from reduced utilization (92%).
The newly collected lung cancer dataset (828 CT cases) provides clinically valuable annotations for central-type lung cancer.

Highlights & Insights¶

First application of VQ to semi-supervised segmentation: The discrete codebook space serves as a structured perturbation carrier, offering more principled control than dropout.
Complete theoretical-to-empirical justification: The logical chain — from KL divergence analysis of dropout instability, to QPM design, to experimental validation — is coherent and well-grounded.
Dual-branch design elegantly addresses quantization information loss: The reconstruction branch not only preserves visual information but also provides a self-supervised signal for the VQ encoder.
New dataset contribution: The 828-case lung cancer CT dataset is a valuable clinical resource that fills a gap in central-type lung cancer segmentation benchmarks.

Limitations & Future Work¶

Experiments are conducted only on 2D slices; the behavior of VQ in 3D settings remains unvalidated.
Codebook learning stability and efficiency may become bottlenecks in more complex tasks.
The QPM perturbation strategy relies on codeword distances and is therefore sensitive to codebook quality.
Only DINOv2 is used as the foundation model prior; multi-model ensembles or domain-specific medical foundation models remain unexplored.

Compared to UA-MT (Monte Carlo Dropout), BCP, and Unimatch, the core distinction of VQ-Seg lies in replacing random dropout in continuous space with structured perturbation in discrete space. Relative to general VQ-VAE literature, VQ-Seg cleverly leverages the VQ space for both perturbation and reconstruction simultaneously. Key insight: when regularization strategies in continuous space exhibit instability, mapping to a discrete space may offer a more controllable alternative.

Rating¶

Novelty: ⭐⭐⭐⭐ First introduction of VQ into semi-supervised segmentation for perturbation, with solid theoretical grounding.
Experimental Thoroughness: ⭐⭐⭐⭐ Ablations are detailed, though only two datasets (LC and ACDC) are used; greater diversity would strengthen the evaluation.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear, figures are intuitive, and the overall structure is well-organized.
Value: ⭐⭐⭐⭐ Offers a novel perspective on perturbation strategies for semi-supervised segmentation; the new dataset provides an additional contribution.