VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation¶
Conference: NeurIPS 2025 arXiv: 2601.10124 Code: GitHub Area: Medical Imaging Keywords: Semi-supervised segmentation, vector quantization, feature perturbation, consistency learning, medical image segmentation
TL;DR¶
VQ-Seg is proposed as the first method to introduce vector quantization into semi-supervised medical image segmentation. A Quantization Perturbation Module (QPM) replaces conventional dropout to achieve more controllable feature perturbation, complemented by a dual-branch architecture and foundation-model-guided alignment to compensate for quantization information loss.
Background & Motivation¶
In semi-supervised medical image segmentation, consistency learning combined with feature perturbation is a widely adopted strategy. However, existing methods rely heavily on dropout for feature-level perturbation, which introduces fundamental problems:
Low dropout rates (e.g., 0.3, 0.5): Insufficient perturbation with negligible effect on segmentation performance, failing to provide meaningful regularization.
High dropout rates (e.g., ≥0.7): Severe performance degradation — Dice and Jaccard drop sharply while HD95 and ASD increase substantially; at DR=0.9 the output becomes completely unusable.
Finding the optimal dropout rate is extremely difficult: It depends on the dataset, task, and network architecture, requiring extensive manual tuning.
From a theoretical perspective, the KL divergence between the posterior and prior under dropout can be approximated as:
As \(p\) increases, the KL divergence grows rapidly, leading to over-regularization and learning collapse. This motivates the idea of performing more controllable perturbations within a discrete vector-quantized space.
Method¶
Overall Architecture¶
VQ-Seg comprises four core components: (1) a VQ encoder that quantizes continuous features into a discrete codebook space; (2) a Quantization Perturbation Module (QPM) that performs controllable perturbation in the codebook index space; (3) a dual-branch architecture sharing the post-quantization space for joint optimization of reconstruction and segmentation; and (4) a foundation-model-guided Post-quantization Feature Adapter (PFA) to compensate for semantic loss. A teacher–student framework is employed for consistency learning.
Key Designs¶
- Quantization Perturbation Module (QPM): The encoder output \(z = f_{\text{enc}}(x)\) is VQ-quantized to the nearest codebook index \(i = \arg\min_j \|z - c_j\|\). QPM defines a perturbation policy — given the original codeword \(c_i\), it replaces it with another codeword \(c_j\) with probability \(\pi(j|i)\):
where \(\epsilon \in [0,1]\) controls perturbation intensity, \(d(c_i, c_j)\) is the codeword distance, and \(Z_i\) is the normalization factor. The key advantage is that, unlike dropout, the perturbation distribution \(Q(c_j|\epsilon)\) of QPM remains bounded at all times, guided by the learned codebook structure to substitute semantically similar codewords — yielding more controllable and interpretable perturbations. For instance, at perturbation intensity \(\epsilon=0.7\), the nearest codeword c₂ is selected with 49% probability.
- Dual-Branch Shared Post-Quantization Space: VQ quantization may discard fine-grained visual information. To address this, a dual-branch architecture is designed in which post-quantization features are fed into both an image decoder \(D_i\) and a segmentation decoder \(D_s\):
For labeled data: \(\mathcal{L}_l = \mathcal{L}_{rec}(x_l, \hat{x}_l^S) + \mathcal{L}_{seg}(y_l, \hat{y}_l^S)\)
For unlabeled data, pseudo-labels \(\tilde{y}_u\) are generated by the teacher network: \(\mathcal{L}_u = \mathcal{L}_{rec}(x_u, \hat{x}_u^S) + \mathcal{L}_{seg}(\tilde{y}_u, \hat{y}_u^S) + \mathcal{L}_{seg}(\tilde{y}_u, \hat{y}_a^S)\)
The reconstruction branch serves as a self-supervised signal, encouraging the VQ encoder to learn better representations.
- Foundation-Model-Guided Post-quantization Feature Adapter (PFA): A frozen DINOv2 is used as an external semantic prior. PFA aligns quantized features with foundation model features via resize and 1×1 convolution for dimension matching, followed by patch-wise contrastive learning:
This localized semantic supervision compensates for detail loss and semantic drift introduced by the quantization process.
Loss & Training¶
Total loss: \(\mathcal{L} = \mathcal{L}_{db} + \lambda_a \mathcal{L}_{\text{align}}\), where \(\mathcal{L}_{db} = \mathcal{L}_l + \lambda_u \mathcal{L}_u\). \(\mathcal{L}_{rec}\) is the L1 loss and \(\mathcal{L}_{seg}\) is the cross-entropy loss. The teacher network is updated via EMA (\(\alpha=0.996\)). Codebook size is set to \(K=16384\); training runs for 100K iterations with the AdamW optimizer on 4 RTX 4090 GPUs.
Key Experimental Results¶
Main Results (LC Lung Cancer Dataset)¶
| Method | 5% Dice↑ | 5% Jaccard↑ | 5% HD95↓ | 10% Dice↑ | 10% Jaccard↑ | 10% HD95↓ |
|---|---|---|---|---|---|---|
| UNet-F (Full sup.) | 0.8345 | 0.7386 | 6.9634 | 0.8345 | 0.7386 | 6.9634 |
| UNet-S | 0.4343 | 0.3118 | 26.0498 | 0.6490 | 0.5175 | 21.4063 |
| UA-MT | 0.6029 | 0.4647 | 48.6681 | 0.7222 | 0.5989 | 11.6724 |
| Unimatch | 0.6493 | 0.5071 | 17.8700 | 0.7511 | 0.6333 | 17.0178 |
| VQ-Seg | 0.6643 | 0.5257 | 12.2525 | 0.7852 | 0.6731 | 11.6179 |
At 10% labeled data, Dice improves by 2.97% and Jaccard by 3.17% over the previous best.
Ablation Study¶
| Base | QPM | DB | PFA | Dice↑ | Jaccard↑ | HD95↓ | ASD↓ |
|---|---|---|---|---|---|---|---|
| ✓ | 0.7443 | 0.6238 | 14.2153 | 5.2301 | |||
| ✓ | ✓ | 0.7701 | 0.6559 | 13.0246 | 4.9378 | ||
| ✓ | ✓ | ✓ | 0.7784 | 0.6620 | 12.4728 | 4.6013 | |
| ✓ | ✓ | ✓ | ✓ | 0.7852 | 0.6731 | 11.6179 | 4.2094 |
QPM contributes the largest individual gain (+2.58% Dice); all three modules together yield the best performance.
Key Findings¶
- Perturbation intensity \(\epsilon=0.7\) is optimal; performance degrades noticeably at \(\epsilon=0.9\), but far less catastrophically than dropout.
- DINOv2 outperforms other foundation model priors (CLIP, BiomedCLIP, MAE, Rad-DINO) as the semantic prior.
- Codebook size \(K=16384\) is optimal: smaller sizes (1024) lack representational capacity, while larger sizes (65536) suffer from reduced utilization (92%).
- The newly collected lung cancer dataset (828 CT cases) provides clinically valuable annotations for central-type lung cancer.
Highlights & Insights¶
- First application of VQ to semi-supervised segmentation: The discrete codebook space serves as a structured perturbation carrier, offering more principled control than dropout.
- Complete theoretical-to-empirical justification: The logical chain — from KL divergence analysis of dropout instability, to QPM design, to experimental validation — is coherent and well-grounded.
- Dual-branch design elegantly addresses quantization information loss: The reconstruction branch not only preserves visual information but also provides a self-supervised signal for the VQ encoder.
- New dataset contribution: The 828-case lung cancer CT dataset is a valuable clinical resource that fills a gap in central-type lung cancer segmentation benchmarks.
Limitations & Future Work¶
- Experiments are conducted only on 2D slices; the behavior of VQ in 3D settings remains unvalidated.
- Codebook learning stability and efficiency may become bottlenecks in more complex tasks.
- The QPM perturbation strategy relies on codeword distances and is therefore sensitive to codebook quality.
- Only DINOv2 is used as the foundation model prior; multi-model ensembles or domain-specific medical foundation models remain unexplored.
Related Work & Insights¶
Compared to UA-MT (Monte Carlo Dropout), BCP, and Unimatch, the core distinction of VQ-Seg lies in replacing random dropout in continuous space with structured perturbation in discrete space. Relative to general VQ-VAE literature, VQ-Seg cleverly leverages the VQ space for both perturbation and reconstruction simultaneously. Key insight: when regularization strategies in continuous space exhibit instability, mapping to a discrete space may offer a more controllable alternative.
Rating¶
- Novelty: ⭐⭐⭐⭐ First introduction of VQ into semi-supervised segmentation for perturbation, with solid theoretical grounding.
- Experimental Thoroughness: ⭐⭐⭐⭐ Ablations are detailed, though only two datasets (LC and ACDC) are used; greater diversity would strengthen the evaluation.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clear, figures are intuitive, and the overall structure is well-organized.
- Value: ⭐⭐⭐⭐ Offers a novel perspective on perturbation strategies for semi-supervised segmentation; the new dataset provides an additional contribution.