TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment¶

Conference: CVPR 2026
arXiv: 2604.12012
Code: https://gdm-tipsv2.github.io/
Area: Multimodal VLM
Keywords: Vision-language pretraining, patch-text alignment, iBOT++, distillation, zero-shot segmentation

TL;DR¶

TIPSv2 is proposed by discovering that distillation substantially improves patch-text alignment, and this insight is translated into a new pretraining objective, iBOT++ (where visible tokens also participate in the loss computation). Combined with head-only EMA and multi-granularity text augmentation, TIPSv2 achieves state-of-the-art performance across 9 tasks and 20 datasets.

Background & Motivation¶

Background: Vision-language pretraining encompasses two major directions: contrastive/sigmoid methods (CLIP, SigLIP, PE) that provide image-text alignment and zero-shot capabilities, and self-supervised methods (DINO, iBOT) that excel at spatial understanding for dense tasks.

Limitations of Prior Work: Achieving unified representations that excel at both global (image-level) and dense (patch-level) understanding simultaneously remains a significant challenge. Unified approaches such as TIPS and SigLIP2 still struggle to maintain precise patch-level text alignment. A surprising trend is that the largest flagship models exhibit worse patch-text alignment than their smaller counterparts.

Key Challenge: The final Transformer layers tend to function as global contrastive "decoders" rather than preserving local semantics, leading to degraded patch-level alignment.

Goal: To address patch-text alignment directly during the pretraining stage.

Key Insight: The paper discovers that the distillation process substantially improves spatial alignment by imposing effective supervision on all patch tokens — the distilled student model's patch-text alignment far surpasses that of the teacher model.

Core Idea: Translate the insight from distillation into the pretraining objective iBOT++, enabling visible tokens to directly participate in the MIM loss.

Method¶

Overall Architecture¶

TIPSv2 integrates three improvements over TIPS: (1) the iBOT++ objective, which applies self-supervised loss to both visible and masked tokens; (2) a Head-only EMA strategy that updates only the projection layers rather than the full model via EMA; and (3) multi-granularity text augmentation combining synthetic captions from PaliGemma and Gemini. The total loss is \(\mathcal{L} = \mathcal{L}_{CLIP} + \mathcal{L}_{DINO} + \mathcal{L}_{iBOT++}\).

Key Designs¶

iBOT++ (Enhanced Masked Image Modeling):
- Function: Directly enhances patch-text alignment during pretraining.
- Mechanism: Standard iBOT computes loss only on masked tokens (\(\mathcal{L}_{iBOT} = -\sum m_i \cdot h_t(f_t(I)_i)^T \log h_s(f_s(I_{mask})_i)\)). iBOT++ removes the masking condition, allowing visible tokens to participate in the loss as well. This effectively imposes a representation consistency constraint on all patch tokens.
- Design Motivation: Distillation experiments demonstrate that removing masking and applying supervision to all tokens is the key factor for improving patch-text alignment. The distilled TIPS ViT-L student substantially outperforms the TIPS ViT-g teacher on zero-shot segmentation, with a mIoU gap exceeding 20 points.
Head-only EMA:
- Function: Substantially reduces training memory and parameter count.
- Mechanism: Standard SSL requires a full-model EMA teacher to prevent model collapse. However, TIPSv2 benefits from additional image-text contrastive loss that provides a stabilizing signal, making it sufficient to apply EMA updates only to the projection head while sharing a single visual encoder.
- Design Motivation: Reduces trainable parameters by nearly half, enabling more efficient large-scale training.
Multi-Granularity Caption Sampling:
- Function: Enhances the diversity and robustness of text supervision.
- Mechanism: Combines original web alt-text captions, spatially-aware captions from PaliGemma, and detailed descriptive captions from Gemini, randomly sampling captions of different granularities during training.
- Design Motivation: Captions at different granularities capture distinct visual information, improving the model's adaptability across diverse downstream tasks.

Loss & Training¶

The overall loss comprises three components: CLIP contrastive loss (with dual CLS tokens for object-level and spatial captions respectively), DINO global self-distillation loss, and iBOT++ patch-level loss (applied to all tokens rather than only masked tokens). Head-only EMA updates the projection layers.

Key Experimental Results¶

Main Results¶

Model	PC59 mIoU	PC60 mIoU	VOC21 mIoU	ADE150 mIoU
TIPS ViT-g (Teacher)	11.4	10.8	19.7	2.6
TIPS ViT-L (Distilled Student)	33.5	30.4	30.5	20.8
TIPSv2 ViT-L	42.1	38.2	45.3	28.7
DINOv2 ViT-L	35.8	32.1	38.6	23.4
PE-core ViT-L	28.3	25.6	32.1	18.2

Ablation Study¶

Configuration	Masking Ratio	PC59	VOC21	ADE150
Standard pretraining	0.75	6.9	6.7	0.3
Distillation	0.75	16.0	22.5	5.9
Distillation	0.5	15.5	24.0	7.0
Distillation (no mask = iBOT++)	0.0	31.4	30.8	20.0

Key Findings¶

Removing masking is the critical factor: reducing the masking ratio from 0.75 to 0.0 during distillation causes PC59 mIoU to jump from 16.0 to 31.4.
Head-only EMA achieves performance comparable to full-model EMA when text supervision is present, while reducing memory by approximately half.
The phenomenon of the distilled student substantially surpassing the teacher suggests that patch-text alignment is an emergent capability that can be acquired post hoc.

Highlights & Insights¶

The finding that the distilled student surpasses the teacher is highly instructive: it suggests that patch-level semantics in large models are "diluted" by the global contrastive loss, and that distillation restores local semantics by imposing dense supervision over all tokens.
The modification introduced by iBOT++ is remarkably simple (effectively a one-line change: removing the masking condition), yet its effect is transformative. Work of this kind — where minimal modifications yield outsized gains — is particularly valuable.
The observation regarding Head-only EMA indicates that text supervision can substitute for the role of EMA in preventing representation collapse.

Limitations & Future Work¶

Validation is conducted primarily on Google-internal data at scale, making community reproduction challenging.
The theoretical explanation for the effectiveness of iBOT++ remains insufficiently developed.
Performance on video understanding and 3D tasks has not been evaluated.
The effect of iBOT++ on larger model scales warrants further exploration.

vs. TIPS: TIPSv2's iBOT++ enhances patch-text alignment directly during pretraining, eliminating the need for a separate distillation step.
vs. DINOv2: DINOv2 lacks text alignment; TIPSv2 simultaneously achieves spatial understanding and text alignment.
vs. PE: PE optimizes global contrastive objectives at the expense of dense tasks; TIPSv2 strikes a balance between both via iBOT++.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The finding that the distilled student surpasses the teacher and the introduction of iBOT++ are both highly original.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 9 tasks and 20 datasets.
Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from discovery to method is exceptionally clear.
Value: ⭐⭐⭐⭐⭐ Provides important guidance for pretraining visual foundation models.