A Mixed Diet Makes DINO An Omnivorous Vision Encoder¶

Conference: CVPR 2026 arXiv: 2602.24181 Code: None Area: Image Segmentation Keywords: DINOv2, cross-modal alignment, modality-agnostic encoder, InfoNCE, knowledge distillation

TL;DR¶

This paper identifies severe cross-modal feature misalignment in pretrained vision encoders such as DINOv2 (across RGB, depth, and segmentation modalities), and proposes the Omnivorous framework, which trains lightweight adapters on the final few layers of a frozen backbone using an alignment loss, an anchoring loss, and modality mixup augmentation. The resulting encoder constructs a unified, modality-agnostic feature space that substantially outperforms baselines on cross-modal retrieval while maintaining or improving downstream task performance.

Background & Motivation¶

Background: Pretrained visual foundation models such as DINOv2 achieve strong performance on single-modality tasks, and it is commonly assumed that their feature spaces exhibit a degree of shared structure across different visual modalities (RGB, depth, segmentation maps).

Limitations of Prior Work: Empirical analysis reveals that the cosine similarity between DINOv2 features of an RGB image and its corresponding depth map is nearly identical to that between two entirely unrelated images: \(\cos(f(x_r), f(x_d)) \approx \cos(f(x_{r,1}), f(x_{r,2}))\). This indicates that the feature space of DINOv2 is severely misaligned across visual modalities.

Key Challenge: Existing unified encoder approaches (e.g., Omnivore, ImageBind) require joint training of the entire backbone from scratch, incurring high computational cost and compromising the discriminative capacity of pretrained models. Naive cross-modal alignment, on the other hand, risks feature space collapse.

Goal: How can the powerful semantic representations of DINOv2 be preserved while achieving parameter-efficient alignment of diverse visual modalities within a shared feature space?

Key Insight: Drawing an analogy to the evolution of multilingual models in NLP—from language-specific to cross-lingual shared encoders—vision models similarly require cross-modal alignment. This is achieved by fine-tuning only the final few transformer blocks, supplemented by an anchoring loss to prevent representation drift.

Core Idea: Fine-tune the terminal blocks of a frozen DINOv2 backbone as a modality-agnostic adapter, employing symmetric cross-modal InfoNCE alignment loss, anchoring distillation loss, and modality mixup augmentation to yield an "omnivorous" encoder.

Method¶

Overall Architecture¶

Images of arbitrary modality (RGB/depth/segmentation) are processed through the first \(L=8\) frozen DINOv2 layers to extract intermediate features, which are then passed through trainable adapter layers (the final 4 blocks) to produce representations in a unified feature space. Training employs a teacher–student architecture: the student shares the frozen backbone but fine-tunes the terminal blocks, while the fully frozen teacher serves as an anchoring reference.

Key Designs¶

Symmetric Cross-Modal Alignment:
- Function: Maximize feature similarity between different modalities of the same scene in the student output space.
- Mechanism: InfoNCE loss is computed over all modality pairs \((m_1, m_2)\) as \(\mathcal{L}_{\text{align}} = \frac{1}{3}\sum_{k_1}\sum_{k_2>k_1}\mathcal{L}_{\text{InfoNCE}}(m_{k_1}, m_{k_2})\), with a learnable temperature parameter \(\tau\). Crucially, alignment is performed only in the adapted space rather than the frozen feature space.
- Design Motivation: Performing symmetric alignment in the student's adapted space, rather than the frozen space, avoids the conflicting optimization objective of pulling mismatched frozen features toward incorrect targets.
Anchoring Loss:
- Function: Constrain the student output \(h_m\) to remain close to the teacher output \(h^*_m\), preventing feature drift or collapse.
- Mechanism: \(\mathcal{L}_{\text{anchor}} = \frac{1}{|M|}\sum_{m \in M}(1 - \text{sim}(h_m, h^*_m))\), using cosine distance.
- Design Motivation: Pure alignment loss may lead to a trivial solution in which the feature space collapses to satisfy alignment constraints while discarding all semantic information. The anchoring loss pulls the student back toward the teacher's representation space, preserving discriminative capacity. The balance is controlled by \(\lambda_{\text{anchor}}=10\).
Data Augmentation: Natural Color Palette + Modality Mixup:
- Function: Colorize depth and segmentation maps using quantized colors from the corresponding RGB image, followed by alpha blending across modalities.
- Mechanism: \(x_d^{\text{mixup}} := (1-\alpha_d)x_d + \alpha_d x_r^{\text{aug}}\), with \(\alpha \sim [0, 0.5]\), creating a continuous modality spectrum.
- Design Motivation: (1) Natural colorization prevents the model from exploiting low-level statistical shortcuts such as color histograms for alignment; (2) blending generates "hard positives" that make the contrastive task more challenging and semantically meaningful; (3) the continuous modality space enhances the encoder's modality invariance.

Loss & Training¶

The overall objective is: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{align}} + \lambda_{\text{anchor}} \mathcal{L}_{\text{anchor}}\). Losses are computed separately on CLS tokens and dense tokens, with 64 dense tokens subsampled. Training uses six datasets (MOVi, ScanNet, TartanAir, etc.) on ViT-B/14, with the first 8 layers frozen and the final 4 layers fine-tuned.

Key Experimental Results¶

Main Results¶

Task / Dataset	Metric	Omnivorous	DINOv2	Gain
ScanNet cross-modal retrieval (GAP)	R@1	46.1%	4.6%	+41.5%
MOVi cross-modal retrieval (GAP)	R@1	86.2%	15.5%	+70.7%
ImageNet classification (Linear)	Top-1	83.8%	80.4%	+3.4%
NYUv2 depth estimation (Linear)	δ₁	0.896	0.875	+0.021
ADE20k segmentation (Linear)	mIoU	0.475	0.463	+0.012

Ablation Study¶

\(\lambda_{\text{anchor}}\)	Cross-modal alignment (cos sim)	Cross-scene discrimination (1−cos sim)	Notes
0	~0.70	~0.36	Over-alignment; discriminative capacity collapses
1.0	~0.65	~0.70	Good balance
10.0 (default)	~0.55	~0.78	Conservative but robust
100.0	~0.35	~0.80	Approaches frozen baseline

Key Findings¶

Cross-modal zero-shot transfer: A depth prediction head trained on RGB features is evaluated zero-shot on segmentation map inputs. DINOv2 achieves RMSE=1.536 (near random), whereas Omnivorous achieves RMSE=0.532; the approach also generalizes to the unseen NOCS modality.
ImageNet k-NN accuracy nearly matches the teacher (81.97% vs. 81.94%), demonstrating that the anchoring loss effectively prevents representation drift.
As the mixup maximum \(\alpha_{\max}\) increases from 0 to 1.0, classification and segmentation performance improve consistently; depth estimation degrades slightly, making 0.5 a favorable trade-off.

Highlights & Insights¶

Key finding on cross-modal misalignment: The paper quantitatively demonstrates that DINOv2 feature similarity between RGB and depth images is nearly indistinguishable from that between random image pairs—an observation that is independently informative.
Parameter-efficient post-hoc alignment: Fine-tuning only the final 4 transformer blocks substantially improves cross-modal alignment while preserving or enhancing downstream task performance, making the approach lightweight and practical.
Colorization and modality mixup augmentation: Applying an RGB color palette to depth and segmentation maps eliminates color histogram shortcuts and creates a continuous modality spectrum, representing an elegant augmentation design.

Limitations & Future Work¶

Validation is currently limited to DINOv2 ViT-B/14; larger scales and alternative backbones remain untested.
It is unclear whether the high-resolution fine-tuning step of DINOv2 remains necessary after Omnivorous training.
Minor performance degradation is observed on fine-grained datasets such as iNaturalist and GLDv2, possibly attributable to training data dominated by multi-object synthetic scenes.
Text modality alignment is not addressed; the framework is limited to visual modalities.

vs. ImageBind: ImageBind binds six modalities using images as a bridge but requires joint training from scratch; Omnivorous achieves alignment via post-hoc fine-tuning of a small number of layers.
vs. Omnivore: Omnivore trains a single ViT from scratch to handle images, video, and 3D data; Omnivorous retains the majority of pretrained backbone parameters.
vs. CMC (Contrastive Multiview Coding): CMC requires large numbers of negative samples and is constrained to specific modality pairs; Omnivorous avoids dependence on large-scale negatives through the anchoring loss.

Rating¶

Novelty: ⭐⭐⭐⭐ The finding and entry point are compelling, though the teacher–student + contrastive learning framework itself is not novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across retrieval, classification, depth estimation, and segmentation, with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Well-organized with a persuasive motivation.
Value: ⭐⭐⭐⭐ Offers a meaningful contribution to the direction of unified multimodal encoders.