VIRTUE: Visual-Interactive Text-Image Universal Embedder¶

Conference: ICLR 2026 arXiv: 2510.00523 Code: GitHub Area: Image Segmentation (Multimodal Embedding / Visual Interaction) Keywords: visual prompt, embedding model, SAM2, VLM, visual-interactive, retrieval

TL;DR¶

This paper proposes VIRTUE, a visual-interactive universal embedder that integrates the segmentation model SAM2 with a VLM to support user-specified regions of interest via points, boxes, or masks, producing joint entity-level and global-level embeddings. A million-scale SCaR benchmark is introduced to evaluate visual-interactive retrieval, achieving SOTA on 36 MMEB tasks (+3.1%–8.5%) and 5 SCaR tasks (+15.2%–20.3%).

Background & Motivation¶

Interaction limitations of embedding models: Existing VLM-based embedding models (VLM2Vec/GME/LamRA) support only text-based interaction and lack visual prompting capabilities (e.g., points, boxes, masks).

Value of visual prompts: Visual prompts are widely used in generative models (SAM, GroundingDINO) but remain unexplored in embedding models. They provide precise spatial localization for fine-grained understanding.

Limitations of cropping: Intuitive ROI cropping discards global scene context — cropping a "salad fork on a table" loses information about the table, causing failures in retrieval tasks requiring compositional reasoning.

Distinct entity needs within the same image: A dog and a cat appearing in the same image require different embeddings, yet a single holistic embedding cannot distinguish them.

Lack of evaluation benchmarks: No publicly available benchmark exists for evaluating visual-interactive embedding capabilities.

Method¶

Architecture: SAM2 + VLM (Qwen2-VL) + Segmentation-Language Connector¶

Three-Stream Embedding Fusion¶

Segmentation embedding \(H_s\): SAM2's prompt encoder processes visual prompts (points/boxes/masks) → the mask decoder generates a 64×64 feature map \(F_s\) → Conv2D compresses it into \(|S|\) tokens → an MLP projects them into the LLM dimension \(d\).
Visual embedding \(H_v\): The VLM's vision encoder extracts global context.
Text embedding \(H_t\): The LLM's text embedding layer processes instruction text.

Strategy When No Visual Prompt Is Provided¶

\(N\) points are uniformly sampled as surrogate inputs to SAM2, leveraging its automatic segmentation capability to extract multi-entity-level features, ensuring performance gains on conventional non-interactive tasks.

Training Scheme¶

Concatenate \([H_s, H_v, H_t]\) → LLM → take the hidden state of the last token → InfoNCE contrastive learning.
SAM2 and the vision encoder are frozen; only LoRA (rank=8) and the segmentation-language connector (trained from scratch) are updated.
Trained on 20 MMEB training sets with batch size 1024.

SCaR Benchmark (Novel Million-Scale Benchmark)¶

Task: Given an image and an ROI bounding box, retrieve the caption describing the entity within its global scene context.
Sources: RefCOCO+, RefCOCOg, VisualGenome, COCO-Stuff, ADE20K.
Scale: 957K training samples + 47K evaluation samples.
Hard negatives: Generated by GPT-4V using three substitution strategies targeting objects, relations, and scenes, with 9 distractors per sample.
Multi-stage quality control: GPT-4V verification + WordNet synonym detection + human review.

Key Experimental Results¶

MMEB Overall (36 Tasks)¶

Model	Params	IND	OOD	Overall
VLM2Vec-2B	2B	60.7	57.3	59.7
VIRTUE-2B	2B	69.7	58.8	64.8
VLM2Vec-7B	7B	71.4	58.1	65.5
UniME-7B	7B	68.4	57.9	66.6
VIRTUE-7B	7B	74.4	61.4	68.6

SCaR (5 Visual-Interactive Tasks)¶

Model	RefCOCOg	RefCOCO+	COCO-Stuff	VG	ADE20K
VLM2Vec-7B	56.2	52.1	45.3	42.8	38.1
VIRTUE-7B	75.1	70.8	62.5	59.4	55.9

Ablation Study¶

Configuration	MMEB Overall	SCaR Avg	Notes
w/o segmentation embedding	65.5	52.1	VLM2Vec baseline
+ cropped ROI	65.8	54.3	Cropping offers limited benefit
+ full SAM2 features	67.1	63.2	Entity-level information is effective
+ full VIRTUE	68.6	68.2	Best

Key Findings¶

Segmentation embeddings also provide entity-level information gains in non-interactive settings (via uniform point sampling).
Gains of 3.1%–8.5% are observed even on conventional MMEB tasks without visual prompts.
SAM2 captures entity semantics more precisely than cropping by serving as a structured prior, avoiding issues such as background inclusion and cross-entity contamination.

Highlights & Insights¶

Novel interaction paradigm: The first work to introduce visual prompts (points/boxes/masks) into embedding models, defining an entirely new problem space.
SCaR benchmark: Million-scale data, GPT-4V-generated hard negatives, and multi-stage filtering make it a reliable evaluation tool.
Generality preserved: The automatic point-sampling strategy under no visual prompt ensures competitive performance on conventional tasks.
Practical efficiency: Freezing SAM2 and fine-tuning with LoRA keeps training costs manageable.

Limitations & Future Work¶

SAM2 introduces additional inference overhead due to an extra segmentation forward pass.
SCaR evaluates only image-to-text (I2T) retrieval and does not cover image-to-image visual-interactive scenarios.
Uniform point sampling may not be the optimal entity discovery strategy; automatic object detection-driven approaches could be explored.
The segmentation-language connector requires training from scratch, increasing training complexity.

VLM2Vec/GME/LamRA: VLM-based embedding model baselines supporting text-only interaction.
CLIP/SigLIP/OpenCLIP: Dual-encoder embedding models performing global matching without region awareness.
SAM2: Incorporated as an entity-level feature extractor for embedding learning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Visual-interactive embedding = entirely new problem definition + new benchmark
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 36+5 tasks + extensive ablations + two model scales
Writing Quality: ⭐⭐⭐⭐ Clear and systematic; benchmark construction process is transparently described
Value: ⭐⭐⭐⭐⭐ Opens a new direction for visual-interactive embedding + high-quality benchmark