INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning¶

Conference: NeurIPS 2025 arXiv: 2412.03565 Code: GitHub | HuggingFace Area: Video Understanding / Multimodal Learning Keywords: Instance-level Understanding, Visual Prompting, Instruction Tuning, Large Multimodal Models, Spatiotemporal Understanding

TL;DR¶

This work presents the complete Inst-IT framework: a GPT-4o-assisted automatic annotation pipeline for generating instance-level fine-grained data, an Inst-IT Bench evaluation benchmark, a 335K QA-pair instruction tuning dataset, and a continual fine-tuning paradigm that effectively enhances instance-level understanding in LMMs while also improving general image and video comprehension.

Background & Motivation¶

Large multimodal models (LMMs) have achieved remarkable progress in holistic image and video understanding, yet they remain significantly limited in instance-level understanding:

What is instance-level understanding: Recognizing the attributes, behaviors, relationships, and temporal changes of specific instances (e.g., a particular person or object) within images or videos.

Strong real-world demand: Users typically focus on specific targets in a scene rather than the overall context — e.g., "What is the person in the red shirt doing?"

Limitations of current models: Existing LMMs perform well at global description but frequently confuse or omit details when asked to focus on a particular instance.

A key observation motivates this work: state-of-the-art LMMs show substantially improved instance understanding when provided with explicit visual cues (e.g., bounding boxes, arrows, ID labels). This suggests that models already possess the latent capacity for instance-level understanding, but lack the appropriate training data to activate it.

Accordingly, the core idea behind Inst-IT is to construct large-scale instance-level visual prompt instruction tuning data, guiding models to learn instance-level understanding through explicit visual marking.

Method¶

Overall Architecture¶

Inst-IT consists of three components:

Inst-IT Bench: An evaluation benchmark for diagnosing instance-level understanding in LMMs.
Inst-IT Dataset: A large-scale instruction tuning dataset.
Continual Instruction Tuning Paradigm: An effective training strategy.

Key Designs¶

1. Automatic Annotation Pipeline

GPT-4o is employed as the annotation engine to process videos frame by frame:

Instance annotation: Explicit visual markers (e.g., [1], [2], [3]) are applied to label each instance in the image.
Frame-level description: Three-level descriptions are generated per frame — (a) independent description of each instance, (b) holistic scene description, and (c) temporal changes relative to the previous frame.
Video-level description: All frame-level annotations are aggregated into a chronologically organized overall video description.
QA generation: Instance-centric open-ended question-answer pairs are generated from the annotations.

2. Inst-IT Bench (Evaluation Benchmark)

Scale: ~1,000 image QA pairs + ~1,000 video QA pairs.
Evaluation dimensions: Image branch (instance attributes, instance relations) + Video branch (temporal tracking, action understanding).
Format: Supports both open-ended and multiple-choice questions.
Distinctive design: Uses [ID] format to reference instances and <timestamp> to reference temporal moments, evaluating fine-grained spatiotemporal understanding.

3. Inst-IT Dataset (Fine-tuning Dataset)

21K videos + 51K images
21K video-level descriptions
207K frame-level descriptions (51K image frames + 156K video frames)
335K open-ended QA pairs
Currently the largest instance-level visual prompt annotation dataset.

4. Continual Instruction Tuning Paradigm

Inst-IT Dataset is mixed with the original general-purpose instruction tuning data.
A continual training strategy is adopted rather than training from scratch.
Adding only a modest volume of instance-level data (~155K) yields substantial gains in instance-level understanding without degrading general capabilities.

Loss & Training¶

Standard autoregressive language modeling loss.
Built upon the LLaVA-Next framework with two-stage training.
Mixing ratio: original LLaVA-Next data (~765K) + Inst-IT data (~155K) = ~920K total.

Key Experimental Results¶

Main Results: Inst-IT Bench Evaluation¶

Model	Backbone	Image OE	Image MC	Video OE	Video MC
Random Guess	—	—	25.0	—	25.0
GPT-4o	—	74.1	84.8	65.5	81.0
Gemini-1.5-pro	—	69.9	79.7	61.4	76.7
LLaVA-1.5	Vicuna-7B	41.6	32.1	—	—
LLaVA-Next	Vicuna-7B	46.0	42.4	—	—
LLaVA-OV	Qwen2-7B	48.0	71.7	33.2	45.6
InternVL2	InternLM2.5-7B	58.6	66.5	39.8	45.5
Qwen2-VL	Qwen2-7B	48.3	64.9	38.2	59.4
LLaVA-Next-Inst-IT	Vicuna-7B	68.6	63.0	49.3	42.1
LLaVA-Next-Inst-IT	Qwen2-7B	67.9	75.3	45.7	53.3

Key findings: - After Inst-IT fine-tuning, LLaVA-Next (Vicuna-7B) improves from 46.0 to 68.6 on Image OE (+22.6), approaching GPT-4o performance. - On Video OE, the score improves from 25.8 to 49.3 (+23.5), a substantial gain.

Performance on General Benchmarks¶

Inst-IT fine-tuning not only improves instance-level understanding but also enhances general image and video comprehension:

Benchmark	LLaVA-Next (Original)	+Inst-IT	Gain
AI2D	65.2	68.7	+3.5
TextVQA	63.8	65.1	+1.3
EgoSchema	42.1	48.5	+6.4
MVBench	56.3	60.8	+4.5

Ablation Study¶

Importance of data composition:

Data Configuration	Inst-IT Bench (MC)	AI2D	EgoSchema
LLaVA-Next baseline	42.4	65.2	42.1
+ Image instance data only	56.8	67.1	43.5
+ Video instance data only	48.2	65.8	47.2
+ Image + Video instance data	63.0	68.7	48.5

The combination of image and video instance data yields the best overall performance; video data contributes more to temporal understanding tasks such as EgoSchema.

Effect of visual prompt type:

Visual Prompt Type	Inst-IT Bench (Image MC)	Inst-IT Bench (Video MC)
No visual prompt	42.4	24.8
Bounding Box	55.1	35.2
ID-labeled markers ([ID])	63.0	42.1

Explicit [ID] label markers outperform simple bounding boxes, as the ID system naturally enables cross-frame tracking of the same instance.

Key Findings¶

Instance understanding is a notable weakness of LMMs: Even GPT-4o achieves only 74–85 on Inst-IT Bench, far from perfect.
Explicit visual cues are highly effective: Introducing ID labels leads to a dramatic leap in instance-level understanding.
Instance data enhances general capabilities: Instance-level training does not conflict with general-purpose capabilities; rather, the two are mutually reinforcing.
High efficiency with limited data: Only ~155K instance-level samples are sufficient to produce substantial improvements.

Highlights & Insights¶

Complete ecosystem: Bench + Dataset + Training constitutes a closed loop from evaluation to data to training.
Automated annotation: GPT-4o is leveraged to automatically generate high-quality instance-level annotations at scale.
General capability enhancement: This finding is counterintuitive — fine-grained instance-level training improves holistic understanding, suggesting that instance-level understanding is a foundational capability.
Simple yet powerful visual prompt design: The [ID] labeling system is straightforward but effective, with native support for cross-frame instance tracking.

Limitations & Future Work¶

Annotation cost: The pipeline relies on GPT-4o for annotation, which incurs high cost and may introduce GPT-4o-specific biases.
Limited model scale: Validation is conducted only on 7B-scale models; effectiveness on larger models remains to be confirmed.
Dependency on instance detection: The annotation pipeline requires existing detection/segmentation models to provide instance locations; detection failures cascade downstream.
Open-world generalization: Current data covers a limited range of scene types; generalization to rare scenarios is unclear.
Training code not open-sourced: Only the evaluation toolkit and model weights are currently released.

ViP-LLaVA / SoM-LLaVA: Early works on visual prompt LMMs, but lacking large-scale instance-level data.
LLaVA-OneVision: A representative multi-stage instruction tuning approach that inspired the continual fine-tuning strategy in Inst-IT.
VideoGLaMM: Pixel-level video grounding focusing on finer-grained localization.
GPT-4o: Serves both as the annotation tool and as the practical upper bound, demonstrating the potential of well-trained instance understanding.
Insight: Explicit visual marking is a low-cost, high-efficiency means of "unlocking" latent model capabilities, a principle that generalizes to other fine-grained tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ (first large-scale instance-level visual prompt fine-tuning framework)
Technical Depth: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐ (data and models open-sourced)
Writing Quality: ⭐⭐⭐⭐⭐