INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning¶
Conference: NeurIPS 2025 arXiv: 2412.03565 Code: GitHub | HuggingFace Area: Video Understanding / Multimodal Learning Keywords: Instance-level Understanding, Visual Prompting, Instruction Tuning, Large Multimodal Models, Spatiotemporal Understanding
TL;DR¶
This work presents the complete Inst-IT framework: a GPT-4o-assisted automatic annotation pipeline for generating instance-level fine-grained data, an Inst-IT Bench evaluation benchmark, a 335K QA-pair instruction tuning dataset, and a continual fine-tuning paradigm that effectively enhances instance-level understanding in LMMs while also improving general image and video comprehension.
Background & Motivation¶
Large multimodal models (LMMs) have achieved remarkable progress in holistic image and video understanding, yet they remain significantly limited in instance-level understanding:
What is instance-level understanding: Recognizing the attributes, behaviors, relationships, and temporal changes of specific instances (e.g., a particular person or object) within images or videos.
Strong real-world demand: Users typically focus on specific targets in a scene rather than the overall context — e.g., "What is the person in the red shirt doing?"
Limitations of current models: Existing LMMs perform well at global description but frequently confuse or omit details when asked to focus on a particular instance.
A key observation motivates this work: state-of-the-art LMMs show substantially improved instance understanding when provided with explicit visual cues (e.g., bounding boxes, arrows, ID labels). This suggests that models already possess the latent capacity for instance-level understanding, but lack the appropriate training data to activate it.
Accordingly, the core idea behind Inst-IT is to construct large-scale instance-level visual prompt instruction tuning data, guiding models to learn instance-level understanding through explicit visual marking.
Method¶
Overall Architecture¶
Inst-IT consists of three components:
- Inst-IT Bench: An evaluation benchmark for diagnosing instance-level understanding in LMMs.
- Inst-IT Dataset: A large-scale instruction tuning dataset.
- Continual Instruction Tuning Paradigm: An effective training strategy.
Key Designs¶
1. Automatic Annotation Pipeline
GPT-4o is employed as the annotation engine to process videos frame by frame:
- Instance annotation: Explicit visual markers (e.g.,
[1],[2],[3]) are applied to label each instance in the image. - Frame-level description: Three-level descriptions are generated per frame — (a) independent description of each instance, (b) holistic scene description, and (c) temporal changes relative to the previous frame.
- Video-level description: All frame-level annotations are aggregated into a chronologically organized overall video description.
- QA generation: Instance-centric open-ended question-answer pairs are generated from the annotations.
2. Inst-IT Bench (Evaluation Benchmark)
- Scale: ~1,000 image QA pairs + ~1,000 video QA pairs.
- Evaluation dimensions: Image branch (instance attributes, instance relations) + Video branch (temporal tracking, action understanding).
- Format: Supports both open-ended and multiple-choice questions.
- Distinctive design: Uses
[ID]format to reference instances and<timestamp>to reference temporal moments, evaluating fine-grained spatiotemporal understanding.
3. Inst-IT Dataset (Fine-tuning Dataset)
- 21K videos + 51K images
- 21K video-level descriptions
- 207K frame-level descriptions (51K image frames + 156K video frames)
- 335K open-ended QA pairs
- Currently the largest instance-level visual prompt annotation dataset.
4. Continual Instruction Tuning Paradigm
- Inst-IT Dataset is mixed with the original general-purpose instruction tuning data.
- A continual training strategy is adopted rather than training from scratch.
- Adding only a modest volume of instance-level data (~155K) yields substantial gains in instance-level understanding without degrading general capabilities.
Loss & Training¶
- Standard autoregressive language modeling loss.
- Built upon the LLaVA-Next framework with two-stage training.
- Mixing ratio: original LLaVA-Next data (~765K) + Inst-IT data (~155K) = ~920K total.
Key Experimental Results¶
Main Results: Inst-IT Bench Evaluation¶
| Model | Backbone | Image OE | Image MC | Video OE | Video MC |
|---|---|---|---|---|---|
| Random Guess | — | — | 25.0 | — | 25.0 |
| GPT-4o | — | 74.1 | 84.8 | 65.5 | 81.0 |
| Gemini-1.5-pro | — | 69.9 | 79.7 | 61.4 | 76.7 |
| LLaVA-1.5 | Vicuna-7B | 41.6 | 32.1 | — | — |
| LLaVA-Next | Vicuna-7B | 46.0 | 42.4 | — | — |
| LLaVA-OV | Qwen2-7B | 48.0 | 71.7 | 33.2 | 45.6 |
| InternVL2 | InternLM2.5-7B | 58.6 | 66.5 | 39.8 | 45.5 |
| Qwen2-VL | Qwen2-7B | 48.3 | 64.9 | 38.2 | 59.4 |
| LLaVA-Next-Inst-IT | Vicuna-7B | 68.6 | 63.0 | 49.3 | 42.1 |
| LLaVA-Next-Inst-IT | Qwen2-7B | 67.9 | 75.3 | 45.7 | 53.3 |
Key findings: - After Inst-IT fine-tuning, LLaVA-Next (Vicuna-7B) improves from 46.0 to 68.6 on Image OE (+22.6), approaching GPT-4o performance. - On Video OE, the score improves from 25.8 to 49.3 (+23.5), a substantial gain.
Performance on General Benchmarks¶
Inst-IT fine-tuning not only improves instance-level understanding but also enhances general image and video comprehension:
| Benchmark | LLaVA-Next (Original) | +Inst-IT | Gain |
|---|---|---|---|
| AI2D | 65.2 | 68.7 | +3.5 |
| TextVQA | 63.8 | 65.1 | +1.3 |
| EgoSchema | 42.1 | 48.5 | +6.4 |
| MVBench | 56.3 | 60.8 | +4.5 |
Ablation Study¶
Importance of data composition:
| Data Configuration | Inst-IT Bench (MC) | AI2D | EgoSchema |
|---|---|---|---|
| LLaVA-Next baseline | 42.4 | 65.2 | 42.1 |
| + Image instance data only | 56.8 | 67.1 | 43.5 |
| + Video instance data only | 48.2 | 65.8 | 47.2 |
| + Image + Video instance data | 63.0 | 68.7 | 48.5 |
The combination of image and video instance data yields the best overall performance; video data contributes more to temporal understanding tasks such as EgoSchema.
Effect of visual prompt type:
| Visual Prompt Type | Inst-IT Bench (Image MC) | Inst-IT Bench (Video MC) |
|---|---|---|
| No visual prompt | 42.4 | 24.8 |
| Bounding Box | 55.1 | 35.2 |
| ID-labeled markers ([ID]) | 63.0 | 42.1 |
Explicit [ID] label markers outperform simple bounding boxes, as the ID system naturally enables cross-frame tracking of the same instance.
Key Findings¶
- Instance understanding is a notable weakness of LMMs: Even GPT-4o achieves only 74–85 on Inst-IT Bench, far from perfect.
- Explicit visual cues are highly effective: Introducing ID labels leads to a dramatic leap in instance-level understanding.
- Instance data enhances general capabilities: Instance-level training does not conflict with general-purpose capabilities; rather, the two are mutually reinforcing.
- High efficiency with limited data: Only ~155K instance-level samples are sufficient to produce substantial improvements.
Highlights & Insights¶
- Complete ecosystem: Bench + Dataset + Training constitutes a closed loop from evaluation to data to training.
- Automated annotation: GPT-4o is leveraged to automatically generate high-quality instance-level annotations at scale.
- General capability enhancement: This finding is counterintuitive — fine-grained instance-level training improves holistic understanding, suggesting that instance-level understanding is a foundational capability.
- Simple yet powerful visual prompt design: The
[ID]labeling system is straightforward but effective, with native support for cross-frame instance tracking.
Limitations & Future Work¶
- Annotation cost: The pipeline relies on GPT-4o for annotation, which incurs high cost and may introduce GPT-4o-specific biases.
- Limited model scale: Validation is conducted only on 7B-scale models; effectiveness on larger models remains to be confirmed.
- Dependency on instance detection: The annotation pipeline requires existing detection/segmentation models to provide instance locations; detection failures cascade downstream.
- Open-world generalization: Current data covers a limited range of scene types; generalization to rare scenarios is unclear.
- Training code not open-sourced: Only the evaluation toolkit and model weights are currently released.
Related Work & Insights¶
- ViP-LLaVA / SoM-LLaVA: Early works on visual prompt LMMs, but lacking large-scale instance-level data.
- LLaVA-OneVision: A representative multi-stage instruction tuning approach that inspired the continual fine-tuning strategy in Inst-IT.
- VideoGLaMM: Pixel-level video grounding focusing on finer-grained localization.
- GPT-4o: Serves both as the annotation tool and as the practical upper bound, demonstrating the potential of well-trained instance understanding.
- Insight: Explicit visual marking is a low-cost, high-efficiency means of "unlocking" latent model capabilities, a principle that generalizes to other fine-grained tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ (first large-scale instance-level visual prompt fine-tuning framework)
- Technical Depth: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐ (data and models open-sourced)
- Writing Quality: ⭐⭐⭐⭐⭐