Skip to content

Advancing Visual Large Language Model for Multi-granular Versatile Perception

Conference: ICCV 2025 arXiv: 2507.16213 Code: None (the paper states "The code will be available here" but no link is provided) Area: Segmentation / Object Detection Keywords: VLLM, multi-granular perception, unified framework, CoT data curation, panoptic segmentation

TL;DR

This paper proposes MVP-LM, a multi-granular versatile perception framework built upon a visual large language model. Through a novel multi-granular decoder and a CoT-inspired data unification strategy, MVP-LM is the first single model to simultaneously support all four perception combinations—box and mask predictions under both word-level and sentence-level instructions—achieving competitive performance on panoptic segmentation, object detection, visual grounding, and referring expression segmentation.

Background & Motivation

Visual perception tasks can be categorized along two dimensions: prediction type (bounding box vs. segmentation mask) and instruction type (word-level vs. sentence-level), yielding four combinations. However, existing methods typically cover only a subset of these combinations, limiting model generality.

Specifically: traditional detectors (e.g., GLIP, Grounding DINO) excel at word-level + box prediction but do not produce masks; segmentation methods (e.g., X-Decoder, OpenSeeD) handle word-level + mask but lack sentence-level understanding; VLMs (e.g., QwenVL, InternVL) understand sentence-level instructions and output box coordinates but cannot perform pixel-level prediction; recent works such as LISA and PixelLM support sentence-level + mask but neglect word-level perception.

Key Challenge: While joint training of certain combinations has been studied, the synergistic training of all four combinations remains largely unexplored. Key open questions include: (1) How can a VLLM simultaneously output both boxes and masks? (2) How can word-level and sentence-level instructions be handled in a unified manner? (3) How can the decoding and generation capabilities of LLMs be leveraged to enhance perception?

Core Idea of MVP-LM: The framework exploits the language understanding and generation capabilities of VLLMs, designs a multi-granular decoder for joint box and mask output, and unifies heterogeneous datasets into a "think-before-perceive" format via CoT-inspired data curation.

Method

Overall Architecture

MVP-LM consists of four core components: (1) a Swin-B Transformer as the image encoder; (2) a connector module for visual-text feature alignment; (3) Phi-1.5 as the language model; and (4) a multi-granular decoder based on OpenSeeD that jointly produces boxes and masks. The VLLM first generates an image description caption, then outputs a summary token whose hidden state is projected into visual queries, which are subsequently fed into the decoder for detection and segmentation.

Key Designs

  1. Dynamic Query Generation:

    • Function: Dynamically generates visual queries for detection/segmentation conditioned on the input instruction, rather than using fixed learnable queries.
    • Mechanism: Each query consists of two components—(a) context-aware base query: the VLLM generates a summary token <PER>, and its hidden state is projected via an MLP into \(N\) base query vectors; (b) language-guided residual: the similarity between multi-scale visual features and the input instruction text embeddings is computed, the Top-\(N\) most similar visual features are selected as residuals, and added to the base queries to form the final queries.
    • Design Motivation: Dynamic queries allow the model to adaptively attend to image regions most relevant to the instruction. The summary token generated by the LLM encodes global contextual information from the input, while language-guided visual feature selection introduces spatial priors.
  2. Multi-granular Decoder:

    • Function: Based on the OpenSeeD architecture, jointly handles box prediction and mask prediction.
    • Mechanism: Content queries and reference points are generated from a query selection mechanism, and cross-attention with multi-scale visual features is performed through multiple layers of deformable attention. Each layer's output is processed by three shared heads—a cross-modal similarity head, a box regression head, and a mask prediction head. For word-level perception, categories are matched via similarity between text embeddings and predicted regions; for sentence-level perception, targets are directly matched using BCE loss.
    • Design Motivation: Sharing representations between boxes and masks enables mutual reinforcement—mask annotations can be converted to box annotations, and joint training can leverage richer data sources.
  3. CoT-Inspired Data Unification Strategy:

    • Function: Unifies heterogeneous datasets from different tasks (panoptic segmentation, detection, grounding, referring segmentation) into a single SFT data format.
    • Mechanism: Each training sample consists of three parts—a task description (e.g., "Please identify all objects from the given phrase list" or "Please identify the target according to the following instruction"), the instruction (a word list or referring expression), and the answer ("[image caption]. The perception result is <PER>"). During training, image description captions are prepended to the answer to encourage the model to "think before perceiving."
    • Design Motivation: Multiple open-source VLLMs (VILA-3B/13B, InternVL2-8B/26B) are used to automatically generate diverse captions, avoiding overfitting to a single description style. For word-level tasks, category order is randomly shuffled and negative categories are introduced to prevent learning spurious correlations.

Loss & Training

The overall loss is: \(\mathcal{L} = \mathcal{L}_{llm} + \lambda_{word}\mathcal{L}_{word} + \lambda_{sent}\mathcal{L}_{sent} + \mathcal{L}_{mask} + \mathcal{L}_{box}\), where \(\mathcal{L}_{mask} = 5 \cdot L_{BCE} + 5 \cdot L_{DICE}\) and \(\mathcal{L}_{box} = 5 \cdot L_{L1} + 2 \cdot L_{GIoU}\). Training proceeds in two stages: Stage 1 trains only the connector module (on CC3M data); Stage 2 jointly trains the full model (excluding the visual encoder) for 80K steps. Hungarian matching and a denoising strategy are employed to stabilize optimization.

Key Experimental Results

Main Results: Closed-Set Panoptic Segmentation and Open-Set Segmentation

Method Type COCO PQ COCO mIoU ADE-OV PQ ADE-OV mIoU PC59 mIoU PAS20 mIoU
PSALM VLLM 55.9 66.6 13.7 18.2 48.5 81.3
OMG-LLaVA VLLM 53.8 - - - - -
MVP-LM VLLM 56.1 66.8 19.4 20.5 44.1 85.7
OpenSeeD Specialist 59.5 68.6 19.7 23.4 - -

Ablation Study

Training Data Configuration RefCOCO val (cIoU) COCO PQ COCO mIoU Notes
C, R 77.6 56.4 66.3 Base configuration
C, R, O 81.8 55.8 65.9 +O365, large gain on sentence-level
C, R, O, G 83.6 56.1 66.8 +Grounding, best overall
Answer Setting RefCOCO cIoU COCO PQ COCO mIoU
No description (summary only) 75.6 55.3 65.7
Generate existing object names 75.6 54.9 65.6
Generate image caption 75.7 55.6 66.2

Key Findings

  • MVP-LM is the first VLLM method to cover all four perception combinations within a single model.
  • With only 1.3B parameters, MVP-LM outperforms most 7B/13B models on REC tasks (RefCOCO val 93.5).
  • Multi-dataset joint training improves sentence-level perception (RefCOCO) by 6.0 points while keeping word-level perception performance stable.
  • On open-set segmentation (ADE-OV), MVP-LM outperforms PSALM by 5.7 PQ / 2.3 mIoU, demonstrating the framework's generalization ability.
  • The "describe-then-perceive" strategy consistently outperforms direct perception, validating the effectiveness of the CoT paradigm for perception tasks.

Highlights & Insights

  • Four-in-one unified framework: MVP-LM is the first to demonstrate the feasibility and mutual benefit of jointly training all four perception combinations, particularly the positive effect of box-annotated data on segmentation.
  • Dynamic query generation: Elegantly combines LLM generative capacity with the query mechanism of traditional detectors—base queries originate from language understanding while residuals come from visual-language matching.
  • CoT paradigm transferred to perception: Applying the "think first" paradigm from reasoning to perception tasks is an inspiring design choice.

Limitations & Future Work

  • The model scale is relatively small (Phi-1.5 + Swin-B); scaling up may yield further improvements.
  • Performance on some open-set segmentation benchmarks (e.g., PC59) remains below specialist models.
  • Extension to video and 3D perception has not been explored.
  • The paper suggests that R1-like RL training for perception tasks is a promising direction worth pursuing.

Supplementary Results: Referring Expression Comprehension (REC)

Method Params RefCOCO val RefCOCO testA RefCOCO testB
Shikra 13B 87.8 91.1 81.8
MiniGPTv2 7B 88.7 91.7 85.3
MVP-LM 1.3B 93.5 94.5 91.6
DeepSeek-VL2 200B+ 95.1 96.7 95.1

With only 1.3B parameters, MVP-LM comprehensively outperforms 7B/13B models on RefCOCO and approaches the 200B+ DeepSeek-VL2, demonstrating the efficiency of the unified framework design.

  • vs. PSALM: PSALM uses a VLLM as a universal decoder for word-level box + mask prediction but neglects sentence-level box prediction; MVP-LM covers all four combinations.
  • vs. LISA: LISA achieves sentence-level segmentation via an appended mask decoder but does not handle word-level tasks; MVP-LM's multi-granular decoder unifies both instruction types.
  • vs. OMG-LLaVA: OMG-LLaVA achieves slightly lower performance on panoptic segmentation (53.8 PQ) and does not report results on open-set segmentation or referring segmentation.

Rating

  • Novelty: ⭐⭐⭐⭐ The four-combination unified framework concept is well-motivated; dynamic query generation and CoT data curation are cleverly designed.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark evaluation with detailed ablations; however, performance on some open-set benchmarks is not particularly outstanding.
  • Writing Quality: ⭐⭐⭐⭐ The taxonomic framework is clear, the method is presented in a well-organized manner, and the capability comparison table (Tab. 1) is intuitive.
  • Value: ⭐⭐⭐⭐ Provides a complete baseline and design paradigm for unified perception with VLLMs.