Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models¶

Conference: ECCV 2024
arXiv: 2404.13013
Code: https://groma-mllm.github.io/
Area: Multimodal VLM
Keywords: visual grounding, Region Tokenization, Grounding, Referring, Multimodal Large Language Models

TL;DR¶

Groma proposes a new paradigm that embeds localization capabilities directly into the visual tokenization process. By discovering regions of interest (ROIs) via a region proposer and encoding them into region tokens, Groma enables MLLMs to perform high-accuracy referring and grounding without relying on LLM-generated coordinates or external modules. It also leverages GPT-4V with visual prompting to construct Groma Instruct, the first grounded chat dataset featuring dual visual-textual prompts.

Background & Motivation¶

Background: MLLMs perform exceptionally well in image-level understanding (such as captioning and VQA) but generally lack localization capabilities, making them unable to link their understanding to specific regions in the visual context.

Limitations of Prior Work: - Approach A (LLM-based coordinate outputs, e.g., Kosmos-2, Shikra): These impose high computational overhead on the LLM and struggle with high-resolution inputs. Sequential output is also ill-suited for dense prediction tasks (such as segmentation). - Approach B (External modules, e.g., LISA using SAM): Requires the image to be processed twice (once by the MLLM and once by the localization module), resulting in high inference latency.

Key Challenge: Localization itself primarily requires perceptual capabilities rather than high-level semantic reasoning. However, existing methods either burden the heavy LLM with localization tasks or introduce separate, external modules.

Goal: How to equip MLLMs with precise localization capabilities without relying on external modules or increasing the burden on the LLM.

Key Insight: Borrowing from open-vocabulary detection, grounding is decoupled into localization and recognition. The localization task is then offloaded to the visual tokenizer.

Core Idea: Perceive-then-understand—utilizing the spatial understanding of the visual tokenizer for localization, while the LLM focuses solely on semantic understanding and reasoning.

Method¶

Overall Architecture¶

Pipeline: Input image (\(448 \times 448\)) \(\rightarrow\) DINOv2 Image Encoder \(\rightarrow\) split into two paths: (1) Global image tokens (4-to-1 downsampling) \(\rightarrow\) MLP \(\rightarrow\) LLM, (2) Region Proposer (DDETR) \(\rightarrow\) discover ROIs \(\rightarrow\) Region Encoder (Multi-scale ROIAlign) \(\rightarrow\) region tokens \(\rightarrow\) MLP \(\rightarrow\) LLM \(\rightarrow\) output text (optionally containing grounded references).

Key Designs¶

Image Encoder (DINOv2):
- Function: Encodes the input image into patch-level features, while providing feature pyramids for both the region proposer and region encoder.
- Mechanism: Chooses DINOv2 instead of CLIP as the visual encoder because DINOv2 excels at high-resolution input and fine-grained localization features. On the COCO detection task, DINOv2 at \(448 \times 448\) achieves 43.6 AP, whereas CLIP at \(336 \times 336\) only reaches 32.4 AP.
- Design Motivation: Localization tasks demand precise spatial features. Self-supervised pre-training enables DINOv2 to preserve better spatial structural information.
- Token Merging: Every 4 adjacent 2D patch tokens are merged into 1, reducing the input length for the LLM (resulting in at most 356 visual tokens).
Region Proposer (Deformable DETR):
- Function: Acts as a class-agnostic detection head to discover potential regions of interest (ROIs) from the image.
- Mechanism: Constructs a feature pyramid from the last 4 layers of DINOv2 \(\rightarrow\) Deformable DETR transformer \(\rightarrow\) binary classifier (based on localization quality score rather than object class) \(\rightarrow\) generates 300 proposals \(\rightarrow\) NMS (threshold 0.6) + confidence filtering (\(>0.15\)) \(\rightarrow\) extracts the top 100.
- Design Motivation: Localization is a low-level perceptual task that does not require LLM intervention. Delegating localization to a specialized detector head allows pre-training on large-scale detection datasets (COCO, Objects365, OpenImages, V3Det, SA1B), which would be computationally prohibitive if directly processed by the LLM.
- Training Data: 5.7M detection annotations (including 2M filtered SA1B data used to expand to part- and stuff-level proposals).
Region Encoder (Multi-scale ROIAlign):
- Function: Encodes region proposals (from either the proposer or user input) into region tokens.
- Mechanism: Extracts a feature pyramid from the last 3 layers of DINOv2 \(\rightarrow\) multi-scale ROIAlign crops and merges region features into unified region tokens \(\rightarrow\) MLP projects them into the LLM's feature space.
- Design Motivation: Compared to numeric coordinate representations (e.g., Shikra) or discrete positional tokens (e.g., Kosmos-2), region tokens directly carry the underlying regional semantic features, making them more intuitive and interpretable for the LLM.
Unified Referring & Grounding Format:
- Function: Registers region tokens into the LLM vocabulary using proxy tokens <r1>, <r2>, ..., <rn> to unify inputs (referring) and outputs (grounding).
- Mechanism: During grounding output, the LLM refers to proxy tokens to link text with image regions; during referring input, user-specified regions are similarly encoded into region tokens and inserted into the instruction. A special [grounding] token prompts the model to generate grounded responses.
- Design Motivation: A unified format avoids having to design separate encoding schemes for referring and grounding, simplifying the architecture.
Groma Instruct Dataset (GPT-4V Assisted):
- Function: Constructs 30K visual grounded dialogue samples for instruction fine-tuning.
- Mechanism: Targets images from Visual Genome that contain dense region annotations \(\rightarrow\) labels numeric markers on the images using Set-of-Mark (SoM) technology \(\rightarrow\) feeds the marked images along with rich textual contexts (region descriptions, image descriptions, QA pairs) into GPT-4V \(\rightarrow\) uses in-context learning to generate grounded conversations.
- Design Motivation: Existing dialogue benchmarks lack grounded information, which impairs the ability of grounded MLLMs to maintain grounding capabilities during long-turn dialogues. Groma Instruct is the first grounded chat dataset constructed using both visual and textual prompting.

Loss & Training¶

Three-stage training: - Stage 1 - Detection Pre-training (12 epochs): Trains the region proposer only (image encoder is frozen) using 5.7M detection data, with \(lr=2e-4\) and \(batch\ size=64\). - Stage 2 - Alignment Pre-training (2 epochs): Trains the region encoder and MLP projector (all other parameters are frozen) using 3.2M vision-language data, with \(lr=1e-4\) and \(batch\ size=128\). - Stage 3 - Instruction Fine-tuning (1 epoch): Unfreezes the LLM, training with 857K high-quality dataset samples (including Groma Instruct), with \(lr=1e-5\) and \(batch\ size=128\).

Total training cost: roughly 8 days on 8 \(\times\) A100.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Groma	Ferret	MiniGPT-v2	Shikra	Gain
RefCOCO val	[email protected]	89.53	87.49	88.69	87.01	+1.04
RefCOCO testA	[email protected]	92.09	91.35	91.65	90.61	+0.44
RefCOCO testB	[email protected]	86.26	82.45	85.33	80.24	+0.93
RefCOCO+ val	[email protected]	83.90	80.78	79.97	81.60	+2.30
RefCOCOg val	[email protected]	86.37	83.93	84.44	82.27	+1.93
REC Average	[email protected]	86.52	83.91	84.29	82.93	+2.23
LVIS-Ground	AR	28.8	16.8	11.4	4.9	+12.0
LVIS-Ground	[email protected]	30.3	16.3	11.2	2.0	+14.0

Outperforms all grounded MLLMs of the same scale, leading the second-best Ferret by over 10 AR on LVIS-Ground.

Ablation Study¶

Configuration	Key Metric	Description
CLIP \(336 \times 336\)	32.4 AP (COCO det)	Weak localization capabilities of CLIP
DINOv2 \(336 \times 336\)	38.9 AP (+6.5)	DINOv2 significantly outperforms CLIP
DINOv2 \(448 \times 448\)	43.6 AP (+4.7)	Higher resolution provides further improvement
Frozen LLM (finetune)	Grounding 84.02%, Referring CIDEr 148.0	Strong localization capabilities can be achieved even with a frozen LLM
Unfrozen LLM (finetune)	Grounding 86.52%, Referring CIDEr 158.4	Stronger comprehension capabilities when unfrozen
4\(\times\) Token Merge	REC Average 86.47%	4-to-1 downsampling is practically lossless
No Merging	REC Average 86.55%	Only +0.08% difference

Key Findings¶

Localization can be decoupled from comprehension: When the LLM is frozen, Groma still maintains grounding capabilities comparable to Ferret (84.02% vs 83.91%), demonstrating that localization does not strictly require LLM capacity.
Token merging is nearly lossless: 4\(\times\) downsampling has only a 0.08% impact on grounding, validating the optimality of the decoupled design in balancing efficiency and accuracy.
LVIS-Ground exposes limitations of exist methods: All methods perform poorly on small object localization (\(AR@s\)), and most tend to predict only a single box per query—reflecting the nature of the training datasets (such as RefCOCO), where each query is annotated with only one target.
Performs comparably to or better than GLaMM (which requires separate referring and grounding designs) on region captioning.

Highlights & Insights¶

The "Perceive-then-Understand" design philosophy: Carrying out localization within the visual tokenizer instead of the LLM mimics human visual processing. This represents a highly generalizable paradigm.
Training efficiency from decoupled architecture: Enables pre-training the localization ability on million-scale detection datasets without involving the LLM, a feat impossible for conventional MLLMs.
LVIS-Ground benchmark: Points out evaluation gaps by introducing the AS-MANY-Protocol evaluation scheme and a 1203-class grounding benchmark.
Groma Instruct construction: Combining SoM visual prompting, GPT-4V, and rich textual context is an effective recipe for constructing grounded conversation data.
Unified region token: Elegant design where the same token representation handles both referring inputs and grounding outputs.

Limitations & Future Work¶

Does not support free-form region inputs (such as clicks or scribbles), being restricted only to bounding boxes.
Does not support pixel-level grounding (such as segmentation masks); the authors suggest replacing the box proposer with Mask2Former.
DINOv2 features are not naturally aligned with text, resulting in slightly weaker performance on conversation and reasoning tasks compared to CLIP-based methods.
The recall of the region proposer sets the performance ceiling of the entire system—if a target region is not proposed, it cannot be understood.
Insufficient small-object annotations in the training data limit the capability to locate small objects.

vs Shikra/Kosmos-2 (LLM-based coordinate outputs): Groma avoids the computational bottleneck of the LLM processing high-resolution inputs and achieves superior localization accuracy (LVIS-Ground AR: 28.8 vs 4.9).
vs LISA/GLaMM (External localization modules): Groma does not require an external SAM module, thereby providing more efficient inference while unifying the representation of referring and grounding.
vs Ferret (Spatial-aware visual sampling): Groma actively discovers ROIs through a region proposer rather than passively receiving inputs, outperforming Ferret by over 12 AR on LVIS-Ground.
vs GPT4RoI (Simple pooling for region features): Groma extracts more refined, hierarchical features using multi-scale ROIAlign.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Embedding localization into visual tokenization is a completely fresh paradigm, and the "perceive-then-understand" design philosophy is highly inspiring.
Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers three tasks—grounding, referring, and VQA—proposes the new LVIS-Ground benchmark, and provides comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐⭐ The logic is exceptionally clear; the comparison of the three paradigms (Figure 2) is highly intuitive and strong, and the motivation is naturally derived.
Value: ⭐⭐⭐⭐☆ Introduces a highly influential new paradigm, though practical application scenarios are somewhat constrained by the lack of support for masks and free-form inputs.