Grounding Language Models for Visual Entity Recognition¶
Conference: ECCV 2024
arXiv: 2402.18695
Code: https://github.com/MrZilinXiao/AutoVER
Area: Information Retrieval
Keywords: Visual Entity Recognition, Multimodal Large Language Models, Retrieval-Augmented Generation, Constrained Decoding, Knowledge Grounding
TL;DR¶
AutoVER is proposed, which is the first method to apply Multimodal Large Language Models (MLLMs) to large-scale visual entity recognition. By integrating retrieval capability directly inside the MLLM, and combining contrastive training with trie-constrained decoding, it substantially outperforms prior methods like PaLI-17B on the Oven-Wiki benchmark.
Background & Motivation¶
- Challenges of Visual Entity Recognition (VER):
- The answer space exceeds 6 million Wikipedia entities, rendering classifier-based approaches infeasible.
- Generative VQA models are prone to hallucinating—the generated text may not map to any legitimate entity.
- Existing methods overlook the visual information of candidate entities.
- Generalization to out-of-domain unseen entities is difficult.
- Oven-Wiki Benchmark: Given an image + a question, find the exact entity answer from Wikipedia.
- Core Idea: Reframing entity recognition as a constrained sequence-to-sequence generation problem.
Method¶
Overall Architecture¶
AutoVER = MLLM (based on LLaVA/Vicuna) + Multimodal Entity Encoder + Retrieval-Augmented Constrained Decoding
Key Designs¶
1. Joint Contrastive-Generative Training
- MLLM side: A special token <ret> is added, and its last-layer hidden state serves as the query representation \(Q\).
- Entity encoder side: A 2-layer Transformer fuses the entity image and textual description, outputting the entity representation \(E\).
- Contrastive loss (InfoNCE): \(\mathcal{L}_{query2ent} = -\frac{1}{N}\sum \log \frac{\exp(sim(Q_i, E_i)/\tau)}{\sum_j \exp(sim(Q_i, E_j)/\tau)}\)
- Language modeling loss (next token prediction): \(\mathcal{L}_{LM}\)
- Total loss: \(\mathcal{L} = \mathcal{L}_{LM} + \lambda_r \cdot \mathcal{L}_{query2ent}\)
2. Hard Negative Mining - vision-hard: A pretrained ViT classifier is used to identify visually similar entities (sharing predicted categories). - kb-hard: Utilizing the Wikidata category hierarchy, entities that share a parent node are treated as knowledge-similar entities. - Avoid duplicate entities in the same batch via rejection sampling.
3. Retrieval-Augmented Constrained Decoding (Inference Phase)
- Pre-cache all entity vectors using the trained entity encoder \(\to\) Construct an entity vector database.
- During inference, perform top-\(k\) similarity search using the representation of the <ret> token to retrieve \(k=300\) candidates.
- Dynamically construct a prefix tree (trie) covering candidate entity identifiers.
- During autoregressive generation, the prefix tree constrains the valid tokens at each step, eliminating invalid decoding paths.
- Guarantees that the generated content always matches an entity in the knowledge base.
Loss & Training¶
- Initialization: LLaVA architecture, Vicuna-7B/13B, CLIP-ViTL/14-336px.
- Training data: Oven-Wiki ~5 million query-entity pairs.
- 32× V100 GPUs, batch size 256.
- \(\lambda_r = 1\), entity descriptions truncated to 77 tokens.
Key Experimental Results¶
Main Results (Oven-Wiki Validation Accuracy)¶
| Method | Entity seen | Entity unseen | Entity hm | Query seen | Query unseen | Overall hm |
|---|---|---|---|---|---|---|
| CLIP Fusion | 32.7 | 4.3 | 7.7 | 33.4 | 2.2 | 5.4 |
| PaLI-17B | 30.6 | 12.4 | 17.6 | 44.2 | 22.4 | 22.1 |
| GPT-4V (zero-shot) | 29.8 | 19.3 | 23.4 | 56.5 | 52.7 | 32.9 |
| AutoVER-7B | 61.5 | 21.7 | 32.1 | 69.0 | 31.4 | 36.8 |
| AutoVER-13B | 63.6 | 24.5 | 35.6 | 68.6 | 32.3 | 39.2 |
Ablation Study¶
- Removing contrastive training \(\to\) entity seen accuracy drops by approximately 10%.
- Removing constrained decoding \(\to\) entity unseen accuracy drops significantly.
- Removing hard negative mining \(\to\) fine-grained recognition capability decreases.
Key Findings¶
- Entity seen accuracy doubles from PaLI-17B's 30.6% to 61.5% (with fewer parameters).
- Constrained decoding eliminates hallucinations: guaranteeing that the generated content corresponds to a real entity.
- Demonstrates strong performance in zero-shot transfer on A-OKVQA-Ent, proving generalization capability.
- AutoVER-13B achieves 53.7% on the human evaluation set (compared to Human+Search at 77.7%, there is still a gap).
Highlights & Insights¶
- Unified retrieval-and-generation framework: Retrieval capability is built directly into the MLLM, eliminating the need for external retrievers.
- Prefix tree constrained decoding ensures grounded generation results, thoroughly eradicating hallucinations.
- Joint contrastive-generative training balances retrieval precision and generation quality.
- Hard negative mining strategy (dual-path: visual + knowledge base) effectively enhances fine-grained discrimination capabilities.
Limitations & Future Work¶
- The accuracy on the entity unseen subset remains low (21.7%), indicating that out-of-domain generalization is still a primary bottleneck.
- Pre-caching the entity database requires substantial storage and top-\(k\) search overhead.
- The prefix tree may face efficiency issues when the candidate pool is extremely large (e.g., > 10,000).
- There is still room for improvement regarding the reasoning demands on the query split (such as spatial relationships and common sense).
Related Work & Insights¶
- FROMAGe/GILL inspired the design of adding retrieval tokens in MLLMs.
- Generative Entity Linking (GENRE) offers insights for knowledge grounding in the textual domain.
- Inspirational: Extrapolating the constrained decoding framework to other tasks requiring structured output (e.g., knowledge graph completion, structured information extraction).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Unified RAG framework + constrained decoding eliminating hallucinations)
- Technical Depth: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Multiple splits + zero-shot transfer + ablations)
- Writing Quality: ⭐⭐⭐⭐
- Overall Recommendation: ⭐⭐⭐⭐⭐