E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition¶
Conference: ACL 2026
arXiv: 2604.17319
Code: https://github.com/Finch-coder/E2E-GMNER
Area: Object Detection
Keywords: Multimodal Named Entity Recognition, End-to-End Generation, Visual Grounding, Gaussian Perturbation, CoT Reasoning
TL;DR¶
Proposes E2E-GMNER, the first end-to-end GMNER framework that unifies entity recognition, semantic classification, visual grounding, and implicit knowledge reasoning within a single multimodal large language model. It adaptively determines the utility of visual/knowledge cues through CoT reasoning and introduces Gaussian Risk-aware Box Perturbation (GRBP) to enhance the robustness of generative box prediction.
Background & Motivation¶
Background: Grounded Multimodal Named Entity Recognition (GMNER) requires jointly identifying entities in text, predicting semantic types, and localizing each entity to its corresponding visual region in the image. Existing methods like H-Index, TIGER, and RiVEG primarily adopt pipeline architectures.
Limitations of Prior Work: (1) Pipeline architectures decouple text entity recognition and visual grounding into independent modules (e.g., separate NER taggers, external object detectors), leading to error propagation and the inability for joint optimization; (2) Existing methods resolve text-visual ambiguity through implicit cross-modal alignment but lack explicit mechanisms to judge when visual evidence or external knowledge is truly useful, causing noisy visual cues to degrade performance; (3) In generative box prediction, single hard-target supervision is sensitive to annotation noise and coordinate discretization errors.
Key Challenge: End-to-end unification vs. task-specific requirements—how to simultaneously optimize entity recognition, semantic classification, and visual grounding, which are three fundamentally different tasks, within a single model?
Goal: Design the first end-to-end GMNER framework to eliminate error accumulation in pipelines.
Key Insight: Model GMNER as an instruction-tuned conditional generation task, leveraging the unified generation capabilities of multimodal large language models (MLLMs).
Core Idea: End-to-end generation + CoT adaptive reasoning + Gaussian soft supervision, working in synergy to address the three core problems of GMNER.
Method¶
Overall Architecture¶
Given an image-text pair and task instructions, a LoRA-adapted multimodal LLM first performs CoT reasoning (visual cue analysis + background knowledge analysis), then autoregressively generates structured entity records (Entity Name|Semantic Type|Bounding Box Coordinates). During training, GRBP replaces hard-box supervision.
Key Designs¶
-
End-to-End Generative GMNER:
- Function: Eliminates error accumulation from pipeline architectures.
- Mechanism: Models GMNER as conditional generation: Input = [Instruction; (Image, Text)], Output = [Reasoning Sequence \(R\); Entity Record Set \(\{(e_i, c_i, b_i)\}]\). Each entity record is serialized into "Entity Name|Type|[x1,y1,x2,y2]" format, with all records concatenated for the final prediction. Training uses the standard autoregressive MLE loss.
- Design Motivation: A single generation process allows information flow between entity recognition and visual grounding, achieving true joint optimization.
-
CoT Instruction Tuning for Adaptive Reasoning:
- Function: Enables the model to autonomously determine the utility of visual evidence or background knowledge.
- Mechanism: Before generating entity records, the model outputs a reasoning sequence \(R\), comprising visual cue analysis (whether visual evidence in the image corresponds to text entities) and background knowledge analysis (whether external knowledge is needed for disambiguation). During training, reasoning sequences are generated by a stronger external LLM via API for supervision; during inference, the model generates them autonomously without depending on external models.
- Design Motivation: Avoids blind usage of noisy visual cues or irrelevant knowledge—allowing the model to "think before acting."
-
Gaussian Risk-aware Box Perturbation (GRBP):
- Function: Enhances the robustness of generative box predictions against annotation noise and discretization errors.
- Mechanism: During training, ground truth (GT) boxes are probabilistically perturbed: center positions receive Gaussian noise (\(\delta_x, \delta_y \sim \mathcal{N}(0, \beta^2)\)), and width/height are multiplied by Gaussian scaling factors. An IoU guard ensures the perturbed box maintains an IoU \(\geq \tau\) with the original box. This replaces hard-target supervision with Gaussian-weighted soft targets—larger perturbations correspond to lower probabilities, maintaining empirical risk minimization while tolerating small geometric deviations.
- Design Motivation: Generative box prediction discretizes coordinates into token sequences; minute deviations can cause disproportionately large training losses. GRBP alleviates this issue through soft supervision.
Loss & Training¶
Standard autoregressive MLE loss \(\mathcal{L} = -\sum_t \log p_\theta(y_t | y_{<t}, \text{Instruction}, I, T)\), where box coordinates act as soft targets during training after GRBP perturbation.
Key Experimental Results¶
Main Results¶
On Twitter-GMNER and Twitter-FMNERG benchmarks:
| Method | Twitter-GMNER (GMNER) | Twitter-GMNER (MNER) |
|---|---|---|
| GMDA (Pipeline) | 58.61 | - |
| GEM (Pipeline+MLLM) | 59.83 | 83.15 |
| E2E-GMNER | Highly Competitive | Highly Competitive |
Ablation Study¶
| Configuration | Effect | Description |
|---|---|---|
| w/o CoT Reasoning | Decrease | Adaptive visual/knowledge utilization is critical |
| w/o GRBP | Decrease | Box prediction robustness is compromised |
| Hard Box vs. GRBP Soft Supervision | GRBP Superior | Tolerates annotation noise |
| End-to-End vs. Pipeline | End-to-End Superior | Eliminates error accumulation |
Key Findings¶
- The end-to-end framework achieves highly competitive performance on the main GMNER task, validating the effectiveness of unified optimization.
- CoT reasoning allows the model to actively ignore noisy visual cues rather than being misled, which is crucial for improving entity localization accuracy.
- The IoU guard mechanism in GRBP ensures that perturbations are not excessive, balancing the flexibility of soft supervision with accuracy.
- The model does not rely on external models during inference, maintaining efficient end-to-end reasoning.
Highlights & Insights¶
- The significance of the first end-to-end GMNER framework lies not only in performance gains but also in proving that entity recognition and visual grounding can effectively synergize within a unified generation framework rather than being processed step-by-step.
- GRBP introduces the idea of data augmentation to supervision target design: instead of augmenting input data, it "augments" labels—generating soft supervision signals by probabilistically perturbing GT boxes. This concept is transferable to other generative localization tasks.
- CoT reasoning acts as an "attention gating" mechanism: allowing the model to evaluate the reliability of visual/knowledge signals before usage is a more intelligent multimodal fusion strategy than simple cross-attention.
Limitations & Future Work¶
- It may still underperform compared to specialized pipeline methods on certain specific categories (especially those using powerful external detectors).
- Training for CoT reasoning depends on external LLMs (like GPT-4o) to generate reasoning sequences, introducing additional data preparation costs.
- Hyperparameters for GRBP (\(\beta, \gamma, \tau\)) require tuning; different datasets may need different settings.
- Currently validated only on Twitter image-text pair datasets; generalization to other domains (news, e-commerce) remains unknown.
Related Work & Insights¶
- vs. RiVEG (Li et al., 2024): Uses MLLM as an assistant but remains a pipeline architecture; E2E-GMNER achieves true end-to-end status.
- vs. MAKAR (Lin et al., 2025): Uses an MLLM multi-agent system to resolve semantic ambiguity but still involves pipeline components; E2E-GMNER is more streamlined.
- vs. MQSPN (Tang et al., 2025): Uses set prediction to mitigate exposure bias but does not address noise sensitivity in box prediction; E2E-GMNER’s GRBP directly tackles this challenge.
Rating¶
- Novelty: ⭐⭐⭐⭐ First end-to-end GMNER + GRBP soft supervision innovation
- Experimental Thoroughness: ⭐⭐⭐⭐ Two benchmarks + comprehensive ablations
- Writing Quality: ⭐⭐⭐⭐ Clear problem definition, detailed method description
- Value: ⭐⭐⭐⭐ Provides an effective demonstration of the end-to-end paradigm for multimodal NER