Mario: Multimodal Graph Reasoning with Large Language Models¶
Conference: CVPR 2026 arXiv: 2603.05181 Code: Coming soon Area: Graph Learning Keywords: Multimodal graphs, LLM reasoning, vision-language alignment, modality-adaptive routing, instruction tuning
TL;DR¶
Mario is proposed for LLM reasoning on multimodal graphs (MMGs). It achieves topology-aware cross-modal alignment via a Graph-conditioned Vision-Language Model (GVLM), and employs a Modality-Adaptive Prompt Router (MAPR) to select the optimal modality configuration for each node, attaining state-of-the-art performance on node classification and link prediction.
Background & Motivation¶
Existing multimodal LLMs process independent image-text pairs, overlooking the relational structure among multimodal data in real-world settings. In MMGs, each node carries both text and image attributes, while edges provide structural priors. Directly encoding with a VLM (e.g., CLIP) and feeding the result into a graph model introduces two challenges:
C1 Weak cross-modal consistency: A node's text and image are not necessarily semantically aligned; neighboring information can resolve ambiguity but is ignored. The cross-modal cosine similarity of frozen CLIP is low, yet improves by 68% upon incorporating graph topology.
C2 Heterogeneous modality preference: The informativeness of different modalities varies across nodes. Approximately 30% of nodes can only be correctly classified under a specific modality configuration. A one-size-fits-all prompt template wastes information.
Core Problem¶
Can a unified framework be designed that simultaneously addresses cross-modal inconsistency and heterogeneous modality preference on MMGs during LLM reasoning?
Method¶
Overall Architecture¶
Stage 1 (GVLM): Dual-tower encoder + topology-aware multimodal mixer → graph-conditioned contrastive learning → structure-aware cross-modal consistent representations. Stage 2 (Modality-Adaptive Instruction Tuning): Construction of text-only, image-only, and multimodal prompt templates → MAPR router selects the optimal template → LLM reasoning.
Key Designs¶
-
Topology-Aware Multimodal Mixer: At each encoder layer, CLS representations are collected from all nodes across the graph, and neighbor information is aggregated via multi-head attention with graph-structure positional bias. The structure-aware CLS is then re-injected into the token sequence to replace the original CLS. Iterating layer by layer enables deep fusion of structural and modal information.
-
Graph-Conditioned Contrastive Learning: Bidirectional InfoNCE is applied to structure-aware text/image CLS embeddings: \(\mathcal{L}_{\text{S1}} = -\frac{1}{|\mathcal{B}|}\sum_v [\log\frac{e^{s(v,v)/\tau}}{\sum_u e^{s(v,u)/\tau}} + \log\frac{e^{s(v,v)/\tau}}{\sum_u e^{s(u,v)/\tau}}]\)
-
Modality-Adaptive Prompt Router (MAPR):
- Three prompts are constructed per node: \(\mathcal{S}_v^{\text{txt}}\) (text tokens only), \(\mathcal{S}_v^{\text{vis}}\) (image tokens only), \(\mathcal{S}_v^{\text{mm}}\) (bimodal tokens)
- Router input: \([\mathbf{h}_v^{\text{text}}; \mathbf{h}_v^{\text{image}}; \phi^{(1)}(v); \phi^{(2)}(v); \log d_v]\)
- An MLP outputs three-class routing probabilities \(\mathbf{p}_v = \text{softmax}(\mathbf{s}_v)\)
- Performance posteriors \(\mathbf{q}_v = \text{softmax}(-[\ell_v^{(\text{txt})}, \ell_v^{(\text{vis})}, \ell_v^{(\text{mm})}])\) serve as teacher signals
- Loss: \(\mathcal{L}_{\text{S2}} = \frac{1}{|B|}\sum_v [\sum_k q_v^{(k)} \ell_v^{(k)} + \lambda \text{KL}(\mathbf{q}_v \| \mathbf{p}_v)]\)
Loss & Training¶
Stage 1 trains the encoder with contrastive loss. Stage 2 fine-tunes the LLM and router using performance-weighted LM loss with KL regularization. At inference, the router selects the optimal modality template.
Key Experimental Results¶
Main Results (Node Classification Accuracy %)¶
| Method | Movies | CDs | Arts | |
|---|---|---|---|---|
| GCN(text) | 43.8 | 84.3 | 51.4 | 76.9 |
| GATv2(text) | 48.7 | 85.6 | 54.7 | 80.4 |
| Mario | 53.6+ | 95.3+ | 63.4+ | 92.1+ |
Ablation Study¶
| Configuration | Performance | Note |
|---|---|---|
| No graph-conditioned VLM (frozen CLIP) | Low consistency | Cross-modal misalignment |
| Node-level fine-tuning (no topology) | Partial improvement | Missing neighbor information |
| +GVLM (Stage 1) | Significant gain | Topology- and modality-aware |
| +MAPR (Stage 2) | Best | Modality-adaptive selection |
Mix-Training Setting (Node Classification Accuracy %)¶
| Method | Modality | Movies | CDs | Arts | |
|---|---|---|---|---|---|
| SAGE | Text | 46.85 | 89.96 | 53.24 | 87.46 |
| LLaGA | Text | 47.80 | 91.14 | 51.33 | 74.02 |
| LLaGA-A | Text+Image | 50.61 | 92.94 | 56.29 | 88.83 |
| Graph4MM | Text+Image | 51.07 | 92.89 | 55.53 | 89.32 |
| Mario-8B | Text+Image | 53.63 | 95.30 | 63.43 | 92.13 |
Key Findings¶
- Cross-modal consistency improves by 68% upon introducing graph topology (vs. frozen CLIP)
- ~30% of nodes exhibit a clear single-modality preference
- Zero-shot transfer yields up to 1.6× improvement
Highlights & Insights¶
- Precise identification of two challenges: Weak consistency and heterogeneous preference are genuine bottlenecks in MMG reasoning; the Venn diagram analysis is intuitive and compelling
- Elegant MAPR routing mechanism: LLM loss serves as a performance signal to drive routing learning—soft routing during training and hard routing at inference with zero overhead
- GVLM in Stage 1 as a new paradigm: A topology-aware vision-language model that alternately performs graph attention and token attention within Transformer layers
- Strong zero-shot transfer: Up to 1.6× gain on unseen MMGs, indicating that the learned modality routing strategy generalizes well
- Unified framework: The same architecture handles both node classification and link prediction, demonstrating strong generality
Limitations & Future Work¶
- Two-stage training increases complexity; Stage 2 requires three LLM forward passes per sample
- Mixer attention complexity is \(\mathcal{O}(|\mathcal{V}_s|^2 d)\), necessitating node sampling for large-scale graphs
- The current framework handles only text+image bimodal graphs and has not been extended to modalities such as audio or video
- The graph topology bias \(\mathbf{B}_h\) relies on precomputed shortest paths, which is unfriendly to dynamic graphs
- MLaGA fuses features via Q-Former before passing to the LLM; Graph4MM handles missing modalities—Mario performs better in complete-modality settings but has not been evaluated under missing-modality conditions
Rating¶
⭐⭐⭐⭐⭐ (5/5)
The dual innovations of GVLM and MAPR, combined with comprehensive experiments spanning four datasets, two tasks, and three modality settings, as well as zero-shot transfer validation of generalization, establish Mario as a significant pioneering contribution to multimodal graph + LLM reasoning.