GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning¶
Conference: CVPR 2026 arXiv: 2603.13370 Code: https://github.com/oamyjin/GraphVLM Area: Multimodal VLM Keywords: Multimodal graph learning, VLM, graph neural network, benchmark, node classification
TL;DR¶
This work proposes the GraphVLM benchmark, which systematically evaluates VLMs across three roles in multimodal graph learning (Encoder / Aligner / Predictor). The VLM-as-Predictor paradigm consistently achieves the best performance, revealing the substantial potential of VLMs as backbones for multimodal graph reasoning.
Background & Motivation¶
VLMs excel at image–text alignment, yet existing work primarily focuses on pairwise modality alignment and overlooks the relational structure among entities in real-world data (e.g., social networks, recommender systems, knowledge graphs). Multimodal Graph Learning (MMGL) aims to integrate heterogeneous node attributes with relational structure, but two critical gaps remain:
Fragmented baselines and shallow fusion: A unified evaluation pipeline is lacking, preventing fair comparison across GNN / LLM / VLM methods; most GNN approaches rely on simple feature concatenation.
Underexplored potential of VLMs in structural reasoning: Existing evaluations are limited to zero-shot inference and do not investigate VLMs as trainable backbones or multimodal aligners.
Method¶
Overall Architecture¶
GraphVLM categorizes the role of VLMs in MMGL into three paradigms under a unified evaluation protocol:
- VLM-as-Encoder: Pre-trained VLMs encode multimodal node features as input to a GNN.
- VLM-as-Aligner: VLMs bridge modalities to assist LLMs in structured reasoning.
- VLM-as-Predictor: VLMs are directly fine-tuned as the predictive backbone for graph learning.
Key Designs¶
-
VLM-as-Encoder: Three encoder variants are explored.
- Pre-trained PVLM: Directly concatenates text and image embeddings from CLIP.
- Fine-tuned PVLM (PVLM-F): Fine-tunes CLIP on the target MMG dataset via contrastive learning to enhance cross-modal alignment.
- Structure-aware PVLM (PVLM-F-S): Jointly optimized within a GNN framework using a structure-aware contrastive loss:
\(\mathcal{L}_v = -\log \frac{\exp(\text{sim}(\mathcal{E}_{TI}^{v_i}, \mathcal{E}_{TI}^{v_j}) / \tau)}{\sum_{v_k \in \mathcal{B}} \exp(\text{sim}(\mathcal{E}_{TI}^{v_i}, \mathcal{E}_{TI}^{v_k}) / \tau)}\)
where \(\mathcal{E}_{TI}^{v_i}\) denotes the text–image concatenated embedding of the anchor node, and \(v_j\) is its 1-hop neighbor. Design Motivation: Encourage the encoder to be topology-aware so that embeddings of neighboring nodes are pulled closer together.
-
VLM-as-Aligner: A two-level alignment strategy.
- Latent-Space Aligner: Replaces unimodal node representations in GraphLLM with CLIP multimodal embeddings while preserving the original architecture.
- Prompt-Level Aligner: Uses Qwen-VL to convert images into textual descriptions, constructing "visually augmented node prompts":
- Visual-augmented: \(\mathcal{T}^I = \text{VLM}(\mathcal{P}_{\text{Gen}}, \mathcal{I}; \theta)\), further summarized by the VLM into a concise caption \(\mathcal{T}^S\).
- Structure-aware: Further incorporates visual descriptions of neighboring nodes \(\mathcal{T}^{SS}\).
- Design Motivation: Achieve cross-modal bridging at both the feature level and the prompt level, accommodating different GraphLLM architectures.
-
VLM-as-Predictor: VLMs are directly fine-tuned as the graph learning backbone.
- Explicit Prompt-Level Fusion: Constructs prompts containing the anchor node and the attributes of its top-\(k\) most similar neighbors.
- Implicit Latent-Space Fusion: Aggregates neighbor representations into structure-aware tokens injected into the model's latent space.
- Visual: Average-pools patch embeddings of neighboring node images.
- Textual: Average final-layer token embeddings as node-level representations.
- LoRA fine-tuning is applied to LLaVA-1.5-7B, Qwen-VL-7B, and Qwen2.5-VL-7B.
Loss & Training¶
- Encoder paradigm: Contrastive learning loss (CLIP-style + structure-aware contrastive loss).
- Aligner paradigm: Follows the original training pipeline of each GraphLLM.
- Predictor paradigm: LoRA SFT following official fine-tuning guidelines.
Key Experimental Results¶
Main Results¶
Node classification is conducted on Amazon co-purchase networks (Movies / Toys / Grocery / Arts / CDs) and the Reddit social network:
| Paradigm | Method | Movies | Toys | Grocery | Arts | CDs | |
|---|---|---|---|---|---|---|---|
| VLM-as-Encoder | GraphSAGE+CLIP | 44.08 | 77.77 | 86.05 | 85.35 | 54.75 | 76.48 |
| VLM-as-Encoder | MMGCN+CLIP | 45.90 | 75.36 | 84.63 | 88.92 | 51.33 | 80.99 |
| VLM-as-Encoder | UniGraph2 | — | — | — | — | — | — |
| VLM-as-Predictor | Qwen2.5-VL-7B (best) | Best | Best | Best | Best | Best | Best |
Ablation Study¶
| Configuration | Key Finding | Remarks |
|---|---|---|
| Text-only vs. Image-only vs. Multimodal | Multimodal > Unimodal | CLIP multimodal concatenation consistently outperforms single-modality input |
| Pre-trained vs. Fine-tuned vs. Structure-aware | Each has advantages | PVLM-F yields notable gains on small graphs; PVLM-F-S benefits denser graphs |
| Latent-space vs. Prompt-level Aligner | Latent-space is more stable | Feature-level fusion yields more consistent gains than prompt-level fusion |
| Zero-shot VLM vs. SFT VLM | SFT brings substantial improvement | Fine-tuned VLM-as-Predictor decisively surpasses zero-shot inference |
Key Findings¶
- VLM-as-Predictor is consistently the best paradigm: Fine-tuned VLMs serving as direct predictive backbones achieve the highest performance across all six datasets.
- Latent-space fusion outperforms prompt-level fusion: Integrating modality and structural signals at the feature level yields more stable gains.
- CLIP as an encoder is already highly competitive: Under the GNN framework, CLIP multimodal embeddings significantly outperform alternatives such as ImageBind.
Highlights & Insights¶
- This work presents the first systematic VLM benchmark for multimodal graph learning, unifying GNN / LLM / VLM paradigms under a single framework and filling an important gap in the field.
- The proposed three-role taxonomy (Encoder / Aligner / Predictor) provides a clear conceptual framework for future research.
- The study reveals the potential of VLMs on graph-structured data: they can serve not merely as feature extractors but as end-to-end graph reasoning backbones.
- The large experimental scale (6 datasets × multiple methods × multiple configurations) ensures reliable conclusions.
Limitations & Future Work¶
- Only node classification is considered; other graph learning tasks such as link prediction and graph classification are not covered.
- Datasets are limited to e-commerce co-purchase and social network domains; scientific knowledge graphs and similar settings are absent.
- The computational overhead of VLM-as-Predictor far exceeds that of GNN-based approaches, requiring practical trade-off analysis for deployment.
- Structural information injection remains relatively preliminary (top-\(k\) neighbors / simple aggregation); more sophisticated graph encoding strategies are left unexplored.
Related Work & Insights¶
- Compared to existing benchmarks such as MAGB and MM-Bench, GraphVLM is the first to cover all three VLM roles and to support fine-tuning-based evaluation.
- The multimodal extension paradigm from the GraphLLM line of work (GraphGPT, LLaGA, MLaGA) offers valuable design inspiration.
- The structure-aware contrastive learning approach (PVLM-F-S) can be generalized to other graph-plus-multimodal settings.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First systematic benchmark; the three-role VLM taxonomy is clear and insightful.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across 6 datasets, multiple paradigms, and multiple configurations; extremely rigorous.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with rich figures and tables, though some experimental tables are somewhat dense.
- Value: ⭐⭐⭐⭐ — Makes an important contribution to multimodal graph learning; the benchmark holds lasting long-term value.
- Value: To be evaluated.