GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning¶

Conference: CVPR 2026 arXiv: 2603.13370 Code: https://github.com/oamyjin/GraphVLM Area: Multimodal VLM Keywords: Multimodal graph learning, VLM, graph neural network, benchmark, node classification

TL;DR¶

This work proposes the GraphVLM benchmark, which systematically evaluates VLMs across three roles in multimodal graph learning (Encoder / Aligner / Predictor). The VLM-as-Predictor paradigm consistently achieves the best performance, revealing the substantial potential of VLMs as backbones for multimodal graph reasoning.

Background & Motivation¶

VLMs excel at image–text alignment, yet existing work primarily focuses on pairwise modality alignment and overlooks the relational structure among entities in real-world data (e.g., social networks, recommender systems, knowledge graphs). Multimodal Graph Learning (MMGL) aims to integrate heterogeneous node attributes with relational structure, but two critical gaps remain:

Fragmented baselines and shallow fusion: A unified evaluation pipeline is lacking, preventing fair comparison across GNN / LLM / VLM methods; most GNN approaches rely on simple feature concatenation.

Underexplored potential of VLMs in structural reasoning: Existing evaluations are limited to zero-shot inference and do not investigate VLMs as trainable backbones or multimodal aligners.

Method¶

Overall Architecture¶

GraphVLM categorizes the role of VLMs in MMGL into three paradigms under a unified evaluation protocol:

VLM-as-Encoder: Pre-trained VLMs encode multimodal node features as input to a GNN.
VLM-as-Aligner: VLMs bridge modalities to assist LLMs in structured reasoning.
VLM-as-Predictor: VLMs are directly fine-tuned as the predictive backbone for graph learning.

Key Designs¶

VLM-as-Encoder: Three encoder variants are explored.
- Pre-trained PVLM: Directly concatenates text and image embeddings from CLIP.
- Fine-tuned PVLM (PVLM-F): Fine-tunes CLIP on the target MMG dataset via contrastive learning to enhance cross-modal alignment.
- Structure-aware PVLM (PVLM-F-S): Jointly optimized within a GNN framework using a structure-aware contrastive loss:
\(\mathcal{L}_v = -\log \frac{\exp(\text{sim}(\mathcal{E}_{TI}^{v_i}, \mathcal{E}_{TI}^{v_j}) / \tau)}{\sum_{v_k \in \mathcal{B}} \exp(\text{sim}(\mathcal{E}_{TI}^{v_i}, \mathcal{E}_{TI}^{v_k}) / \tau)}\)

where \(\mathcal{E}_{TI}^{v_i}\) denotes the text–image concatenated embedding of the anchor node, and \(v_j\) is its 1-hop neighbor. Design Motivation: Encourage the encoder to be topology-aware so that embeddings of neighboring nodes are pulled closer together.

VLM-as-Aligner: A two-level alignment strategy.
- Latent-Space Aligner: Replaces unimodal node representations in GraphLLM with CLIP multimodal embeddings while preserving the original architecture.
- Prompt-Level Aligner: Uses Qwen-VL to convert images into textual descriptions, constructing "visually augmented node prompts":
- Visual-augmented: \(\mathcal{T}^I = \text{VLM}(\mathcal{P}_{\text{Gen}}, \mathcal{I}; \theta)\), further summarized by the VLM into a concise caption \(\mathcal{T}^S\).
- Structure-aware: Further incorporates visual descriptions of neighboring nodes \(\mathcal{T}^{SS}\).
- Design Motivation: Achieve cross-modal bridging at both the feature level and the prompt level, accommodating different GraphLLM architectures.
VLM-as-Predictor: VLMs are directly fine-tuned as the graph learning backbone.
- Explicit Prompt-Level Fusion: Constructs prompts containing the anchor node and the attributes of its top-\(k\) most similar neighbors.
- Implicit Latent-Space Fusion: Aggregates neighbor representations into structure-aware tokens injected into the model's latent space.
- Visual: Average-pools patch embeddings of neighboring node images.
- Textual: Average final-layer token embeddings as node-level representations.
- LoRA fine-tuning is applied to LLaVA-1.5-7B, Qwen-VL-7B, and Qwen2.5-VL-7B.

Loss & Training¶

Encoder paradigm: Contrastive learning loss (CLIP-style + structure-aware contrastive loss).
Aligner paradigm: Follows the original training pipeline of each GraphLLM.
Predictor paradigm: LoRA SFT following official fine-tuning guidelines.

Key Experimental Results¶

Main Results¶

Node classification is conducted on Amazon co-purchase networks (Movies / Toys / Grocery / Arts / CDs) and the Reddit social network:

Paradigm	Method	Movies	Toys	Grocery	Arts	CDs	Reddit
VLM-as-Encoder	GraphSAGE+CLIP	44.08	77.77	86.05	85.35	54.75	76.48
VLM-as-Encoder	MMGCN+CLIP	45.90	75.36	84.63	88.92	51.33	80.99
VLM-as-Encoder	UniGraph2	—	—	—	—	—	—
VLM-as-Predictor	Qwen2.5-VL-7B (best)	Best	Best	Best	Best	Best	Best

Ablation Study¶

Configuration	Key Finding	Remarks
Text-only vs. Image-only vs. Multimodal	Multimodal > Unimodal	CLIP multimodal concatenation consistently outperforms single-modality input
Pre-trained vs. Fine-tuned vs. Structure-aware	Each has advantages	PVLM-F yields notable gains on small graphs; PVLM-F-S benefits denser graphs
Latent-space vs. Prompt-level Aligner	Latent-space is more stable	Feature-level fusion yields more consistent gains than prompt-level fusion
Zero-shot VLM vs. SFT VLM	SFT brings substantial improvement	Fine-tuned VLM-as-Predictor decisively surpasses zero-shot inference

Key Findings¶

VLM-as-Predictor is consistently the best paradigm: Fine-tuned VLMs serving as direct predictive backbones achieve the highest performance across all six datasets.
Latent-space fusion outperforms prompt-level fusion: Integrating modality and structural signals at the feature level yields more stable gains.
CLIP as an encoder is already highly competitive: Under the GNN framework, CLIP multimodal embeddings significantly outperform alternatives such as ImageBind.

Highlights & Insights¶

This work presents the first systematic VLM benchmark for multimodal graph learning, unifying GNN / LLM / VLM paradigms under a single framework and filling an important gap in the field.
The proposed three-role taxonomy (Encoder / Aligner / Predictor) provides a clear conceptual framework for future research.
The study reveals the potential of VLMs on graph-structured data: they can serve not merely as feature extractors but as end-to-end graph reasoning backbones.
The large experimental scale (6 datasets × multiple methods × multiple configurations) ensures reliable conclusions.

Limitations & Future Work¶

Only node classification is considered; other graph learning tasks such as link prediction and graph classification are not covered.
Datasets are limited to e-commerce co-purchase and social network domains; scientific knowledge graphs and similar settings are absent.
The computational overhead of VLM-as-Predictor far exceeds that of GNN-based approaches, requiring practical trade-off analysis for deployment.
Structural information injection remains relatively preliminary (top-\(k\) neighbors / simple aggregation); more sophisticated graph encoding strategies are left unexplored.

Compared to existing benchmarks such as MAGB and MM-Bench, GraphVLM is the first to cover all three VLM roles and to support fine-tuning-based evaluation.
The multimodal extension paradigm from the GraphLLM line of work (GraphGPT, LLaGA, MLaGA) offers valuable design inspiration.
The structure-aware contrastive learning approach (PVLM-F-S) can be generalized to other graph-plus-multimodal settings.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic benchmark; the three-role VLM taxonomy is clear and insightful.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across 6 datasets, multiple paradigms, and multiple configurations; extremely rigorous.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with rich figures and tables, though some experimental tables are somewhat dense.
Value: ⭐⭐⭐⭐ — Makes an important contribution to multimodal graph learning; the benchmark holds lasting long-term value.
Value: To be evaluated.