GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning¶

Conference: CVPR 2026 arXiv: 2603.13370 Code: https://github.com/oamyjin/GraphVLM (open source) Area: Multimodal VLM / Graph Learning Keywords: Multimodal graph learning, VLM role analysis, graph neural networks, benchmark, structure-aware reasoning

TL;DR¶

This paper proposes GraphVLM, a benchmark that systematically evaluates VLMs in three roles for multimodal graph learning (MMGL): VLM-as-Encoder (enhancing GNN features), VLM-as-Aligner (bridging modalities for LLM-based reasoning), and VLM-as-Predictor (serving directly as the graph learning backbone). Experiments across six datasets demonstrate that VLM-as-Predictor consistently achieves the best performance, revealing the substantial potential of VLMs as a new foundation for MMGL.

Background & Motivation¶

Background: VLMs have achieved great success in aligning paired modalities (image–text), yet their capacity for multimodal reasoning over structured data (entities connected via graphs) remains largely unexplored. Two paradigms exist in MMGL—GNN-based and LLM-based methods—but the third paradigm of using VLMs directly as graph learning backbones is almost entirely absent.

Limitations of Prior Work: (a) Existing MMGL methods lack a unified evaluation protocol, preventing fair comparison across GNN/LLM/VLM approaches; (b) most GNN-based methods perform multimodal fusion via naive feature concatenation; (c) the potential of VLMs in graph learning has been confined to zero-shot inference, leaving their role as trainable backbones unexplored.

Key Challenge: While VLMs inherently possess cross-modal alignment capabilities, how such capabilities should be combined with the relational structure of graphs, and how VLMs can be most effectively leveraged for MMGL, remain open questions.

Goal: To establish a systematic benchmark that uniformly evaluates different roles of VLMs in multimodal graph learning and identifies the most effective usage paradigm.

Key Insight: The paper decomposes the role of VLMs in MMGL into three complementary paradigms and explores each along distinct axes.

Core Idea: VLM-as-Predictor—directly fine-tuning a VLM as the graph learning backbone with structural signal injection—is the most effective paradigm for multimodal graph learning.

Method¶

Overall Architecture¶

Three VLM roles correspond to three paradigms, evaluated uniformly across six multimodal graph datasets (four Amazon e-commerce, one Reddit, one CDs). Each paradigm includes multiple variants (e.g., with/without structural information, different fusion strategies).

Key Designs¶

VLM-as-Encoder:
- Function: Encodes multimodal node features using a pretrained VLM (e.g., CLIP) and feeds them into a GNN for graph learning.
- Mechanism: Three encoder variants — (a) direct feature extraction and concatenation using pretrained CLIP; (b) CLIP fine-tuned on graph data via contrastive learning (CLIP-F); (c) structure-aware CLIP (CLIP-F-S), jointly optimized within a GNN framework using a structure-aware contrastive loss to align multimodal features with graph topology. Downstream GNNs include GCN, GraphSAGE, MMGCN, MGAT, and UniGraph2.
- Design Motivation: To investigate how multimodal feature quality affects GNN performance, and whether structure-aware encoding outperforms naive concatenation.
VLM-as-Aligner:
- Function: Uses VLMs to bridge modalities so that GraphLLMs can process multimodal graph data.
- Mechanism: Two alignment strategies — (a) Latent-space alignment: replaces original unimodal node representations with CLIP multimodal embeddings injected directly into the LLM input space; (b) Prompt-level alignment: uses a VLM (Qwen-VL) to convert images into textual descriptions appended to node text attributes, with optional inclusion of visual descriptions from neighboring nodes for structure-aware augmentation.
- Design Motivation: To test whether the multimodal alignment capabilities of VLMs can enhance existing GraphLLMs (e.g., LLaGA, GraphGPT, MLaGA).
VLM-as-Predictor:
- Function: Directly fine-tunes VLMs (LLaVA-1.5, Qwen-VL, Qwen2.5-VL) via LoRA as task-specific backbones for graph learning.
- Mechanism: Two structural signal injection strategies — (a) Explicit prompt-level fusion: constructs instruction prompts containing the anchor node and attributes of its top-3 most similar neighbors; (b) Implicit latent-space fusion: aggregates visual patch embeddings (avg pooling) and text token embeddings (avg pooling) from neighboring nodes and injects them into the VLM latent space. Supports text-only, vision-only, and multimodal neighbor configurations.
- Design Motivation: Since VLMs already possess strong multimodal reasoning capabilities, combining fine-tuning with structural signal injection allows VLMs to serve directly as graph learning models, avoiding the information loss introduced by intermediate GNN or LLM layers.

Key Experimental Results¶

Main Results (Cross-Paradigm Comparison, Average over 6 Datasets)¶

Paradigm	Representative Method	Avg. Best Performance	Notes
VLM-as-Encoder	CLIP + GraphSAGE	Moderate	High-quality CLIP features, but bottlenecked by GNN
VLM-as-Aligner	CLIP/Qwen-VL + MLaGA	Moderate–High	Latent-space alignment outperforms prompt-level alignment
VLM-as-Predictor	Qwen2.5-VL + SFT	Best	Consistently best, especially after structural augmentation

Ablation Study (Encoder Comparison, VLM-as-Encoder)¶

Encoder	GraphSAGE Performance	Notes
ImageBind	Lower	Basic multimodal encoding
CLIP	High	Strong cross-modal alignment
CLIP-F (fine-tuned)	Slightly higher / comparable	Domain fine-tuning yields limited gains in some cases
CLIP-F-S (structure-aware)	Marginal improvement	Structure-aware encoding effective on some datasets

Fusion Strategy Comparison (VLM-as-Predictor)¶

Fusion Strategy	Effect	Notes
No structural information	Baseline	VLM observes only individual nodes
Prompt-level structural fusion	Improvement	Neighbor text/visual information added to prompt
Latent-space structural fusion	Largest improvement	Aggregated neighbor features injected into latent space

Key Findings¶

VLM-as-Predictor consistently achieves the best results: It outperforms GNN and LLM methods across all six datasets, establishing VLM-as-backbone with structural augmentation as the most effective paradigm for MMGL.
Latent-space fusion outperforms prompt-level fusion: Integrating modality and structural signals at the feature level is more consistently effective than describing them in natural language within prompts.
Pretrained CLIP is already a strong encoder: The feature quality of pretrained CLIP is high, and further fine-tuning yields limited additional gains.
Structural information is most valuable in VLM-as-Predictor: The same neighbor information yields larger improvements when injected into a VLM than when injected into a GNN or LLM.
Qwen2.5-VL > Qwen-VL > LLaVA-1.5: Stronger and more recent VLM backbones lead to better graph learning performance.

Highlights & Insights¶

The systematic three-role taxonomy (Encoder / Aligner / Predictor) provides a clear conceptual framework for understanding the role of VLMs in graph learning, and can be generalized to analyze VLMs in other structured data tasks.
The finding that VLM-as-Predictor is superior carries clear methodological implications: VLMs should not be treated merely as feature extractors or modality bridges, but as first-class backbone models for graph learning.
The systematic comparison of latent-space vs. prompt-level fusion offers concrete design guidance for multimodal graph learning: structural information should be injected at the feature level rather than the textual level.
The comprehensive comparison spanning six datasets, three paradigms, and multiple sub-methods constitutes the most complete benchmark in this direction to date.

Limitations & Future Work¶

The benchmark focuses solely on node classification, without addressing other graph learning tasks such as link prediction or graph classification.
Datasets are limited to Amazon e-commerce and Reddit, providing narrow domain coverage (knowledge graphs, molecular graphs, etc. are absent).
LoRA fine-tuning in VLM-as-Predictor requires labeled data and is not a truly zero-shot approach.
Only image and text modalities are considered; richer multimodal graphs involving audio or video are not explored.
The scalability of VLMs on large-scale graphs is not discussed (per-node VLM inference incurs substantial computational cost).

vs. MM-Bench [Zheng et al.]: MM-Bench covers only GNN-based methods. GraphVLM is the first to provide unified coverage of GNN, LLM, and VLM backbones.
vs. MAGB [Wei et al.]: MAGB covers GNNs and zero-shot VLMs, but does not explore VLM fine-tuning. GraphVLM provides complete coverage of the VLM SFT setting.
vs. MLaGA [Chen et al.]: MLaGA is among the most advanced GraphLLM methods, yet is outperformed by VLM-as-Predictor in GraphVLM's comparison, demonstrating that direct VLM fine-tuning is more effective than routing through an LLM intermediary.

Rating¶

Novelty: ⭐⭐⭐⭐ The systematic three-role taxonomy of VLMs is a novel contribution, though individual sub-methods are largely combinations of existing approaches.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparison across 6 datasets, 3 paradigms, and multiple GNN/LLM/VLM methods.
Writing Quality: ⭐⭐⭐⭐ Well-structured with an intuitive classification framework.
Value: ⭐⭐⭐⭐ Provides a systematic benchmark and clear methodological guidance for multimodal graph learning.