Skip to content

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Conference: CVPR 2026 arXiv: 2603.13370 Code: https://github.com/oamyjin/GraphVLM Area: Multimodal VLM Keywords: Multimodal graph learning, VLM, graph neural network, benchmark, node classification

TL;DR

This work proposes the GraphVLM benchmark, which systematically evaluates VLMs across three roles in multimodal graph learning (Encoder / Aligner / Predictor). The VLM-as-Predictor paradigm consistently achieves the best performance, revealing the substantial potential of VLMs as backbones for multimodal graph reasoning.

Background & Motivation

VLMs excel at image–text alignment, yet existing work primarily focuses on pairwise modality alignment and overlooks the relational structure among entities in real-world data (e.g., social networks, recommender systems, knowledge graphs). Multimodal Graph Learning (MMGL) aims to integrate heterogeneous node attributes with relational structure, but two critical gaps remain:

Fragmented baselines and shallow fusion: A unified evaluation pipeline is lacking, preventing fair comparison across GNN / LLM / VLM methods; most GNN approaches rely on simple feature concatenation.

Underexplored potential of VLMs in structural reasoning: Existing evaluations are limited to zero-shot inference and do not investigate VLMs as trainable backbones or multimodal aligners.

Method

Overall Architecture

GraphVLM categorizes the role of VLMs in MMGL into three paradigms under a unified evaluation protocol:

  • VLM-as-Encoder: Pre-trained VLMs encode multimodal node features as input to a GNN.
  • VLM-as-Aligner: VLMs bridge modalities to assist LLMs in structured reasoning.
  • VLM-as-Predictor: VLMs are directly fine-tuned as the predictive backbone for graph learning.

Key Designs

  1. VLM-as-Encoder: Three encoder variants are explored.

    • Pre-trained PVLM: Directly concatenates text and image embeddings from CLIP.
    • Fine-tuned PVLM (PVLM-F): Fine-tunes CLIP on the target MMG dataset via contrastive learning to enhance cross-modal alignment.
    • Structure-aware PVLM (PVLM-F-S): Jointly optimized within a GNN framework using a structure-aware contrastive loss:

    \(\mathcal{L}_v = -\log \frac{\exp(\text{sim}(\mathcal{E}_{TI}^{v_i}, \mathcal{E}_{TI}^{v_j}) / \tau)}{\sum_{v_k \in \mathcal{B}} \exp(\text{sim}(\mathcal{E}_{TI}^{v_i}, \mathcal{E}_{TI}^{v_k}) / \tau)}\)

where \(\mathcal{E}_{TI}^{v_i}\) denotes the text–image concatenated embedding of the anchor node, and \(v_j\) is its 1-hop neighbor. Design Motivation: Encourage the encoder to be topology-aware so that embeddings of neighboring nodes are pulled closer together.

  1. VLM-as-Aligner: A two-level alignment strategy.

    • Latent-Space Aligner: Replaces unimodal node representations in GraphLLM with CLIP multimodal embeddings while preserving the original architecture.
    • Prompt-Level Aligner: Uses Qwen-VL to convert images into textual descriptions, constructing "visually augmented node prompts":
    • Visual-augmented: \(\mathcal{T}^I = \text{VLM}(\mathcal{P}_{\text{Gen}}, \mathcal{I}; \theta)\), further summarized by the VLM into a concise caption \(\mathcal{T}^S\).
    • Structure-aware: Further incorporates visual descriptions of neighboring nodes \(\mathcal{T}^{SS}\).
    • Design Motivation: Achieve cross-modal bridging at both the feature level and the prompt level, accommodating different GraphLLM architectures.
  2. VLM-as-Predictor: VLMs are directly fine-tuned as the graph learning backbone.

    • Explicit Prompt-Level Fusion: Constructs prompts containing the anchor node and the attributes of its top-\(k\) most similar neighbors.
    • Implicit Latent-Space Fusion: Aggregates neighbor representations into structure-aware tokens injected into the model's latent space.
    • Visual: Average-pools patch embeddings of neighboring node images.
    • Textual: Average final-layer token embeddings as node-level representations.
    • LoRA fine-tuning is applied to LLaVA-1.5-7B, Qwen-VL-7B, and Qwen2.5-VL-7B.

Loss & Training

  • Encoder paradigm: Contrastive learning loss (CLIP-style + structure-aware contrastive loss).
  • Aligner paradigm: Follows the original training pipeline of each GraphLLM.
  • Predictor paradigm: LoRA SFT following official fine-tuning guidelines.

Key Experimental Results

Main Results

Node classification is conducted on Amazon co-purchase networks (Movies / Toys / Grocery / Arts / CDs) and the Reddit social network:

Paradigm Method Movies Toys Grocery Arts CDs Reddit
VLM-as-Encoder GraphSAGE+CLIP 44.08 77.77 86.05 85.35 54.75 76.48
VLM-as-Encoder MMGCN+CLIP 45.90 75.36 84.63 88.92 51.33 80.99
VLM-as-Encoder UniGraph2
VLM-as-Predictor Qwen2.5-VL-7B (best) Best Best Best Best Best Best

Ablation Study

Configuration Key Finding Remarks
Text-only vs. Image-only vs. Multimodal Multimodal > Unimodal CLIP multimodal concatenation consistently outperforms single-modality input
Pre-trained vs. Fine-tuned vs. Structure-aware Each has advantages PVLM-F yields notable gains on small graphs; PVLM-F-S benefits denser graphs
Latent-space vs. Prompt-level Aligner Latent-space is more stable Feature-level fusion yields more consistent gains than prompt-level fusion
Zero-shot VLM vs. SFT VLM SFT brings substantial improvement Fine-tuned VLM-as-Predictor decisively surpasses zero-shot inference

Key Findings

  1. VLM-as-Predictor is consistently the best paradigm: Fine-tuned VLMs serving as direct predictive backbones achieve the highest performance across all six datasets.
  2. Latent-space fusion outperforms prompt-level fusion: Integrating modality and structural signals at the feature level yields more stable gains.
  3. CLIP as an encoder is already highly competitive: Under the GNN framework, CLIP multimodal embeddings significantly outperform alternatives such as ImageBind.

Highlights & Insights

  • This work presents the first systematic VLM benchmark for multimodal graph learning, unifying GNN / LLM / VLM paradigms under a single framework and filling an important gap in the field.
  • The proposed three-role taxonomy (Encoder / Aligner / Predictor) provides a clear conceptual framework for future research.
  • The study reveals the potential of VLMs on graph-structured data: they can serve not merely as feature extractors but as end-to-end graph reasoning backbones.
  • The large experimental scale (6 datasets × multiple methods × multiple configurations) ensures reliable conclusions.

Limitations & Future Work

  • Only node classification is considered; other graph learning tasks such as link prediction and graph classification are not covered.
  • Datasets are limited to e-commerce co-purchase and social network domains; scientific knowledge graphs and similar settings are absent.
  • The computational overhead of VLM-as-Predictor far exceeds that of GNN-based approaches, requiring practical trade-off analysis for deployment.
  • Structural information injection remains relatively preliminary (top-\(k\) neighbors / simple aggregation); more sophisticated graph encoding strategies are left unexplored.
  • Compared to existing benchmarks such as MAGB and MM-Bench, GraphVLM is the first to cover all three VLM roles and to support fine-tuning-based evaluation.
  • The multimodal extension paradigm from the GraphLLM line of work (GraphGPT, LLaGA, MLaGA) offers valuable design inspiration.
  • The structure-aware contrastive learning approach (PVLM-F-S) can be generalized to other graph-plus-multimodal settings.

Rating

  • Novelty: ⭐⭐⭐⭐ — First systematic benchmark; the three-role VLM taxonomy is clear and insightful.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across 6 datasets, multiple paradigms, and multiple configurations; extremely rigorous.
  • Writing Quality: ⭐⭐⭐⭐ — Well-structured with rich figures and tables, though some experimental tables are somewhat dense.
  • Value: ⭐⭐⭐⭐ — Makes an important contribution to multimodal graph learning; the benchmark holds lasting long-term value.
  • Value: To be evaluated.