GraphGPT-o: Synergistic Multimodal Comprehension and Generation on Graphs¶
Conference: CVPR 2025
arXiv: 2502.11925
Code: To be open-sourced
Area: Image Generation
Keywords: multimodal attributed graph, MLLM, graph linearization, hierarchical aligner, Q-Former, DreamLLM, Stable Diffusion
TL;DR¶
This paper proposes GraphGPT-o, which injects structural information of multimodal attributed graphs (MMAGs, where nodes contain image+text and edges represent relations) into a Multimodal Large Language Model (MLLM). Through PPR sampling, a hierarchical Q-Former aligner, and flexible inference strategies, it achieves joint text-image generation conditioned on the graph context.
Background & Motivation¶
Background: MLLMs (such as DreamLLM) are already capable of understanding and generating text-image content. However, in reality, text and images are often interconnected in graph structures (such as co-purchase graphs of e-commerce products or art genre association graphs), forming Multimodal Attributed Graphs (MMAGs).
Limitations of Prior Work: 1. Graph scale explosion: Directly inputting the complete local subgraph leads to exponentially growing sequence lengths. 2. Non-Euclidean structure: The complex topology of graphs is difficult to input directly into a serialized MLLM. 3. Hierarchical modal dependencies: Node-level (text+image complementarity) and subgraph-level (node semantics + structure) fusion require different levels of integration. 4. Inference dependency: The sequential order of text and image generation affects the output quality.
Key Challenge: MLLMs process linear sequences, whereas the value of MMAGs lies precisely in their non-linear graph structural relationships.
Goal: Enable MLLMs to utilize structural and semantic information within MMAGs to generate matching images and texts for target nodes.
Key Insight: Compress subgraphs with PPR sampling + encode graph structure with a two-level Q-Former + inject graph tokens into the MLLM.
Core Idea: Use Personalized PageRank to sample key neighbors, employ a hierarchical Q-Former to encode node modalities and graph structures, and inject graph tokens into the MLLM to achieve graph-conditioned multimodal generation.
Method¶
Overall Architecture¶
- PPR Neighbor Sampling: Uses Personalized PageRank on the target node to select the top-K most relevant neighbors.
- Hierarchical Multimodal Aligner:
- Node Feature Q-Former: Fuses text and image features of each neighbor.
- Graph Structure Q-Former: Aggregates structural information among neighboring nodes.
- MLLM Inference: Feeds the graph token \(\mathbf{g}_{v_i}\) into DreamLLM along with text/image tokens.
- Stable Diffusion Decoding: The generated image tokens are decoded into images via SD.
Key Designs¶
1. Personalized PageRank Neighbor Sampling¶
To resolve the issue of graph scale explosion, the PPR matrix \(\mathbf{P} = \beta \hat{\mathbf{A}} \mathbf{P} + (1-\beta)\mathbf{I}\) calculates the relevance score of each node to the target, selecting the top-K neighbors:
Compared to fixed-hop neighbor sampling, PPR can select the most relevant nodes across multiple hops, avoiding the introduction of irrelevant noise.
2. Hierarchical Q-Former Aligner¶
Node Feature Q-Former \(\phi(\cdot)\): - Concatenates the neighbor node's text token \(\mathbf{w}_{v_j}\) and CLIP image token \(\mathbf{I}_{v_j}\). - Interchanges text-image modal information through an \(L_1\)-layer self-attention Transformer. - Compresses into a fixed-length representation \(\mathbf{H}_{v_j}\) via cross-attention with a learnable soft prompt \(\mathbf{Q}_V\).
Graph Structure Q-Former \(\psi(\cdot)\): - Concatenates the node representations \(\mathbf{H}_{v_j}\) of all neighbors as input. - Fuses deep information among nodes through an \(L_2\)-layer self-attention. - Aggregates into a graph token \(\mathbf{g}_{v_i}\) via cross-attention with a learnable soft prompt \(\mathbf{Q}_G\).
3. Exploration of Inference Strategies¶
- Sequential (Text-first): Generates text \(d_{v_i}\) first, then generates the image \(p_{v_i}\) conditioned on the text and graph tokens.
- Sequential (Image-first): Generates the image first, then generates text conditioned on the image and graph tokens.
- Parallel: Generates text and image independently and simultaneously to avoid error propagation.
Loss & Training¶
- MLLM Loss: Autoregressive next-token prediction (interleaved sequences of text, image, and graph tokens).
- SD Loss: Standard diffusion denoising loss, conditioned on text, image, and graph tokens.
Key Experimental Results¶
Main Results¶
Three datasets: ART500K (artwork graph), Amazon-Baby, Amazon-Beauty (e-commerce product graphs).
Evaluation metrics: CLIP-I2 (generated image vs. GT image), Perplexity (text fluency), CLIP-IT (image-text alignment), KL-DV (consistency with neighbor distribution).
Key findings from individual Graph Linearization experiments: - Using bi-modality (text+image) generally outperforms single modality. - Modal ordering has inconsistent impacts on performance. - Image-first inference improves image quality but may degrade text quality. - Text-first inference achieves the lowest KL-DV (highest consistency with neighbor distribution).
Hierarchical Aligner vs. Linearization (ART500K):
| Method | CLIP-I2 ↑ | Perp. ↓ | CLIP-IT ↑ | KL-DV ↓ |
|---|---|---|---|---|
| Best Linearization | 79.26 | 117.7 | 20.15 | 0.19 |
| Hierarchical Aligner | 82.15 | 98.3 | 23.41 | 0.15 |
The Hierarchical Q-Former fully outperforms simple linearization.
Ablation Study¶
- Node Q-Former: Removing it leads to a significant drop in CLIP-I2 (node-level modal fusion is key).
- Graph Q-Former: Removing it increases KL-DV (graph structural information is important for distribution consistency).
- PPR vs. Random Sampling: PPR outperforms random sampling across all metrics.
Key Findings¶
- Graph structural information contributes significantly to generation quality and cannot rely solely on node attributes.
- Hierarchical alignment (node-level fusion followed by graph-level aggregation) outperforms one-step flat fusion.
- PPR sampling captures relevant neighbors across multiple hops better than BFS/DFS.
- The optimal choice of inference strategy depends on dataset characteristics (artwork vs. e-commerce).
- There is an optimal number of neighbors; too many will introduce noise.
Highlights & Insights¶
- Novel Problem Definition: Formulates the multimodal content generation task on MMAGs for the first time, generating text and images simultaneously.
- Reasonable Hierarchical Design: The two-level design of Node Q-Former and Graph Q-Former elegantly handles information fusion at different granularities.
- Systematic Experiments: Conducts comprehensive combinatorial experiments on modal selection, modal order, and inference strategies for linearization.
- Rich Application Scenarios: E-commerce recommendation (product generation), virtual art creation, and social network content recommendation.
- PPR Sampling: Introduces the mature PPR method from graph learning to MLLM subgraph selection, balancing efficiency and effectiveness.
Limitations & Future Work¶
- Built on DreamLLM + Stable Diffusion 1.x, the generated image quality is capped by the foundation models.
- Validated only on three relatively small-scale datasets, without testing on large-scale graphs (million-node level).
- PPR sampling requires precomputing the PPR matrix for the entire graph, which is unfriendly to dynamic graphs or streaming scenarios.
- The selection of inference strategies requires tuning for different datasets, lacking an adaptive mechanism.
- Edge attributes (such as relationship types) are not considered; edges are only used for connectivity.
Related Work & Insights¶
- DreamLLM (Dong et al., 2024): The foundational MLLM of GraphGPT-o, supporting interleaved text-image understanding and generation.
- BLIP-2 / InstructBLIP: The architectural source of Q-Former.
- GraphGPT (Tang et al., 2023): A pioneer in injecting graph information into LLMs, but only processes textual graphs.
- PPR / APPNP (Klicpera et al., 2019): Application of Personalized PageRank in graph learning.
Insights: Graph structure is a universally existing relational form in the real world. Injecting graph tokens into various foundation models (LLMs, MLLMs, diffusion models) could be an important direction. The "two-stage compression" concept of Q-Former can be generalized to other hierarchical structured data.
Rating¶
⭐⭐⭐⭐ (4/5)
- Novelty: ⭐⭐⭐⭐ — The problem definition is novel, but the individual components (Q-Former, PPR, DreamLLM) are combinations of existing technologies.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Systematic studies on linearization variants are very detailed, but the dataset scale is relatively small.
- Writing Quality: ⭐⭐⭐⭐ — Clear motivation of the problem, and complete description of the methodology.
- Value: ⭐⭐⭐ — Promising in e-commerce recommendation/art creation scenarios, but the actual demand for graph-based generation still needs verification.