Skip to content

DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs

Conference: CVPR 2026 arXiv: 2602.21864 Code: To be confirmed Area: Multimodal VLM Keywords: Graph QA, graph topology representation, VLM zero-shot reasoning, dynamic routing, accuracy-conciseness trade-off

TL;DR

This paper proposes DynamicGTR, a framework that dynamically routes each query at inference time to the optimal graph topology representation (GTR, 8 variants spanning visual and textual modalities), substantially improving VLM performance on zero-shot graph algorithm QA, with transferability to real-world tasks such as link prediction and node classification.

Background & Motivation

Background: VLMs have demonstrated the ability to answer graph-related questions in zero-shot settings, yet understanding structured graph data remains challenging.

Limitations of Prior Work: Existing methods rely on a single fixed GTR (e.g., a uniform textual prompt or a fixed visualization style), ignoring model-specific and task-specific representational preferences.

Key Challenge: Experiments reveal that different tasks favor different GTRs — perception-intensive tasks (connectivity/cycle detection) prefer visual GTRs, while edge-weight tasks (shortest path/maximum flow) prefer textual GTRs.

Cost of Suboptimal GTR: A suboptimal representation can lead to incorrect answers or unnecessarily verbose responses.

Limitations of Existing Graph QA Methods: Tool-augmented systems are constrained to predefined question types, while graph-aware VLMs require additional training or architectural modifications, violating the zero-shot premise.

Core Problem: Can GTR preferences be exploited to make graph QA both accurate and efficient?

Method

Overall Architecture

DynamicGTR consists of: (1) construction of a zero-shot GTR pool \(\mathcal{R}_{ZS}\) (5 visual + 3 textual GTRs); (2) definition of the GRE metric; (3) construction of a GTR preference dataset; and (4) training a GTR router for dynamic selection at inference time.

Key Designs

Zero-Shot GTR Pool

Following three principles — model-agnosticism, diversity, and effectiveness — eight GTRs are constructed: - Visual GTRs (5 variants): \(V_{dot}\) (hierarchical tree), \(V_{neato}\) (spring model), \(V_{circo}\) (circular), \(V_{fdp}\) (fast force-directed), \(V_{sfdp}\) (scalable force-directed) - Textual GTRs (3 variants): \(T_{set}\) (edge set), \(T_{list}\) (adjacency list), \(T_{mat}\) (adjacency matrix)

Graph Response Efficiency (GRE) Metric

\[GRE_r(q) = \text{Acc}_r(q) + \alpha \times \text{Eff}_r(q)\]

where \(\text{Acc}_r(q) = \log(1+100 \times \text{correctness})\) and \(\text{Eff}_r(q) = -\log(\text{tok}_r(q))\). The coefficient \(\alpha\) controls the trade-off between accuracy and conciseness.

GTR Router

A DeBERTaV3-base model is fine-tuned to map queries to the optimal GTR, requiring approximately 2.96 hours of training on a single A100. Multi-label classification is supported via binary cross-entropy loss.

Loss & Training

Multi-label binary cross-entropy:

\[\mathcal{L} = -\mathbb{E}\left[\sum_r y_r \log p_\phi(y_r|q) + (1-y_r)\log(1-p_\phi(y_r|q))\right]\]

Key Experimental Results

Main Results: Graph Algorithm QA on GPT-4o

Method Conn Acc Cyc Acc SP Acc Avg. Tok
CoT 92.5 52.7 54.6 273–566
NLGraph 92.9 60.2 59.0 202–534
DynamicGTR Best Best Best Fewer

Ablation Study: Task Preference Analysis

Task Type Preferred GTR Representative Tasks
Perception-intensive Visual GTR Connectivity, cycle detection, bipartite matching
Edge-weight computation Textual GTR Shortest path, maximum flow
Ordered decomposition Textual GTR Hamiltonian path, topological sort

Key Findings

  • GTR preference patterns differ across VLMs (GPT-4o vs. Gemini-2.5 Pro).
  • Insights from DynamicGTR transfer zero-shot from synthetic graph algorithm tasks to link prediction and node classification.
  • The router exhibits good transferability across different VLMs.
  • The framework remains effective on large-scale graphs.

Highlights & Insights

  • This work is the first to systematically study representational preferences in VLM-based graph QA, revealing the importance of task–representation alignment.
  • The accuracy-conciseness trade-off embedded in the GRE metric is practically useful, with \(\alpha\) as a tunable knob for different user needs.
  • The combination of a lightweight router (DeBERTa) with black-box VLM inference is friendly to closed-source model deployments.
  • The GTR preference dataset itself holds independent research value.

Limitations & Future Work

  • The GTR pool is manually designed and may overlook superior representations.
  • The probing data is based on Erdős–Rényi random graphs; real-world graphs may exhibit different preferences.
  • The router relies solely on textual features and cannot exploit the graph structure itself.
  • Visual GTRs may become unreadable for extremely large graphs.
  • Compared to methods using uniform textual representations (e.g., NLGraph, GraphArena), DynamicGTR offers greater flexibility through dynamic selection.
  • Relative to fixed visual representations in VisionGraph and GITA, this work complements the design space with textual GTR options.
  • The discovered task–representation preferences may generalize to VLM understanding of other structured data types (e.g., tables, flowcharts).
  • The task–representation mapping patterns revealed by the GTR preference dataset carry independent research value.
  • The findings empirically validate a dual-system cognition analogy: fast-intuitive visual processing versus slow-analytical textual processing.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐