DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs¶

Conference: CVPR 2026 arXiv: 2602.21864 Code: To be confirmed Area: Multimodal VLM Keywords: Graph QA, graph topology representation, VLM zero-shot reasoning, dynamic routing, accuracy-conciseness trade-off

TL;DR¶

This paper proposes DynamicGTR, a framework that dynamically routes each query at inference time to the optimal graph topology representation (GTR, 8 variants spanning visual and textual modalities), substantially improving VLM performance on zero-shot graph algorithm QA, with transferability to real-world tasks such as link prediction and node classification.

Background & Motivation¶

Background: VLMs have demonstrated the ability to answer graph-related questions in zero-shot settings, yet understanding structured graph data remains challenging.

Limitations of Prior Work: Existing methods rely on a single fixed GTR (e.g., a uniform textual prompt or a fixed visualization style), ignoring model-specific and task-specific representational preferences.

Key Challenge: Experiments reveal that different tasks favor different GTRs — perception-intensive tasks (connectivity/cycle detection) prefer visual GTRs, while edge-weight tasks (shortest path/maximum flow) prefer textual GTRs.

Cost of Suboptimal GTR: A suboptimal representation can lead to incorrect answers or unnecessarily verbose responses.

Limitations of Existing Graph QA Methods: Tool-augmented systems are constrained to predefined question types, while graph-aware VLMs require additional training or architectural modifications, violating the zero-shot premise.

Core Problem: Can GTR preferences be exploited to make graph QA both accurate and efficient?

Method¶

Overall Architecture¶

DynamicGTR consists of: (1) construction of a zero-shot GTR pool \(\mathcal{R}_{ZS}\) (5 visual + 3 textual GTRs); (2) definition of the GRE metric; (3) construction of a GTR preference dataset; and (4) training a GTR router for dynamic selection at inference time.

Key Designs¶

Zero-Shot GTR Pool¶

Following three principles — model-agnosticism, diversity, and effectiveness — eight GTRs are constructed: - Visual GTRs (5 variants): \(V_{dot}\) (hierarchical tree), \(V_{neato}\) (spring model), \(V_{circo}\) (circular), \(V_{fdp}\) (fast force-directed), \(V_{sfdp}\) (scalable force-directed) - Textual GTRs (3 variants): \(T_{set}\) (edge set), \(T_{list}\) (adjacency list), \(T_{mat}\) (adjacency matrix)

Graph Response Efficiency (GRE) Metric¶

\[GRE_r(q) = \text{Acc}_r(q) + \alpha \times \text{Eff}_r(q)\]

where \(\text{Acc}_r(q) = \log(1+100 \times \text{correctness})\) and \(\text{Eff}_r(q) = -\log(\text{tok}_r(q))\). The coefficient \(\alpha\) controls the trade-off between accuracy and conciseness.

GTR Router¶

A DeBERTaV3-base model is fine-tuned to map queries to the optimal GTR, requiring approximately 2.96 hours of training on a single A100. Multi-label classification is supported via binary cross-entropy loss.

Loss & Training¶

Multi-label binary cross-entropy:

\[\mathcal{L} = -\mathbb{E}\left[\sum_r y_r \log p_\phi(y_r|q) + (1-y_r)\log(1-p_\phi(y_r|q))\right]\]

Key Experimental Results¶

Main Results: Graph Algorithm QA on GPT-4o¶

Method	Conn Acc	Cyc Acc	SP Acc	Avg. Tok
CoT	92.5	52.7	54.6	273–566
NLGraph	92.9	60.2	59.0	202–534
DynamicGTR	Best	Best	Best	Fewer

Ablation Study: Task Preference Analysis¶

Task Type	Preferred GTR	Representative Tasks
Perception-intensive	Visual GTR	Connectivity, cycle detection, bipartite matching
Edge-weight computation	Textual GTR	Shortest path, maximum flow
Ordered decomposition	Textual GTR	Hamiltonian path, topological sort

Key Findings¶

GTR preference patterns differ across VLMs (GPT-4o vs. Gemini-2.5 Pro).
Insights from DynamicGTR transfer zero-shot from synthetic graph algorithm tasks to link prediction and node classification.
The router exhibits good transferability across different VLMs.
The framework remains effective on large-scale graphs.

Highlights & Insights¶

This work is the first to systematically study representational preferences in VLM-based graph QA, revealing the importance of task–representation alignment.
The accuracy-conciseness trade-off embedded in the GRE metric is practically useful, with \(\alpha\) as a tunable knob for different user needs.
The combination of a lightweight router (DeBERTa) with black-box VLM inference is friendly to closed-source model deployments.
The GTR preference dataset itself holds independent research value.

Limitations & Future Work¶

The GTR pool is manually designed and may overlook superior representations.
The probing data is based on Erdős–Rényi random graphs; real-world graphs may exhibit different preferences.
The router relies solely on textual features and cannot exploit the graph structure itself.
Visual GTRs may become unreadable for extremely large graphs.

Compared to methods using uniform textual representations (e.g., NLGraph, GraphArena), DynamicGTR offers greater flexibility through dynamic selection.
Relative to fixed visual representations in VisionGraph and GITA, this work complements the design space with textual GTR options.
The discovered task–representation preferences may generalize to VLM understanding of other structured data types (e.g., tables, flowcharts).
The task–representation mapping patterns revealed by the GTR preference dataset carry independent research value.
The findings empirically validate a dual-system cognition analogy: fast-intuitive visual processing versus slow-analytical textual processing.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐