🕸️ Graph Learning¶

📷 CVPR2026 · 9 paper notes

Adaptive Learned Image Compression with Graph Neural Networks: GLIC reformulates the nonlinear transforms in learned image compression (LIC) from fixed convolutions or window-based attention into content-adaptive graph neural network operations. A dual-scale graph determines where to connect, while a complexity-aware mechanism determines how much to connect, enabling more effective modeling of both local and long-range redundancies. GLIC consistently outperforms traditional codecs and recent LIC baselines across three standard benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning: This paper proposes the G2F-RAG paradigm, which renders retrieved structured knowledge into a single "reasoning frame" appended to the end of the video, enabling large models to reason uniformly within the visual space. This approach avoids the attention dilution and cognitive overload caused by text appending, achieving consistent training-free improvements across 8 video benchmarks.
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs: This paper proposes Graph2Eval, a knowledge graph-driven framework for the automatic generation of agent evaluation tasks. By constructing structured knowledge graphs from documents and webpages, performing subgraph sampling, applying LLM-conditioned generation, and employing multi-stage filtering, the framework automatically produces multimodal agent tasks with improved semantic consistency (+20%) and solvability (+17%). The resulting benchmark, Graph2Eval-Bench, comprises 1,319 tasks.
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs: This paper proposes Graph2Eval, a framework that leverages knowledge graphs constructed from heterogeneous data sources as a structured task space. By employing subgraph sampling, task templates, and meta-path strategies, it automatically generates semantically consistent and solvable multimodal agent evaluation tasks, achieving improvements of 20% and 17% in semantic consistency and solvability, respectively.
Hyperbolic Busemann Neural Networks: This paper intrinsically lifts multinomial logistic regression (MLR) and fully connected (FC) layers to hyperbolic space via Busemann functions, proposing two unified components—BMLR and BFC—applicable to both the Poincaré ball and the Lorentz model. The proposed components outperform existing hyperbolic layers across four task categories: image classification, genomic sequence classification, node classification, and link prediction.
M3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation: This paper proposes M3KG-RAG, which constructs a multi-hop multimodal knowledge graph (M3KG) via a lightweight multi-agent pipeline and introduces the GRASP mechanism for entity grounding and selective pruning. By retaining only query-relevant and answer-useful knowledge, the approach substantially improves audio-visual reasoning capabilities of MLLMs.
Mario: Multimodal Graph Reasoning with Large Language Models: Mario is proposed for LLM reasoning on multimodal graphs (MMGs). It achieves topology-aware cross-modal alignment via a Graph-conditioned Vision-Language Model (GVLM), and employs a Modality-Adaptive Prompt Router (MAPR) to select the optimal modality configuration for each node, attaining state-of-the-art performance on node classification and link prediction.
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning: This work embeds a Procedural Knowledge Graph (PKG) into a planning model end-to-end via a differentiable Viterbi layer, enabling the neural network to learn only emission probabilities rather than memorizing complete procedural structures. With only 5–7M parameters—one to three orders of magnitude fewer than diffusion- or LLM-based methods—the approach achieves state-of-the-art success rates on CrossTask/COIN/NIV and establishes a unified evaluation benchmark.
WSGG: Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos: This paper proposes the World Scene Graph Generation (WSGG) task, extending conventional frame-level scene graphs to track all objects—including occluded and invisible ones—within a unified world coordinate system. Accompanied by the ActionGenome4D dataset and three complementary methods (PWG, MWAE, and 4DST), the work enables persistent scene reasoning.