Skip to content

DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

Conference: CVPR 2025
arXiv: 2503.14867
Code: None (not provided in the paper)
Area: Graph Learning / Vision Backbone Networks
Keywords: Hypergraph Neural Networks, Multi-Scale Dilated Hypergraph, Vision Backbone, Dynamic Hypergraph Convolution, High-Order Correlation

TL;DR

This paper proposes DVHGNN, a vision backbone network that utilizes multi-scale dilated hypergraphs to capture high-order correlations among image patches. By employing clustering and Dilated Hypergraph Construction (DHGC) to extract multi-scale hyperedges, alongside dynamic hypergraph convolution for adaptive feature exchange, DVHGNN achieves an 83.1% top-1 accuracy on ImageNet-1K with 30.2M parameters, outperforming ViG-S by 1.0% and ViHGNN-S by 0.6%.

Background & Motivation

Background: Vision GNN (ViG) pioneered representing images as graph structures and processing them with GNNs, but it faces two major challenges. Meanwhile, ViHGNN attempts to use hypergraphs to capture high-order relationships, but gains remain limited.

Limitations of Prior Work: (1) ViG relies on KNN graph construction, which yields quadratic computational complexity and is non-learnable, potentially leading to the loss of key information; (2) Ordinary graphs can only model pairwise relationships and fail to capture high-order correlations among multiple nodes; (3) ViHGNN constructs hypergraphs using fuzzy C-means clustering, which lacks multi-scale information and cannot dynamically adapt during the learning process.

Key Challenge: There is a need to capture high-order relationships (via hypergraphs), but existing hypergraph construction methods either overlook multi-scale properties or suffer from excessive computational complexity.

Goal: How to efficiently construct hypergraph-based visual representations that capture multi-scale high-order relationships.

Key Insight: A dual-path hypergraph representation combining clustering (to capture global semantic grouping) and dilated hypergraph construction (to capture multi-scale local spatial relationships).

Core Idea: Constructing a multi-scale hypergraph using clustering and local hyperedges with different dilation rates, and performing dynamic hypergraph convolution via cosine similarity and sparsity-aware weights.

Method

Overall Architecture

The image is split into \(N\) patches acting as vertices. Each block comprises: multi-scale hypergraph construction (clustering hyperedges + dilated hyperedges) \(\rightarrow\) two-stage dynamic hypergraph convolution (vertex convolution aggregating to hyperedges \(\rightarrow\) hyperedge convolution distributing back to vertices) \(\rightarrow\) ConvFFN for enhanced feature transformation. It adopts a hierarchical isotropic structure similar to ViG.

Key Designs

  1. Clustering + Dilated Hypergraph Construction (DHGC):

    • Function: Adaptively obtaining multi-scale hyperedge sets
    • Mechanism: Dual-path design. Clustering path: map patch features to a similarity space and assign them to \(C\) cluster centers via cosine similarity to form semantic-level hyperedges \(\mathcal{E}_c\). Dilation path: for the center vertex \(v_c\) in each \(w \times w\) window, construct local hyperedges with dilation rates \(r=1,2,3\) (corresponding to \(3 \times 3\), \(5 \times 5\), and \(7 \times 7\) receptive fields), where each dilated hyperedge is associated with a learnable sparsity-aware weight \(w_r\). A region partitioning mechanism (similar to Swin Transformer windows) is introduced to reduce the complexity from \(O(NCD)\) to \(O(NCD/m)\)
    • Design Motivation: Clustering hyperedges capture global semantic similarity but overlook spatial locality, whereas dilated hyperedges capture multi-scale spatial relationships but are limited in range. The two paths are complementary
  2. Dynamic Hypergraph Convolution (DHConv):

    • Function: Adaptive feature exchange and fusion
    • Mechanism: Two-stage message passing. Vertex convolution phase: For clustering hyperedges, aggregate vertex features to hyperedges using sigmoid-weighted cosine similarity (\(h_e = \frac{1}{C}(h_c + \sum \text{sig}(\alpha s_i + \beta) x_i)\)); for dilated hyperedges, aggregate with learnable sparse weights \(w_r\). Hyperedge convolution phase: Distribute hyperedge features back to vertices using cosine similarity or sparse weights, and update using GIN-style formulation: \(x'_i = FC(\sigma(\text{Conv}((1+\varepsilon)x_i + z_i)))\)
    • Design Motivation: Different types of hyperedges require distinct aggregation strategies—semantic hyperedges are suited for similarity-based soft assignment, while spatial hyperedges are suited for weight-based fixed assignment
  3. ConvFFN + Multi-Head Mechanism:

    • Function: Enhancing feature transformation capability and relieving over-smoothing
    • Mechanism: A ConvFFN (a feed-forward network with convolution, similar to ViG) is appended after the hypergraph convolution to expand the local receptive feld. A multi-head mechanism is applied by grouping features to perform hypergraph convolutions independently and then concatenating them, enhancing representational diversity
    • Design Motivation: Pure hypergraph convolutions can cause node representations to converge (over-smoothing). ConvFFN mitigates this issue by introducing non-linearity and local inductive bias

Loss & Training

Standard ImageNet classification training (cross-entropy loss + label smoothing + conventional augmentations like Mixup).

Key Experimental Results

Main Results

Model Type Params (M) FLOPs (G) Top-1 Acc
DeiT-S ViT 22.1 4.6 79.8%
Swin-S ViT 50.0 8.7 83.0%
ViG-S GNN 27.3 4.6 82.1%
ViHGNN-S HGNN 28.5 6.3 82.5%
DVHGNN-S HGNN 30.2 5.2 83.1%
DVHGNN-B HGNN 92.8 16.8 84.2%

Downstream tasks: Consistent improvements are also achieved in COCO object detection/segmentation and ADE20K semantic segmentation.

Ablation Study

Configuration Top-1 Acc
Clustering hyperedges only 82.3%
Dilated hyperedges only 82.5%
Clustering + Dilated (Full) 83.1%
Without dynamic weights (fixed uniform aggregation) 82.7%
Fixed window partition vs No partition 83.1% vs OOM

Key Findings

  • The dual-path hypergraph (clustering + dilated) outperforms either single-path variant (83.1% vs 82.3%/82.5%), proving that semantic and spatial information are complementary.
  • Dynamic hypergraph convolution (based on cosine similarity) yields a 0.4% gain over fixed aggregation, demonstrating the effectiveness of adaptive weights.
  • DVHGNN-S has 3.0M more parameters than ViG-S, but remains highly FLOP-efficient (5.2G vs 4.6G), achieving a 1.0% higher accuracy.
  • The region partitioning strategy significantly reduces memory consumption while maintaining performance.

Highlights & Insights

  • Dilated Hypergraphs Analogous to Dilated Convolutions: Introducing the multi-scale concept of dilated convolutions to hypergraph construction is an intuitive and effective design, where different dilation rates correspond to spatial relationships at different scales.
  • Dual-Path Complementary Design: The combination of clustering hyperedges (global semantics) and dilated hyperedges (local space) is conceptually similar to the dual-stream approach in SlowFast networks, but is implemented more naturally within the hypergraph framework.
  • Feasibility of Hypergraph GNNs as General Vision Backbones: Outperforming Swin-S in a comparable accuracy range on ImageNet demonstrates the strong potential of the hypergraph paradigm.

Limitations & Future Work

  • The parameter count and FLOPs are still higher than those of ViG-S, leaving room for efficiency optimization.
  • The number of cluster centers \(C\) and the dilation rates \(R\) are manually designed hyperparameters.
  • Comparisons with modern linear-complexity architectures such as Mamba are missing.
  • Hypergraph construction is conducted independently at each layer, lacking cross-layer propagation of hypergraph structures.
  • vs ViG: ViG utilizes KNN to build ordinary graphs (with quadratic complexity and pairwise relationships), whereas DVHGNN uses clustering and dilation to construct hypergraphs (with linear complexity and high-order relationships).
  • vs ViHGNN: ViHGNN relies on fuzzy C-means clustering (global and non-adaptive), whereas DVHGNN applies cosine-similarity clustering and dilated hypergraphs (multi-scale and adaptive).
  • vs Swin Transformer: Swin performs self-attention within windows, while DVHGNN executes hypergraph convolutions within windows, enabling the latter to capture high-order relationships.

Rating

  • Novelty: ⭐⭐⭐⭐ The dilated hypergraph construction and dual-path design are highly innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation covering ImageNet, COCO, and ADE20K.
  • Writing Quality: ⭐⭐⭐ Rich in content but slightly verbose in structure.
  • Value: ⭐⭐⭐⭐ Advances the development of hypergraph-based vision architectures.