Skip to content

Parallel Sequence Modeling via Generalized Spatial Propagation Network

Conference: CVPR 2025
arXiv: 2501.12381
Code: Project Page
Area: Image Generation / Architecture Design
Keywords: Attention Mechanism, Spatial Propagation Network, Linear Complexity, 2D Structure, High-Resolution Generation

TL;DR

GSPN proposes the Generalized Spatial Propagation Network, which achieves a natively 2D spatially-aware sub-quadratic attention mechanism through 2D linear propagation of row/column line scans under a stability-context condition. This reduces the effective sequence length to \(\sqrt{N}\) and accelerates SD-XL by up to 84x in 16K image generation.

Background & Motivation

While Transformers have achieved great success in vision tasks, they face two core limitations: - Quadratic Computational Complexity: The computational cost is massive when processing high-resolution images, especially for ultra-high-resolution tasks such as 16K. - Neglect of Spatial Structure: Flattening a 2D image into a 1D sequence loses spatial coherence. - Although linear attention methods (\(Q(K^\top V)\)) reduce complexity, they similarly ignore spatial structures. - State Space Models (SSMs, e.g., Mamba) employ 1D raster scanning to handle 2D data, sacrificing inherent spatial structures. - The 2D linear propagation of the core challenge: successive multiplication of weight matrices—eigenvalues that are too large cause exponential growth (instability), while those that are too small lead to signal decay (vanishing information). - There is a critical need to simultaneously ensure stability and model long-range dependencies.

Method

Overall Architecture

GSPN achieves 2D linear propagation via row/column line scanning, where each pixel connects to 3 adjacent pixels in the previous row/column (forming a tridiagonal matrix). After propagating from 4 directions (left-to-right, top-to-bottom, and their reverses), they are fused. GSPN offers both global and local variants, allowing seamless replacement of existing attention modules in current architectures.

Key Designs

Key Design 1: Stability-Context Condition

Function: Ensures that 2D propagation is both stable over long distances and maintains effective contextual information.

Mechanism: The 2D propagation formula is \(h_i^c = w_i^c h_{i-1}^c + \lambda_i^c \odot x_i^c\), with the accumulated weight \(W_{ij} = \prod_{\tau=j+1}^i w_\tau\). To guarantee that \(h_i\) is a weighted average of all previous \(x'_j\), two conditions must be met: (1) \(W_{ij}\) is a dense matrix, and (2) \(\sum_{j=0}^{n-1} W_{ij} = 1\). Theorem 1: If all \(w_\tau\) are row-stochastic matrices (non-negative with a row sum of 1), then \(\sum W_{ij} = 1\) holds. Theorem 2: The row-stochastic constraint simultaneously guarantees propagation stability. Implementation: Apply sigmoid to non-zero elements of each row, followed by row normalization.

Design Motivation: To simultaneously achieve stable propagation and capture long-range dependencies without introducing decay factors—traditional methods must compromise between the two.

Key Design 2: Tridiagonal Matrix + 4-Direction Line Scan

Function: Establishes dense pairwise connections among all pixels in a parameter-efficient manner.

Mechanism: Each pixel only connects to 3 adjacent pixels in the previous row/column (top-left, top-middle, top-right), meaning \(w_\tau\) is a tridiagonal matrix. Key mathematical property: the product of multiple tridiagonal matrices is a dense matrix, so long-range connections are naturally established after propagating through multiple rows. Propagation is conducted from 4 directions, which are then aggregated using a learnable merger. Using a custom CUDA kernel for parallelization: inter-row propagation is sequential, while inter-column, inter-channel, and inter-batch calculations are parallelized, resulting in an effective sequence length of only \(\sqrt{N}\).

Design Motivation: Directly learning an \(n \times n\) fully-connected matrix requires too many parameters; the combination of tridiagonal connections and cumulative multiplication achieves equivalent full connectivity with \(O(3n)\) parameters.

Key Design 3: Global/Local Variants and Task Adaptation

Function: Flexibly selects global or local propagation scopes according to task requirements.

Mechanism: Local GSPN divides a spatial dimension into \(g\) non-overlapping groups, with propagation occurring independently within each group, reducing complexity by \(g\) times (in the extreme case of \(g=n\), complexity is \(O(1)\)). Classification tasks: local layers are used at lower levels, and global layers are used at higher levels (requiring semantic understanding). Generation tasks: local layers are primarily used (requiring spatial details and local consistency). T2I Generation: directly replaces the self-attention layers in SD-XL, initializing GSPN parameters using pretrained Q/K/V weights (exploiting the mathematical relationship between GSPN and linear attention).

Design Motivation: Different vision tasks have disparate demands for global vs. local information, and flexible switching maximizes efficiency. Eliminating position embeddings (as scanning inherently implies positional information) avoids common aliasing problems.

Loss & Training

Varies with tasks: cross-entropy for classification, diffusion loss for DiT, and standard SD loss for T2I.

Key Experimental Results

Main Results: ImageNet Classification

Model Type Params (M) MACs (G) Top-1 Acc
GSPN-T Line scan 30 5.3 83.0
VMamba-T Raster 22 5.6 82.2
Swin-T Transformer 29 4.5 81.3
ConvNeXT-T ConvNet 29 4.5 82.1
LocalVMamba-T Raster 26 5.7 82.7

Ablation Study: Inference Speed Comparison (SD-XL 16K Generation)

Attention Type 16K Inference Time Acceleration Ratio
Softmax Attention Extremely slow
GSPN (Local) Extremely fast 84×
GSPN (Global) Fast Medium

Key Findings

  • GSPN-T (83.0%) outperforms all Mamba/Transformer/ConvNet models of similar size on ImageNet classification.
  • In DiT class-conditional generation, GSPN outperforms SOTA Diffusion Transformers using only 65.6% of the parameters.
  • Replacing self-attention in SD-XL achieves an 84x speedup in 16K image generation while maintaining original performance.
  • The theoretical guarantees of the stability-context condition are validated experimentally—demonstrating effective modeling of long-range dependencies.

Highlights & Insights

  • Mathematical Elegance: Solves both stability and long-range dependency simultaneously through properties of row-stochastic matrices, supported by clear theoretical guarantees.
  • High Practical Appeal: The 84x speedup makes ultra-high-resolution generation feasible.
  • No Need for Position Embeddings: Scanning order implicitly encodes position information, avoiding extrapolation and aliasing issues.

Limitations & Future Work

  • The computational overhead introduced by the 4-direction scanning is a constant factor but non-negligible.
  • The sparsity of tridiagonal connections might be suboptimal for certain tasks that require exact global correspondences.
  • Currently validated mainly on 2D images; extending this to 3D (e.g., video) and multi-modal settings remains to be explored.
  • Future work can explore adaptive numbers of directions and connection schemes.
  • Connection to SPN: GSPN promotes a single-layer, module-level SPN to a stackable backbone architecture and solves the issue of long-range propagation.
  • Comparison with Mamba: GSPN natively preserves 2D structures instead of flattening them into 1D.
  • An effective sequence length of \(\sqrt{N}\) may inspire other efficient architecture designs aimed at processing 2D data.

Rating

⭐⭐⭐⭐ — A novel attention mechanism featuring theoretical elegance and thorough experimental evaluation. The mathematical derivation of the stability-context condition is convincing, and the 84x speedup provides immense practical value. It exhibits competitive performance across three distinct tasks: classification, conditional generation, and T2I generation.