Superpixel-Informed Implicit Neural Representation for Multi-Dimensional Data¶

Conference: ECCV 2024
arXiv: 2411.11356
Code: None
Area: Others
Keywords: Implicit Neural Representation, Superpixel, Multi-Dimensional Data Recovery, Attention Mechanism, Dictionary Learning

TL;DR¶

Proposes Superpixel-Informed Implicit Neural Representation (S-INR), which replaces pixels with generalized superpixels as the basic unit of INR. By utilizing two modules—an exclusive attention MLP and a shared dictionary matrix—this method fully mines semantic information within and across generalized superpixels, outperforming existing INR methods in tasks such as image reconstruction, completion, denoising, and point data recovery.

Background & Motivation¶

Implicit Neural Representation (INR) uses coordinate-based MLPs to map spatial coordinates to corresponding values (e.g., pixel intensities, occupancy values) and has achieved success in representing multi-dimensional data such as images, videos, and 3D shapes. However, existing INR methods suffer from a fundamental issue:

Ignoring the inherent semantic information of the data. Standard INRs treat individual pixels as the basic unit, mapping coordinates to values pixel-by-pixel independently. This neither utilizes local semantic correlations between adjacent pixels nor captures structural commonalities across different regions.

To address this limitation, the authors raise a natural question: Can we develop a new representation method within the INR framework that efficiently utilizes the data's inherent semantic information?

Core Idea: Replace pixels with generalized superpixels containing rich semantic data as the basic units of INR.

Method¶

Overall Architecture¶

S-INR consists of three key components: 1. Generalized Superpixel Segmentation Algorithm (GSSA): Segments data into semantically consistent regions. 2. Exclusive Attention MLP ($\Psi_{\theta_k}$): Uses independent MLPs for each superpixel to capture internal semantics. 3. Shared Dictionary Matrix ($\mathbf{D}$): Shared across all superpixels to capture commonalities between them.

Mathematical expression: $$\hat{\mathbf{o}}^k = \mathbf{D}(\Psi_{\theta_k}(\mathbf{x}^k)), \quad k = 1, \ldots, K$$

Key Designs¶

1. Generalized Superpixel Definition

Unlike traditional superpixels which only apply to images, generalized superpixels extend to arbitrary point data (e.g., 3D surfaces, meteorological data). They must satisfy two conditions: - Disjointness: There is no overlap between different superpixels. - Spatial Connectivity: Data points within each superpixel are spatially continuous.

2. Generalized Superpixel Segmentation Algorithm (GSSA)

Based on a variant of k-means++, GSSA simultaneously considers both feature similarity and spatial coordinate distance during clustering: $$m_{ik} = \begin{cases} 1 & \text{if } k = \arg\min_k \|\mathbf{o}_i - \boldsymbol{\mu}_k\|^2 + \alpha \|\mathbf{x}_i - \mathbf{x}_{\boldsymbol{\mu}_k}\|^2 \\ 0 & \text{otherwise} \end{cases}$$

The weight $\alpha$ controls the strength of spatial connectivity, ensuring both conditions of the generalized superpixel are satisfied.

3. Exclusive Attention MLP

A self-attention module is inserted into the MLP of each superpixel to enhance expressiveness across feature dimensions: $$\psi_l^k(\mathbf{z}_{l+1}^k) = \eta(\mathbf{U}_l^k(\delta(\mathbf{V}_l^k(\tau(\mathbf{z}_{l+1}^k))))) \otimes \mathbf{z}_{l+1}^k$$

where $\tau$ represents channel average pooling, $\delta$ is ReLU, $\eta$ is sigmoid, and $\otimes$ is the element-wise channel product. This channel attention mechanism enables the model to adaptively emphasize feature dimensions that are most important to the current superpixel.

4. Shared Dictionary Matrix

$\mathbf{D} \in \mathbb{R}^{s \times r}$ is shared across all superpixels, similar to the coding matrix in dictionary learning but defined as a learnable parameter. It maps the $r$-dimensional coefficients output by each superpixel MLP to an $s$-dimensional output. Info transmission and commonality capture across different superpixels are achieved through this sharing.

Loss & Training¶

Diverse loss functions are designed for three tasks:

Data Reconstruction: $\sum_k \sum_i \|\mathbf{o}_i^k - \mathbf{D}(\Psi_{\theta_k}(\mathbf{x}_i^k))\|^2$
Data Completion: Calculating the loss only on observed locations, $\|\cdot\|_\Omega^2$
Data Denoising: Same as reconstruction, leveraging the implicit regularization of INR to filter noise.

An Adam optimizer is used, and the model is unsupervised (requiring only observed data) without needing extra training sets.

Hyperparameters: hidden layer size 35, 5 layers, $K \in \{15, 25, 50\}$, $\alpha \in \{1, 5, 20\}$, $\omega_0 \in \{30, 150, 300\}$.

Key Experimental Results¶

Main Results¶

Image Reconstruction:

Method	Kodim PSNR↑	Kodim SSIM↑	Pavia PSNR↑	Pavia SSIM↑
S-INR	36.077	0.965	39.102	0.949
WIRE	33.199	0.918	38.455	0.941
SIREN	33.052	0.932	37.727	0.937
Fourier	32.101	0.899	37.982	0.935
Gauss	30.188	0.862	36.413	0.923
DIP	30.154	0.882	36.283	0.919

Point Data Recovery (Table)¶

3D Surface Completion and Meteorological Data Completion (NRMSE↓, R-Square↑):

Method	Meteorological (63°N) NRMSE↓	Meteorological (63°N) R²↑	3D Scene1 NRMSE↓	3D Scene1 R²↑
S-INR	0.058	0.900	0.074	0.944
SIREN	0.078	0.818	0.109	0.878
KNR	0.072	0.849	0.112	0.868
RF	0.076	0.829	0.107	0.880
DT	0.101	0.698	0.171	0.703

Key Findings¶

S-INR outperforms the second-best method, WIRE, by approximately 3 dB PSNR in image reconstruction.
In the image completion task (with a 2.5% sampling rate), S-INR reaches 29.068 dB, which is about 1 dB higher than WIRE.
For image denoising, S-INR achieves effective denoising without any explicit regularization, thanks to the implicit structural constraints of superpixels.
In the 3D surface completion task, S-INR achieves an R² of 0.944, significantly outperforming SIREN's 0.878.
Meteorological data completion verifies the effectiveness of generalized superpixels on non-image data.
Traditional regression methods (KNR, RF) perform worse than S-INR on datasets with complex structures.

Highlights & Insights¶

Paradigm shift from pixel to superpixel: This simple yet powerful idea introduces structural priors into INR.
The definition of generalized superpixel transcends the boundaries of traditional image superpixels, making it applicable to various data types such as meteorology and 3D point clouds.
The architecture design of "individuality + commonality" is essentially a balance of local-global information—exclusive MLPs capture local features, while coordinates of the shared dictionary capture global structure.
The introduction of the attention mechanism enhances interaction among feature dimensions, which is particularly effective for high-dimensional data such as multi-spectral images.

Limitations & Future Work¶

Each superpixel requires an independent MLP, which significantly increases the parameter size when $K$ is large.
GSSA is a preprocessing step, and the quality of superpixel segmentation on noisy or incomplete observed data is questionable.
The number of superpixels $K$ serves as a critical hyperparameter that requires manual tuning.
The method is only validated on medium-resolution data (256×256 / 512×768), and its scalability to high-resolution data remains unknown.
Comparisons with recent hash-encoding-based INR methods (e.g., InstantNGP) are missing.
The size $r$ of the shared dictionary matrix needs to be tuned separately for different data types.

The sinusoidal activation function of SIREN is retained by S-INR as a basic component.
The concept is similar to works like Neural Dictionary, but S-INR introduces a superpixel-aware attention MLP before the dictionary.
The concept of superpixels has been widely used in traditional image processing (e.g., SLIC); this paper introduces it to the field of neural representation.
The design of the shared dictionary matrix draws inspiration from dictionary learning and sparse coding.

Rating¶

Novelty: ★★★★☆ — The integration of superpixels and INR is novel, and the definition of generalized superpixels is meaningful.
Value: ★★★☆☆ — Parameter efficiency needs improvement, and there are many hyperparameters.
Experimental Thoroughness: ★★★★★ — Comprehensive validation across images, point data, and multiple tasks.
Writing Quality: ★★★★☆ — Rigorous mathematical definitions and clear structure.