Graph Your Own Prompt¶

Conference: NeurIPS 2025 arXiv: 2509.23373 Code: To be confirmed (the paper mentions a Project website and Code link) Area: Model Compression Keywords: Graph regularization, feature alignment, semantic consistency, classification, parameter-free module

TL;DR¶

This paper proposes a Graph Consistency Regularization (GCR) framework that inserts parameter-free Graph Consistency Layers (GCL) at arbitrary network depths. GCL aligns the relational graph of intermediate features with a class-aware semantic graph derived from model predictions, promoting semantically consistent feature learning in a self-prompting manner—improving classification generalization without modifying the architecture or introducing additional parameters.

Background & Motivation¶

Lack of semantic alignment in deep features: Although deep networks learn rich representations, intermediate features often capture noisy inter-class similarities that contradict the model's own predictions, causing samples from different classes to cluster closely in feature space.
Explicit sampling requirements in contrastive learning: Existing contrastive learning and graph regularization methods require positive/negative sampling strategies or carefully designed data augmentation, and typically operate at a single network layer.
Underutilization of prediction information: The network's own softmax predictions encode rich semantic relational structure (predictions of same-class samples should be similar), yet existing methods rarely feed this structure back into feature learning.
Lack of multi-layer structural supervision: Existing auxiliary losses for intermediate layers typically supervise each layer independently, rarely exploiting structured model outputs to regularize feature learning.
Lightweight deployment requirements: In practice, regularization methods are expected to introduce no extra parameters, require no architectural modifications, and leave the training pipeline unchanged.
Naturalness of graph structure: Pairwise sample relationships are naturally modeled as graphs; aligning a feature similarity graph with a prediction similarity graph provides an elegant cross-space structural constraint.

Method¶

Graph Consistency Layer (GCL)¶

Feature relational graph construction: Given the feature matrix \(X^{(l)} \in \mathbb{R}^{n \times d}\) at layer \(l\) (where \(n\) is the batch size), a pairwise relational graph is constructed using ReLU-gated cosine similarity:

\[F_{ij}^{(l)} = \text{ReLU}(\cos(x_i^{(l)}, x_j^{(l)}))\]

Masked prediction relational graph construction: A prediction similarity matrix \(S\) is computed from the softmax of the final-layer logits \(Z \in \mathbb{R}^{n \times C}\), then filtered by a class-aware binary mask \(M\) (1 for same-class pairs, 0 otherwise):

\[P_{ij} = M_{ij} \odot S_{ij}\]

The mask serves two purposes: (1) filtering unreliable cross-class prediction similarities in early training; (2) focusing on intra-class semantic relationships to avoid interference from visually similar but semantically distinct classes.

Graph Consistency Regularization (GCR)¶

Per-layer alignment loss: The strict upper-triangular portions of the feature graph and prediction graph (eliminating self-connections and double-counting) are extracted, and the Frobenius norm is computed:

\[\mathcal{L}_{\text{GCR}}^{(l)} = \|\text{triu}(F^{(l)}) - \text{triu}(P)\|_F^2\]

Multi-layer aggregation: Alignment losses are collected at \(K\) insertion points and aggregated via a weighted sum:

\[\mathcal{L}_{\text{GCR}} = \sum_{l=1}^{K} w_l \cdot \|\text{triu}(F^{(l)}) - \text{triu}(P)\|_F^2\]

Adaptive weighting: Softmax weights based on the degree of graph inconsistency assign higher weights to layers with greater misalignment:

\[w_l = \frac{\exp(-\|\text{triu}(F^{(l)}) - \text{triu}(P)\|_F^2)}{\sum_{j=1}^{K} \exp(-\|\text{triu}(F^{(j)}) - \text{triu}(P)\|_F^2)}\]

Overall training objective: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + \lambda \cdot \mathcal{L}_{\text{GCR}}\), with \(\lambda = 1\) in all experiments.

GCL Insertion Strategy¶

The network is divided into three stages—Early (E), Mid (M), and Late (L)—and seven configurations are evaluated: E, M, L, E+M, M+L, E+L, and Full. Experiments show that Late GCL generally achieves the best results, as later-layer features are semantically richer and closer to the decision boundary.

Theoretical Analysis¶

Generalization bound: Using covering numbers and Dudley's entropy integral, GCR constraints are shown to reduce the effective hypothesis class complexity, thereby lowering generalization error.
Spectral alignment: It is proven that as the Frobenius distance between the feature graph and prediction graph decreases, the spectra of their normalized Laplacians also converge, guaranteeing consistent clustering structure.
PAC-Bayes perspective: The GCR constraint is equivalent to imposing a structural prior over the function space, with the KL divergence upper bound proportional to \(\sum_l \|F^{(l)} - P\|_F^2\).

Key Experimental Results¶

CIFAR-10 Classification Accuracy (%, averaged over 11 architectures)¶

Configuration	MAE	MobileNet	ShuffleNet	GoogLeNet	ResNet-50	DenseNet-121	Mean
Baseline	88.95	90.23	91.21	94.10	95.03	95.01	93.32
Late GCL	89.70	91.40	92.36	94.88	95.66	95.72	94.07
Gain	+0.75	+1.17	+1.15	+0.78	+0.63	+0.71	+0.75

CIFAR-100 Classification Accuracy (%, averaged over 9 architectures)¶

Configuration	MAE	MobileNet	ResNeXt-50	ResNet-50	DenseNet-121	Mean
Baseline	64.29	65.95	77.75	77.31	77.09	72.95
Late GCL	65.54	68.32	79.54	79.42	79.69	74.74
Gain	+1.25	+2.37	+1.79	+2.11	+2.60	+1.79

ImageNet-1K Classification Accuracy (%, Transformer architectures)¶

Method	iFormer-S	iFormer-B	ViT-B/16	ViG-B
Baseline	83.4	84.6	74.3	82.3
Late GCL	84.5	86.1	75.8	84.0
Gain	+1.1	+1.5	+1.5	+1.7

Highlights & Insights¶

Zero parameter overhead: GCL is entirely parameter-free, requires no architectural modification, and leaves the training pipeline unchanged, adding only lightweight matrix operations.
Strong generalizability: Effective across lightweight CNNs (MobileNet/ShuffleNet), deep CNNs (ResNet/DenseNet), and Transformers (ViT/Swin/iFormer).
Elegant self-prompting design: The model's own prediction structure serves as a reference signal for feature learning, forming a self-prompting mechanism that requires no external supervision.
Improved interpretability: GCL-enhanced feature maps focus more on class-discriminative regions (e.g., eyes and ears for cats, mouths and noses for dogs), with accuracy improving from 98.1% to 99.8%.
Comprehensive theoretical grounding: Generalization guarantees are established from three perspectives: covering numbers, spectral graph theory, and PAC-Bayes analysis.

Limitations & Future Work¶

Validated on classification only: The method is evaluated solely on image classification and has not been extended to tasks such as segmentation, retrieval, or object detection.
Mask construction requires labels: The mask \(M\) relies on ground-truth labels, making it inapplicable to self-supervised or label-free settings.
Intra-batch relational scope: Graph construction is limited to samples within the current batch, which may reduce effectiveness under class imbalance or insufficient class coverage per batch.
Fixed \(\lambda = 1\): While this simplifies hyperparameter tuning, different tasks and architectures may require different regularization strengths.
Late GCL is generally optimal but not universal: The best insertion position varies by architecture, and no automatic selection mechanism is provided.
Relationship to knowledge distillation remains unclear: Using the prediction graph to guide feature learning resembles self-distillation, but no comparison with such methods is conducted.

vs. Contrastive learning (SimCLR/SupCon): Contrastive learning requires positive/negative sampling and data augmentation and typically operates at a single layer; GCR requires no sampling and supports multi-layer alignment.
vs. Graph neural networks: GNNs require maintaining global graphs or additional message-passing modules; GCR constructs graphs dynamically within each batch with no architectural overhead.
vs. Center Loss/Triplet Loss: These methods enforce inter-class distances or intra-class compactness but operate in a single representation space; GCR performs structural alignment across both feature and prediction spaces.
vs. Knowledge distillation: Distillation requires a teacher model; GCR uses the model's own predictions, constituting a form of self-distillation that is considerably more lightweight.
vs. Attention mechanisms: The feature focusing effect of GCL resembles attention, but is entirely parameter-free and achieved through graph alignment.

Rating¶

Novelty: ⭐⭐⭐⭐ The self-prompting idea of aligning feature graphs via prediction graphs is novel; the multi-layer adaptive weighting design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 datasets and 16+ architectures, with visualizations, ablations, weighting scheme comparisons, and theoretical analysis.
Writing Quality: ⭐⭐⭐⭐ Clear structure, complete theoretical exposition, and rich intuitive visualizations.
Value: ⭐⭐⭐⭐ A zero-parameter plug-and-play regularization method with strong practical utility; broader task coverage is needed to fully establish generality.