Stratified Knowledge-Density Super-Network for Scalable Vision Transformers¶

Conference: AAAI 2026 arXiv: 2511.11683 Code: N/A Area: Model Compression Keywords: Vision Transformer, Super-Network, Stratified Knowledge Density, PCA, Progressive Dropout

TL;DR¶

This paper proposes transforming a pretrained ViT into a "Stratified Knowledge-Density Super-Network" (SKD Super-Network) via two steps—WPAC (Weighted PCA Attention Contraction) and PIAD (Progressive Importance-Aware Dropout)—to hierarchically organize knowledge within the pretrained weights, enabling subnetwork extraction of arbitrary size at O(1) cost without additional fine-tuning, achieving performance on par with or surpassing state-of-the-art compression methods.

Background & Motivation¶

Deploying Vision Transformers in practice typically requires training and maintaining multiple model variants for different resource constraints, which is prohibitively costly. Existing scaling approaches suffer from the following limitations:

Traditional pruning methods: Each target size requires an independent pruning and fine-tuning cycle, making it impossible to obtain multi-scale models in a single pass.

Learngene paradigm (e.g., TLEG, WAVE): Core weights are extracted from a pretrained model and expanded into descendant models of varying sizes, but this relies on manually designed expansion rules and requires additional knowledge distillation and prolonged fine-tuning.

Low-rank compression: Weight matrices are decomposed into two low-rank matrices, but the resulting models suffer from limited information retention.

The authors' core insight is that rather than compressing separately for each target size, it is more effective to establish a stratified knowledge-density structure within the pretrained weights—concentrating the most important knowledge in the leading dimensions. Extracting a subnetwork of arbitrary size then reduces to simply truncating the first $k$ dimensions.

Method¶

Overall Architecture¶

The method consists of two stages: 1. WPAC stage: An equivalent transformation of the pretrained model's weights via weighted PCA, concentrating knowledge into a small number of critical dimensions (initial knowledge stratification). 2. PIAD stage: Further training via progressive importance-aware dropout to reinforce the hierarchical organization of knowledge, enabling smaller subnetworks to maintain strong performance.

The resulting super-network supports on-demand dimension truncation for subnetwork extraction without additional fine-tuning.

Key Designs¶

WPAC — Weighted PCA Attention Contraction

Core Idea: PCA is applied to the intermediate features of attention modules to obtain a principal component transformation matrix ranked by information content, which is then "absorbed" into the adjacent linear layers, realizing a function-preserving equivalent transformation.

V/O Projection Transformation: A weighted covariance matrix is computed over the Value projection output $X_v \in \mathbb{R}^{n \times d}$ and eigendecomposed to obtain the transformation matrix $W^{(vo)}_{Trans}$; the weights are then updated as: $$W_v \leftarrow W^{(vo)}_{Trans} W_v, \quad b_v \leftarrow W^{(vo)}_{Trans} b_v, \quad W_o \leftarrow W_o (W^{(vo)}_{Trans})^{-1}$$

Q/K Projection Transformation: The covariance matrices of the Q and K outputs are jointly computed (summed before decomposition), and the orthogonality property $W^T W = I$ ensures that attention scores remain invariant: $$\text{sim}_{(i,j)} \equiv (W_q x_i + b_q)^T (W^{(qk)}_{Trans})^T W^{(qk)}_{Trans} (W_k x_j + b_k)$$

MLP Transformation: Due to the presence of nonlinear activations, direct PCA is infeasible; instead, dimensions are ranked by Taylor importance and a permutation matrix $W_{sort}$ is constructed to reorder the weights.

Weighting Strategy: Standard PCA treats all tokens equally, but different tokens contribute differently to the final prediction. WPAC computes token-level importance using a first-order Taylor estimate: $$\Theta_{TE}(h_i) \approx \left|\frac{\delta \mathcal{C}}{\delta h_i} \cdot h_i\right|$$ The centered token features are then weighted by $\sqrt{\Theta^{\text{token}}_{TE}}$ before covariance computation.

PIAD — Progressive Importance-Aware Dropout

Objective: Building on WPAC, further reinforce knowledge stratification so that small subnetworks also perform well.

Droppable Units: The intermediate dimensions of MHSA are divided into 8 groups and those of MLP into 32 groups, with each group serving as one "droppable unit."

Importance Estimation proceeds in two steps: - Module sensitivity $\gamma_m$: the relative increase in the cost function when module $m$ is bypassed. - Dimension-level importance $I^{(m)}_i = \gamma_m \cdot \alpha^{(m)}_i$, where $\alpha^{(m)}_i$ is the intra-module normalized Taylor importance. - Final unit importance is normalized by MACs: $I_u = \frac{\sum_{i \in u} I^{(m)}_i}{\text{MACs}(u)}$

Progressive Update: Given a target maximum compression ratio $r$, the Dropout List is progressively constructed over $P_e$ epochs. At the beginning of each epoch, the least important units are appended until the cumulative MACs of the list reaches the target value.

Subnetwork Sampling and Training: At each batch, a truncation index $s$ is randomly sampled; all units ranked below $s$ in the Dropout List are dropped, and the resulting subnetwork is trained with gradients back-propagated to the super-network.

Loss & Training¶

The WPAC stage requires no training; PCA transformations are computed using a small proxy set of 1,024 samples.
PIAD stage: DeiT-B is trained for 150 epochs; DeiT-S/Ti are trained for 300 epochs.
The progressive Dropout List construction spans $P_e = 50$ epochs.
Training settings follow the standard DeiT configuration.

Key Experimental Results¶

Main Results¶

Comparison with network scaling methods (ImageNet-1k, subnetworks extracted without fine-tuning):

Method	KD	DeiT-B 4:12	DeiT-B 6:12	DeiT-S 4:12	DeiT-S 6:12	DeiT-Ti 4:12	DeiT-Ti 6:12
Albert	No	71.7	75.3	65.0	69.7	55.2	59.8
WAVE	Yes	74.5	77.5	68.9	72.7	58.6	63.2
TLEG	Yes	71.6	76.2	63.7	69.5	—	58.2
SKD (Ours)	No	77.0	80.4	70.6	76.2	61.4	65.8

Comparison with network compression methods (DeiT-S, ImageNet-1k):

Method	MACs	Params	Epochs	Top-1
DeiT-S (original)	4.26G	22.05M	—	79.83
SPViT	3.30G	15.90M	300	78.30
RePaViT	3.20G	16.70M	300	78.90
WDPruning	3.10G	15.00M	100	78.55
SKD (Ours)	3.07G	16.03M	30	79.42

Ablation Study¶

Configuration	DeiT-S 4:12	DeiT-S 8:12	DeiT-Ti 4:12	DeiT-Ti 8:12	Notes
Baseline (random truncation)	1.2	39.1	1.4	25.8	No stratified structure
B + Channel Dropout	7.3	34.4	2.0	23.1	Uniform dropout
B + Weighted CD	34.2	64.7	29.4	49.1	Weighted dropout
B + LayerDrop	39.7	68.6	34.5	57.1	Layer-level dropout
B + PIAD	70.6	78.2	61.4	68.6	Progressive importance-aware

WPAC vs. other pruning criteria (direct evaluation, 50% dimensions retained, DeiT-B → 81.8):

Criterion	1/4 retained	2/4 retained	3/4 retained
Random	0.9	24.4	74.0
Magnitude	1.7	29.2	71.9
Taylor FO	6.0	52.4	78.1
Hessian	5.1	52.0	78.2
WPAC	41.8	76.9	81.2

Key Findings¶

WPAC substantially outperforms traditional pruning criteria: at a 1/4 retention rate, WPAC achieves 41.8% versus Taylor FO's 6.0%, a gap of 35.8 percentage points.
Zero fine-tuning surpasses distillation-dependent scaling methods: SKD requires neither a teacher model nor knowledge distillation, yet directly extracted subnetworks outperform methods such as WAVE that rely on KD.
Minimal training cost: on DeiT-B, only 30 fine-tuning epochs are needed to match or exceed compression methods requiring 100–300 epochs.
A proxy set of only 1,024 samples suffices to obtain accurate PCA projections (verified in Figure 5).
Using all tokens with importance weighting yields the best result in weighted PCA (Table 6), though using all tokens without weighting renders the covariance matrix ill-conditioned.

Highlights & Insights¶

Clever exploitation of function-preserving transformations: WPAC does not alter network behavior but reorganizes the knowledge distribution within the weights, constituting a "free" knowledge concentration operation.
Unified understanding from an information-theoretic perspective: PCA maximizes information retention → knowledge is concentrated in the leading $k$ dimensions → truncation yields the optimal subnetwork.
Extremely low training cost: zero training in the WPAC stage plus minimal training in the PIAD stage results in a total cost far below that of repeated conventional compression.
Unified framework: a single super-network covers all model sizes from 1/3 to full scale, truly achieving "build once, extract anywhere."

Limitations & Future Work¶

Validation is limited to DeiT and Swin Transformer; extension to LLMs or multimodal models has not been explored.
PIAD training still requires hundreds of epochs, which may be costly for larger models.
Only homogeneous subnetwork extraction is supported (uniform dimension truncation); heterogeneous compression (different compression rates per layer) is not explored.
Downstream transfer experiments cover only image classification; detection and segmentation tasks are not evaluated.
MLP dimension ordering relies solely on Taylor importance rather than PCA, potentially incurring information loss.

Once-for-All / Slimmable Networks are seminal works on scalable networks; SKD achieves similar goals for ViTs with greater efficiency.
The Learngene series (TLEG, WAVE, SWS) provides ideas for weight sharing and expansion, but this work demonstrates that directly establishing a stratified structure from a pretrained model is more effective.
Low-rank compression methods (SVD decomposition) are conceptually related to WPAC, but performing PCA in the feature space rather than the weight space yields superior results.
Taylor importance is widely used; combining it with PCA for weighting in this work constitutes an interesting contribution.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of function-preserving PCA transformation and progressive dropout is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extensive comparative experiments, detailed ablations, and cross-model validation.
Writing Quality: ⭐⭐⭐⭐ — Mathematical derivations are rigorous and illustrations are clear.
Value: ⭐⭐⭐⭐⭐ — Highly practical, significantly reducing the cost of multi-scale deployment.