Auto-Compressing Networks¶
Conference: NeurIPS 2025 arXiv: 2506.09714 Code: Available (declared in paper) Area: Network Architecture / Model Compression Keywords: Auto-compression, residual connections, layer redundancy, forward connections, continual learning
TL;DR¶
Auto-Compressing Networks (ACN) replace short residual connections with long-range forward connections (aggregating all layer outputs directly into the final output), making the Direct Gradient (DG) component significantly stronger than the Forward Gradient (FG), thereby implicitly compressing information into earlier layers. A ViT with only 6 layers matches standard 12-layer performance; BERT saves 75% of its layers; additional benefits include noise robustness (+6.4%) and reduced catastrophic forgetting in continual learning (−18%).
Background & Motivation¶
Background: Many layers in deep residual networks are redundant and can be removed without performance degradation. Short residual connections create an exponential number of implicit paths, yet many paths remain underutilized.
Limitations of Prior Work: (a) Residual networks do not actively compress—deep layers are still allocated computation even when performing identity mappings; (b) LayerDrop/LayerSkip require explicit pruning strategies; (c) no architecture can automatically determine the required number of layers.
Key Challenge: Although residual connections resolve the vanishing gradient problem, they also make deep layers "lazy"—the shortcut paths provide gradient escape routes, relieving deep layers of the need to learn meaningful transformations.
Goal: Design an architecture that automatically compresses information into the minimum necessary number of layers.
Key Insight: Connect all layer outputs directly to the final output (rather than accumulating layer by layer), making the Direct Gradient (DG) substantially stronger than the Forward Gradient (FG), thereby implicitly realizing layer-wise training—information is automatically pushed toward earlier layers.
Core Idea: Long-range forward connections → high DG/FG ratio → information compressed into early layers → deep layers automatically degrade to identity mappings → optimal depth determined automatically.
Method¶
Overall Architecture¶
Input \(x_0\) → \(L\) transformation layers → ACN aggregation: \(y_A = \sum_{i=0}^L x_i\) (all layer outputs plus input are directly summed as the final output) → during training, deep layers compress information into early layers due to DG > FG → redundant deep layers can be discarded at inference.
Key Designs¶
-
Long-Range Forward Connections vs. Short Residual Connections:
- Function: Alter gradient flow so that information is automatically compressed.
- Mechanism: In ACN, the forward path from \(x_i\) to \(y\) is singular, while the backward path has \(L - i + 1\) routes—growing linearly. In ResNet, there are \(2^{L-i}\) backward routes—exponential growth. The linear path structure causes the DG component to dominate.
- Design Motivation: The DG/FG ratio in ACN is substantially higher than in ResNet (2–3× for early layers), providing early layers with stronger direct learning signals while deep layers, lacking FG support, automatically degrade.
-
Automatic Depth Adaptation:
- Function: Automatically utilizes a different number of layers depending on task difficulty.
- Mechanism: Experiments show that AC-Mixer uses 8/10/12 layers on 2/5/10-class CIFAR-10 respectively (automatic adaptation), whereas ResNet always uses all layers.
- Design Motivation: ACN naturally provides a search-free depth selection mechanism.
-
Noise Robustness and Continual Learning as Incidental Benefits:
- Function: ACN incidentally achieves improved noise robustness and resistance to catastrophic forgetting.
- Mechanism: Information compressed into early layers → deep layers perform identity mappings → deep layers are insensitive to noise. In continual learning, compressed representations are less susceptible to overwriting by new tasks.
- Design Motivation: The structural properties of ACN naturally produce these favorable side effects.
Loss & Training¶
- Standard cross-entropy loss
- Training time is approximately 2× that of ResNet (requires more epochs to converge)
- Applicable to various architectures including MLP-Mixer, ViT, and BERT
Key Experimental Results¶
Main Results¶
| Task | ACN Layers | ResNet Layers | Performance |
|---|---|---|---|
| ImageNet-1K (ViT) | 6 | 12 | Comparable |
| CIFAR-10 (MLP-Mixer) | 6–8 | 16 | Comparable |
| BERT (GLUE) | ~25% | Full | Maintained |
| Robustness | ResNet ViT | ACN ViT |
|---|---|---|
| Gaussian noise σ=0.4 | 45.46% | 51.89% (+6.4%) |
| Salt-and-pepper noise p=0.1 | 10.34% | 19.98% (+9.6%) |
| Continual Learning | ResNet | ACN | Improvement |
|---|---|---|---|
| Split CIFAR-100 forgetting | baseline | −18% | Significant |
| Transfer C-100→C-10 | 79–83% | 85.38% | +2.5%+ |
Ablation Study¶
| Configuration | Finding |
|---|---|
| DG component analysis | DG dominates in ACN; DG/FG ratio is 2–3× higher than ResNet |
| ACN with DG only | Auto-compression still occurs—DG is the core mechanism |
| ACN vs. DenseNet-Mixer | ACN > DenseNet ≈ DenseFormer |
Key Findings¶
- ACN compresses ViT from 12 layers to 6—a 50% layer reduction with no performance drop.
- Compression is automatic—no search or pruning strategy is required.
- Noise robustness and continual learning improvements are free incidental benefits determined by architectural properties.
- Simpler tasks use fewer layers—2-class tasks use 8 layers, 10-class tasks use 12 layers.
Highlights & Insights¶
- The simple change of "replacing short connections with long connections" has far-reaching effects—automatic compression, robustness, and reduced forgetting.
- The DG/FG ratio analysis provides a clear framework for understanding the compression mechanism.
- Adaptive depth is the most valuable property—search-free architectural efficiency is highly significant for real-world deployment.
Limitations & Future Work¶
- Training time is approximately 2× that of ResNet (700 vs. 300 epochs).
- Validation is limited to small- and medium-scale models—billion-parameter regimes have not been tested.
- No theoretical guarantee that compression is always optimal.
- The truncation depth at inference still requires manual determination.
Related Work & Insights¶
- vs. ResNet: A simple substitution of short residual connections with long forward connections, yet the effect is fundamentally different.
- vs. LayerDrop: LayerDrop requires an explicit pruning strategy; ACN compresses automatically.
- vs. DenseNet: DenseNet connects each layer to all subsequent layers (forward + lateral); ACN connects only to the output (simpler).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ An elegant work in which a simple architectural change produces far-reaching impact.
- Experimental Thoroughness: ⭐⭐⭐⭐ MLP-Mixer + ViT + BERT evaluated across robustness, continual learning, and transfer learning dimensions.
- Writing Quality: ⭐⭐⭐⭐ The DG/FG analysis is intuitive and accessible.
- Value: ⭐⭐⭐⭐⭐ Reframes the understanding of residual connections and establishes a new paradigm for automatic compression.