Skip to content

EA-ViT: Efficient Adaptation for Elastic Vision Transformer

Conference: ICCV 2025 arXiv: 2507.19360 Code: https://github.com/zcxcf/EA-ViT Area: Model Compression / Elastic Networks Keywords: Vision Transformer, Elastic Architecture, Sub-model Selection, Curriculum Learning, Pareto Optimization

TL;DR

This paper proposes the first ViT framework that introduces elastic structure at the adaptation stage. Through a multi-dimensional elastic architecture, curriculum learning, and a lightweight router, a single adaptation run yields sub-models covering \(10^{26}\) configurations, consistently outperforming existing elastic methods across multiple downstream tasks.

Background & Motivation

After pre-training, Vision Transformers (ViTs) must be adapted to various downstream tasks, yet different deployment platforms (mobile phones, laptops, GPU clusters) impose heterogeneous model-size constraints. Existing solutions suffer from two core limitations:

Traditional pruning methods require separate training and fine-tuning for each target device, incurring high computational costs and complex version management.

Existing elastic methods (DynaBERT, MatFormer, HydraViT, Flextron) either support only a limited number of elastic dimensions or introduce elastic structure during pre-training, still requiring additional fine-tuning when transferred to downstream tasks.

The core motivation of EA-ViT is: can a pre-trained ViT be transformed into an elastic structure at the adaptation stage (rather than the pre-training stage), such that a single adaptation run produces sub-models suitable for diverse resource constraints?

Method

Overall Architecture

EA-ViT proceeds in two stages: - Stage 1: Construct a multi-dimensional elastic architecture and apply curriculum elastic adaptation. - Stage 2: Search for Pareto-optimal sub-models to initialize the router, then jointly train the router and backbone.

Key Designs

  1. Multi-Dimensional Elastic Architecture

Elastic configurations are introduced along four dimensions: - MLP expansion ratio \(R^{(l)}\): controls the hidden dimension of the MLP in each layer. - Number of attention heads \(H^{(l)}\): controls the active attention heads per layer. - Embedding dimension \(E\): a unified embedding dimension shared across all layers. - Network depth \(D^{(l)}\): a binary indicator that skips specific layers via identity mapping.

The overall configuration space is defined as: \(\theta = (\{R^{(l)}\}_{l=1}^{L}, \{H^{(l)}\}_{l=1}^{L}, E, \{D_{\text{MLP}}^{(l)}, D_{\text{MHA}}^{(l)}\}_{l=1}^{L})\)

For ViT-Base, the search space reaches \(10^{26}\) sub-model configurations, far exceeding prior methods (at most \(10^{14}\)). A nested elastic structure is adopted, with embedding dimensions and attention heads sorted by importance so that critical components are shared across a larger number of sub-models.

  1. Curriculum Elastic Adaptation (CEA)

Simultaneously training a large number of sub-models causes gradient conflicts and parameter interference. CEA employs a curriculum learning strategy that progressively expands the elastic range from simple to complex: - In the initial phase, only the largest sub-model is trained (\(R_{\min}=R_{\max}\), \(H_{\min}=H_{\max}\), etc.). - During training, the sampling lower bound is gradually decreased according to a preset schedule, introducing smaller sub-models. - Each dimension's parameters are sampled from a uniform distribution: \(R \sim \mathcal{U}(R_{\min}^{(t)}, R_{\max})\).

This preserves pre-trained knowledge and prevents interference with the performance of larger sub-models.

  1. Pareto-Optimal Sub-model Search + Constraint-Aware Router

Pareto Search: A customized NSGA-II evolutionary algorithm searches the elastic space for candidate sub-models on the Pareto frontier. A partition-based selection strategy and an iterative crowding-distance mechanism are introduced to improve population diversity and coverage of the complexity space.

Router: A lightweight two-layer MLP that takes the normalized MACs constraint \(M_t \in [0,1]\) as input and outputs the corresponding sub-model configuration \(\theta\). The Gumbel-Sigmoid trick enables differentiable optimization over discrete decisions: \(y_{\text{soft}} = \text{Sigmoid}\left(\frac{\text{logits} + G_1 - G_2}{\tau}\right)\) A straight-through estimator maintains gradient flow.

Loss & Training

The unified loss function for Stage 2:

\[\mathcal{L} = \mathcal{L}_{\text{CE}} + \lambda_1 \left(\frac{\text{MACs}(\theta)}{M_0} - M_t\right)^2 + \lambda_2 \|\theta - \theta^*(M_t)\|_2^2\]
  • First term: cross-entropy classification loss.
  • Second term: computational constraint penalty (MACs deviation).
  • Third term: Pareto reference regularization; \(\lambda_2\) is annealed to zero during training to encourage the router to explore independently.

Differentiable computation of MACs: $\(\text{MACs}(\theta) = \sum_{l=1}^{L}[D_{\text{MLP}}^{(l)} \cdot 2NE^2R^{(l)} + D_{\text{MHA}}^{(l)} \cdot Nd_{\text{head}}H^{(l)}(4E+2N)]\)$

Key Experimental Results

Main Results

Comparison with DynaBERT, MatFormer, HydraViT, and Flextron on 9 image classification benchmarks (8 GMACs constraint, ViT-Base backbone):

Method Cifar10 Cifar100 SVHN Flowers Food101 Aircraft Cars DTD Pets
DynaBERT 94.24 78.30 96.58 24.53 82.24 71.19 80.19 36.88 57.05
MatFormer 96.17 86.18 97.45 69.00 86.31 76.13 88.10 49.25 84.53
HydraViT 96.11 85.39 96.98 29.27 85.49 75.95 85.07 43.53 75.32
Flextron 97.11 85.95 97.61 73.80 87.79 79.36 88.62 61.21 86.35
EA-ViT 97.98 88.20 97.63 85.39 90.05 83.37 90.09 67.56 90.57

EA-ViT surpasses the best baseline by 26.39% on Flowers-102, demonstrating that its advantage is particularly pronounced under low-MACs conditions.

Ablation Study

Contribution of each elastic dimension (validated on Cifar100):

MLP MHA Embed Depth MACs_min↓ Acc_max↑ AUC↑
- - - 7.80 93.01 0.505
- - 6.04 93.12 0.646
- 5.11 93.31 0.651
4.60 93.32 0.663

Incrementally adding elastic dimensions consistently improves MACs coverage, peak accuracy, and AUC.

Key Findings

  • Router effectiveness: Sub-models automatically selected by the router consistently outperform those selected by fixed, manually designed architectures.
  • Necessity of CEA: Removing CEA slightly improves small sub-models but noticeably degrades large sub-model performance; after Stage 2, models trained with CEA consistently outperform those without it across the full MACs range.
  • Pareto initialization: Initializing the router with Pareto-frontier configurations allows training to make more effective use of limited resources.
  • t-SNE analysis: Different datasets exhibit distinct sub-model structural preferences (e.g., DTD and SVHN favor configurations different from those preferred on general datasets), validating the router's adaptability.

Highlights & Insights

  1. First adaptation-stage elastic framework: Introducing elasticity at the adaptation rather than the pre-training stage substantially reduces training and storage costs.
  2. Vast search space: \(10^{26}\) sub-model configurations, exceeding the most flexible existing method by 12 orders of magnitude.
  3. Strong practicality: The same framework applies to diverse tasks including classification, segmentation, medical imaging, and remote sensing.
  4. Dataset sensitivity analysis: Simple datasets (SVHN) are insensitive to model compression, while complex datasets (DTD, Flowers) require larger model capacity for adequate representation.

Limitations & Future Work

  • MACs is currently the primary constraint; extension to other metrics such as latency and parameter count is straightforward.
  • Elastic adaptation still incurs non-trivial training cost; further reducing Stage 1 overhead warrants investigation.
  • The router predicts static configurations based on device constraints and does not account for dynamic, input-sample-level adaptation.
  • The approach shares conceptual similarity with OFA (Once-for-All) but targets ViTs and operates at the adaptation stage.
  • The use of Gumbel-Sigmoid for differentiable optimization over discrete decisions is a broadly applicable technique worth adopting.
  • The two-stage strategy of Pareto search followed by joint router optimization is generalizable to other configuration search problems.

Rating

  • Novelty: ⭐⭐⭐⭐ First ViT framework to introduce elasticity at the adaptation stage, with comprehensive four-dimensional elastic design.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 13 datasets spanning classification, segmentation, medical imaging, and remote sensing, with thorough ablations.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and complete derivations.
  • Value: ⭐⭐⭐⭐⭐ Directly addresses multi-device deployment; code is publicly available.