TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning¶

Conference: ICCV 2025 arXiv: 2507.22872 Code: https://github.com/synbol/TR-PTS Area: Model Compression / Parameter-Efficient Fine-Tuning Keywords: PEFT, Vision Transformer, Token Selection, Fisher Information Matrix, Parameter Selection

TL;DR¶

This paper proposes TR-PTS, a framework that performs task-driven layer-wise parameter selection via the Fisher Information Matrix and dynamically filters/merges tokens using CLS attention scores. By tuning only 0.34%–0.60% of parameters, TR-PTS surpasses full fine-tuning by 3.40% on FGVC and 10.35% on VTAB.

Background & Motivation¶

Large-scale pretrained ViTs demonstrate strong performance on downstream vision tasks, but full fine-tuning is prohibitively expensive. Existing parameter-efficient fine-tuning (PEFT) methods suffer from three key limitations:

Inference overhead: Methods such as VPT introduce additional learnable modules, increasing computational cost at inference time.

Lack of task awareness: Most methods apply a uniform tuning strategy across tasks, ignoring the varying importance of different layers and parameters for specific tasks.

Decoupled parameter and token optimization: Existing work treats parameter selection and token processing independently, despite the fact that token informativeness is inherently task-dependent.

The authors observe that different tasks rely on different subsets of tokens for final prediction (confirmed by visualizations in Figure 2), motivating a unified framework that simultaneously performs task-relevant parameter selection and token refinement.

Method¶

Overall Architecture¶

TR-PTS consists of two synergistic modules: Task-Relevant Parameter Selection and Task-Relevant Token Selection, which are jointly optimized to mutually reinforce each other.

Key Designs¶

Task-Relevant Parameter Selection (FIM-based Layer-wise Parameter Allocation):
- The Fisher Information Matrix (FIM) is used to quantify the task sensitivity of each parameter, approximated via the squared gradient of the cross-entropy loss: \(\mathcal{F}(\theta) \approx \mathbb{E}_{(x,y)\sim D}\left[\left(\frac{\partial \mathcal{L}_{CE}}{\partial \theta}\right)^2\right]\)
- After selecting the top-M% parameters by FIM score, the per-layer contribution weight \(w_l\) is computed and normalized to determine the number of trainable connections per neuron: \(C_l = \max(1, \frac{w_l}{\min(w)} \cdot C_{\min})\)
- Within each layer, the top-\(C_l\) connections per neuron are selected by FIM score as trainable parameters.
- Design Motivation: Compared to gradient magnitudes used in GPS, FIM is less susceptible to optimization noise and more accurately reflects parameter importance; layer-wise allocation ensures at least one active connection per layer, preventing local network deactivation.
Task-Relevant Token Selection (CLS Attention-based Dynamic Token Filtering and Merging):
- The attention scores \(a_i\) from the CLS token to each image token are used to measure token importance.
- The top-\(\lfloor\rho N\rfloor\) tokens with the highest attention scores are retained at selection rate \(\rho\).
- Rather than discarding unselected tokens, they are aggregated into a single merged token via attention-weighted averaging: \(x_{\text{merged}} = \frac{\sum_{i\in\mathcal{I}} a_i x_i}{\sum_{i\in\mathcal{I}} a_i}\)
- The refined sequence is then: \(X_{\text{refined}} = \{x_{\text{CLS}}, X_{\text{selected}}, x_{\text{merged}}\}\)
- Design Motivation: This combines the benefits of token pruning (reduced computation) and token merging (preserved global information).
Parameter–Token Co-selection Strategy:
- Key finding: Layers with sparser parameters tend to encode less informative tokens.
- Token reduction is therefore prioritized at parameter-sparse layers ("sparse insertion" strategy).
- A binary mask \(M\) controls gradient updates: \(\Theta^{(t+1)} = \Theta^{(t)} - \eta(M \odot \nabla_\Theta \mathcal{L})\)

Loss & Training¶

Standard cross-entropy loss is used for training.
Adam optimizer with cosine learning rate decay, trained for 100 epochs.
Backbone: ViT-B/16 pretrained on ImageNet-21K.

Key Experimental Results¶

Main Results¶

VTAB-1k Benchmark (19 visual classification tasks):

Method	Natural (mean)	Specialized (mean)	Structured (mean)	Overall (mean)	Params (%)
Full Fine-tuning	-	-	-	65.57	100.00
LoRA	-	-	-	72.63	0.90
GPS	-	-	-	75.18	0.25
TR-PTS	-	-	-	75.92	0.34

FGVC Benchmark (5 fine-grained classification datasets):

Method	CUB-200	NABirds	Flowers	Dogs	Cars	Mean	Params (%)
Full	87.3	82.7	98.8	89.4	84.5	88.54	100.00
GPS	89.9	86.7	99.7	92.2	90.4	91.78	0.77
TR-PTS	90.0	87.1	99.6	92.4	90.6	91.94	0.60

Ablation Study¶

Component contributions (VTAB-1k subset):

Token Selection	Parameter Selection	dSprites/loc	Flower102	Sun397
✗	✗	12.5	97.0	51.0
✓	✗	14.8	98.8	51.2
✗	✓	85.1	99.4	54.2
✓	✓	87.7	99.5	54.5

Token selection placement strategy comparison:

Strategy	Selection Rate	Sun397	Flower102	Loc	Camelyon
Dense	0.95	53.5	99.3	85.2	87.3
Random	0.95	54.0	99.3	85.9	87.9
Sparse	0.95	54.5	99.4	87.7	88.1

Key Findings¶

TR-PTS achieves the lowest FLOPs and inference memory consumption among all evaluated PEFT methods.
The distribution of FIM-critical parameters across layers varies significantly by task: Flower102 concentrates on Blocks 8/10, Sun397 on Block 0, while Patch/Camelyon exhibit a uniform distribution.
The overlap between parameter selection sets across tasks is low (e.g., only 0.17 between Sun397 and Patch/Camelyon), validating the necessity of task-adaptive selection.
Token visualizations show that shallower layers retain more tokens, while deeper layers progressively focus on foreground objects.

Highlights & Insights¶

The joint optimization of parameters and tokens is a novel contribution; the discovery that "parameter-sparse layers correspond to high token redundancy" motivates a principled co-selection strategy.
Experiments span 24 datasets with comprehensive and consistently strong results.
No additional parameters are introduced; there is no extra overhead during either training or inference.
FIM provides a more stable measure of parameter importance compared to gradient magnitudes.

Limitations & Future Work¶

Validation is limited to classification tasks; extension to dense prediction tasks such as detection and segmentation remains unexplored.
The token selection rate \(\rho\) and minimum connection count \(C_{\min}\) are hyperparameters requiring manual tuning.
FIM computation requires both forward and backward passes, increasing initialization-stage cost.
Adaptive per-layer token selection rates are not explored; a fixed \(\rho\) is applied uniformly across layers.

GPS is the closest predecessor, performing parameter selection via gradient magnitudes without token compression.
ToMe (Token Merging) proposes similarity-based token merging but does not account for task relevance.
The finding that "token compression should be applied at parameter-sparse layers" may inspire future work: in other model compression settings, identifying computationally sparse regions for more aggressive resource reduction is a promising direction.

Rating¶

Novelty: ⭐⭐⭐⭐ — The joint parameter–token selection framework and FIM-based layer-wise allocation strategy are creative contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 24 datasets, multi-dimensional ablations, and computational cost analysis.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with rich figures and tables.
Value: ⭐⭐⭐⭐ — Practically useful and a valuable reference for the PEFT community.