FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection¶

Conference: CVPR 2026 arXiv: 2603.20403 Code: Available (mentioned in the paper) Area: Parameter-Efficient Fine-Tuning / Multi-Task Learning Keywords: LoRA, automatic rank selection, FFT, multi-task learning, PEFT

TL;DR¶

This paper proposes FAAR, a frequency-aware parameter-efficient fine-tuning method for multi-task learning. It introduces Performance-Driven Rank Shrinking (PDRS) to dynamically select the optimal rank for each task and layer, and designs a Task-Spectral Pyramidal Decoder (TS-PD) that leverages FFT frequency information to enhance spatial awareness and cross-task consistency. FAAR achieves superior performance using only 1/9 the parameters of full fine-tuning.

Background & Motivation¶

Multi-task learning (MTL) aims to learn multiple tasks simultaneously, sharing representations to discover inter-task relationships and structure. As backbone model sizes continue to grow, traditional full fine-tuning becomes increasingly impractical. Parameter-efficient fine-tuning (PEFT), particularly methods based on Low-Rank Adaptation (LoRA), has become the dominant paradigm.

However, existing LoRA-based MTL methods suffer from two core limitations:

Fixed-rank problem: Existing methods apply a uniform rank across all layers and all tasks, which is counter-intuitive — different tasks may require different adaptation strengths, and different layers need different degrees of fine-tuning flexibility. Deeper layers require stronger adaptability to handle task-specific fine-grained information, whereas shallower layers may need only minor adjustments.

Lack of spatial inductive bias: Existing LoRA-based MTL strategies overlook the role of cross-task interaction in deeper layers. For dense visual tasks such as semantic segmentation, depth estimation, and surface normal estimation, strong spatial awareness and cross-task geometric consistency are essential — yet low-rank adaptation inherently lacks these capabilities.

FAAR addresses these issues by: - Resolving the fixed-rank problem via dynamic rank shrinking (PDRS), enabling each task/layer to automatically find its optimal rank. - Introducing cheap yet effective spatial information and cross-task relationships through frequency analysis (TS-PD).

Method¶

Overall Architecture¶

FAAR is built upon a frozen Swin Transformer backbone, with DoRA adapters placed in the attention and MLP layers. Within each Transformer stage, the last block uses task-specific adapters while preceding blocks share adapters. A Task-Spectral Pyramidal Decoder (TS-PD) is appended after the backbone for frequency enhancement and cross-task alignment. The entire training process is governed by PDRS, which dynamically reduces adapter ranks.

Key Designs¶

Performance-Driven Rank Shrinking (PDRS):
- Rank Masking: At each forward pass, a prefix size \(b \in \{1, ..., r_{curr}\}\) is randomly sampled to construct a binary mask \(m\), allowing only the first \(b\) rank components to participate in computation:
  - \(A^{eff} = \text{diag}(m) A\), \(B^{eff} = B \text{diag}(m)\)
  - This compels important rank-1 updates to concentrate toward lower dimensions.
- Coverage Strategy:
  - At each backward pass, an importance score is computed for each active rank \(i\): \(s_i = \frac{1}{2}(|\langle A_{:,i}^{eff}, \frac{\partial \mathcal{L}}{\partial A_{:,i}^{eff}} \rangle| + |\langle B_{i,:}^{eff}, \frac{\partial \mathcal{L}}{\partial B_{i,:}^{eff}} \rangle|)\)
  - Scores are accumulated across batches via EMA: \(\hat{s}_i \leftarrow \beta \hat{s}_{i-1} + (1-\beta) s_i\)
  - At the end of each epoch, ranks are sorted by score in descending order, and the minimum number of ranks \(K\) satisfying coverage ratio \(\rho\) is selected: \(K = \min\{k : c(k) \geq \rho\}\)
  - Uncovered ranks are permanently removed from optimization.
- Design Motivation: The directional derivative of the MTL loss reflects the actual contribution of each rank-1 component; performance-driven shrinking ensures no critical updates are discarded.
DoRA Adapters (instead of LoRA):
- DoRA decouples low-rank adaptation into magnitude and direction: \(\text{Out}_i^{DoRA} = m_i \frac{W_i + \alpha B_i A_i}{\|W_i + \alpha B_i A_i\|_2} x + b_i\)
- DoRA is more stable than LoRA at extremely low ranks and pairs better with PDRS rank shrinking.
- Experimental validation shows that at high ranks DoRA does not necessarily outperform LoRA, but at low ranks DoRA is significantly superior.
Task-Spectral Pyramidal Decoder (TS-PD):
- Channel-wise Spectral Filter (CW-SP):
  - FFT is applied to each task-specific feature, and a task/resolution-specific 2D frequency filter matrix \(W_t^{res}\) is learned.
  - Selective amplification/suppression of different frequencies is achieved via element-wise multiplication \(Y = W \odot FFT(I)\).
  - After inverse FFT back to feature space, learnable scale/shift parameters are applied for modulation.
  - Design Motivation: Different tasks require different frequency information — edge detection relies on high frequencies, while depth estimation leverages both high and low frequencies.
- Cross-Task Consensus Alignment (XT-Cons):
  - For the primary task, the average representation \(F_{avg}\) of auxiliary task spectra is computed.
  - High- and low-frequency masks \(M_{low}\), \(M_{high}\) are extracted from the primary task spectrum.
  - Alignment differences are computed: \(\Delta_{low,high} = M_{low,high} * (F_{avg} - FFT(X_i^{main}))\)
  - Contributions are scaled by learnable scalars \(\alpha_{low,high}\).
  - Design Motivation: The "consensus" of auxiliary tasks in the frequency domain is used to drive geometric consistency in the primary task representation, at lower cost than direct spatial-domain interaction.

Loss & Training¶

MTL loss: \(L_{MTL} = \sum_{i=1}^T w \times L_i\)
- Semantic segmentation, human part segmentation: pixel-wise cross-entropy
- Depth estimation, surface normal estimation: L1 loss
- Saliency detection: balanced cross-entropy
Coverage ratio: \(\rho_{shared} = \rho_{task} = 0.95\)
Backbone: Swin-Tiny (ImageNet-1k pretrained); Decoder: HRNet
Initial rank \(r_{init} = 64\), dynamically shrunk to approximately \(r_{global} \approx 5\) during training
Single NVIDIA A40; learning rate \(5 \times 10^{-4}\); batch size 32

Key Experimental Results¶

Main Results¶

PASCAL-Context dataset (4 tasks):

Method	SemSeg (mIoU↑)	HumanParts (mIoU↑)	Saliency (mIoU↑)	Normals (rmse↓)	Δm (%)	Params (M)
Single Task	67.21	61.93	62.35	17.97	0	112.62
MTL Full FT	67.56	60.24	65.21	16.64	+2.23	30.06
MTLoRA (r=64)	67.90	59.84	65.40	16.60	+2.55	8.34
TADFormer (r=64)	70.82	60.45	65.88	16.48	+4.24	7.38
FAAR	72.02	61.25	66.11	16.35	+5.28	3.38

NYUDv2 dataset (3 tasks):

Method	SemSeg (mIoU↑)	Depth (rmse↓)	Normals (rmse↓)	Δm (%)	Params (M)
Single Task	42.65	0.60	22.83	0	84.00
MTL Full FT	38.85	0.66	24.33	-8.49	28.10
TADFormer (r=64)	40.85	0.64	27.48	-10.42	8.90
FAAR	41.27	0.63	26.35	-7.88	2.85

Ablation Study¶

Component ablation on PASCAL-Context:

Configuration	SemSeg	HumanParts	Saliency	Normals	Δm
MTLoRA (r=64)	67.90	59.84	65.40	16.60	+2.55
+ DoRA (high rank)	67.55	60.00	64.70	17.20	+1.36
+ PDRS w/ LoRA	68.11	59.93	65.54	16.50	+2.83
+ PDRS w/ DoRA (1)	71.35	61.02	65.92	16.42	+4.92
+ TS-PD (2)	70.73	60.95	65.92	16.40	+4.63
FAAR (1+2)	72.02	61.25	66.11	16.35	+5.28

Key Findings¶

Rank shrinkage patterns are intuitive: Task-specific and deeper layers tend to retain larger ranks due to their role in processing fine-grained task-specific information, while shared and shallower layers are significantly pruned.
DoRA substantially outperforms LoRA at low ranks: At high ranks, DoRA actually underperforms (+1.36 vs. +2.55), but after PDRS shrinks ranks to very low values, DoRA yields a significant advantage (+4.92).
Initial rank has little impact on final performance: Results are nearly identical for \(r_{init} \in \{16, 32, 64\}\), indicating that PDRS provides a sufficiently broad search space.
XT-Cons cross-task alignment is effective: It provides an additional +0.8% Δm improvement on top of TS-PD, validating the value of cross-task consistency in the frequency domain.
9× parameter savings: FAAR (3.38M) vs. MTL Full FT (30.06M), with superior performance.

Highlights & Insights¶

Performance-driven rank shrinking: Unlike AdaLoRA (singular-value importance) or DyLoRA (robustness to low-rank training), PDRS directly uses directional derivatives of the MTL loss to guide rank pruning, aligning more directly with the optimization objective.
Frequency domain as a cross-task bridge: This is the first work to leverage FFT in dense visual MTL. The frequency domain naturally separates edge/semantic information, providing a meaningful shared basis for different tasks.
Synergy between DoRA and extremely low ranks: At high ranks, DoRA does not necessarily outperform LoRA; however, when ranks are dynamically compressed to very low values by PDRS, DoRA's magnitude-direction decoupling becomes critical.
Simultaneous improvement across all tasks: FAAR outperforms baselines on all 4 PASCAL tasks, with no evidence of sacrificing one task at the expense of others.

Limitations & Future Work¶

On NYUDv2, all MTL PEFT methods fail to surpass single-task training; FAAR does not fully resolve the difficulty of MTL on small datasets.
The coverage ratio \(\rho\) still requires manual specification (although the paper selects 0.95 on a validation set, different datasets may require different values).
Only the Swin-Tiny backbone is evaluated; the effectiveness on larger backbones (e.g., Swin-Base/Large) or ViT architectures remains unknown.
The frequency filter matrices in TS-PD are learned separately for each resolution and task, leading to parameter growth as the number of tasks increases.
Cross-task alignment is performed only in the frequency domain; spatial-domain interactions may provide additional complementary information.

MTLoRA / TADFormer: LoRA-based MTL baselines using fixed ranks.
AdaLoRA / AutoLoRA / DyLoRA: Automatic rank selection methods for single-task settings; FAAR extends these to the multi-task regime.
FADA / NightAdapter: Frequency adapters for domain generalization and nighttime segmentation, which inspired the design of TS-PD.
DiTASK: An alternative approach adapting singular values via neural diffeomorphisms.

Rating¶

Novelty: ⭐⭐⭐⭐ (PDRS and TS-PD are individually novel, though both represent improved combinations of existing ideas)
Experimental Thoroughness: ⭐⭐⭐⭐ (Two datasets, detailed ablations, and comprehensive parameter efficiency comparisons)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, though the density of formulas and abbreviations somewhat reduces readability)
Value: ⭐⭐⭐⭐ (Provides a practical and efficient solution for MTL PEFT; the 9× parameter savings is highly attractive)