FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection¶
Conference: CVPR 2026 arXiv: 2603.20403 Code: Available (mentioned in the paper) Area: Parameter-Efficient Fine-Tuning / Multi-Task Learning Keywords: LoRA, automatic rank selection, FFT, multi-task learning, PEFT
TL;DR¶
This paper proposes FAAR, a frequency-aware parameter-efficient fine-tuning method for multi-task learning. It introduces Performance-Driven Rank Shrinking (PDRS) to dynamically select the optimal rank for each task and layer, and designs a Task-Spectral Pyramidal Decoder (TS-PD) that leverages FFT frequency information to enhance spatial awareness and cross-task consistency. FAAR achieves superior performance using only 1/9 the parameters of full fine-tuning.
Background & Motivation¶
Multi-task learning (MTL) aims to learn multiple tasks simultaneously, sharing representations to discover inter-task relationships and structure. As backbone model sizes continue to grow, traditional full fine-tuning becomes increasingly impractical. Parameter-efficient fine-tuning (PEFT), particularly methods based on Low-Rank Adaptation (LoRA), has become the dominant paradigm.
However, existing LoRA-based MTL methods suffer from two core limitations:
Fixed-rank problem: Existing methods apply a uniform rank across all layers and all tasks, which is counter-intuitive — different tasks may require different adaptation strengths, and different layers need different degrees of fine-tuning flexibility. Deeper layers require stronger adaptability to handle task-specific fine-grained information, whereas shallower layers may need only minor adjustments.
Lack of spatial inductive bias: Existing LoRA-based MTL strategies overlook the role of cross-task interaction in deeper layers. For dense visual tasks such as semantic segmentation, depth estimation, and surface normal estimation, strong spatial awareness and cross-task geometric consistency are essential — yet low-rank adaptation inherently lacks these capabilities.
FAAR addresses these issues by: - Resolving the fixed-rank problem via dynamic rank shrinking (PDRS), enabling each task/layer to automatically find its optimal rank. - Introducing cheap yet effective spatial information and cross-task relationships through frequency analysis (TS-PD).
Method¶
Overall Architecture¶
FAAR is built upon a frozen Swin Transformer backbone, with DoRA adapters placed in the attention and MLP layers. Within each Transformer stage, the last block uses task-specific adapters while preceding blocks share adapters. A Task-Spectral Pyramidal Decoder (TS-PD) is appended after the backbone for frequency enhancement and cross-task alignment. The entire training process is governed by PDRS, which dynamically reduces adapter ranks.
Key Designs¶
-
Performance-Driven Rank Shrinking (PDRS):
- Rank Masking: At each forward pass, a prefix size \(b \in \{1, ..., r_{curr}\}\) is randomly sampled to construct a binary mask \(m\), allowing only the first \(b\) rank components to participate in computation:
- \(A^{eff} = \text{diag}(m) A\), \(B^{eff} = B \text{diag}(m)\)
- This compels important rank-1 updates to concentrate toward lower dimensions.
- Coverage Strategy:
- At each backward pass, an importance score is computed for each active rank \(i\): \(s_i = \frac{1}{2}(|\langle A_{:,i}^{eff}, \frac{\partial \mathcal{L}}{\partial A_{:,i}^{eff}} \rangle| + |\langle B_{i,:}^{eff}, \frac{\partial \mathcal{L}}{\partial B_{i,:}^{eff}} \rangle|)\)
- Scores are accumulated across batches via EMA: \(\hat{s}_i \leftarrow \beta \hat{s}_{i-1} + (1-\beta) s_i\)
- At the end of each epoch, ranks are sorted by score in descending order, and the minimum number of ranks \(K\) satisfying coverage ratio \(\rho\) is selected: \(K = \min\{k : c(k) \geq \rho\}\)
- Uncovered ranks are permanently removed from optimization.
- Design Motivation: The directional derivative of the MTL loss reflects the actual contribution of each rank-1 component; performance-driven shrinking ensures no critical updates are discarded.
- Rank Masking: At each forward pass, a prefix size \(b \in \{1, ..., r_{curr}\}\) is randomly sampled to construct a binary mask \(m\), allowing only the first \(b\) rank components to participate in computation:
-
DoRA Adapters (instead of LoRA):
- DoRA decouples low-rank adaptation into magnitude and direction: \(\text{Out}_i^{DoRA} = m_i \frac{W_i + \alpha B_i A_i}{\|W_i + \alpha B_i A_i\|_2} x + b_i\)
- DoRA is more stable than LoRA at extremely low ranks and pairs better with PDRS rank shrinking.
- Experimental validation shows that at high ranks DoRA does not necessarily outperform LoRA, but at low ranks DoRA is significantly superior.
-
Task-Spectral Pyramidal Decoder (TS-PD):
-
Channel-wise Spectral Filter (CW-SP):
- FFT is applied to each task-specific feature, and a task/resolution-specific 2D frequency filter matrix \(W_t^{res}\) is learned.
- Selective amplification/suppression of different frequencies is achieved via element-wise multiplication \(Y = W \odot FFT(I)\).
- After inverse FFT back to feature space, learnable scale/shift parameters are applied for modulation.
- Design Motivation: Different tasks require different frequency information — edge detection relies on high frequencies, while depth estimation leverages both high and low frequencies.
-
Cross-Task Consensus Alignment (XT-Cons):
- For the primary task, the average representation \(F_{avg}\) of auxiliary task spectra is computed.
- High- and low-frequency masks \(M_{low}\), \(M_{high}\) are extracted from the primary task spectrum.
- Alignment differences are computed: \(\Delta_{low,high} = M_{low,high} * (F_{avg} - FFT(X_i^{main}))\)
- Contributions are scaled by learnable scalars \(\alpha_{low,high}\).
- Design Motivation: The "consensus" of auxiliary tasks in the frequency domain is used to drive geometric consistency in the primary task representation, at lower cost than direct spatial-domain interaction.
-
Loss & Training¶
- MTL loss: \(L_{MTL} = \sum_{i=1}^T w \times L_i\)
- Semantic segmentation, human part segmentation: pixel-wise cross-entropy
- Depth estimation, surface normal estimation: L1 loss
- Saliency detection: balanced cross-entropy
- Coverage ratio: \(\rho_{shared} = \rho_{task} = 0.95\)
- Backbone: Swin-Tiny (ImageNet-1k pretrained); Decoder: HRNet
- Initial rank \(r_{init} = 64\), dynamically shrunk to approximately \(r_{global} \approx 5\) during training
- Single NVIDIA A40; learning rate \(5 \times 10^{-4}\); batch size 32
Key Experimental Results¶
Main Results¶
PASCAL-Context dataset (4 tasks):
| Method | SemSeg (mIoU↑) | HumanParts (mIoU↑) | Saliency (mIoU↑) | Normals (rmse↓) | Δm (%) | Params (M) |
|---|---|---|---|---|---|---|
| Single Task | 67.21 | 61.93 | 62.35 | 17.97 | 0 | 112.62 |
| MTL Full FT | 67.56 | 60.24 | 65.21 | 16.64 | +2.23 | 30.06 |
| MTLoRA (r=64) | 67.90 | 59.84 | 65.40 | 16.60 | +2.55 | 8.34 |
| TADFormer (r=64) | 70.82 | 60.45 | 65.88 | 16.48 | +4.24 | 7.38 |
| FAAR | 72.02 | 61.25 | 66.11 | 16.35 | +5.28 | 3.38 |
NYUDv2 dataset (3 tasks):
| Method | SemSeg (mIoU↑) | Depth (rmse↓) | Normals (rmse↓) | Δm (%) | Params (M) |
|---|---|---|---|---|---|
| Single Task | 42.65 | 0.60 | 22.83 | 0 | 84.00 |
| MTL Full FT | 38.85 | 0.66 | 24.33 | -8.49 | 28.10 |
| TADFormer (r=64) | 40.85 | 0.64 | 27.48 | -10.42 | 8.90 |
| FAAR | 41.27 | 0.63 | 26.35 | -7.88 | 2.85 |
Ablation Study¶
Component ablation on PASCAL-Context:
| Configuration | SemSeg | HumanParts | Saliency | Normals | Δm |
|---|---|---|---|---|---|
| MTLoRA (r=64) | 67.90 | 59.84 | 65.40 | 16.60 | +2.55 |
| + DoRA (high rank) | 67.55 | 60.00 | 64.70 | 17.20 | +1.36 |
| + PDRS w/ LoRA | 68.11 | 59.93 | 65.54 | 16.50 | +2.83 |
| + PDRS w/ DoRA (1) | 71.35 | 61.02 | 65.92 | 16.42 | +4.92 |
| + TS-PD (2) | 70.73 | 60.95 | 65.92 | 16.40 | +4.63 |
| FAAR (1+2) | 72.02 | 61.25 | 66.11 | 16.35 | +5.28 |
Key Findings¶
- Rank shrinkage patterns are intuitive: Task-specific and deeper layers tend to retain larger ranks due to their role in processing fine-grained task-specific information, while shared and shallower layers are significantly pruned.
- DoRA substantially outperforms LoRA at low ranks: At high ranks, DoRA actually underperforms (+1.36 vs. +2.55), but after PDRS shrinks ranks to very low values, DoRA yields a significant advantage (+4.92).
- Initial rank has little impact on final performance: Results are nearly identical for \(r_{init} \in \{16, 32, 64\}\), indicating that PDRS provides a sufficiently broad search space.
- XT-Cons cross-task alignment is effective: It provides an additional +0.8% Δm improvement on top of TS-PD, validating the value of cross-task consistency in the frequency domain.
- 9× parameter savings: FAAR (3.38M) vs. MTL Full FT (30.06M), with superior performance.
Highlights & Insights¶
- Performance-driven rank shrinking: Unlike AdaLoRA (singular-value importance) or DyLoRA (robustness to low-rank training), PDRS directly uses directional derivatives of the MTL loss to guide rank pruning, aligning more directly with the optimization objective.
- Frequency domain as a cross-task bridge: This is the first work to leverage FFT in dense visual MTL. The frequency domain naturally separates edge/semantic information, providing a meaningful shared basis for different tasks.
- Synergy between DoRA and extremely low ranks: At high ranks, DoRA does not necessarily outperform LoRA; however, when ranks are dynamically compressed to very low values by PDRS, DoRA's magnitude-direction decoupling becomes critical.
- Simultaneous improvement across all tasks: FAAR outperforms baselines on all 4 PASCAL tasks, with no evidence of sacrificing one task at the expense of others.
Limitations & Future Work¶
- On NYUDv2, all MTL PEFT methods fail to surpass single-task training; FAAR does not fully resolve the difficulty of MTL on small datasets.
- The coverage ratio \(\rho\) still requires manual specification (although the paper selects 0.95 on a validation set, different datasets may require different values).
- Only the Swin-Tiny backbone is evaluated; the effectiveness on larger backbones (e.g., Swin-Base/Large) or ViT architectures remains unknown.
- The frequency filter matrices in TS-PD are learned separately for each resolution and task, leading to parameter growth as the number of tasks increases.
- Cross-task alignment is performed only in the frequency domain; spatial-domain interactions may provide additional complementary information.
Related Work & Insights¶
- MTLoRA / TADFormer: LoRA-based MTL baselines using fixed ranks.
- AdaLoRA / AutoLoRA / DyLoRA: Automatic rank selection methods for single-task settings; FAAR extends these to the multi-task regime.
- FADA / NightAdapter: Frequency adapters for domain generalization and nighttime segmentation, which inspired the design of TS-PD.
- DiTASK: An alternative approach adapting singular values via neural diffeomorphisms.
Rating¶
- Novelty: ⭐⭐⭐⭐ (PDRS and TS-PD are individually novel, though both represent improved combinations of existing ideas)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Two datasets, detailed ablations, and comprehensive parameter efficiency comparisons)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, though the density of formulas and abbreviations somewhat reduces readability)
- Value: ⭐⭐⭐⭐ (Provides a practical and efficient solution for MTL PEFT; the 9× parameter savings is highly attractive)