FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection¶
Conference: CVPR 2026 arXiv: 2603.20403 Code: Available (mentioned in paper) Area: Parameter-Efficient Fine-Tuning / Multi-Task Learning Keywords: LoRA, automatic rank selection, FFT, multi-task learning, PEFT
TL;DR¶
This paper proposes FAAR, a frequency-aware parameter-efficient fine-tuning method for multi-task learning. It introduces Performance-Driven Rank Shrinking (PDRS) to dynamically select the optimal rank per task and per layer, and designs a Task-Spectral Pyramidal Decoder (TS-PD) that leverages FFT frequency information to enhance spatial awareness and cross-task consistency. FAAR achieves superior performance using only 1/9 the parameters of full fine-tuning.
Background & Motivation¶
Multi-task learning (MTL) aims to learn multiple tasks simultaneously, leveraging shared representations to discover inter-task relationships and structure. As backbone model sizes continue to grow, traditional full fine-tuning becomes increasingly impractical. Parameter-efficient fine-tuning (PEFT), particularly methods based on Low-Rank Adaptation (LoRA), has become the dominant approach.
However, existing LoRA-based MTL methods suffer from two core limitations:
Fixed-rank problem: Existing methods apply a uniform rank across all layers and tasks, which is counterintuitive — different tasks may require different adaptation strengths, and different layers may need varying degrees of fine-tuning flexibility. Deeper layers require stronger adaptation capacity to handle task-specific fine-grained information, while shallower layers may require only minor adjustments.
Lack of spatial inductive bias: Existing LoRA-based MTL strategies overlook cross-task interactions in deeper layers. For dense visual tasks such as semantic segmentation, depth estimation, and surface normal estimation, strong spatial awareness and cross-task geometric consistency are critical, yet low-rank adaptation inherently lacks such capability.
FAAR addresses these issues by: - Resolving the fixed-rank problem through dynamic rank shrinking (PDRS), enabling each task/layer to automatically identify its optimal rank. - Introducing inexpensive yet effective spatial information and cross-task relationships through frequency analysis (TS-PD).
Method¶
Overall Architecture¶
FAAR is built upon a frozen Swin Transformer backbone, with DoRA adapters placed in the attention and MLP layers. Within each Transformer stage, the final block uses task-specific adapters while preceding blocks share adapters. The backbone is followed by a Task-Spectral Pyramidal Decoder (TS-PD) for frequency enhancement and cross-task alignment. The entire training process is governed by PDRS, which dynamically reduces adapter rank.
Key Designs¶
-
Performance-Driven Rank Shrinking (PDRS):
- Rank Masking: At each forward pass, a prefix size \(b \in \{1, ..., r_{curr}\}\) is randomly sampled to construct a binary mask \(m\), allowing only the first \(b\) rank components to participate in computation:
- \(A^{eff} = \text{diag}(m) A\), \(B^{eff} = B \text{diag}(m)\)
- This forces important rank-1 updates to concentrate toward lower dimensions.
- Coverage Strategy:
- At each backward pass, an importance score is computed for each active rank \(i\): \(s_i = \frac{1}{2}(|\langle A_{:,i}^{eff}, \frac{\partial \mathcal{L}}{\partial A_{:,i}^{eff}} \rangle| + |\langle B_{i,:}^{eff}, \frac{\partial \mathcal{L}}{\partial B_{i,:}^{eff}} \rangle|)\)
- Scores are accumulated across batches via EMA: \(\hat{s}_i \leftarrow \beta \hat{s}_{i-1} + (1-\beta) s_i\)
- At the end of each epoch, scores are sorted in descending order and the minimum number of ranks \(K\) satisfying coverage ratio \(\rho\) is selected: \(K = \min\{k : c(k) \geq \rho\}\)
- Ranks that are not covered are permanently removed from optimization.
- Design Motivation: The directional derivative of the MTL loss reflects the actual contribution of each rank-1 component. Performance-driven shrinking ensures that critical updates are preserved.
- Rank Masking: At each forward pass, a prefix size \(b \in \{1, ..., r_{curr}\}\) is randomly sampled to construct a binary mask \(m\), allowing only the first \(b\) rank components to participate in computation:
-
DoRA Adapters (instead of LoRA):
- DoRA decouples low-rank adaptation into magnitude and direction: \(\text{Out}_i^{DoRA} = m_i \frac{W_i + \alpha B_i A_i}{\|W_i + \alpha B_i A_i\|_2} x + b_i\)
- DoRA is more stable than LoRA at extremely low ranks and works more effectively with PDRS rank shrinking.
- Empirical validation shows that at high ranks DoRA does not necessarily outperform LoRA, but at low ranks DoRA is substantially better.
-
Task-Spectral Pyramidal Decoder (TS-PD):
-
Channel-wise Spectral Filter (CW-SP):
- FFT is applied to each task-specific feature map; a task- and resolution-specific 2D frequency filter matrix \(W_t^{res}\) is learned.
- Selected frequencies are selectively enhanced or suppressed via element-wise multiplication: \(Y = W \odot FFT(I)\).
- After inverse FFT back to feature space, the output is modulated with learnable scale and shift parameters.
- Design Motivation: Different tasks require different frequency information — edge detection relies on high frequencies, while depth estimation leverages both high and low frequencies.
-
Cross-Task Consensus Alignment (XT-Cons):
- For the primary task, an average spectral representation \(F_{avg}\) is computed from auxiliary task spectra.
- High- and low-frequency masks \(M_{low}\), \(M_{high}\) are extracted from the primary task spectrum.
- Alignment differences are computed as: \(\Delta_{low,high} = M_{low,high} * (F_{avg} - FFT(X_i^{main}))\)
- Contributions are scaled by learnable scalars \(\alpha_{low,high}\).
- Design Motivation: The frequency-domain "consensus" of auxiliary tasks is used to drive geometric consistency in the primary task representation, at lower cost than direct spatial-domain interaction.
-
Loss & Training¶
- MTL loss: \(L_{MTL} = \sum_{i=1}^T w \times L_i\)
- Semantic segmentation and human part segmentation: pixel-wise cross-entropy
- Depth estimation and surface normal estimation: L1 loss
- Saliency detection: balanced cross-entropy
- Coverage ratio: \(\rho_{shared} = \rho_{task} = 0.95\)
- Backbone: Swin-Tiny (ImageNet-1k pretrained); decoder: HRNet
- Initial rank \(r_{init} = 64\), dynamically shrunk to approximately \(r_{global} \approx 5\) during training
- Single NVIDIA A40; learning rate \(5 \times 10^{-4}\); batch size 32
Key Experimental Results¶
Main Results¶
PASCAL-Context dataset (4 tasks):
| Method | SemSeg (mIoU↑) | HumanParts (mIoU↑) | Saliency (mIoU↑) | Normals (rmse↓) | Δm (%) | Params (M) |
|---|---|---|---|---|---|---|
| Single Task | 67.21 | 61.93 | 62.35 | 17.97 | 0 | 112.62 |
| MTL Full FT | 67.56 | 60.24 | 65.21 | 16.64 | +2.23 | 30.06 |
| MTLoRA (r=64) | 67.90 | 59.84 | 65.40 | 16.60 | +2.55 | 8.34 |
| TADFormer (r=64) | 70.82 | 60.45 | 65.88 | 16.48 | +4.24 | 7.38 |
| FAAR | 72.02 | 61.25 | 66.11 | 16.35 | +5.28 | 3.38 |
NYUDv2 dataset (3 tasks):
| Method | SemSeg (mIoU↑) | Depth (rmse↓) | Normals (rmse↓) | Δm (%) | Params (M) |
|---|---|---|---|---|---|
| Single Task | 42.65 | 0.60 | 22.83 | 0 | 84.00 |
| MTL Full FT | 38.85 | 0.66 | 24.33 | -8.49 | 28.10 |
| TADFormer (r=64) | 40.85 | 0.64 | 27.48 | -10.42 | 8.90 |
| FAAR | 41.27 | 0.63 | 26.35 | -7.88 | 2.85 |
Ablation Study¶
Component ablation on PASCAL-Context:
| Configuration | SemSeg | HumanParts | Saliency | Normals | Δm |
|---|---|---|---|---|---|
| MTLoRA (r=64) | 67.90 | 59.84 | 65.40 | 16.60 | +2.55 |
| + DoRA (high rank) | 67.55 | 60.00 | 64.70 | 17.20 | +1.36 |
| + PDRS w/ LoRA | 68.11 | 59.93 | 65.54 | 16.50 | +2.83 |
| + PDRS w/ DoRA (1) | 71.35 | 61.02 | 65.92 | 16.42 | +4.92 |
| + TS-PD (2) | 70.73 | 60.95 | 65.92 | 16.40 | +4.63 |
| FAAR (1+2) | 72.02 | 61.25 | 66.11 | 16.35 | +5.28 |
Key Findings¶
- Rank shrinkage patterns are intuitive: Task-specific and deeper layers tend to retain larger ranks, as they handle finer task-specific information; shared and shallower layers have their ranks substantially reduced.
- DoRA significantly outperforms LoRA at low ranks: At high ranks, DoRA performs worse (+1.36 vs. +2.55), but after PDRS shrinks the rank to low values, DoRA yields a substantial advantage (+4.92).
- Initial rank has minimal impact on final performance: Results are nearly identical across \(r_{init} \in \{16, 32, 64\}\), indicating that the PDRS search space is sufficient.
- Cross-task alignment via XT-Cons is effective: An additional +0.8% Δm gain over TS-PD alone validates the value of frequency-domain cross-task consistency.
- 9× parameter reduction: FAAR (3.38M) vs. MTL Full FT (30.06M), with superior performance.
Highlights & Insights¶
- Performance-driven rank shrinking: Unlike AdaLoRA (singular value importance) or DyLoRA (robustness to low-rank training), PDRS directly uses the directional derivative of the MTL loss to guide rank pruning, aligning more directly with the optimization objective.
- Frequency domain as a cross-task bridge: This work is the first to leverage FFT for dense visual MTL. The frequency domain naturally separates edge and semantic information, providing a meaningful shared basis for different tasks.
- Synergy between DoRA and extremely low rank: At high ranks, DoRA does not necessarily outperform LoRA, but when ranks are dynamically compressed to very low values by PDRS, the magnitude-direction decoupling in DoRA becomes critical.
- Simultaneous improvement across all tasks: FAAR outperforms baselines on all 4 PASCAL tasks without sacrificing any task for another.
Limitations & Future Work¶
- On NYUDv2, no MTL PEFT method surpasses single-task training; FAAR does not fully resolve the difficulty of MTL on small datasets.
- The coverage ratio \(\rho\) still requires manual specification (although the paper reports 0.95 selected on the validation set, different datasets may require different values).
- Only the Swin-Tiny backbone is evaluated; effectiveness on larger backbones (e.g., Swin-Base/Large) or ViT architectures remains unknown.
- The TS-PD frequency filter matrices are learned separately for each resolution and task, causing parameter count to grow with the number of tasks.
- Cross-task alignment is performed only in the frequency domain; spatial-domain interactions may provide additional complementary information.
Related Work & Insights¶
- MTLoRA / TADFormer: LoRA-based baselines for MTL with fixed ranks.
- AdaLoRA / AutoLoRA / DyLoRA: Automatic rank selection methods for single-task settings; FAAR extends these to the multi-task setting.
- FADA / NightAdapter: Frequency adapters applied to domain generalization and nighttime segmentation, which inspired the design of TS-PD.
- DiTASK: An alternative approach that adapts singular values via neural diffeomorphisms.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Both PDRS and TS-PD are individually novel, though each represents an incremental combination of existing ideas)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Two datasets, detailed ablations, and comprehensive parameter efficiency comparisons)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, though the density of formulas and abbreviations somewhat hinders readability)
- Value: ⭐⭐⭐⭐ (Provides a practical and efficient solution for MTL PEFT; the 9× parameter reduction is compelling)