Bilevel Layer-Positioning LoRA for Real Image Dehazing¶
Conference: CVPR2026 arXiv: 2603.10872 Code: GitHub Area: Model Compression Keywords: image dehazing, LoRA, bilevel optimization, CLIP, unsupervised adaptation, parameter-efficient fine-tuning
TL;DR¶
This paper proposes BiLaLoRA, which employs bilevel optimization to automatically identify the optimal network layers for LoRA insertion, coupled with H2C Loss — an unsupervised dehazing loss based on CLIP semantic directions — to efficiently adapt synthetic-data-pretrained dehazing models to real-world scenes. The approach reduces training time by 77.7% while matching full fine-tuning performance, and generalizes across models and domains.
Background & Motivation¶
Image dehazing is a classical problem in low-level vision. Mainstream methods rely on synthetic data (e.g., ITS/OTS from the RESIDE dataset) for supervised training, but suffer from a severe domain gap:
Synthetic-to-real domain gap: Synthetic hazy images are generated via the atmospheric scattering model \(I(x) = J(x)t(x) + A(1-t(x))\), which differs significantly from the complex degradations in real haze (non-uniform haze, color shifts, multi-layer fog, etc.).
Absence of paired real data: It is nearly impossible to obtain paired hazy/clear images of the same real scene, rendering conventional supervised fine-tuning infeasible.
High cost of full fine-tuning: For Transformer-based dehazing models, updating all parameters is time-consuming and prone to overfitting on limited adaptation data.
Limitations of prior methods: - Domain adaptation methods (DA-dehazing, USID-Net) rely on CycleGAN-style translation, which suffers from training instability and may introduce artifacts. - LoRA fine-tuning reduces parameter count, but the choice of which layers to insert LoRA into is critical — random or uniform placement is far from optimal.
Core Problem¶
How can a dehazing model pretrained on synthetic data be adapted to real haze scenes at minimal training cost, without any real paired supervision? Two sub-problems must be addressed: 1. Design of an unsupervised optimization objective in the absence of real ground truth. 2. Automated and optimal selection of LoRA insertion layers.
Method¶
H2C Loss: Haze-to-Clear Text-Guided Loss¶
Core idea: The CLIP-pretrained vision-language alignment space is leveraged to construct a semantic direction from "hazy" to "clear" as an unsupervised dehazing signal.
Defining positive and negative text prompts: - \(T_{\text{pos}}\) (positive/clear): "a clear photo", "a bright image", "a high-quality photo" - \(T_{\text{neg}}\) (negative/hazy): "a hazy photo", "a foggy image", "a blurry photo"
Semantic direction computation:
Image-domain direction: \(\Delta V_{\text{img}} = V_{\text{out}} - V_{\text{in}}\)
where \(V_{\text{out}} = \text{CLIP}_{\text{img}}(\hat{J})\) is the CLIP image feature of the dehazed output and \(V_{\text{in}} = \text{CLIP}_{\text{img}}(I)\) is the feature of the hazy input.
Text-domain direction: \(\Delta T_{\text{text}} = T_{\text{pos}} - T_{\text{neg}}\)
H2C Loss:
This maximizes the cosine similarity between the image change direction and the "hazy→clear" text direction, requiring no real ground truth and relying entirely on CLIP's semantic priors.
BiLaLoRA: Bilevel Optimization for Layer Positioning¶
Determining which layers to insert LoRA into and how to weight their importance is a combinatorial optimization problem. BiLaLoRA formulates this as bilevel optimization:
Upper-level optimization (layer selection weights \(\alpha\)):
Lower-level optimization (LoRA weights \(\omega\)):
where \(\alpha = \{\alpha_1, \ldots, \alpha_L\}\) denotes per-layer selection weights and \(\omega\) denotes all LoRA parameters.
Continuous relaxation: Gumbel-Sigmoid is applied to relax the discrete layer selection:
where \(G\) is Gumbel noise and \(\tau\) is a temperature parameter. During training, \(g_l\) serves as a continuous weight; after the search, the Top-K layers are selected and fixed.
Efficient hypergradient computation: Standard bilevel optimization requires second-order Hessian computation. This work adopts a rank-one approximation:
where \(\omega^\pm = \omega \pm \epsilon \nabla_\omega \mathcal{L}_{\text{val}}\), requiring only first-order derivatives.
Two-Stage Training Pipeline¶
- Stage 1 — Bilevel Search: \(\alpha\) and \(\omega\) are jointly optimized using Gumbel-Sigmoid continuous relaxation. After the search, Top-K layers are selected based on \(\alpha\) values.
- Stage 2 — LoRA Fine-tuning: LoRA weights are trained only on the Top-K fixed layers; all other layers remain frozen.
Total Training Loss¶
where \(\mathcal{L}_{\text{reg}}\) is a regularization term preventing excessive deviation of the dehazed output from the input, and \(\lambda\) is a balancing coefficient.
Key Experimental Results¶
Cross-Model Adaptation Performance¶
BiLaLoRA is effective across four different dehazing backbones:
| Base Model | Method | RTTS (MUSIQ↑) | URHI (MUSIQ↑) | Parameters |
|---|---|---|---|---|
| MSBDN | Full Fine-tuning | Baseline | Baseline | 100% |
| MSBDN | BiLaLoRA | On par | On par | ~5% |
| DeHamer | Full Fine-tuning | Baseline | Baseline | 100% |
| DeHamer | BiLaLoRA | On par | On par | ~5% |
| ConvIR | Full Fine-tuning | Baseline | Baseline | 100% |
| ConvIR | BiLaLoRA | On par | On par | ~5% |
| DEA | Full Fine-tuning | Baseline | Baseline | 100% |
| DEA | BiLaLoRA | On par | On par | ~5% |
Comparison with Real Dehazing SOTA¶
| Method | RTTS | URHI | Fattal |
|---|---|---|---|
| DAD (CVPR 2020) | Low | Low | Low |
| USID-Net (TIP 2022) | Medium | Medium | Medium |
| BiLaLoRA (Ours) | SOTA | SOTA | SOTA |
State-of-the-art results are achieved on all three real dehazing benchmarks: RTTS, URHI, and Fattal.
Training Efficiency¶
| Method | Training Time | vs. Full Fine-tuning |
|---|---|---|
| Full Fine-tuning | 100% | Baseline |
| LoRA (uniform) | ~40% | −60% |
| BiLaLoRA | ~22.3% | −77.7% |
Training time is reduced by 77.7%, primarily due to training only on a small number of layers after Stage 1 search.
Ablation Study¶
| Component | MUSIQ | Note |
|---|---|---|
| Full BiLaLoRA | Best | — |
| w/o H2C Loss (replaced with L1) | Significant drop | Unsupervised adaptation fails |
| Uniform LoRA (no bilevel search) | Drop | Demonstrates importance of layer selection |
| Random layer selection | Larger drop | Worse than uniform |
| Full-layer LoRA | Moderate | More parameters but inferior to targeted placement |
Cross-Domain Adaptation¶
| Training Data | Test Data | BiLaLoRA |
|---|---|---|
| ITS (indoor synthetic) | RTTS (real) | Effective |
| OTS (outdoor synthetic) | URHI (real) | Effective |
| Daytime haze | Nighttime haze | Effective |
| Synthetic domain A | Synthetic domain B | Effective |
Highlights & Insights¶
- Elegant H2C Loss design: Constructing an unsupervised loss via directional semantics in CLIP space is more stable than CycleGAN-style methods and avoids artifact risks. The key insight is using a directional difference rather than an absolute distance, preventing degenerate solutions where the output is forced to match specific text embeddings.
- Applying NAS ideas to LoRA layer selection: Bilevel optimization automates optimal layer identification, eliminating manual trial-and-error. Gumbel-Sigmoid relaxation makes the discrete search differentiable.
- Rank-one approximation reduces bilevel optimization cost: Simplifying second-order Hessian computation to first-order significantly improves practical feasibility.
- Strong cross-model generalizability: The method is effective on both CNN (MSBDN, ConvIR) and Transformer (DeHamer, DEA) architectures, demonstrating architecture-agnostic applicability.
- Decoupled two-stage pipeline: Separating search from training allows the search phase to complete rapidly, with training focused on only the most effective layers.
Limitations & Future Work¶
- CLIP dependency: The quality of H2C Loss depends on the quality of CLIP's semantic space. For degradation types not well covered by CLIP (e.g., extreme haze), the directional signal may be inaccurate.
- Hard Top-K truncation: Selecting Top-K layers after the search is a discrete approximation that may discard contributions from borderline layers. The choice of K requires cross-validation.
- Validation limited to dehazing: Although the framework is general, experiments are conducted only on dehazing; effectiveness on other low-level vision tasks (deraining, denoising, super-resolution) remains unverified.
- Limitations of perceptual quality metrics: Evaluation relies primarily on no-reference metrics such as MUSIQ; validation using reference metrics (PSNR/SSIM on paired data) is absent.
- Lack of comparison with recent vision foundation models: No comparison is made against generative dehazing methods such as Stable Diffusion.
Related Work & Insights¶
- vs. standard LoRA fine-tuning: Uniform LoRA insertion underperforms BiLaLoRA's adaptive selection, demonstrating that where LoRA is inserted matters more than how much.
- Analogy to DARTS (classic bilevel optimization in NAS): BiLaLoRA transfers DARTS's operation search paradigm to LoRA layer selection, representing a cross-disciplinary innovation between NAS and PEFT.
- Connection to CLIP-guided methods (CLIPasso, StyleCLIP): All leverage CLIP semantic directions to guide optimization, but BiLaLoRA introduces directional loss for low-level vision tasks, with a use case more grounded in physical degradation modeling.
- Insights: (1) The bilevel layer selection paradigm can be extended to position optimization for other PEFT methods (e.g., Adapter, Prefix Tuning); (2) The "directional" design of H2C Loss can be applied to other unsupervised image restoration tasks — requiring only the definition of a degraded→restored text direction.
Rating¶
- Novelty: ⭐⭐⭐⭐ — H2C Loss and bilevel layer positioning each contribute independently, forming a coherent and complete solution.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Cross-model, cross-domain, and ablation experiments are thorough, though reference-metric validation is absent.
- Practicality: ⭐⭐⭐⭐⭐ — A 77.7% reduction in training time with no performance degradation is highly deployment-friendly.
- Writing Quality: ⭐⭐⭐⭐ — Method description is clear; mathematical derivation of bilevel optimization is complete.
- Overall: ⭐⭐⭐⭐ (4.0/5)