FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation¶
Conference: ICCV 2025 arXiv: 2509.01107 Code: None (not mentioned) Area: Image Generation / Layout-to-Image / Degraded Scene Synthesis Keywords: Layout-to-Image, Degraded Image Generation, Frequency Disentanglement, Low-light, Remote Sensing, Underwater
TL;DR¶
FICGen is proposed as the first method to address the "contextual illusion dilemma" in Layout-to-Image (L2I) generation for degraded scenes (low-light, underwater, remote sensing, adverse weather, etc.). It extracts high- and low-frequency prototypes of degraded scenes via a learnable dual-query mechanism, injects them into the latent diffusion space through visual-frequency enhanced attention, and achieves foreground-background disentanglement using instance consistency maps and spatial-frequency adaptive aggregation. FICGen comprehensively outperforms existing L2I methods across five degraded-scene datasets.
Background & Motivation¶
Problem Background¶
Visual perception tasks in degraded scenes (low-light, underwater, remote sensing, adverse weather, etc.) suffer from severe data scarcity. For example, the ExDARK low-light dataset contains only 7,363 images, roughly 1/20 the size of COCO. Layout-to-Image (L2I) generation is a promising approach for synthesizing training data from layout conditions.
Core Challenge: Contextual Illusion Dilemma¶
Existing L2I methods perform well on natural scenes but face serious issues when applied to degraded scenarios: - Remote sensing objects (e.g., vehicles) are small and visually similar to surrounding structures (e.g., bridges). - Underwater species (e.g., fish) frequently merge with nearby organisms (e.g., coral). - This leads to hallucinations in object count, position, and interaction during generation.
Frequency-Domain Analysis¶
In natural images, high-frequency (HF) and low-frequency (LF) components are relatively balanced, and foreground-background distinction is clear. In degraded images, high-frequency details of foreground instances are attenuated, while low-frequency background components dominate the overall frequency distribution. This explains why instances tend to be "submerged" in degraded scenes.
Motivation¶
The paper motivates contextual disentanglement from a frequency perspective: extracting high-frequency (instance boundaries/textures) and low-frequency (background color/atmosphere) knowledge from degraded scenes, injecting them into the diffusion generation process, and achieving latent-space foreground-background disentanglement via instance-level masks.
Method¶
Overall Architecture¶
FICGen comprises three core modules: 1. Frequency Perceiver Resamplers — extract HF/LF frequency prototypes via a dual-query mechanism. 2. Visual-Frequency Enhanced Attention — inject frequency knowledge into the latent diffusion space. 3. Adaptive Spatial-Frequency Aggregation — blend spatial and frequency information to reconstruct degraded representations.
Frequency Prototype Extraction¶
Step 1: Constructing Frequency Prototypes Degraded instances are sampled per category from the training set. Intermediate feature maps \(\mathbf{X} \in \mathbb{R}^{H \times W}\) are transformed to the frequency domain via DFT:
A binary mask \(\mathbf{M}_{\mathcal{F}}\) separates HF/LF regions; learnable channel weights are applied before inverse DFT back to the spatial domain:
Average pooling over HF/LF feature maps yields frequency prototypes: \(\textbf{p}^{\uparrow} = \{p_i^{\uparrow}\}_{i=1}^N\) (instance HF) and \(p^{\downarrow}\) (background LF).
Step 2: Dual-Query Frequency Resamplers Inspired by the Perceiver architecture, two independent learnable queries interact with the frequency prototypes via Transformer blocks:
The dual-query mechanism simultaneously captures instance boundary textures (HF) and environmental atmosphere/color (LF).
Contextual Frequency Knowledge Transfer¶
Visual-Frequency Enhanced Attention: Frequency-aware tokens are fused with layout conditions and injected into the diffusion U-Net: - Instance representation: \(\textbf{R}_i = [\textbf{q}_i^{\uparrow}; \textbf{E}_{clip}(l_i); \textbf{E}_{box}(\text{Fourier}(b_i))]\) (HF token + semantics + position) - Background representation: \(\textbf{G} = [\textbf{q}^{\downarrow}; \textbf{E}_{clip}(\mathcal{Y})]\) (LF token + global description)
Instance Consistency Maps decouple foreground from background:
Mask constraints ensure each layout condition influences only its corresponding local region, preventing attribute leakage.
Adaptive Spatial-Frequency Aggregation¶
Rather than simple summation or purely spatial-domain fusion, FICGen aggregates degraded instances and background simultaneously in both spatial and frequency domains:
where SAM captures spatial relational dependencies via standard self-attention, and FAM employs frequency attention to emphasize fine-grained cross-instance attributes (boundary sharpness, texture). The two streams are fused via a learnable depthwise convolution \(\zeta\) and softmax-weighted aggregation to produce the final degraded representation \(\delta^{final}\).
Loss & Training¶
Pre-trained LDM parameters are frozen; only FICGen modules are trained:
Key Experimental Results¶
Experimental Setup¶
- Base Model: SDv1.5; FICGen is deployed only at the 8×8 and 16×16 resolution decoder layers.
- Training: AdamW, lr=1e-4, 300 epochs, 8×A100, batch size=320.
- Evaluation: Five degraded-scene datasets — ExDARK (low-light), RUOD (underwater), DIOR-H (remote sensing), DAWN (adverse weather), blurred VOC2012 (blur).
- Metrics: FID (fidelity), COCO-style AP (alignment), downstream detector mAP (trainability).
Main Results: Fidelity and Alignment¶
| Dataset | Method | FID↓ | mAP↑ | AP_50↑ | AP_75↑ |
|---|---|---|---|---|---|
| DIOR-H (Remote Sensing) | MIGC | 31.64 | 21.8 | 38.4 | 17.5 |
| DIOR-H | CC-Diff | 30.88 | 23.6 | 42.4 | 21.4 |
| DIOR-H | FICGen | 31.25 | 27.6 | 48.7 | 27.6 |
| RUOD (Underwater) | MIGC | 26.50 | 27.2 | 54.1 | 24.6 |
| RUOD | CC-Diff | 25.21 | 29.7 | 58.4 | 27.9 |
| RUOD | FICGen | 25.10 | 37.0 | 68.6 | 36.5 |
| ExDARK (Low-light) | MIGC | 45.76 | 32.4 | 63.5 | 29.5 |
| ExDARK | CC-Diff | 44.26 | 35.1 | 65.6 | 34.1 |
| ExDARK | FICGen | 42.40 | 42.5 | 73.0 | 45.1 |
FICGen substantially leads in alignment (mAP) across all degraded scenes. Notably, on ExDARK it even surpasses the real test-set baseline (42.5 vs. 37.2 mAP), demonstrating that generated degraded instances precisely follow the layout.
DIOR-H Remote Sensing Comparison (Multiple Methods)¶
| Method | FID↓ | YOLO mAP↑ | AP_50↑ | AP_75↑ |
|---|---|---|---|---|
| LayoutDiffusion | 45.31 | 20.0 | 37.4 | 19.3 |
| GLIGEN | 41.31 | 25.8 | 44.4 | 27.8 |
| AeroGen | 38.57 | 29.8 | 54.2 | 31.6 |
| CC-Diff | 30.88 | 26.4 | 44.2 | 28.5 |
| FICGen | 31.25 | 31.2 | 49.9 | 34.6 |
FICGen achieves the best mAP under YOLO evaluation as well.
Downstream Trainability (Data Augmentation Effect)¶
When FICGen-synthesized data is used to augment downstream detector training: - Consistent improvement of ~2.0 mAP. - Significant gains on specific categories: "airport" class in remote sensing +6.0 AP (32.2→38.1). - On ExDARK, Cascade R-CNN mAP improves from 37.2 to 42.5.
Deformable-DETR Validation¶
| Dataset | Method | mAP↑ | AP_50↑ | AP_75↑ |
|---|---|---|---|---|
| ExDARK | CC-Diff | 31.3 | 61.8 | 28.8 |
| ExDARK | FICGen | 38.5 | 68.5 | 39.5 |
| RUOD | CC-Diff | 29.7 | 57.8 | 28.0 |
| RUOD | FICGen | 37.1 | 67.1 | 36.7 |
FICGen's advantage is even more pronounced when evaluated with the stronger Deformable-DETR detector.
Highlights & Insights¶
- First systematic treatment of L2I generation for degraded scenes: introduces the concept of the "contextual illusion dilemma" and provides a frequency-domain solution.
- Elegant design of frequency prototypes: explicitly models the frequency characteristics of degraded scenes (HF instance attenuation + LF background dominance) as learnable prototypes.
- Clean dual-query architecture: the HF query focuses on instance details while the LF query captures environmental atmosphere, with a clear division of roles.
- Simple yet effective instance consistency maps: binary masks enable latent-space disentanglement, preventing attribute leakage and object merging.
- Broad validation across five degraded scenarios: from severe low-light to mild blur, demonstrating the generality of the approach.
- Practical data augmentation capability: synthesized data directly improves downstream detector performance.
Limitations & Future Work¶
- Frequency prototypes are constructed by sampling degraded instances from the training set, making the method dependent on the coverage of degradation patterns in the training data.
- The implementation is based on SDv1.5; adaptation to newer foundation models is not explored.
- The \(\gamma\) parameter (controlling HF region size) requires manual specification.
- Robustness to extreme degradation (e.g., near-total darkness in low-light) is not thoroughly validated.
- Downstream trainability is verified only on object detection; other tasks such as segmentation remain unexplored.
Related Work & Insights¶
- Text-driven image synthesis: diffusion/autoregressive models such as DALL-E and LDM.
- Layout-driven image synthesis: GLIGEN, LayoutDiffusion, MIGC (multi-instance control), CC-Diff (contextual consistency).
- Degraded scene generation: AeroGen (remote sensing) is a pioneer but is limited by semantic ambiguity and insufficient layout controllability.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The frequency-domain perspective on degraded-scene L2I generation is original.
- Technical Depth: ⭐⭐⭐⭐ — A complete design chain of dual-query + frequency prototypes + instance disentanglement.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Five datasets, multiple detectors, and downstream training validation.
- Practical Value: ⭐⭐⭐⭐ — Data augmentation for degraded scenes addresses a genuine engineering need.