Skip to content

DiffFNO: Diffusion Fourier Neural Operator

Conference: CVPR 2025
arXiv: 2411.09911
Code: None
Area: Image Restoration
Keywords: Super-Resolution, Fourier Neural Operator, Diffusion Models, Arbitrary-Scale, High-Frequency Reconstruction

TL;DR

This paper proposes DiffFNO, which integrates the Weighted Fourier Neural Operator (WFNO) with a diffusion framework for arbitrary-scale super-resolution. It preserves critical high-frequency components through Mode Rebalancing, fuses frequency-domain and spatial-domain features using a Gated Fusion Mechanism, and accelerates inference with an adaptive-step ODE solver, outperforming existing methods by 2-4 dB in PSNR across multiple benchmarks.

Background & Motivation

Background: Image super-resolution (SR) has evolved from SRCNN to attention-based deep architectures, and recently to diffusion models for high-fidelity generation. Arbitrary-scale SR (which works at unseen scale factors during training) has emerged as a new trend, where neural operators (e.g., SRNO, HiNOTE) utilize function space mapping to achieve resolution-agnostic upsampling.

Limitations of Prior Work: Standard FNO improves efficiency via mode truncation, which discards crucial high-frequency components essential for SR. MLPs suffer from spectral bias (favoring low frequencies). Although diffusion models can generate high-fidelity details, they suffer from slow inference. Operator learning methods like SRNO, originating from physical simulations, suffer from insufficient high-frequency recovery when applied directly to real-world images.

Key Challenge: A conflict exists between the efficient global modeling of FNO and the heavy reliance of SR on high-frequency details—mode truncation enhances computational efficiency at the cost of high-frequency information, while diffusion models can recover high frequencies but are computationally expensive.

Goal: To design an arbitrary-scale SR method that preserves high-frequency information while maintaining efficient inference.

Key Insight: Instead of truncating Fourier modes, all modes are rebalanced using a learnable weight function to reinforce high-frequency components. Concurrently, a diffusion process is utilized for iterative refinement, accelerated by an adaptive-step ODE solver.

Core Idea: The WFNO retains all Fourier modes and enhances high frequencies using a frequency-dependent learnable weight \(\mathbf{w}(\boldsymbol{\xi}) = 1 + \gamma \cdot \|\boldsymbol{\xi}\|^\alpha\). These features are then fused with an attention operator and iteratively refined within a diffusion framework.

Method

Overall Architecture

The input LR image is processed through a CNN encoder (EDSR-baseline/RDN) to extract features. The WFNO captures global dependencies in the frequency domain and enhances high frequencies, while the AttnNO captures local details in the spatial domain using Galerkin attention. A Gated Fusion Mechanism (GFM) adaptively combines features from both pathways. After projecting these features into the RGB space, reverse diffusion is executed efficiently utilizing an Adaptive-Step (ATS) ODE solver to produce the HR output.

Key Designs

  1. Weighted Fourier Neural Operator (WFNO) and Mode Rebalancing:

    • Function: Adaptively enhances high-frequency components while retaining all Fourier formats.
    • Mechanism: Introduces a learnable weight function \(\mathbf{w}_l(\boldsymbol{\xi}) = 1 + \gamma_l \cdot \|\boldsymbol{\xi}\|^\alpha\) (\(\alpha = 0.7\)) into the frequency-domain convolution of standard FNO, formulated as the updated integral operator: \(\mathcal{K}_l \mathbf{v}_l = \mathcal{F}^{-1}(\mathbf{w}_l \cdot \mathbf{P}_l \cdot \mathcal{F}[\mathbf{v}_l])\). For \(\alpha > 0\), higher frequencies receive larger weights, and \(\gamma_l\) is learned independently per layer to control the intensity of the enhancement.
    • Design Motivation: Standard FNO truncates high-frequency modes to reduce computation, but the SR task inherently requires high frequencies. Mode rebalancing keeps all modes and allows the network to learn the high-frequency enhancement strategy, offering greater flexibility than manual truncation.
  2. Gated Fusion Mechanism (GFM):

    • Function: Adaptively fuses the global frequency-domain features of WFNO and the local spatial-domain features of AttnNO.
    • Mechanism: Channels of both WFNO and AttnNO feature maps are concatenated and passed through a \(1 \times 1\) convolution + sigmoid to generate a spatial gate map \(\mathbf{G} \in \mathbb{R}^{B \times H \times W \times 1}\). The fused output is computed as: \(\mathbf{v}_{fused} = \mathbf{G} \odot \mathbf{v}_{WFNO} + (1 - \mathbf{G}) \odot \mathbf{v}_{AttnNO}\). AttnNO runs in parallel with WFNO, sharing the same encoder and employing the Galerkin attention mechanism.
    • Design Motivation: WFNO excels at capturing global structure and long-range dependencies, while AttnNO excels at local textures and fine-grained details. Gated fusion dynamically balances their contributions at each spatial location.
  3. Adaptive-Step (ATS) ODE Solver:

    • Function: Accelerates the reverse sampling process of the diffusion model.
    • Mechanism: Reformulates the stochastic reverse diffusion as a deterministic ODE solved via a numerical solver with adaptive steps. The step size is dynamically adjusted based on the complexity of image regions: larger steps for simpler regions and smaller steps for complex ones, thereby reducing computational overhead while retaining quality.
    • Design Motivation: Fixed-step ODE solvers apply the same number of steps to all regions, leading to wasted computational resources. The adaptive-step approach significantly cuts down inference time without sacrificing generation quality.

Loss & Training

The score matching loss is adopted: \(\mathcal{L}(\theta) = \mathbb{E}_{t, \mathbf{x}_0}[\|s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t|\mathbf{x}_0)\|_2^2]\). The forward diffusion process utilizes a modified variance-preserving SDE, where the drift term \(-\frac{1}{2}\beta(t)(\mathbf{x} - \mathbf{Dx})\) models the high-frequency degradation caused by downsampling. The noise schedule grows linearly with \(\beta_{min}=0.1\) and \(\beta_{max}=20\).

Key Experimental Results

Main Results

Method 4x SR PSNR Inference Time Arbitrary Scale
SRNO Baseline Slower Supported
HiNOTE Medium Slower Supported
DiffFNO +2-4 dB Faster Supported
EDSR Lower Fast Fixed Scale

Ablation Study

Configuration Key Metrics Description
FNO (Standard Truncation) Lower PSNR Loss of high frequencies
WFNO (Mode Rebalancing) Significant Improvement High-frequency preservation is critical
Without AttnNO PSNR Decreases Loss of local details
Naive Concat instead of GFM PSNR Decreases Adaptive fusion outperforms simple concatenation
Fixed-step ODE Slower Inference ATS is more efficient under equal quality

Key Findings

  • Mode rebalancing is central to performance gains—a simple frequency-dependent weight function substantially improves high-frequency reconstruction.
  • The spatially adaptive gating of GFM is more effective than naive concatenation, as different spatial locations have varying demands for global and local features.
  • DiffFNO excels not only at scale factors within the training distribution but also remains robust on unseen scale factors.
  • The ATS ODE solver significantly reduces inference time while keeping output quality intact.

Highlights & Insights

  • Parsimonious Form of Frequency Rebalancing: Working as a weight function, \(1 + \gamma \cdot \|\xi\|^\alpha\) is extremely simple with only one additional scalar parameter, yet successfully tackles FNO's high-frequency loss issue. This "minimum intervention, maximum yield" design is highly instructive.
  • Synergy of Diffusion Framework and Neural Operators: WFNO provides resolution-agnostic upsampling while the diffusion process delivers iterative refinement—the two are naturally complementary. The former ensures "correct structure" whereas the latter ensures "realistic details".
  • Dual-Pathway Design with Shared Encoder: WFNO and AttnNO share an encoder but run in parallel in different domains (frequency vs. spatial domain), utilizing computational resources in a highly efficient manner.

Limitations & Future Work

  • Even with ATS acceleration, the inference time of the diffusion process is still longer than single forward-pass methods (e.g., EDSR).
  • The hyperparameter \(\alpha\) in mode rebalancing may require tuning for different datasets or tasks.
  • Experiments are primarily validated on standard SR benchmarks; real-world degradation scenarios (blur + noise + compression) have not yet been explored.
  • Comparison with recent Transformer-based SR methods (e.g., SwinIR, HAT) is relatively incomplete.
  • vs SRNO: SRNO employs standard FNO + Galerkin attention but lacks frequency rebalancing and diffusion refinement. Based on this, DiffFNO introduces WFNO and a diffusion process, achieving a 2-4 dB PSNR improvement.
  • vs HiNOTE: HiNOTE possesses its own encoder, but its high-frequency reconstruction remains insufficient. DiffFNO directly enhances high frequencies from the frequency domain via mode rebalancing.
  • vs SRDiff/SR3: Pure diffusion SR methods suffer from extremely slow inference and may not support arbitrary scales. DiffFNO integrates the resolution-independence of operator learning with the high fidelity of diffusion.

Rating

  • Novelty: ⭐⭐⭐⭐ The mode rebalancing of WFNO and its integration with the diffusion framework are novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-scale evaluation, comprehensive ablation, and comparison with various baselines.
  • Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivations, intuitive architecture diagrams, and detailed method descriptions.
  • Value: ⭐⭐⭐⭐ Introducing Fourier mode rebalancing to diffusion SR for the first time, establishing a new standard for arbitrary-scale SR.