DiffFNO: Diffusion Fourier Neural Operator¶

Conference: CVPR 2025
arXiv: 2411.09911
Code: None
Area: Image Restoration
Keywords: Super-Resolution, Fourier Neural Operator, Diffusion Models, Arbitrary-Scale, High-Frequency Reconstruction

TL;DR¶

This paper proposes DiffFNO, which integrates the Weighted Fourier Neural Operator (WFNO) with a diffusion framework for arbitrary-scale super-resolution. It preserves critical high-frequency components through Mode Rebalancing, fuses frequency-domain and spatial-domain features using a Gated Fusion Mechanism, and accelerates inference with an adaptive-step ODE solver, outperforming existing methods by 2-4 dB in PSNR across multiple benchmarks.

Background & Motivation¶

Background: Image super-resolution (SR) has evolved from SRCNN to attention-based deep architectures, and recently to diffusion models for high-fidelity generation. Arbitrary-scale SR (which works at unseen scale factors during training) has emerged as a new trend, where neural operators (e.g., SRNO, HiNOTE) utilize function space mapping to achieve resolution-agnostic upsampling.

Limitations of Prior Work: Standard FNO improves efficiency via mode truncation, which discards crucial high-frequency components essential for SR. MLPs suffer from spectral bias (favoring low frequencies). Although diffusion models can generate high-fidelity details, they suffer from slow inference. Operator learning methods like SRNO, originating from physical simulations, suffer from insufficient high-frequency recovery when applied directly to real-world images.

Key Challenge: A conflict exists between the efficient global modeling of FNO and the heavy reliance of SR on high-frequency details—mode truncation enhances computational efficiency at the cost of high-frequency information, while diffusion models can recover high frequencies but are computationally expensive.

Goal: To design an arbitrary-scale SR method that preserves high-frequency information while maintaining efficient inference.

Key Insight: Instead of truncating Fourier modes, all modes are rebalanced using a learnable weight function to reinforce high-frequency components. Concurrently, a diffusion process is utilized for iterative refinement, accelerated by an adaptive-step ODE solver.

Core Idea: The WFNO retains all Fourier modes and enhances high frequencies using a frequency-dependent learnable weight \(\mathbf{w}(\boldsymbol{\xi}) = 1 + \gamma \cdot \|\boldsymbol{\xi}\|^\alpha\). These features are then fused with an attention operator and iteratively refined within a diffusion framework.

Method¶

Overall Architecture¶

The input LR image is processed through a CNN encoder (EDSR-baseline/RDN) to extract features. The WFNO captures global dependencies in the frequency domain and enhances high frequencies, while the AttnNO captures local details in the spatial domain using Galerkin attention. A Gated Fusion Mechanism (GFM) adaptively combines features from both pathways. After projecting these features into the RGB space, reverse diffusion is executed efficiently utilizing an Adaptive-Step (ATS) ODE solver to produce the HR output.

Key Designs¶

Weighted Fourier Neural Operator (WFNO) and Mode Rebalancing:
- Function: Adaptively enhances high-frequency components while retaining all Fourier formats.
- Mechanism: Introduces a learnable weight function \(\mathbf{w}_l(\boldsymbol{\xi}) = 1 + \gamma_l \cdot \|\boldsymbol{\xi}\|^\alpha\) (\(\alpha = 0.7\)) into the frequency-domain convolution of standard FNO, formulated as the updated integral operator: \(\mathcal{K}_l \mathbf{v}_l = \mathcal{F}^{-1}(\mathbf{w}_l \cdot \mathbf{P}_l \cdot \mathcal{F}[\mathbf{v}_l])\). For \(\alpha > 0\), higher frequencies receive larger weights, and \(\gamma_l\) is learned independently per layer to control the intensity of the enhancement.
- Design Motivation: Standard FNO truncates high-frequency modes to reduce computation, but the SR task inherently requires high frequencies. Mode rebalancing keeps all modes and allows the network to learn the high-frequency enhancement strategy, offering greater flexibility than manual truncation.
Gated Fusion Mechanism (GFM):
- Function: Adaptively fuses the global frequency-domain features of WFNO and the local spatial-domain features of AttnNO.
- Mechanism: Channels of both WFNO and AttnNO feature maps are concatenated and passed through a \(1 \times 1\) convolution + sigmoid to generate a spatial gate map \(\mathbf{G} \in \mathbb{R}^{B \times H \times W \times 1}\). The fused output is computed as: \(\mathbf{v}_{fused} = \mathbf{G} \odot \mathbf{v}_{WFNO} + (1 - \mathbf{G}) \odot \mathbf{v}_{AttnNO}\). AttnNO runs in parallel with WFNO, sharing the same encoder and employing the Galerkin attention mechanism.
- Design Motivation: WFNO excels at capturing global structure and long-range dependencies, while AttnNO excels at local textures and fine-grained details. Gated fusion dynamically balances their contributions at each spatial location.
Adaptive-Step (ATS) ODE Solver:
- Function: Accelerates the reverse sampling process of the diffusion model.
- Mechanism: Reformulates the stochastic reverse diffusion as a deterministic ODE solved via a numerical solver with adaptive steps. The step size is dynamically adjusted based on the complexity of image regions: larger steps for simpler regions and smaller steps for complex ones, thereby reducing computational overhead while retaining quality.
- Design Motivation: Fixed-step ODE solvers apply the same number of steps to all regions, leading to wasted computational resources. The adaptive-step approach significantly cuts down inference time without sacrificing generation quality.

Loss & Training¶

The score matching loss is adopted: \(\mathcal{L}(\theta) = \mathbb{E}_{t, \mathbf{x}_0}[\|s_\theta(\mathbf{x}_t, t) - \nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t|\mathbf{x}_0)\|_2^2]\). The forward diffusion process utilizes a modified variance-preserving SDE, where the drift term \(-\frac{1}{2}\beta(t)(\mathbf{x} - \mathbf{Dx})\) models the high-frequency degradation caused by downsampling. The noise schedule grows linearly with \(\beta_{min}=0.1\) and \(\beta_{max}=20\).

Key Experimental Results¶

Main Results¶

Method	4x SR PSNR	Inference Time	Arbitrary Scale
SRNO	Baseline	Slower	Supported
HiNOTE	Medium	Slower	Supported
DiffFNO	+2-4 dB	Faster	Supported
EDSR	Lower	Fast	Fixed Scale

Ablation Study¶

Configuration	Key Metrics	Description
FNO (Standard Truncation)	Lower PSNR	Loss of high frequencies
WFNO (Mode Rebalancing)	Significant Improvement	High-frequency preservation is critical
Without AttnNO	PSNR Decreases	Loss of local details
Naive Concat instead of GFM	PSNR Decreases	Adaptive fusion outperforms simple concatenation
Fixed-step ODE	Slower Inference	ATS is more efficient under equal quality

Key Findings¶

Mode rebalancing is central to performance gains—a simple frequency-dependent weight function substantially improves high-frequency reconstruction.
The spatially adaptive gating of GFM is more effective than naive concatenation, as different spatial locations have varying demands for global and local features.
DiffFNO excels not only at scale factors within the training distribution but also remains robust on unseen scale factors.
The ATS ODE solver significantly reduces inference time while keeping output quality intact.

Highlights & Insights¶

Parsimonious Form of Frequency Rebalancing: Working as a weight function, \(1 + \gamma \cdot \|\xi\|^\alpha\) is extremely simple with only one additional scalar parameter, yet successfully tackles FNO's high-frequency loss issue. This "minimum intervention, maximum yield" design is highly instructive.
Synergy of Diffusion Framework and Neural Operators: WFNO provides resolution-agnostic upsampling while the diffusion process delivers iterative refinement—the two are naturally complementary. The former ensures "correct structure" whereas the latter ensures "realistic details".
Dual-Pathway Design with Shared Encoder: WFNO and AttnNO share an encoder but run in parallel in different domains (frequency vs. spatial domain), utilizing computational resources in a highly efficient manner.

Limitations & Future Work¶

Even with ATS acceleration, the inference time of the diffusion process is still longer than single forward-pass methods (e.g., EDSR).
The hyperparameter \(\alpha\) in mode rebalancing may require tuning for different datasets or tasks.
Experiments are primarily validated on standard SR benchmarks; real-world degradation scenarios (blur + noise + compression) have not yet been explored.
Comparison with recent Transformer-based SR methods (e.g., SwinIR, HAT) is relatively incomplete.

vs SRNO: SRNO employs standard FNO + Galerkin attention but lacks frequency rebalancing and diffusion refinement. Based on this, DiffFNO introduces WFNO and a diffusion process, achieving a 2-4 dB PSNR improvement.
vs HiNOTE: HiNOTE possesses its own encoder, but its high-frequency reconstruction remains insufficient. DiffFNO directly enhances high frequencies from the frequency domain via mode rebalancing.
vs SRDiff/SR3: Pure diffusion SR methods suffer from extremely slow inference and may not support arbitrary scales. DiffFNO integrates the resolution-independence of operator learning with the high fidelity of diffusion.

Rating¶

Novelty: ⭐⭐⭐⭐ The mode rebalancing of WFNO and its integration with the diffusion framework are novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-scale evaluation, comprehensive ablation, and comparison with various baselines.
Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivations, intuitive architecture diagrams, and detailed method descriptions.
Value: ⭐⭐⭐⭐ Introducing Fourier mode rebalancing to diffusion SR for the first time, establishing a new standard for arbitrary-scale SR.