RDNet: Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network in Optical Remote Sensing Images¶

Conference: CVPR 2026 arXiv: 2603.12215 Code: None Area: Image Segmentation Keywords: Salient Object Detection, Remote Sensing Images, Dynamic Convolutional Kernel Selection, Wavelet Transform, Region Proportion Awareness

TL;DR¶

This paper proposes RDNet, which employs a region proportion-aware Proportion Guidance block to estimate the area ratio of salient objects and dynamically selects combinations of 3/4/5 convolutional kernels of varying sizes for detail extraction. Combined with wavelet-domain frequency-matched context enhancement (reducing computation to 1/4) and a cross-attention localization module, RDNet comprehensively outperforms 21 state-of-the-art methods on three optical remote sensing SOD benchmarks: EORSSD, ORSSD, and ORSI-4199.

Background & Motivation¶

Background: Optical Remote Sensing Image Salient Object Detection (ORSI-SOD) has increasingly relied on CNN/Transformer-based multi-level feature extraction and fusion pipelines, achieving continuous performance gains on standard benchmarks.

Limitations of Prior Work:

Object scales in remote sensing images vary drastically (from aircraft spanning only a few pixels to stadiums occupying half the image). Existing methods apply fixed convolutional kernel combinations — large kernels introduce excessive background noise for small objects, while small kernels fail to capture the complete extent of large objects.
Full-resolution self-attention for cross-layer feature interaction incurs high computational cost, and directly mixing high- and low-frequency information dilutes object features.
CNN backbones lack the capacity for global context modeling and long-range dependency capture.

Key Challenge: Object scale is inherently uncertain, yet feature extraction strategies remain static — without knowing "how large the object is," it is impossible to select "how wide a perspective to adopt."

Goal: Adaptively select appropriate feature extraction strategies according to the proportion of the image area occupied by the salient object, while performing multi-level feature interaction efficiently.

Key Insight: Estimate the object region proportion from high-level features, then use this estimate to guide dynamic kernel selection in low-level feature extraction; perform mid-level feature interaction via frequency decomposition in the wavelet domain for dimensionality reduction.

Core Idea: Determine approximately how large the object is before deciding how to perceive it — region proportion-guided dynamic convolutional kernel selection.

Method¶

Overall Architecture¶

Given an input of \(4 \times 3 \times 384 \times 384\), a Swin Transformer backbone extracts five levels of features. High-level features \(F_4\) and \(F_5\) are fed into the RPL module for object localization and region proportion estimation; low-level feature \(F_1\) is processed by the DAD module, which dynamically selects convolutional kernels guided by the estimated proportion for detail extraction; mid-level features \(F_2\) and \(F_3\) are enhanced by the FCE module through wavelet-domain frequency-matched context enhancement. The outputs of the three modules are fused in a bottom-up manner to generate the final saliency map.

Key Designs¶

RPL (Region Proportion-aware Localization Module)
Function: Localizes salient objects and estimates their area proportion using high-level semantic features.
Mechanism: Applies alternating channel attention (GAP → two 1×1 Conv layers → Sigmoid) and spatial attention (Max Pooling → Sigmoid) on \(F_4\) and \(F_5\), then concatenates and applies a 3×3 convolution to obtain localization features.
PG (Proportion Guidance) block: Applies GAP followed by two fully connected layers on \(F_5\) to output a per-sample object region proportion, supervised by ground-truth proportion via MSE loss.
Design Motivation: Knowing "how large the object is" in advance enables the subsequent DAD module to select appropriate convolutional kernels.
DAD (Dynamic Adaptive Detail-perception Module)
Function: Dynamically selects the number and size of convolutional kernels for object detail extraction based on the region proportion output by PG.
Mechanism: The region proportion is discretized into three ranges — \(<25\%\) uses 3 kernel types (\(1\times1\), \(3\times3\), \(5\times5\)); \(25\%\)–\(50\%\) uses 4 types (adding \(7\times7\)); \(>50\%\) uses 5 types (adding \(9\times9\)). A dual-branch design is adopted: the lower branch performs detail extraction via summation of multi-scale convolutions, and the upper branch applies spatial attention weighting to suppress noise.
Design Motivation: Small objects do not require large receptive fields (which introduce background noise), while large objects require large receptive fields to capture complete regions — proportion-guided selection breaks the one-size-fits-all paradigm.
FCE (Frequency-matched Context Enhancement Module)
Function: Performs efficient cross-layer interaction among mid-level features, avoiding the high computational cost of full-resolution self-attention and the interference caused by mixing high- and low-frequency information.
Mechanism: DWT decomposes features into four frequency sub-bands (LL/LH/HL/HH) → attention-based interaction is performed between corresponding sub-bands → IDWT reconstructs the features → concatenated with original features → channel and spatial attention further suppress noise.
Design Motivation: Performing interaction in the frequency domain halves the spatial resolution in each dimension, reducing computation to 1/4 of full-resolution attention while preventing high- and low-frequency information from interfering with each other.

Loss & Training¶

Total loss: \(\mathcal{L}_\text{total} = \mathcal{L}_\text{BCE} + \mathcal{L}_\text{IoU} + \mathcal{L}_{F\text{-measure}} + \mathcal{L}_\text{MSE}\), with equal weights.
The first three terms supervise saliency map prediction (pixel-level BCE + region-level IoU + precision-recall-balanced F-measure).
MSE supervises region proportion prediction.
Optimizer: RMSprop; learning rate: \(1\text{e}{-5}\); batch size: 4; input resolution: \(384\times384\).

Key Experimental Results¶

Main Results¶

Dataset	Metric	RDNet	GeleNet (Prev. SOTA)	ADSTNet	HFCNet	Gain
EORSSD	MAE↓	0.0049	0.0066	0.0065	0.0051	−25.8%
EORSSD	Fβ↑	0.8563	0.8367	0.8321	0.7845	+2.3%
EORSSD	Eξ↑	0.9718	0.9678	0.9633	0.9280	+0.4%
ORSSD	MAE↓	0.0066	0.0083	0.0089	0.0073	−20.5%
ORSSD	Fβ↑	0.9080	0.8879	0.8856	0.8581	+2.3%
ORSI-4199	MAE↓	0.0254	0.0266	0.0319	0.0270	−4.5%
ORSI-4199	Fβ↑	0.8781	0.8711	0.8615	0.8272	+0.8%

RDNet achieves the best performance on all metrics across comparisons with 21 methods. All pairwise t-test p-values are \(<1\text{e}{-10}\), confirming statistical significance.

Ablation Study¶

Configuration	EORSSD MAE	EORSSD Fβ	Notes
Full RDNet	0.0049	0.8563	Baseline
w/o DAD	0.0052	0.8550	Dynamic kernel selection is effective
w/o FCE	0.0061	—	Largest impact; frequency-domain interaction is critical
w/o RPL	0.0054	—	Localization and proportion estimation are effective
No proportion guidance (fixed kernels)	↓	↓	Dynamic selection outperforms fixed strategy
Thresholds [25%, 50%]	Best	Best	Both wider and narrower thresholds degrade performance

Backbone comparison: Swin Transformer achieves Fβ \(0.8563 \gg\) ViT \(0.5762 \gg\) ResNet-50 \(0.7756\). The model runs at 48.7 GFLOPs and 13 FPS on an RTX 3090.

Key Findings¶

The FCE module contributes the most; frequency-domain cross-layer interaction is the core driver of performance improvement.
Region proportion-guided dynamic kernel selection consistently outperforms fixed kernel strategies.
Swin Transformer's global context modeling capability is critical for remote sensing SOD.
Failure cases are concentrated on extremely small objects and scenes where object texture is highly similar to the background.

Highlights & Insights¶

Region proportion → dynamic kernel selection is an intuitive and effective design — deciding "how to look" based on "how large the object is."
Wavelet-domain frequency-matched interaction reduces computation to 1/4 of full-resolution self-attention while preventing high- and low-frequency information from interfering with each other.
The PG block directly supervises proportion prediction with MSE loss, providing a well-defined learning objective for dynamic selection rather than relying on pure heuristics.
MAE is reduced by 4.5%–25.8% over the previous SOTA across three datasets, representing substantial improvements.

Limitations & Future Work¶

The inference speed of 13 FPS is insufficient for real-time remote sensing detection requirements.
The three-tier proportion thresholds (25%/50%) are manually defined; end-to-end learnable soft thresholds warrant investigation.
Failure cases indicate persistent difficulties with extremely small objects and objects with background-similar textures.
Evaluation is limited to three remote sensing SOD benchmarks without extension to natural image SOD or general segmentation tasks.

vs. GeleNet: Also employs Transformers for remote sensing SOD but adopts a fixed feature extraction strategy. RDNet's core advantage lies in the dynamic kernel selection mechanism that adapts to objects of varying scales.
vs. ADSTNet: An adaptive dual-stream Transformer; Fβ of 0.8321 vs. RDNet's 0.8563, with the gap attributable to targeted handling of multi-scale objects.
vs. HFCNet: The closest competitor in MAE (0.0051 vs. 0.0049), yet a substantial Fβ gap (0.7845 vs. 0.8563) indicates insufficient region completeness.
Insights: The region proportion-guided paradigm could be transferred to anchor-free detectors for dynamic receptive field adjustment; wavelet-domain feature interaction could be applied to multimodal feature fusion.

Rating¶

Novelty: ⭐⭐⭐⭐ Region proportion-guided dynamic kernel selection is a genuinely novel contribution, though the overall framework remains within the encoder-decoder + attention paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comparisons with 21 methods, multiple ablation groups, and t-test statistical significance verification are comprehensive.
Writing Quality: ⭐⭐⭐ Formulas and structure are clear, though some descriptions are redundant.
Value: ⭐⭐⭐⭐ Practically valuable within the remote sensing SOD subfield; the dynamic kernel selection idea has reasonable generalizability.