PRUE: A Practical Recipe for Field Boundary Segmentation at Scale¶

Conference: CVPR 2026 arXiv: 2603.27101 Code: https://github.com/fieldsoftheworld/ftw-prue Area: Semantic Segmentation / Remote Sensing Keywords: Field boundary segmentation, geospatial foundation models, U-Net, deployment robustness, large-scale mapping

TL;DR¶

This paper systematically evaluates 18 segmentation and geospatial foundation models (GFMs), and proposes PRUE—a field boundary segmentation recipe combining a U-Net backbone, composite loss function, and targeted data augmentation. PRUE achieves 76% IoU and 47% object-F1 on the FTW benchmark, surpassing the baseline by 6% and 9% respectively, while introducing a novel set of metrics for evaluating deployment robustness.

Background & Motivation¶

Background: Large-scale field boundary maps are critical for agricultural monitoring. Deep learning methods—particularly U-Net-based semantic segmentation—have become the dominant approach for extracting field boundaries from satellite imagery.
Limitations of Prior Work: Existing methods are highly sensitive to illumination variation, spatial scale changes, and geographic domain shifts. Deploying top-performing models over large regions introduces tiling artifacts, boundary discontinuities, and other quality degradation issues.
Key Challenge: Conventional evaluation focuses solely on patch-level metrics such as IoU and F1, which fail to capture deployment-level failure modes encountered during large-scale mapping—including translation consistency, input order sensitivity, preprocessing normalization sensitivity, and spatial scale sensitivity.
Goal: To systematically identify the optimal combination of model architecture, loss function, and data augmentation, and to propose a deployment-oriented robustness evaluation framework that enables reliable national-scale field boundary mapping.
Key Insight: The problem is framed as a systematic "bake-off" benchmark, comparing 18 models spanning semantic segmentation, instance segmentation, and GFMs under a unified experimental protocol, with ablations over architecture, loss, and augmentation choices.
Core Idea: Through systematic exploration of the model design space—rather than architectural innovation—PRUE combines U-Net with EfficientNet-B7, log-cosh Dice loss, and channel shuffle with brightness/scale augmentation to jointly optimize accuracy and deployment robustness.

Method¶

Overall Architecture¶

The input consists of bi-temporal RGBN Sentinel-2 imagery (4 channels each for growing and harvest seasons, 8 channels total). The output is a three-class semantic segmentation map (background / field interior / boundary), from which individual field instance polygons are extracted via connected-component post-processing. The core pipeline is: encoder–decoder segmentation → pixel-level classification → connected-component instance extraction → polygonization.

Key Designs¶

U-Net + EfficientNet-B7 Encoder
Function: Serves as the feature extraction backbone, providing multi-scale semantic features.
Mechanism: After systematically comparing FCN, UPerNet, FCSiam, and multiple U-Net variants, the EfficientNet-B7 encoder achieves the best balance between accuracy and parameter efficiency. Compared to the B3 baseline, it increases model capacity while avoiding overfitting through careful selection of complementary components.
Design Motivation: A larger encoder captures richer spatial context, yielding better representation of complex field morphologies—especially irregular smallholder parcels. Its 67.1M parameters sit in the optimal region of the accuracy–throughput trade-off (306.94 km²/s).
Log-Cosh Dice Loss + Boundary Class Weight Adjustment
Function: Optimizes the segmentation objective and balances boundary versus interior classes.
Mechanism: After comparing CE, Dice, Focal, Tversky, and Jaccard losses, log-cosh Dice is found to provide a smoother optimization landscape. A boundary weight of \(\omega = 0.75\) (normalized class weights: [0.05, 0.20, 0.75]) substantially increases focus on narrow boundary pixels.
Design Motivation: Boundary pixels constitute a very small fraction of the image; standard losses tend to neglect them. The log-cosh transformation also alleviates gradient instability of Dice loss in early training.
Deployment-Oriented Data Augmentation (Channel Shuffle + Brightness + Resize)
Function: Improves model robustness to input variations encountered in real deployment scenarios.
Mechanism: Channel shuffle randomly swaps growing- and harvest-season channels, inducing input-order invariance. Brightness augmentation makes the model robust to different Sentinel-2 radiometric preprocessing pipelines. Resize augmentation simulates imagery at varying spatial resolutions.
Design Motivation: In practice, users may supply multi-temporal data in different orderings, with different preprocessing workflows, or at different resolutions—none of which should affect prediction quality.
Deployment Robustness Evaluation Metrics
Function: Quantifies model behavior under real-world large-scale mapping conditions.
Mechanism: Four new metrics are proposed: (a) translation consistency—prediction agreement in overlapping regions across four corner crops; (b) input order sensitivity—performance variance across channel permutations; (c) preprocessing invariance—performance variance across different radiometric normalization schemes; (d) spatial scale sensitivity—performance variance across different input resolutions.
Design Motivation: Conventional metrics only measure patch-level accuracy and cannot predict stitching quality during large-scale map production.

Loss & Training¶

The total loss is a class-weighted log-cosh Dice loss. Training uses the Adam optimizer, with the learning rate selected by sweep over \(\{10^{-4},\ 3\times10^{-4},\ 3\times10^{-3},\ 10^{-2},\ 3\times10^{-2}\}\). For presence-only samples (countries with only positive-sample annotations), unknown-label pixels are masked out during training.

Key Experimental Results¶

Main Results¶

Model	Category	IoU ↑	Object-F1 ↑	AP0.5 ↑	Params (M)	Throughput (km²/s)
PRUE (ours)	Semantic Seg.	0.76	0.47	0.40	67.1	306.94
FTW-Baseline	Semantic Seg.	0.70	0.38	0.39	13.2	623.28
Mask2Former	Instance/Panoptic	0.68	0.39	0.44	68.8	26.66
Clay (ViT-L)	GFM	0.67	0.36	0.41	363.8	10.98
Galileo (ViT-B)	GFM	0.66	0.32	0.37	119.0	*
SAM (fine-tuned)	Instance Seg.	0.45	0.37	0.19	642.7	0.17
Del-Any (zero-shot)	Instance Seg.	0.37	0.09	0.10	56.9	87.32

Ablation Study¶

Configuration	Object-F1 ↑	IoU ↑	Input Order Δ ↓	Brightness Δ ↓	Scale Δ ↓	Consistency ↑
FTW-Baseline	0.39	0.68	0.07 / 0.11	0.04 / 0.05	0.15 / 0.12	0.93
+Brightness+Resize	0.38	0.66	0.06 / 0.10	0.02 / 0.03	0.00 / 0.01	0.95
+Channel Shuffle	0.39	0.68	0.00 / 0.00	0.04 / 0.05	0.17 / 0.14	0.94
+ω=0.75	0.42	0.74	0.08 / 0.11	0.07 / 0.07	0.29 / 0.15	0.95
+Log-Cosh Dice	0.44	0.77	0.09 / 0.13	0.06 / 0.05	0.36 / 0.20	0.94
PRUE (full)	0.47	0.76	0.00 / 0.00	0.00 / 0.00	0.01 / 0.01	0.95

Key Findings¶

GFMs, despite having 3–10× more parameters, are consistently outperformed by the carefully optimized U-Net. The best-performing GFM, Clay (ViT-L, 363.8M parameters), still trails PRUE by 9% IoU—indicating that coarse patch-embedding resolution is insufficient for fine-grained boundary segmentation.
Systematic design optimization (loss + augmentation + class weighting) matters more than architecture selection: the same U-Net backbone gains 9% F1 through combined optimization.
The augmentation strategies are complementary: channel shuffle eliminates input-order dependence, brightness and resize augmentation eliminate radiometric and scale dependence, and their combination drives all robustness metrics to near-perfect levels.
Instance segmentation models (SAM, Delineate Anything) perform poorly in zero-shot settings, as field boundaries do not conform to the bounding-box assumptions underlying typical object detection frameworks.

Highlights & Insights¶

Deployment-Oriented Evaluation Framework: This work is the first to propose a systematic deployment robustness evaluation suite for geospatial segmentation, covering translation consistency, input order/preprocessing/scale sensitivity. This methodology is directly transferable to any remote sensing task requiring large-scale tiled inference.
"Recipe" Thinking vs. "Architecture" Thinking: The paper demonstrates that systematic engineering optimization (loss, augmentation, class weighting) on a mature segmentation backbone substantially outperforms introducing more complex architectures or larger foundation models—an insight of significant practical value for real-world deployment.
The channel shuffle trick for achieving input-order invariance is elegant, zero-cost, and directly transferable to all multi-temporal remote sensing tasks.

Limitations & Future Work¶

The method still relies on connected-component post-processing for instance extraction and cannot directly output instance-level segmentations; separation of adjacent parcels remains limited by boundary prediction quality.
The model uses only bi-temporal inputs and does not leverage full time-series information (e.g., temporal Sentinel-2 sequences as used in PASTIS).
Evaluation is conducted solely at Sentinel-2 10 m resolution; transferability to higher-resolution imagery (e.g., PlanetScope at 3 m) has not been fully validated.
National-scale maps are produced for only five countries; global generalization across more diverse geographies and agricultural systems remains to be validated.

vs. FTW Baseline: Both employ U-Net semantic segmentation; PRUE achieves IoU +6% and F1 +9% through systematic optimization of loss, augmentation, and encoder.
vs. GFMs (Clay, Galileo, etc.): GFMs offer stronger general-purpose representations but suffer from insufficient spatial resolution for fine-grained boundary segmentation, and their inference throughput is 1–2 orders of magnitude lower.
vs. Delineate Anything: This YOLOv11-based instance segmentation method designed specifically for parcel delineation yields modest zero-shot performance on FTW (IoU = 0.37), confirming that task-specific training remains necessary.

Rating¶

Novelty: ⭐⭐⭐ — No new architectural components are introduced; the core contribution is systematic engineering optimization, though the deployment robustness metrics represent an original methodological contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — The large-scale comparison of 18 models is comprehensive; ablations cover loss, augmentation, architecture, and class weighting; national-scale maps for five countries are publicly released.
Writing Quality: ⭐⭐⭐⭐ — The paper is clearly structured with well-organized experiments; the motivation for the deployment metrics is convincingly articulated.
Value: ⭐⭐⭐⭐ — High practical value for the remote sensing community, providing a reproducible best-practice recipe along with publicly released models and data.