Skip to content

PRUE: A Practical Recipe for Field Boundary Segmentation at Scale

Conference: CVPR 2026 arXiv: 2603.27101 Code: https://github.com/fieldsoftheworld/ftw-prue Area: Semantic Segmentation / Remote Sensing Keywords: Field boundary segmentation, geospatial foundation models, U-Net, deployment robustness, large-scale mapping

TL;DR

This paper systematically evaluates 18 segmentation and geospatial foundation models (GFMs), and proposes PRUE—a field boundary segmentation recipe combining a U-Net backbone, composite loss function, and targeted data augmentation. PRUE achieves 76% IoU and 47% object-F1 on the FTW benchmark, surpassing the baseline by 6% and 9% respectively, while introducing a novel set of metrics for evaluating deployment robustness.

Background & Motivation

  1. Background: Large-scale field boundary maps are critical for agricultural monitoring. Deep learning methods—particularly U-Net-based semantic segmentation—have become the dominant approach for extracting field boundaries from satellite imagery.

  2. Limitations of Prior Work: Existing methods are highly sensitive to illumination variation, spatial scale changes, and geographic domain shifts. Deploying top-performing models over large regions introduces tiling artifacts, boundary discontinuities, and other quality degradation issues.

  3. Key Challenge: Conventional evaluation focuses solely on patch-level metrics such as IoU and F1, which fail to capture deployment-level failure modes encountered during large-scale mapping—including translation consistency, input order sensitivity, preprocessing normalization sensitivity, and spatial scale sensitivity.

  4. Goal: To systematically identify the optimal combination of model architecture, loss function, and data augmentation, and to propose a deployment-oriented robustness evaluation framework that enables reliable national-scale field boundary mapping.

  5. Key Insight: The problem is framed as a systematic "bake-off" benchmark, comparing 18 models spanning semantic segmentation, instance segmentation, and GFMs under a unified experimental protocol, with ablations over architecture, loss, and augmentation choices.

  6. Core Idea: Through systematic exploration of the model design space—rather than architectural innovation—PRUE combines U-Net with EfficientNet-B7, log-cosh Dice loss, and channel shuffle with brightness/scale augmentation to jointly optimize accuracy and deployment robustness.

Method

Overall Architecture

The input consists of bi-temporal RGBN Sentinel-2 imagery (4 channels each for growing and harvest seasons, 8 channels total). The output is a three-class semantic segmentation map (background / field interior / boundary), from which individual field instance polygons are extracted via connected-component post-processing. The core pipeline is: encoder–decoder segmentation → pixel-level classification → connected-component instance extraction → polygonization.

Key Designs

  1. U-Net + EfficientNet-B7 Encoder

  2. Function: Serves as the feature extraction backbone, providing multi-scale semantic features.

  3. Mechanism: After systematically comparing FCN, UPerNet, FCSiam, and multiple U-Net variants, the EfficientNet-B7 encoder achieves the best balance between accuracy and parameter efficiency. Compared to the B3 baseline, it increases model capacity while avoiding overfitting through careful selection of complementary components.
  4. Design Motivation: A larger encoder captures richer spatial context, yielding better representation of complex field morphologies—especially irregular smallholder parcels. Its 67.1M parameters sit in the optimal region of the accuracy–throughput trade-off (306.94 km²/s).

  5. Log-Cosh Dice Loss + Boundary Class Weight Adjustment

  6. Function: Optimizes the segmentation objective and balances boundary versus interior classes.

  7. Mechanism: After comparing CE, Dice, Focal, Tversky, and Jaccard losses, log-cosh Dice is found to provide a smoother optimization landscape. A boundary weight of \(\omega = 0.75\) (normalized class weights: [0.05, 0.20, 0.75]) substantially increases focus on narrow boundary pixels.
  8. Design Motivation: Boundary pixels constitute a very small fraction of the image; standard losses tend to neglect them. The log-cosh transformation also alleviates gradient instability of Dice loss in early training.

  9. Deployment-Oriented Data Augmentation (Channel Shuffle + Brightness + Resize)

  10. Function: Improves model robustness to input variations encountered in real deployment scenarios.

  11. Mechanism: Channel shuffle randomly swaps growing- and harvest-season channels, inducing input-order invariance. Brightness augmentation makes the model robust to different Sentinel-2 radiometric preprocessing pipelines. Resize augmentation simulates imagery at varying spatial resolutions.
  12. Design Motivation: In practice, users may supply multi-temporal data in different orderings, with different preprocessing workflows, or at different resolutions—none of which should affect prediction quality.

  13. Deployment Robustness Evaluation Metrics

  14. Function: Quantifies model behavior under real-world large-scale mapping conditions.

  15. Mechanism: Four new metrics are proposed: (a) translation consistency—prediction agreement in overlapping regions across four corner crops; (b) input order sensitivity—performance variance across channel permutations; (c) preprocessing invariance—performance variance across different radiometric normalization schemes; (d) spatial scale sensitivity—performance variance across different input resolutions.
  16. Design Motivation: Conventional metrics only measure patch-level accuracy and cannot predict stitching quality during large-scale map production.

Loss & Training

The total loss is a class-weighted log-cosh Dice loss. Training uses the Adam optimizer, with the learning rate selected by sweep over \(\{10^{-4},\ 3\times10^{-4},\ 3\times10^{-3},\ 10^{-2},\ 3\times10^{-2}\}\). For presence-only samples (countries with only positive-sample annotations), unknown-label pixels are masked out during training.

Key Experimental Results

Main Results

Model Category IoU ↑ Object-F1 ↑ AP0.5 ↑ Params (M) Throughput (km²/s)
PRUE (ours) Semantic Seg. 0.76 0.47 0.40 67.1 306.94
FTW-Baseline Semantic Seg. 0.70 0.38 0.39 13.2 623.28
Mask2Former Instance/Panoptic 0.68 0.39 0.44 68.8 26.66
Clay (ViT-L) GFM 0.67 0.36 0.41 363.8 10.98
Galileo (ViT-B) GFM 0.66 0.32 0.37 119.0 *
SAM (fine-tuned) Instance Seg. 0.45 0.37 0.19 642.7 0.17
Del-Any (zero-shot) Instance Seg. 0.37 0.09 0.10 56.9 87.32

Ablation Study

Configuration Object-F1 ↑ IoU ↑ Input Order Δ ↓ Brightness Δ ↓ Scale Δ ↓ Consistency ↑
FTW-Baseline 0.39 0.68 0.07 / 0.11 0.04 / 0.05 0.15 / 0.12 0.93
+Brightness+Resize 0.38 0.66 0.06 / 0.10 0.02 / 0.03 0.00 / 0.01 0.95
+Channel Shuffle 0.39 0.68 0.00 / 0.00 0.04 / 0.05 0.17 / 0.14 0.94
+ω=0.75 0.42 0.74 0.08 / 0.11 0.07 / 0.07 0.29 / 0.15 0.95
+Log-Cosh Dice 0.44 0.77 0.09 / 0.13 0.06 / 0.05 0.36 / 0.20 0.94
PRUE (full) 0.47 0.76 0.00 / 0.00 0.00 / 0.00 0.01 / 0.01 0.95

Key Findings

  • GFMs, despite having 3–10× more parameters, are consistently outperformed by the carefully optimized U-Net. The best-performing GFM, Clay (ViT-L, 363.8M parameters), still trails PRUE by 9% IoU—indicating that coarse patch-embedding resolution is insufficient for fine-grained boundary segmentation.
  • Systematic design optimization (loss + augmentation + class weighting) matters more than architecture selection: the same U-Net backbone gains 9% F1 through combined optimization.
  • The augmentation strategies are complementary: channel shuffle eliminates input-order dependence, brightness and resize augmentation eliminate radiometric and scale dependence, and their combination drives all robustness metrics to near-perfect levels.
  • Instance segmentation models (SAM, Delineate Anything) perform poorly in zero-shot settings, as field boundaries do not conform to the bounding-box assumptions underlying typical object detection frameworks.

Highlights & Insights

  • Deployment-Oriented Evaluation Framework: This work is the first to propose a systematic deployment robustness evaluation suite for geospatial segmentation, covering translation consistency, input order/preprocessing/scale sensitivity. This methodology is directly transferable to any remote sensing task requiring large-scale tiled inference.
  • "Recipe" Thinking vs. "Architecture" Thinking: The paper demonstrates that systematic engineering optimization (loss, augmentation, class weighting) on a mature segmentation backbone substantially outperforms introducing more complex architectures or larger foundation models—an insight of significant practical value for real-world deployment.
  • The channel shuffle trick for achieving input-order invariance is elegant, zero-cost, and directly transferable to all multi-temporal remote sensing tasks.

Limitations & Future Work

  • The method still relies on connected-component post-processing for instance extraction and cannot directly output instance-level segmentations; separation of adjacent parcels remains limited by boundary prediction quality.
  • The model uses only bi-temporal inputs and does not leverage full time-series information (e.g., temporal Sentinel-2 sequences as used in PASTIS).
  • Evaluation is conducted solely at Sentinel-2 10 m resolution; transferability to higher-resolution imagery (e.g., PlanetScope at 3 m) has not been fully validated.
  • National-scale maps are produced for only five countries; global generalization across more diverse geographies and agricultural systems remains to be validated.
  • vs. FTW Baseline: Both employ U-Net semantic segmentation; PRUE achieves IoU +6% and F1 +9% through systematic optimization of loss, augmentation, and encoder.
  • vs. GFMs (Clay, Galileo, etc.): GFMs offer stronger general-purpose representations but suffer from insufficient spatial resolution for fine-grained boundary segmentation, and their inference throughput is 1–2 orders of magnitude lower.
  • vs. Delineate Anything: This YOLOv11-based instance segmentation method designed specifically for parcel delineation yields modest zero-shot performance on FTW (IoU = 0.37), confirming that task-specific training remains necessary.

Rating

  • Novelty: ⭐⭐⭐ — No new architectural components are introduced; the core contribution is systematic engineering optimization, though the deployment robustness metrics represent an original methodological contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — The large-scale comparison of 18 models is comprehensive; ablations cover loss, augmentation, architecture, and class weighting; national-scale maps for five countries are publicly released.
  • Writing Quality: ⭐⭐⭐⭐ — The paper is clearly structured with well-organized experiments; the motivation for the deployment metrics is convincingly articulated.
  • Value: ⭐⭐⭐⭐ — High practical value for the remote sensing community, providing a reproducible best-practice recipe along with publicly released models and data.