RMAE-ProGRess: Advancing Semantic Segmentation in Unstructured Environments¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://gitlab.com/coeaiml/rmae-progress
Area: Semantic Segmentation
Keywords: Unstructured environments, off-road segmentation, lightweight decoder, multi-scale fusion, MAE encoder

TL;DR¶

For semantic segmentation in off-road/unstructured scenes, this paper employs a ViT-MAE encoder (RMAE) with half the layers removed to extract non-adjacent multi-layer features. It is paired with a lightweight decoder, ProGRess, consisting of three modules: Progressive Leapwise Fusion (PLF), Lightweight Channel Attention with Residuals (LCAR), and Bottleneck Feature Fusion (BFF). It achieves SOTA mIoU of 57.41% / 78.95% / 45.63% on RELLIS-3D / RELLIS-3DC / RUGD datasets with significantly fewer parameters.

Background & Motivation¶

Background: Mainstream research in semantic segmentation focuses almost exclusively on structured urban scenes (Cityscapes, ADE20K), reaching high performance using powerful encoder-decoder architectures. However, off-road navigation, search and rescue, defense robotics, and planetary exploration involve unstructured environments characterized by irregular terrain, blurry boundaries, and a lack of geometric consistency.

Limitations of Prior Work: Unstructured segmentation suffers from two gaps. First, lack of standardized benchmarks: existing off-road works (various methods on RELLIS-3D and RUGD) use inconsistent evaluation protocols and unclear metrics, making direct comparison impossible for subsequent researchers. Second, lack of targeted architecture design: directly applying urban decoders (UPerNet, DeepLabV3+, OCRNet, etc.) to off-road data has neither been systematically verified nor accounts for parameter redundancy—off-road deployment often occurs in edge scenarios with limited compute, requiring a balance between high accuracy and controllable computation.

Key Challenge: General-purpose segmentation models are either accurate but too heavy (ViT-Base backbones 86M+, UPerNet decoders 32M+) or lightweight but suffer significant accuracy drops. Furthermore, visual cues in off-road scenes (fallen leaves, gravel, shadows, blurry boundaries) are highly heterogeneous, and simple multi-scale fusion is insufficient.

Goal: (1) Re-train and evaluate 16 mainstream CNN/Transformer segmentation models on off-road data to establish a reproducible standard benchmark; (2) Design a segmentation framework that balances accuracy, computational efficiency, and modularity.

Core Idea: On the encoder side—since adjacent ViT layer features are highly redundant, half of the transformer layers are removed to create the lightweight RMAE, extracting features only from non-adjacent interval layers. On the decoder side—three lightweight modules (PLF/LCAR/BFF) primarily based on \(1 \times 1\) convolutions are used for progressive multi-scale fusion, replacing heavy FPN-style decoders.

Method¶

Overall Architecture¶

The RMAE-ProGRess is a standard encoder-neck-decoder pipeline, with each segment redesigned for "lightweight off-road" performance. Given an input image \(I \in \mathbb{R}^{B \times C' \times H \times W}\), it first passes through the RMAE encoder—a slimmed-down ViT-MAE-Base reduced from 12 to 8/6/4 layers. It extracts four non-adjacent feature maps \(f_i = \{f_1, f_2, f_3, f_4\}\) all at resolution \(\frac{H}{16} \times \frac{W}{16}\), corresponding to early structures, mid-level patterns, and deep semantics. Next, the F2P (Feature-to-Pyramid) neck scales these same-resolution features by factors \(r_i \in \{4, 2, 1, 0.5\}\) into a multi-scale pyramid (from \(\frac{H}{4}\) to \(\frac{H}{32}\)), maintaining channel dimension \(C\) throughout. Finally, the ProGRess decoder links PLF→LCAR→BFF: PLF performs top-down progressive fusion of pyramid features, LCAR applies pixel-wise and channel-wise gated weighting, and BFF aligns all scales to \(\frac{H}{4}\) before compression and aggregation for pixel-level probability output.

F2P follows standard practices (transposed convolution upsampling + max-pooling downsampling); the core innovation lies in the RMAE encoder and the three ProGRess decoder modules.

graph TD
    A["Input Image"] --> B["RMAE Encoder<br/>Reduced ViT-MAE, 4 non-adjacent layers"]
    B --> C["F2P Neck (Standard)<br/>Scale to multi-scale pyramid F1..F4"]
    C --> D["PLF: Progressive Leapwise Fusion<br/>Top-down recursive fusion"]
    D --> E["LCAR: Lightweight Channel Attention + Residuals<br/>Pixel-wise channel-wise gating"]
    E --> F["BFF: Bottleneck Feature Fusion<br/>Align to H/4 then 1x1 compression & aggregation"]
    F --> G["Segmentation Head → Pixel-wise Categories"]

Key Designs¶

1. RMAE Encoder: Halving ViT Layers and Extracting Non-Adjacent Features

To address the redundancy in adjacent ViT-Base layers, the authors trim the 12 transformer layers of ViT-MAE-Base to 8/6/4 layers, resulting in RMAE-8L (58.5M), RMAE-6L (44.2M), and RMAE-4L (29.9M). Parameters decrease significantly compared to the original 86M, while representations are preserved via MAE self-supervised pre-training weights. Critically, since adjacent layers are highly correlated due to token mixing, the authors extract features only from regularly spaced non-adjacent layers (e.g., indices \(\{1,3,5,7\}\) for RMAE-8L), covering early, middle, and deep semantic levels.

2. PLF (Progressive Leapwise Fusion): Top-Down Recursive Fusion of Non-Adjacent Features

Pyramid features from non-adjacent layers (leapwise) have discontinuous resolutions and semantic levels. PLF uses a top-down recursive cascade for fusion: it enhances the deepest layer through self-fusion \(\tilde{F}_4 = \mathrm{Fuse}(F_4, F_4)\), then progressively fuses deep layers with preceding layers: \(\tilde{F}_3 = \mathrm{Fuse}(F_3, \tilde{F}_4)\), \(\tilde{F}_2 = \mathrm{Fuse}(F_2, \tilde{F}_3)\), and \(\tilde{F}_1 = \mathrm{Fuse}(F_1, \tilde{F}_2)\). Each fusion operator is \(\mathrm{Fuse}(F_i, F_j) = \phi(\mathrm{BN}(W_{ij} * [F_i \| U(F_j, \text{size of } F_i)]))\), where \(U\) is nearest-neighbor interpolation, \(\|\) is concatenation, and \(W_{ij}\) is a learnable \(1\times1\) convolution. The recursive structure ensures each \(\tilde{F}_i\) contains information from all deeper layers \(\{F_i, \dots, F_4\}\).

3. LCAR (Lightweight Channel Attention with Residuals): Pooling-free Pixel-wise Gating + Selective Residuals

To emphasize specific channels without losing spatial details in heterogeneous off-road scenes, LCAR avoids global average pooling. Instead, it uses a pooling-free \(1\times1\) convolution to generate pixel-wise channel attention maps: \(\mathrm{LCA}(X) = X \odot \sigma(W_c * X)\), where \(W_c \in \mathbb{R}^{C\times C\times 1\times 1}\) mixes channels. A residual connection with a binary switch is added: \(\mathrm{LCAR}(X) = \mathrm{LCA}(X) + \alpha X\), where \(\alpha \in \{0,1\}\). Empirical results show \(\alpha=1\) only for the deepest layer \(\tilde{F}_4\) stabilizes gradient flow for fragile deep representations.

4. BFF (Bottleneck Feature Fusion): Compression at Unified Resolution

BFF aligns all LCAR outputs \(\hat{F}_i\) to the target resolution \(\frac{H}{4} \times \frac{W}{4}\) to get \(\bar{F}_i\), then concatenates them and uses a \(W_{bff} \in \mathbb{R}^{C\times 4C\times 1\times 1}\) convolution to compress channels from \(4C\) back to \(C\): \(Z = \phi(\mathrm{BN}(W_{bff} * [\bar{F}_1\|\bar{F}_2\|\bar{F}_3\|\bar{F}_4]))\). The final prediction is \(Y = \mathrm{softmax}(W_{cls} * Z)\). The entire decoder relies solely on \(1\times1\) convolutions and interpolation, yielding only 4.86M parameters with a ViT-B16 backbone, an 85% reduction compared to UPerNet.

Loss & Training¶

Based on the MMSegmentation framework, all models are trained for 160K iterations. The encoder uses MAE pre-trained weights and a layer-wise learning rate (LR) configuration. Nearest-neighbor interpolation is used by default as it provides the best accuracy with zero computational overhead.

Key Experimental Results¶

Main Results¶

Benchmarking 16 mainstream models on RELLIS-3D / RUGD (512×512):

Dataset	Method	Backbone	Params (M)	mIoU	mAcc
RELLIS-3D	Swin-UPerNet (Strong Baseline)	Swin-B	121.2	53.86	63.33
RELLIS-3D	ProGRess	RMAE-4L	46.5	53.23	63.04
RELLIS-3D	ProGRess	RMAE-8L	75.0	57.14	68.53
RELLIS-3D	ProGRess	ViTMAE-Base	103.6	57.41	69.21
RUGD	Segformer (Strong Baseline)	MiT-B5	82.0	43.69	56.45
RUGD	ProGRess	ViTMAE-Base	103.6	45.63	57.80

The lightweight RMAE-4L variant (46.5M, 128 GFLOPs) achieves 53.23% mIoU, outperforming nearly all heavier baselines.

Ablation Study¶

Component Stacking (RMAE-8L, RELLIS-3D Test Set):

BFF	PLF	Self-Fusion	LCAR	mIoU (Frozen)	mIoU (Fine-tuned)
✗	✗	✗	✗	49.02	52.52
✓	✓	✓	✓	53.18	56.90

Decoder Cross-Backbone Generalization (RELLIS-3D):

Encoder	Decoder	Decoder Params (M)	mIoU
ViT-B16	UPerNet	32.27	48.03
ViT-B16	ProGRess	4.86	55.68
RMAE-4L	ProGRess	9.46	53.23

Key Findings¶

PLF provides the largest gain: Adding PLF increases mIoU from 54.56 to 56.15.
Backbone Agnostic: ProGRess improves performance across ResNet, Swin, MiT, and ViT encoders while using a fraction of the parameters compared to UPerNet.
Interpolation Insensitivity: Differences between bicubic, bilinear, and nearest interpolation are \(<0.42\) mIoU; nearest neighbor is selected for its efficiency.

Highlights & Insights¶

Shrinking Depth vs. Width: Unlike traditional lightweight ViTs that reduce embedding dimensions, RMAE preserves width but halves depth, utilizing non-adjacent layers to avoid redundancy.
1x1 Conv Decoder: The decoder avoids heavy operators (ASPP, heavy Attention), providing a template for lightweight segmentation in resource-constrained environments.
Recursive Information Preservation: The PLF module ensures global context permeates through high-resolution branches via recursive cascade.

Limitations & Future Work¶

Absolute Accuracy: mIoU remains relatively low (57% / 46%), reflecting the extreme difficulty of off-road segmentation.
F2P Neck: The pyramid neck uses standard transposed convolutions rather than modules optimized specifically for off-road features.
Empirical Indexing: The selection of non-adjacent layer indices is currently manual rather than learned or searched.

vs. Lightweight ViTs: While others reduce width, this work reduces depth and leverages MAE weights, showing "depth redundancy" is more critical to address in off-road data.
vs. FPN-style Decoders: ProGRess uses recursive cascades (PLF) rather than single-shot lateral connections, achieving higher accuracy with 85% fewer parameters than UPerNet.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐