Skip to content

OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

Conference: ECCV 2024
arXiv: 2411.02858
Code: olafseg.github.io
Area: Segmentation
Keywords: multi-part segmentation, plug-and-play, input augmentation, low-level features, weight adaptation

TL;DR

A plug-and-play framework named OLAF is proposed. By incorporating foreground/edge masks as additional input channels, introducing a Low-level Dense Feature extraction module (LDF), and applying a targeted weight adaptation strategy, it brings significant multi-object multi-part segmentation gains to various segmentation networks (CNN/U-Net/Transformer) without changing the base architecture, surpassing the SOTA by 4.0 mIoU on the highly challenging Pascal-Parts-201 dataset.

Background & Motivation

Background: Multi-object multi-part scene segmentation requires simultaneously segmenting multiple objects and their constituent parts in an image, which is key to achieving fine-grained scene understanding. This task is crucial for downstream applications such as robotic interaction, visual question answering, and object modeling.

Limitations of Prior Work: Although recent methods (e.g., FLOAT, BSANet, GMNet) are specifically designed for this task, they suffer from three major limitations:

Foreground Segmentation Errors: Object regions (the union of foregrounds) are frequently missegmented, leading to subsequent errors in internal part segmentation. For instance, FLOAT completely fails to recognize the bezel and screen of a television.

Loss of Boundary Details: Boundary information between objects and parts cannot be accurately captured, such as the boundaries between a car body and tires, or a car body and windows.

Missing Small/Thin Parts: Parts with small areas or elongated shapes are rarely segmented correctly, such as the eyes/tails of animals or the lights of vehicles.

Key Challenge: Existing methods usually learn foreground/boundary information as auxiliary tasks, which introduces the gradient conflict issue in multi-task loss optimization; meanwhile, the downsampling operations of the encoder lead to the loss of small part information in the feature space.

Key Insight: Instead of introducing auxiliary tasks at the loss function level, it is more effective to directly inject structural priors at the input level by utilizing foreground masks and boundary edges as additional input channels. Concurrently, a dedicated low-level feature module is designed to preserve the spatial details of small parts.

Core Idea: Transform object boundary information from an "auxiliary learning target" into an "input-stage structural inductive bias", enabling the model to perceive foreground regions and boundary locations from the very beginning of training, while utilizing the LDF module to mitigate the damage of downsampling on small part information.

Method

Overall Architecture

OLAF is a plug-and-play enhancement framework consisting of three complementary components: 1. Input Channel Augmentation: RGB 3 channels \(\rightarrow\) 5 channels (+foreground mask +edge mask) to provide object-level structural priors for the segmentation network. 2. LDF Encoder Module: Extracts dense low-level information from the shallow features of the backbone, specifically designed for small/thin parts. 3. Weight Adaptation Strategy: Enables pre-trained 3-channel models to stably process 5-channel inputs.

These three components can be applied to any segmentation architecture (DeepLabV3, BSANet, GMNet, FLOAT, Segformer, etc.) without modifying the core design of the base networks.

Key Designs

1. Foreground and Edge Masks as Input Channels

  • Function: Concatenate the binary foreground mask and foreground edge mask generated by pre-trained models behind the RGB image, forming an \(H \times W \times 5\) input.
  • Mechanism:
    • Foreground mask \(fb(x,y)\): Use a pre-trained object segmentation network to obtain object predictions, and merge the predicted regions of all object classes to obtain a binary foreground/background map. The mathematical definition is: $\(fb(x,y) = \begin{cases} 1, & \text{if } P(x,y) \in C \text{ and } P(x,y) \neq 0 \\ 0, & \text{otherwise} \end{cases}\)$
  • Edge mask \(edge\): Use the HED edge detection network to obtain an initial edge map, and then filter out edges in the background regions using the foreground mask: $\(edge = \mathbb{I}[edge_{initial} > 0] \odot fb\)$ where \(\odot\) denotes element-wise multiplication, and \(\mathbb{I}\) is the indicator function. This yields a binary edge map that only retains edges within the foreground regions.
  • Design Motivation: Traditional methods treat foreground/edge learning as auxiliary tasks, which suffers from multi-task gradient conflicts (ad-hoc loss scaling). Directly feeding these signals as input channels can be viewed as a structural inductive bias for the task, continuously offering boundary guidance throughout the entire optimization process and avoiding gradient interference from auxiliary losses.

2. Low-level Dense Feature Extraction Module (LDF)

  • Function: Extract shallow features from the first two blocks of the backbone network and process them to provide the decoder with dense low-level feature guidance.
  • Mechanism:
    • Take features \(x_1\) and \(x_2\) from the first and second blocks of the backbone network.
    • Apply a \(3 \times 3\) convolution to enhance \(x_1\). Apply a \(3 \times 3\) convolution + upsampling to \(x_2\) to match the size of \(x_1\), and then concatenate them.
    • Pass the concatenated features into an Atrous Spatial Pyramid Pooling (ASPP) module to capture multi-scale contextual information.
    • Finally, reduce the dimension using a \(1 \times 1\) convolution. $\(feat(x_1, x_2) = Conv_{3 \times 3}(x_1) \oplus UP(Conv_{3 \times 3}(x_2))\)$ $\(LDF(x_1, x_2) = Conv_{1 \times 1}(ASPP(feat(x_1, x_2)))\)$ where \(\oplus\) represents the concatenation operation, and \(UP(\cdot)\) denotes upsampling + \(1 \times 1\) convolution + BN + ReLU.
  • Design Motivation: Segmentation encoders typically operate at 1/8 or 1/16 resolution, where heavy downsampling and pooling delete significant small-part features. Although traditional skip connections can deliver shallow information, the early features are often too coarse, lacking effective semantic context for small parts. The key difference of LDF is that it captures multi-scale context via ASPP while operating at the original resolution of the shallow features, thereby retaining sufficient spatial details for small/thin parts.

3. Weight Adaptation

  • Function: Enable the RGB pre-trained model to stably accept 5-channel inputs.
  • Mechanism: For the convolution kernel of the input layer, calculate the mean of the weights from the 3 RGB channels along the channel dimension. Use this average value to initialize the weights associated with the 2 newly added channels (foreground + edge). Add a warm-up of \(n_{warm}=5\) epochs at the beginning of the optimization.
  • Design Motivation: Directly initializing the weights of the new channels randomly leads to training instability (the Random-5 scheme achieves only 35.2 mIoU), because error gradients from random weights damage the pre-trained weights of the subsequent layers in the backbone. Compared with other alternatives (such as Average-RGB-5 and Adapt-n-Freeze), our method most effectively adapts to new channels while preserving pre-trained knowledge.

Loss & Training

OLAF does not modify the loss function of the base models, directly employing the original training configurations (hyperparameters, data augmentation, pre-trained backbones) of each base method. It only requires an additional 5-epoch warm-up for weight adaptation. All experiments were conducted on NVIDIA A100 GPUs.

Key Experimental Results

Main Results

Dataset Metric FLOAT (SOTA) FLOAT + OLAF Gain
Pascal-Parts-58 mIoU 61.0 62.7 +3.3 (vs DeepLabV3)
Pascal-Parts-58 sqIoU 54.2 55.4 +1.2
Pascal-Parts-108 mIoU 48.0 50.3 +3.5 (vs DeepLabV3)
Pascal-Parts-108 sqIoU 40.5 43.4 +2.9
Pascal-Parts-201 mIoU 46.6 49.6 +4.0 (vs DeepLabV3)
Pascal-Parts-201 sqIoU 39.2 41.9 +4.8 (vs DeepLabV3)
PartImageNet mIoU 61.44 (Compositor) 65.46 (Segformer+O) +4.0

FLOAT† + OLAF using a ViT-H backbone achieves further improvements: 64.3 mIoU on PP-58, 51.5 on PP-108, and 50.7 on PP-201.

Cross-Architecture Validation: OLAF yields consistent gains across all tested baselines:

Baseline Method Architecture Type PP-201 mIoU Gain
DeepLabV3 CNN +3.4
GMNet CNN+GCN +4.5
BSANet CNN+Boundary-aware +3.4
FLOAT CNN+Label-dependency +3.0
Segformer Transformer +4.9 (PartImageNet)

Ablation Study

Configuration mIoU sqIoU mIoU_small Description
Baseline FLOAT† 37.7 30.8 24.0 Without OLAF
+LDF 38.8 31.8 25.7 LDF is most effective for small parts
+Edge 38.9 32.2 24.5 Edge channel
+Fg/Bg 39.1 32.0 24.6 Foreground channel
+Edge+Fg/Bg 39.2 32.2 24.8 Combination of both channels
OLAF (All) 40.9 34.3 26.9 Three-component synergy is optimal

Weight Adaptation Comparison:

Scheme mIoU Description
Random-5 35.2 Unstable training
Average-RGB-5 36.3 Improved but still insufficient
Adapt-n-Freeze 38.2 Multi-stage training
Random-2 40.2 Only new channels randomized
OLAF (Mean initialization + warmup) 40.9 Most stable and optimal

Alternative Input Channel Schemes: - Replacing foreground mask with SAM: mIoU=40.5 (slightly lower than the default object segmentation network) - Replacing edge mask with EDTER/Canny: mIoU=39.5/39.0 (HED performs best) - Adding depth map channel (6 channels): mIoU=40.7~40.8 (almost no additional gain; depth information has limited help in distinguishing parts)

Key Findings

  • Three-component Synergetic Effect > Sum of Individual Contributions: LDF, Edge, and Fg contribute about 1 mIoU individually, but yield a 3.2 boost when combined.
  • LDF Contributes the Most to Small Parts: mIoU_small improves from 24.0 to 25.7 (+1.7), representing the largest gain for small parts among all single components.
  • Weight Adaptation is a Necessary Condition: Inappropriate adaptation schemes (e.g., Random-5) degrade performance by over 2.5 mIoU relative to the baseline.
  • Extremely Low Computational Cost: Parameter count increases by 1.5%~20%, training time increases by 5%~10%, and inference time increases by 0.26s (on the FLOAT baseline).

Highlights & Insights

  • "Input as Prior" Design Philosophy: Transferring structural prior information from auxiliary task losses to input channels is a simple and effective approach that avoids gradient conflicts in multi-task learning, demonstrating strong generalizability.
  • Plug-and-Play: Truly architecture-agnostic; it requires no modification to the underlying network structure and can be directly applied to typical CNN, U-Net, and Transformer segmentation architectures.
  • Multi-scale Context Design of LDF: Applying ASPP to shallow features is a key innovation. Compared to simple skip connections, ASPP provides rich multi-scale semantic context, preventing the shallow features from being "too coarse".
  • Valuable Negative Findings from Depth Map Experiments: The results indicate that while depth information helps distinguish object-level boundaries, it provides limited assistance for part-level segmentation, as depth variations between parts of the same object are negligible.

Limitations & Future Work

  • Dependency on Input Channel Quality: The foreground/edge masks are generated by pre-trained models. If these models perform poorly on certain classes (e.g., SAM's poor segmentation of potted plants), the performance of OLAF will be affected.
  • Increased Inference Pre-processing Overhead: Running the auxiliary object segmentation net and edge detection net is required to generate the input masks.
  • Fixed Channel Expansion: The input is currently fixed to 5 channels (3+2), without exploring the possibility of adaptively choosing auxiliary channels.
  • Validation Limited to Pascal-Part Series: Although validated on PartImageNet as well, experiments on more diverse scenarios (e.g., ADE20K part annotations) are still lacking.
  • vs FLOAT: FLOAT simplifies the task using label space decomposition to reduce the number of output heads, but does not address the lack of information on the input side. OLAF supplements structural priors from the input end, which is complementary to and stackable with FLOAT.
  • vs BSANet: BSANet learns boundary awareness via auxiliary tasks but suffers from multi-task gradient conflicts. OLAF directly injects boundary information into the input, which is more straightforward and effective.
  • vs Compositor: Compositor exhibits strong performance on PartImageNet, but OLAF-enhanced Segformer surpasses it in a plug-and-play manner.

Rating

  • Novelty: ⭐⭐⭐⭐ The idea is simple yet effective, and the perspective of shifting structural priors from auxiliary tasks to input channels is highly inspiring.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Cross-architecture, cross-dataset, comprehensive ablations, and detailed comparisons of alternative schemes.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure, well-explained motivation, and intuitive diagrams.
  • Value: ⭐⭐⭐⭐ The plug-and-play design offers strong applicability and low computational cost, making it directly deployable.