Occlusion-Aware Seamless Segmentation¶

Conference: ECCV 2024
arXiv: 2407.02182
Code: https://github.com/yihong-97/OASS
Area: Image Segmentation
Keywords: Panoramic Segmentation, Occlusion-Aware, Seamless Segmentation, Unsupervised Domain Adaptation, Amodal Segmentation

TL;DR¶

Proposes a new task, Occlusion-Aware Seamless Segmentation (OASS), and the UnmaskFormer framework to simultaneously address three major challenges: unlocking the narrow field-of-view of panoramic images, complete amodal segmentation of occluded objects, and pinhole-to-panoramic cross-domain adaptation, achieving SOTA performance on the self-created BlendPASS dataset.

Background & Motivation¶

Background: Panoramic scene understanding and occlusion-aware amodal segmentation have both made individual progress, but these two directions have developed independently for a long time. Panoramic segmentation methods (e.g., Trans4PASS, DATR) can handle 360° image distortion but cannot reason about occluded objects, whereas amodal methods (e.g., ORCNN, SLN) can predict complete occluded silhouettes but fail to generalize to panoramic images.

Limitations of Prior Work: - Panoramic images exhibit severe distortions, leading to a drastic performance drop when directly applying traditional segmentation models; - Existing semantic/instance segmentation can only predict visible regions, failing to reason about the complete shape of occluded areas; - Panoramic image annotation is extremely expensive (approx. 210 minutes per image), resulting in a scarcity of labels and necessitating unsupervised domain adaptation (UDA) to transfer knowledge from the label-rich pinhole domain.

Key Challenge: FoV occlusion, in-field object occlusion, and cross-domain gaps are entangled. Existing methods only address one of these issues individually, failing to achieve a "seamless" and comprehensive understanding.

Goal: Unify and solve three "masking" issues: (1) unmasking narrow FoV \(\rightarrow\) panoramic 360°; (2) unmasking object occlusion \(\rightarrow\) amodal complete segmentation; (3) unmasking domain gap \(\rightarrow\) UDA from pinhole to panoramic.

Key Insight: Define a brand new task named OASS, construct a dedicated dataset BlendPASS, and design a unified framework UnmaskFormer to solve distortion handling, occlusion reasoning, and domain adaptation in one go.

Core Idea: Process distortion and occlusion via Unmasking Attention, and enhance cross-domain adaptation and occlusion reconstruction through Amodal-oriented Mix, seamlessly performing five types of segmentation tasks within a single transformer framework.

Method¶

Overall Architecture¶

UnmaskFormer consists of three major components: - UA-based Backbone: A four-stage Transformer-based feature extractor containing Deformable Patch Embedding (DPE) and Unmasking Attention (UA) to simultaneously handle panoramic distortions and object occlusions; - Three-branch Decoder: A semantic branch (per-pixel semantic classification), an instance branch (top-down instance segmentation based on Mask R-CNN), and an amodal instance branch (predicting complete occluded regions); - OAFusion Module: Fuses the outputs of the three branches to generate five types of segmentation results simultaneously: semantic, instance, amodal instance, panoptic, and amodal panoptic segmentation.

Key Designs¶

Unmasking Attention (UA):
- Function: Introduces an enhanced pooling layer after self-attention to generate occlusion-aware features.
- Mechanism: After the feature vector \(\boldsymbol{X'}\) passes through self-attention, global average pooling is used to obtain the pooled query \(\boldsymbol{q} = GAP(\boldsymbol{X'})\). Then, cross-attention generates \(\boldsymbol{q'} \in \mathbb{R}^{1 \times 1 \times C}\). A sigmoid function \(\phi(\boldsymbol{q'})\) is applied to produce an occlusion-aware mask, which is element-wise multiplied by the original features to obtain the occlusion-aware features \(\boldsymbol{X''} = \phi(\boldsymbol{q'}) \odot \boldsymbol{X'}\).
- Design Motivation: Global pooling captures global image context, while the sigmoid mask selectively enhances or suppresses feature channels, enabling the network to learn to focus on information in occluded areas.
Interleaved DPE:
- Function: Alternates Deformable Patch Embedding from being used solely in the initial stage to being interleaved in Stage 2 and Stage 4.
- Mechanism: DPE captures local geometric variations through learnable adaptive offsets \(\boldsymbol{\Delta}^{DPE}(i,j)\). Using DPE in late stages is more effective than in early ones.
- Design Motivation: Panoramic images exhibit distortions at different feature levels; addressing them only in the initial stage is insufficient. Experiments show that employing DPE in deeper stages (Stage 2, 4) yields better results than in shallower stages (Stage 1, 3).
Amodal-oriented Mix (AoMix):
- Function: A cross-domain data augmentation strategy that utilizes amodal annotations to generate occluded training samples and blends source-target domain images.
- Mechanism:
  - Randomly sample amodal instance masks \(\{M_r^{(i)}\}_{i=1}^z\), and apply random scaling \(RS(\cdot)\) and random padding \(RP(\cdot)\) to generate a new mask \(M_r = H(\sum_i RP(RS(M_r^{(i)})))\).
  - Occlude the "Thing" regions of the source image using \(M_r\): \(\hat{x}_s = (1 - M_r \cap M_s) \odot x_s\).
  - Randomly sample half of the semantic classes from the occluded source image \(\hat{x}_s\) and paste them onto the target image \(x_t\) to generate the mixed image \(\hat{x}_m\).
- Design Motivation: Using real object shapes (rather than random patches) to create occlusions mimics real-world scenes more closely, while mixing the source and target domains alleviates the domain gap.
OAFusion (Occlusion-Aware Fusion):
- Function: Fuses the outputs of the three branches to generate five types of segmentation results.
- Mechanism: Semantic segmentation is output directly. The categories of instances / amodal instances are determined by a majority vote from the semantic branch. The key improvement is that in amodal scenarios, the voting only considers regions that do not overlap with other objects.
- Design Motivation: Traditional fusion can cause misclassification when heavy occlusion occurs (e.g., a pedestrian heavily occluded by a car is misclassified as a car). OAFusion avoids this issue by ignoring overlapping regions during voting.

Loss & Training¶

Source domain supervised loss \(\mathcal{L}_S\): Cross-entropy loss for the semantic branch, and standard Mask R-CNN losses (bbox + mask) for the instance and amodal branches.
Target domain self-training loss \(\mathcal{L}_T = -\omega \sum_{h,w,c} p_t^{(h,w,c)} \log \hat{y}_t^{(h,w,c)}\), where pseudo-labels are generated by an EMA teacher model of Mean-Teacher, and the weight \(\omega\) is estimated based on a confidence threshold \(\tau = 0.968\).
Total loss: \(\mathcal{L}_{total} = \mathcal{L}_S + \mathcal{L}_T\).
Training configuration: AdamW optimizer, lr = \(6 \times 10^{-5}\), weight decay 0.01, batch size 4, crop size 376×376, trained for 40k iterations.

Key Experimental Results¶

Main Results (KITTI360-APS → BlendPASS)¶

Method	mIoU (SS)	mAPQ (APS)	mAP (Instance)	mAAP (Amodal)
DATR	34.91%	20.26%	8.66%	8.68%
Trans4PASS	40.66%	22.94%	10.01%	9.85%
UniDAPS	38.46%	—	3.43%	n.a.
EDAPS	40.17%	23.14%	10.28%	10.68%
Source-Only	38.65%	22.13%	10.54%	10.22%
UnmaskFormer	43.66%	26.58%	11.10%	10.50%

Panoramic Semantic Segmentation (Other Datasets)¶

Dataset	Metric	UnmaskFormer	Prev. SOTA	Gain
SynPASS	mIoU	45.34%	44.80% (Trans4PASS)	+0.54%
DensePASS	mIoU	48.08%	45.89% (Trans4PASS)	+2.19%

Ablation Study¶

Component Config	mIoU	mAPQ	Description
PE (baseline)	41.07%	22.00%	Original Patch Embedding
DPE (early stage)	40.70%	22.90%	DPE only in early stage
AvgPool	42.10%	23.90%	Standard Average Pooling
SimPool	43.04%	24.74%	SimPool replacement
UA (ours)	43.06%	25.04%	Unmasking Attention
UA + AoMix	42.39%	25.17%	Adding AoMix
UA + AoMix + OAFusion	43.66%	26.58%	Full model

Ablation of AoMix Strategies¶

Strategy	mIoU	mAPQ	Description
T for S	42.53%	23.98%	Occlusion only on source image
T for M	43.18%	24.78%	Occlusion only on mixed image
P for S&M	42.52%	21.87%	Occlusion with random patches
W for S&M	41.64%	24.12%	Occlude all regions
AoMix (ours)	42.39%	25.17%	Occlude only Thing categories

Key Findings¶

UA achieves a +3.04% improvement in mAPQ compared to the baseline PE, demonstrating the effectiveness of the occlusion-aware pooling attention.
AoMix using real object shapes for occlusion achieves significantly better results than random patch-based occlusion (mAPQ +3.3%), validating the importance of amodal-oriented augmentation.
OAFusion successfully resolves the misclassification issue of traditional fusion methods in heavily occluded scenarios.
DPE performs better in the deep stages (Stages 2, 4) compared to the early stages (Stages 1, 3).
UnmaskFormer has only 13.96M parameters, which is comparable to Trans4PASS (13.93M) but delivers significantly better performance.

Highlights & Insights¶

Novelty of Task Definition: OASS unifies panoramic segmentation, amodal segmentation, and UDA into a single task for the first time, filling a gap in the literature.
BlendPASS Dataset: Contains 2,000 panoramic images for training and 100 finely annotated ones for testing, including 2,960 Thing instances where 43% contain occlusions. The annotation quality is guaranteed by three-person cross-validation.
Simple Yet Effective UA Design: Simply adding a pooling attention layer after self-attention significantly boosts the occlusion-awareness capacity.
Practical Value of OAFusion: Solves the fundamental flaw of semantic voting in amodal scenarios, ensuring heavily occluded objects are no longer misclassified into the category of the occluding object.
A single model simultaneously produces five types of segmentation outputs (semantic/instance/amodal instance/panoptic/amodal panoptic) under a unified and efficient design.

Limitations & Future Work¶

The test set contains only 100 images, which is relatively small and might lead to performance evaluation fluctuations.
Panoramic image annotation is extremely costly (210 mins/image/person), making it difficult to scale up the dataset.
Performance on certain minor categories (e.g., van, traffic-light) remains quite low, showing that cross-domain adaptation struggles with long-tailed classes.
Only validated on driving scenarios; indoor panoramic scenes are not yet explored.
AoMix relies on source-domain amodal annotations, which limits the model's generalizability.
The mAAP of amodal instance segmentations is only 10.50%, leaving a large room for improvement in overall performance.

Trans4PASS [CVPR 2022]: Proposes DPE to handle panoramic distortions, serving as the baseline for UnmaskFormer's backbone.
EDAPS [CVPR 2023]: SOTA for UDA panoramic segmentation, upon which UnmaskFormer builds to incorporate amodal capabilities.
DAFormer/DACS: Offers the framework for self-training and class-mix UDA strategies.
ORCNN: An amodal segmentation method that infers occluded masks through visible masks.
Insight: Integrating amodal information into data augmentation (AoMix) is a clever design, which is more suited for occlusion scenarios than traditional CutMix/ClassMix.

Rating¶

Novelty: ⭐⭐⭐⭐ Defines the OASS task for the first time, unifying three independent challenges through a pioneering task formulation.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on three datasets with detailed ablation and rich visualizations, though the test set scale is relatively small.
Writing Quality: ⭐⭐⭐⭐ Clear logic and beautiful figures, with the triple "unmasking" concept consistently thread throughout the narrative.
Value: ⭐⭐⭐⭐ The dataset and benchmarks hold long-term value, though absolute performance remains relatively low, leaving a gap for practical deployment.