Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance¶

Conference: CVPR 2026 arXiv: 2603.07570 Code: N/A Area: RGB-D Scene Understanding / Multi-task Learning / Panoptic Segmentation Keywords: multi-task learning, RGB-D fusion, panoptic segmentation, adaptive loss, cross-dimensional guidance

TL;DR¶

This paper proposes an efficient RGB-D multi-task scene understanding network. A partial-channel convolution fusion encoder reduces FLOPs to 1/16 of standard convolution. A Normalized Focus Channel Layer (NFCL) and a Context Feature Interaction Layer (CFIL) enable cross-dimensional feature guidance. A batch-level multi-task adaptive loss dynamically balances five tasks. The method achieves 49.82 mIoU on NYUv2 at 20.33 FPS, which is 24% faster than EMSAFormer.

Background & Motivation¶

Background: Robotic scene understanding requires simultaneous execution of multiple tasks including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. RGB-D data fusion has become the mainstream solution, yet efficiently fusing two modalities while jointly optimizing multiple tasks remains an open problem.

Limitations of Prior Work: (1) Dual-encoder architectures (e.g., EMSANet) incur high computational costs and underutilize inter-modal complementary information; (2) Transformer-based encoders (e.g., EMSAFormer with Swin v2) involve intensive matrix operations and frequent memory access, limiting inference to 16 FPS; (3) MLP decoder structures are simple and efficient but susceptible to noise from shallow features; (4) Fixed multi-task loss weights cannot adapt to the dynamic changes in task learning throughout training.

Key Challenge: The trade-off between multi-task performance and inference speed — how to substantially improve inference efficiency without sacrificing task accuracy.

Goal: Design an efficient RGB-D multi-task network that simultaneously addresses modality fusion efficiency, the shallow-feature misleading problem in MLP decoders, and dynamic balancing of multi-task loss weights.

Key Insight: Exploiting channel feature redundancy — applying convolution to only 1/4 of channels suffices to achieve full-channel performance, substantially reducing FLOPs and memory access.

Core Idea: Partial-channel convolution for efficient RGB-D fusion + cross-dimensional feature guidance for enhanced shallow representations + batch-level adaptive loss for dynamic multi-task balancing.

Method¶

Overall Architecture¶

The network accepts 4-channel RGBD input and extracts features through a single fusion encoder (based on FasterNet-M, 4 stages with 3/4/18/3 fusion blocks per stage). Encoder outputs are distributed to three branches: (1) a scene classification head (fully connected layers); (2) a semantic segmentation decoder (MLP + NFCL + CFIL, producing pixel-wise semantic labels); and (3) an instance segmentation decoder (three-layer non-bottleneck 1D modules outputting instance centers, offsets, and orientations). Semantic segmentation provides foreground masks to instance segmentation, and their combination forms panoptic segmentation. During training, a multi-task adaptive loss dynamically adjusts per-task learning weights.

Key Designs¶

Partial-Channel Fusion Encoder:
- Function: Efficient fusion of RGB and depth features.
- Mechanism: Exploiting the high similarity across channel features, each fusion block applies Conv2D to only 1/4 of the channels while directly concatenating the remaining 3/4: \(F = \text{Cat}(\text{Conv2d}(I_1), I_2)\). Since \(C'=C/4\), partial convolution reduces FLOPs to 1/16 of full convolution. Two subsequent pointwise convolutions extract inter-channel relationships, followed by a residual connection. Depth weights are initialized as \(D=(R+G+B)/2\), reusing ImageNet pretrained parameters.
- Design Motivation: Frequent memory access is the bottleneck of conventional depthwise separable convolutions; partial-channel convolution reduces memory access while exploiting channel redundancy.
Normalized Focus Channel Layer (NFCL) + Context Feature Interaction Layer (CFIL):
- Function: NFCL filters shallow-feature noise; CFIL compensates for insufficient local-global fusion in MLP decoders.
- Mechanism: NFCL reuses the learnable scaling factor \(\gamma\) from Batch Normalization as a channel importance measure. Channel weights are computed as \(W_i = |\gamma_i| / \sum_j |\gamma_j|\), and sigmoid gating suppresses shallow-feature noise. CFIL performs adaptive average pooling at two scales (1×1 and 5×5), compresses channels to \(C/2\), upsamples and concatenates with original features, then restores the channel count.
- Design Motivation: MLP decoders depend heavily on encoder feature quality — NFCL eliminates shallow-feature misleading while CFIL supplements multi-scale context, and the two modules complement each other.
Multi-task Adaptive Loss:
- Function: Real-time batch-level dynamic adjustment of per-task learning weights.
- Mechanism: At each batch, the relative loss of each task is computed as \(RL_k = L_k / \sum_t L_t\); a historical mean \(\text{Avg}RL_k\) is maintained; weights are updated as \(W_k = \max(\bar{W}_k \times (\text{Avg}RL_k)^\alpha, W_{min})\), where \(\alpha=0.01\) controls sensitivity and \(W_{min}=0.1\) prevents any task from being ignored.
- Design Motivation: Responds faster than epoch-level methods and adapts to intra-epoch data distribution shifts; more stable than stochastic weighting approaches (e.g., Lin et al.).

Loss & Training¶

Each of the five tasks employs a dedicated loss: semantic segmentation (CE), instance center (MSE), instance offset (MAE), orientation estimation (von Mises: \(L_{or}=1-e^{\kappa(f \cdot t - 1)}\)), and scene classification (CE). These are combined via adaptive weighting. The optimizer is SGD (lr=0.03, weight decay=1e-4, momentum=0.9), trained on an RTX 3090 Ti.

Key Experimental Results¶

Main Results¶

Dataset	Method	Semantic mIoU	PQ (Panoptic)	FPS	Params
NYUv2	EMSAFormer (Swin v2)	49.76	43.08	16.32	72.08M
NYUv2	Ours	49.82	43.21	20.33	71.82M
NYUv2	MPViT	-	-	9.94	-
SUN RGB-D	CI-Net	44.30	-	-	-
SUN RGB-D	Ours	45.56	-	-	-
Cityscapes	PSPNet	63.10	-	-	-
Cityscapes	Ours	65.11	-	-	-

Ablation Study¶

Configuration	Key Metric	Notes
Fusion encoder vs. Swin v2	Instance PQ 58.59 vs. 58.49, faster	Fewer parameters, higher speed, comparable accuracy
+CFIL (semantic decoder)	Panoptic mIoU 50.16%	Multi-scale context fusion is effective
+NFCL (layers 1/2/3)	mIoU 49.82%	Stage-4 encoder features are already sufficient
Non-bottleneck 1D vs. Bottleneck	PQ 59.25 vs. 57.97	Decomposed convolutions enhance nonlinearity
Adaptive loss vs. fixed weights	mIoU 47.72 vs. 46.83	Training variance is also reduced
\(\alpha\)=0.01 vs. 0.1/0.001	0.01 is optimal	Balances sensitivity and stability

Key Findings¶

Partial-channel convolution is effective for dense prediction tasks — FLOPs are reduced by 16× with negligible accuracy loss.
NFCL reuses BN's \(\gamma\) parameters as a zero-overhead channel importance measure.
Batch-level adaptive loss is more stable than epoch-level approaches, with lower training variance.
NB1D reduces parameters by 30% yet improves PQ by 1.28; the nonlinear activations from decomposed convolutions benefit instance segmentation.

Highlights & Insights¶

The efficiency principle is consistently applied throughout the entire framework: partial channels in the encoder (1/16 FLOPs), zero-overhead BN reuse in NFCL, and 30% parameter reduction via NB1D.
NFCL is minimally designed — it directly repurposes existing BN \(\gamma\) parameters for channel weighting without introducing any additional learnable parameters.
An excellent speed–accuracy trade-off is achieved: 24% faster than Swin v2 at higher accuracy.
The multi-task adaptive loss operates at the batch level, responding faster than epoch-level alternatives.

Limitations & Future Work¶

The partial-channel ratio of 1/4 is fixed; dataset-adaptive selection could be explored.
Validation is limited to RGB-D; extension to thermal imaging, point clouds, and other modalities remains unexplored.
The method processes frames independently without exploiting temporal consistency in video streams.
\(\alpha\) and \(W_{min}\) are manually set hyperparameters; automated tuning could be considered.
Scalability to high-resolution scenes has not been verified.

vs. EMSAFormer: Both address RGB-D multi-task learning; this work replaces the Swin Transformer with a CNN encoder to achieve higher speed (20.33 vs. 16.32 FPS) at comparable accuracy.
vs. EMSANet: The dual-encoder fusion incurs high computational cost; this work directly processes RGBD 4-channel input with a single encoder.
vs. SegFormer: The limitations of MLP decoders are effectively mitigated by the combination of NFCL and CFIL.
The partial-channel convolution concept originates from FasterNet; this work demonstrates its effectiveness in dense multi-task prediction scenarios.

Rating¶

Novelty: ⭐⭐⭐ Individual components show some novelty, but the overall contribution is largely an integration and optimization of existing techniques.
Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, detailed ablations, and heatmap visualizations.
Writing Quality: ⭐⭐⭐ Well-structured overall, though some descriptions are slightly redundant.
Value: ⭐⭐⭐⭐ Practically valuable for robotic scene understanding, with an excellent speed–accuracy balance.