MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception¶

Conference: ICCV 2025 arXiv: 2509.17462 Code: To be released Area: Autonomous Driving Keywords: Multi-task Learning, 3D Perception, BEV Segmentation, 3D Object Detection, Occupancy Prediction

TL;DR¶

This paper proposes the MAESTRO framework, which generates task-specific features and suppresses inter-task interference in multi-task 3D perception through three modules: Class-wise Prototype Generator (CPG), Task-Specific Feature Generator (TSFG), and Scene Prototype Aggregator (SPA). MAESTRO simultaneously surpasses single-task models on 3D object detection, BEV map segmentation, and 3D occupancy prediction.

Background & Motivation¶

Autonomous driving perception systems must simultaneously execute multiple tasks: 3D object detection focuses on movable foreground objects, BEV map segmentation focuses on static background structures, and 3D occupancy prediction requires attention to both foreground and background. Multi-task learning (MTL) can improve computational efficiency via a shared backbone, but inconsistent gradient directions across tasks lead to task conflicts that degrade per-task performance.

Existing MTL methods partially alleviate these conflicts but still fail to generate truly task-specific feature representations, leaving performance below independently trained single-task models. The authors identify fundamental differences in the semantic cues and spatial regions attended to by different tasks, motivating a mechanism that enhances task-relevant information and suppresses irrelevant information from shared features.

Method¶

Overall Architecture¶

MAESTRO first extracts 2D features from multi-view images via a shared backbone, then lifts them to a 3D voxel representation $F_s \in \mathbb{R}^{C \times X \times Y \times Z}$ using the Lift-Splat-Shoot (LSS) method. The features are subsequently processed by CPG, TSFG, and SPA in sequence to generate task-specific features, which are then forwarded to individual task heads for prediction.

Key Designs¶

Class-wise Prototype Generator (CPG): Semantic categories are partitioned into a foreground group (vehicles, pedestrians, etc.) and a background group (roads, sidewalks, etc.). A lightweight mask classifier computes per-voxel semantic confidence $S_v$ from the shared voxel features, generates binary masks $B_k$ based on the highest-confidence class, and applies average pooling over masked regions to obtain class prototypes $P_k$. Foreground prototypes are assigned to the detection task, background prototypes to BEV segmentation, and both are jointly assigned to occupancy prediction. The core formula is $P_k = \text{AvgPool}(F_s \otimes B_k)$. Design Motivation: Semantic grouping provides targeted prior information for each task.
Task-Specific Feature Generator (TSFG): Comprises three sub-modules. (a) Task-dependent feature transformation: Transforms shared voxel features into BEV-domain features (detection/segmentation) or voxel-domain features (occupancy prediction). (b) Adaptive feature enhancement: Computes dot products between the prototype group and the transformed features to generate prototype-level features (activating spatial regions semantically aligned with prototypes), while channel attention generates prototype-aware features; these are concatenated and convolved to yield the enhanced feature $\tilde{F}_t$. (c) Feature suppression: A lightweight CNN derives a suppression score map $S^{supp}_t$ from the prototype-aware features, which is applied element-wise to the enhanced features via gating: $F^{TS}_t = \tilde{F}_t \otimes S^{supp}_t$. Design Motivation: The enhance-then-suppress paradigm thoroughly eliminates task-irrelevant information.
Scene Prototype Aggregator (SPA): Detection prototypes $P_{Det}$ are generated from predicted bounding boxes via RoIAlign from the detection head, and map prototypes $P_{Map}$ are generated from predicted masks via mask average pooling from the segmentation head. These task-guided prototypes are aggregated into the occupancy prediction prototype group through semantic alignment rules, forming scene prototypes as initial queries for the occupancy decoder. Design Motivation: Complementary semantic information from detection and segmentation is exploited to enhance occupancy prediction without degrading other tasks.

Loss & Training¶

The total loss is the sum of six terms: $$L_{total} = L_{depth} + L_{CPG} + L_{Sup} + L_{det} + L_{map} + L_{occ}$$

$L_{CPG}$ supervises class mask classification using Dice Loss + Lovász Loss
$L_{Sup}$ supervises task-wise suppression score maps using Focal Loss
Each task head employs the respective losses from CenterPoint (detection), BEVFusion (segmentation), and OccFormer (occupancy prediction)
ResNet-50 backbone, AdamW optimizer (lr=1e-4), 24 epochs, without CBGS

Key Experimental Results¶

Main Results¶

Method	mAP	NDS	mIoU (Map)	mIoU (Occ)	Latency (ms)
Baseline-STL	33.8	41.7	47.5	36.5	405.9
Baseline-MTL	32.7	38.2	43.5	36.0	219.6
BEVFusion MTL	33.6	39.2	44.0	-	-
DualBEV (STL)	35.2	42.5	-	-	65.1
DifFUSER (STL)	-	-	48.3	-	92.2
FB-Occ (STL)	-	-	-	37.4	129.7
MAESTRO-MTL	36.4	43.2	51.3	38.6	250.3

Compared to Baseline-STL, MAESTRO achieves gains of +2.6% mAP, +3.8% mIoU (Map), and +2.1% mIoU (Occ) across the three tasks while reducing latency by 155.6 ms; gains over Baseline-MTL are even more pronounced.

Ablation Study¶

CPG	TSFG (Det)	TSFG (Map)	TSFG (Occ)	SPA	mAP	NDS	mIoU (Map)	mIoU (Occ)
✗	✗	✗	✗	✗	31.3	32.3	40.4	33.6
✓	✗	✗	✗	✗	31.3	32.4	40.3	34.6
✓	✓	✓	✓	✗	32.6	34.3	44.2	36.9
✓	✓	✓	✓	✓	32.6	34.3	44.2	36.9

TSFG sub-module ablation: prototype-level features contribute +1.8% mIoU (Map), prototype-aware features +1.3%, and feature suppression +0.7%. In SPA, removing map prototypes reduces occupancy mIoU by 0.6%, and further removing detection prototypes reduces it by an additional 0.4%.

Key Findings¶

The MTL framework is the first to comprehensively surpass independently trained STL models across all three 3D perception tasks.
Foreground/background semantic grouping prototypes serve as effective priors for alleviating task conflicts.
The feature suppression module contributes most significantly to BEV map segmentation, which requires suppressing foreground interference.
Output information from detection and segmentation provides significant complementary benefits to occupancy prediction.

Highlights & Insights¶

The core insight is distinctive: different 3D perception tasks inherently attend to foreground and background differently, and leveraging this semantic divergence to design prototype groupings is a natural and effective approach.
The two-stage "enhance-then-suppress" feature refinement mechanism is elegantly designed—first amplifying relevant information via prototype groups, then filtering irrelevant components via learnable suppression score maps.
SPA's use of outputs from already-trained tasks as auxiliary information is a clever approach to modeling inter-task dependencies.
The overall framework exhibits high modularity, with each component's contribution independently verifiable.

Limitations & Future Work¶

Ablation experiments are conducted on 1/4 of the training set, and the generalizability of the ablation conclusions remains to be validated.
The manual foreground/background grouping rules may not generalize to a broader range of tasks (e.g., lane detection).
Evaluation is conducted only on the nuScenes validation set; performance on other large-scale datasets (e.g., Waymo) has not been verified.
The suppression score maps require additional GT supervision (RoI masks), increasing annotation dependency.
The effects of temporal information fusion and larger backbone networks remain unexplored.

Compared to existing MTL methods such as HENet and SOGDet, MAESTRO not only alleviates task conflicts but also achieves genuinely task-specific feature generation through the prototype mechanism.
The MoE approach in TaskExpert shares a conceptual similarity with the prototype grouping idea in this work, as both explore how to assign distinct feature processing pathways to different tasks.
The prototype learning concept originates from Prototypical Networks and is innovatively applied here to multi-task feature disentanglement.

Rating¶

Novelty: ⭐⭐⭐⭐ The foreground/background prototype grouping combined with enhancement-suppression design is original.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive main experiment comparisons with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-executed illustrations.
Value: ⭐⭐⭐⭐ Achieving MTL superiority over STL in multi-task 3D perception represents meaningful progress.