Mr. DETR++: Instructive Multi-Route Training for Detection Transformers with MoE¶
Conference: CVPR 2025
arXiv: 2412.10028
Code: Project Page
Area: Object Detection
Keywords: DETR, multi-route training, instructive self-attention, mixture-of-experts, one-to-many assignment, detection transformer
TL;DR¶
This work systematically investigates the roles of various components in the DETR decoder within a joint one-to-one/one-to-many multi-task framework, and reveals that transitioning any single component to be independent can effectively coordinate the two objectives. Based on this, instructive multi-route training is proposed (Instructive Self-Attention + Independent FFN + Route-Aware MoE), which discards auxiliary routes during inference, incurring zero extra cost.
Background & Motivation¶
Background: The DETR family achieves end-to-end detection without NMS through one-to-one matching, but the sparse supervision of one-to-one matching leads to slow convergence. Existing methods (H-DETR, DN-DETR, DINO, DAC-DETR, MS-DETR) introduce auxiliary one-to-many matching to accelerate training.
Limitations of Prior Work: (1) Simply sharing all components for both one-to-one and one-to-many predictions severely degrades one-to-one performance (-6.0 AP); (2) Existing methods only investigate the functions of decoder components under single-task settings, lacking a systematic analysis within a multi-task framework; (3) The designs of DAC-DETR and MS-DETR are based on heuristic observations rather than systematic experiments.
Key Challenge: A conflict arises when the two training objectives, one-to-one and one-to-many, pass through the same decoder—the same predicted box may be a positive sample in one-to-many but a negative sample in one-to-one.
Goal: Systematically understand the roles of individual decoder components (self-attention, cross-attention, FFN) under a multi-task framework, and identify the optimal component-sharing/independent strategy.
Key Insight: View the joint one-to-one and one-to-many auxiliary training as multi-task learning, and exhaustively validate the performance of different combinations of shared/independent components.
Method¶
Overall Architecture¶
Three-route decoder structure: - Route-2 (Main Route): Identical to the baseline model, performs one-to-one prediction, and is preserved during inference. - Route-1 (Auxiliary-FFN): Uses an independent FFN to perform one-to-many prediction, and is discarded during inference. - Route-3 (Auxiliary-InstructSA): Uses Instructive Self-Attention to perform one-to-many prediction, and is discarded during inference.
All routes share the object queries and detection heads.
Key Designs¶
1. Empirical Foundation of the Multi-Route Training Mechanism¶
Exhaustive validation of 12 shared/independent component combinations (Table I): - Sharing all components: -6.0 AP (severe task conflict). - Independent Self-Attention: +2.1 AP; Independent Cross-Attention: +1.6 AP; Independent FFN: +2.0 AP. - Having any individual component independent is sufficient to effectively coordinate the two objectives. - Optimal combination: The two-route combination of independent SA + independent FFN achieves +2.4 AP (independent CA degrades performance because CA converges slowly).
2. Instructive Self-Attention¶
To replace independent Self-Attention and reduce parameters, an instructive mechanism is designed: - Construct \(m\) learnable instruction tokens \(\mathbf{Q}^{ins}\). - Concatenate them with the object queries and perform Self-Attention with shared parameters: \(\hat{Q}^{ins} = \{q_0^{ins}, ..., q_{m-1}^{ins}, q_0, ..., q_{n-1}\}\). - Discard the outputs of the instruction tokens after Self-Attention. - No extra parameters (shares SA weights), guiding the queries from one-to-one to one-to-many prediction mode solely through the instruction tokens. - Uses 10 instruction tokens by default.
3. Route-Aware Mixture-of-Experts (Mr. DETR++)¶
Replaces the two independent FFNs with MoE to achieve knowledge sharing: - \(t\) shared experts, with sparse top-\(k\) activation. - Route-2 and Route-3 share the gating function \(G(\cdot)\). - Route-1 uses an independent gating function \(G'(\cdot)\)—preventing gradients from the one-to-many route from interfering with the one-to-one route. - Scale-aware MoE in Encoder: Applies MoE to low-resolution features, while using only a shared FFN for high-resolution features, to balance computational overhead.
Loss & Training¶
- One-to-one Matching Route (Route-2): Standard Hungarian matching + classification loss + box loss.
- One-to-many Matching Route (Route-1, Route-3): Top-\(K\) matching based on \(M_{ij} = \alpha \cdot s_i + (1-\alpha) \cdot \text{IoU}(b_i, \bar{b_j})\).
- Localization-aware Score Calibration: VFL Loss learns a class-aware IoU score, making \(s_{calib} = s_{cls}^\phi \cdot s_{iou}^{1-\phi}\) during inference.
- Parameters: \(K=6\), \(\alpha=0.3\), \(\tau=0.4\), with \(\phi\) used to balance classification and localization confidence.
Key Experimental Results¶
Main Results (Table II, ResNet-50)¶
| Baseline | Queries | Epochs | Baseline AP | +Mr.DETR | +Mr.DETR++ |
|---|---|---|---|---|---|
| Deformable-DETR++ | 300 | 12 | 47.0 | 49.5 (+2.5) | 51.0 (+4.0) |
| Deformable-DETR++ | 900 | 12 | 47.6 | 50.7 (+3.1) | 51.8 (+4.2) |
| DINO | 900 | 12 | 49.0 | 50.9 (+1.9) | 52.2 (+3.2) |
| Align-DETR | 900 | 12 | 50.2 | 51.4 (+1.2) | 52.2 (+2.0) |
Swin-L Backbone (Table III, 12 epochs, 900 queries)¶
| Method | AP | AP50 | AP75 |
|---|---|---|---|
| DINO | 56.8 | 75.4 | 62.3 |
| Rank-DETR | 57.6 | 76.0 | 63.4 |
| Stable-DINO | 57.7 | 75.7 | 63.4 |
| Mr. DETR | 58.4 | 76.3 | 63.9 |
| Mr. DETR++ | 58.7 | 76.5 | 64.0 |
Large-Scale Dataset Scaling (ResNet-50, 900 queries, 12 epochs)¶
| Dataset | Baseline AP | +Mr.DETR | +Mr.DETR++ |
|---|---|---|---|
| Objects365 | 30.4 | 32.7 (+2.3) | 34.9 (+4.5) |
| NuImages | 48.5 | 51.2 (+2.7) | 52.3 (+3.8) |
Ablation Study (Table I)¶
| Combination | No. Routes | AP(o2o) | vs Baseline |
|---|---|---|---|
| Fully Shared | 1 | 41.6 | -6.0 |
| Independent SA | 2 | 49.7 | +2.1 |
| Independent FFN | 2 | 49.6 | +2.0 |
| Independent SA + Independent FFN | 3 | 50.0 | +2.4 |
| Independent CA + Independent FFN | 3 | 49.0 | +1.4 |
Key Findings¶
- Any Single Independent Component Suffices: This is the most core empirical finding—breaking the preconceived notion that only SA can differentiate between o2o/o2m.
- Independent CA is Detrimental: The independent CA combination yields the worst performance (+1.4 vs +2.4) due to the slow convergence of independent CA.
- Zero Inference Overhead for Mr. DETR++: Auxiliary routes are completely discarded during inference, without affecting the architecture and speed.
- Consistent Improvements across Baselines: Consistently improves 1-4 AP across Deformable-DETR++, DINO, and Align-DETR.
- Consistent across Datasets: Significant improvements on Objects365 (+4.5) and NuImages (+3.8).
Highlights & Insights¶
- Systematic Empirical Analysis: Exhaustive experiments of 12 combinations in Table I clearly reveal the roles of decoder components under a multi-task framework, providing reliable design guidelines for future works.
- Ingenious Instruction Token Design: Simply prepending learnable tokens switches the self-attention behavior mode (\(o2o \to o2m\)) without introducing any new parameters—conceptually simple yet highly effective.
- Mitigating Conflict via Route-Aware MoE: The design of shared experts + independent gating strikes an effective balance between knowledge sharing and task conflicts.
- Plug-and-Play: Does not alter the baseline model architecture, introduces no inference cost, and is adaptable to various DETR variants—exhibiting high practical value.
Limitations & Future Work¶
- Increases training computational overhead by around 20-30% (requiring gradient computations for three routes).
- MoE introduces additional expert parameters, which increases training memory despite being reducible during inference.
- The number of instruction tokens (10) is a hyperparameter, which, although shown to be insensitive in experiments, might have varying optimal values depending on the model.
- Validation is primarily conducted on COCO, Objects365, and NuImages; more downstream tasks (e.g., open-vocabulary detection) remain to be explored.
Related Work & Insights¶
- DAC-DETR: Achieves o2m by removing SA—this paper proves that discarding SA is unnecessary, and using instruction tokens is more elegant.
- MS-DETR: Uses CA output for o2m and SA output for o2o—this paper proves that this is not the only effective combination.
- DINO: Uses multiple groups of denoising queries for auxiliary training—the multi-route strategy in this paper is orthogonal to denoising queries and can be combined.
- Inspiration: The methodology of component-wise analysis under a multi-task framework can be extended to other multi-head/multi-task architectures (e.g., multimodal fusion modules in VLMs).
Rating¶
⭐⭐⭐⭐ (8/10)
- Novelty: ⭐⭐⭐⭐ — Valuable systematic empirical analysis; the Instruction Token design is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Convincing with multiple baselines, datasets, tasks, and exhaustive ablations.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured, with a complete logical flow from experimental observations to method design.
- Value: ⭐⭐⭐⭐⭐ — High engineering value owing to zero inference overhead, plug-and-play capability, and generalizability across baselines.