DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic¶
Conference: ICCV 2025 arXiv: 2506.21260 Code: To be confirmed Area: Autonomous Driving Keywords: Incremental Object Detection, Class Incremental, Domain Incremental, Task Arithmetic, Model Merging, Catastrophic Forgetting, YOLO11, RT-DETR
TL;DR¶
This paper proposes DuET, a framework that, for the first time, addresses both class-incremental and domain-incremental object detection simultaneously (Dual Incremental Object Detection, DuIOD) via exemplar-free Task Arithmetic model merging. It introduces a Directional Consistency Loss to mitigate sign conflicts, achieving substantial improvements over existing methods on the Pascal Series and Diverse Weather Series benchmarks.
Background & Motivation¶
The Dual Challenge in Real-World Scenarios¶
Real-world object detection systems (e.g., autonomous driving, surveillance) must continuously learn new categories while adapting to environmental changes (illumination, weather, style, and other domain shifts). Existing methods address only one of these dimensions:
- Class-Incremental Object Detection (CIOD): Incrementally learns new categories but assumes a fixed domain, leading to severe performance degradation on unseen domains.
- Domain-Incremental Object Detection (DIOD): Adapts to new domains but assumes a fixed category set, making it unable to detect new classes.
Both families of methods fail when class shifts and domain shifts occur simultaneously—precisely the scenario most common in practice.
Problem Formulation of DuIOD¶
The authors propose Dual Incremental Object Detection (DuIOD) as a more realistic setting: the model must handle a sequence of incremental tasks, each introducing new categories \(\mathcal{C}_t\) and a new domain \(\mathcal{D}_t\), without retaining any historical training data (exemplar-free). This introduces three core challenges:
Catastrophic Forgetting: Learning new knowledge erases old knowledge.
Domain Generalization: Maintaining performance on old categories in new domains.
Background Shift: Old categories are unannotated in new tasks and are treated as background during training.
Limitations of Prior Work¶
- CL-DETR (CIOD SOTA): Relies on the DETR architecture and exemplar replay; performs poorly under severe domain shift.
- LDB (DIOD SOTA): Freezes the base model and learns domain biases; cannot handle category changes.
- LwF / ERD: General continual learning methods that degrade significantly in multi-stage DuIOD.
Method¶
Overall Architecture¶
The DuET framework consists of two core components:
- DuET Module: Updates shared parameters (backbone + neck) by dynamically fusing old and new task vectors layer by layer, balancing knowledge retention and new knowledge absorption.
- Incremental Head: Concatenates detection head parameters across tasks to expand the model's categorical detection capacity.
The overall pipeline is as follows:
- Base Task \(\mathcal{T}_1\): Fine-tune a pretrained detector on the first task to obtain parameters \(\theta_1\).
- Parameter Decomposition: Decompose model parameters into shared parameters \(\theta_s\) (backbone + neck) and task-specific parameters \(\theta_\tau\) (detection heads).
- Incremental Task \(\mathcal{T}_t, t \geq 2\):
- Sequential fine-tuning: Initialize from \(\theta_{t-1}\) and train with the total loss to obtain \(\theta_t\).
- Compute old and current task vectors: \(\tau_{\text{old}} = \theta_{s_{t-1}} - \theta_{s_0}\), \(\tau_{\text{curr}} = \theta_{s_t} - \theta_{s_0}\).
- Merge shared parameters via the DuET Module.
- Concatenate detection head parameters via the Incremental Head.
- Inference: Perform detection using the merged incremental weights.
Key Design 1: DuET Module — Layer-wise Dynamic Task Vector Fusion¶
The DuET Module is the core of the framework. For each layer \(l\), it computes a retention factor \(\alpha_l\) and an adaptation factor \(\beta_l\) to fuse old and new task vectors.
The p-factor is first computed to measure the relative importance of old versus current updates:
After \(\tanh\) mapping and clamping, the dynamic coefficients are:
The merged shared parameters for each layer are then:
Design Intuition: When the old task vector norm dominates in a given layer, \(\alpha_l\) is larger, prioritizing the preservation of old knowledge (stability); otherwise, the model absorbs more new knowledge (plasticity). This avoids the high computational cost of second-order methods such as Fisher Merging.
Key Design 2: Incremental Head — Task-Specific Parameter Concatenation¶
Detection head parameters are not merged but directly concatenated across current and historical task-specific parameters:
This allows the model to simultaneously output detection results for all learned categories and constitutes a simple yet effective incremental expansion strategy.
Key Design 3: Detector Agnosticism¶
The parameter decomposition strategy of DuET is architecture-agnostic: - YOLO11: backbone + neck as \(\theta_s\), detection head as \(\theta_\tau\). - RT-DETR: Analogously applicable. - Deformable DETR: Analogously applicable.
This enables real-time detectors such as YOLO11 and RT-DETR to serve as incremental detectors for the first time.
Loss & Training¶
For the base task (\(t=1\)), only the standard detection loss \(\mathcal{L}_{\text{Detector}}\) is used.
For incremental tasks (\(t \geq 2\)), the total loss is:
Directional Consistency Loss \(\mathcal{L}_{\text{DC}}\):
The dot product measures the directional consistency of consecutive incremental updates: if the current update direction is opposite to the previous step (sign conflict), the ReLU term produces a penalty. This effectively reduces sign conflicts by approximately 34% during model merging.
Modified Distillation Loss \(\mathcal{L}_{\text{Distill}}^*\): A standard distillation strategy to preserve knowledge of previous tasks.
Evaluation Metric: Retention-Adaptability Index (RAI)¶
The authors propose a new evaluation metric that jointly measures retention and adaptability:
- Avg RI (Average Retention Index): Ratio of the final model's mAP on old categories in old domains to the mAP at the time of initial learning; measures the degree of forgetting.
- Avg GI (Average Generalization Index): Ratio of the model's mAP on unseen categories to a reference model; measures cross-domain generalization capability.
Key Experimental Results¶
Datasets¶
| Dataset Series | Domains | # Categories | Source |
|---|---|---|---|
| Pascal Series | VOC, Clipart, Watercolor, Comic | 3–20 | Pascal VOC, Cross-Domain Det |
| Diverse Weather Series | Daytime Sunny, Night Sunny, Daytime Foggy | 7 | BDD-100k, FoggyCityscapes, Adverse-Weather |
Main Results: Two-Stage and Multi-Stage Results (Table 2)¶
| Method | Base Detector | Pascal 2-Stage RAI | Pascal 4-Stage RAI | Weather 2-Stage RAI | Weather 3-Stage RAI |
|---|---|---|---|---|---|
| Sequential FT | YOLO11n | 6.81% | 5.53% | 22.94% | 15.26% |
| LwF | YOLO11n | 53.19% | 34.84% | 38.88% | 25.86% |
| ERD | YOLO11n | 56.17% | 47.95% | 59.92% | 42.00% |
| LDB | ViTDet | 42.83% | 52.83% | 11.76% | 27.96% |
| CL-DETR | Def. DETR | 54.51% | 54.18% | 57.09% | 53.86% |
| DuET | YOLO11n | 65.99% | 65.95% | 72.51% | 65.25% |
Key Findings: - DuET substantially outperforms all baselines across all experiments: +13.12% RAI on Pascal multi-stage and +11.39% RAI on Weather multi-stage. - DuET maintains Avg RI at 87–89%, indicating minimal forgetting. - DuET has only 2.58M parameters, far fewer than CL-DETR (39.85M) and LDB (110.52M).
Cross-Detector Generalization (Table 3) — Weather Two-Stage¶
| Base Detector | # Params | GFLOPs | Avg RI | Avg GI | RAI |
|---|---|---|---|---|---|
| ViTDet | 110.52M | 1829.6 | 27.55% | 28.22% | 27.89% |
| Deformable DETR | 39.85M | 11.8 | 84.45% | 33.45% | 58.95% |
| RT-DETR-l | 32.00M | 103.4 | 47.73% | 21.00% | 34.37% |
| RT-DETR-x | 65.49M | 222.5 | 56.39% | 24.15% | 40.27% |
| YOLO11n | 2.58M | 6.3 | 88.06% | 56.95% | 72.51% |
| YOLO11x | 56.84M | 194.4 | 96.88% | 42.41% | 69.18% |
Key Finding: YOLO11n achieves the best RAI with the fewest parameters and lowest computational cost, demonstrating that DuET's task arithmetic strategy is highly compatible with lightweight detectors.
Ablation Study (Table 4) — Pascal Two-Stage, YOLO11n¶
| Configuration | Avg RI | Avg GI | RAI |
|---|---|---|---|
| No Incremental (Static) | 0.5% | 9.13% | 4.82% |
| + Sequential Fine-tuning | 0.75% | 12.86% | 6.81% |
| + Incremental Head | 24.75% | 33.36% | 29.06% |
| + DuET Module | 75.00% | 37.26% | 56.13% |
| + \(\mathcal{L}_{\text{Distill}}^*\) | 87.06% | 37.75% | 62.41% |
| + \(\mathcal{L}_{\text{DC}}\) (Full DuET) | 87.44% | 44.54% | 65.99% |
Key Findings: - The Incremental Head contributes the largest single RAI jump (+22.25%). - The DuET Module further substantially improves Avg RI (24.75% → 75.00%). - \(\mathcal{L}_{\text{DC}}\) particularly boosts Avg GI (+6.79%), effectively improving generalization. - Every component is indispensable.
Highlights & Insights¶
- Valuable Problem Formulation: DuIOD is more realistic than pure CIOD or DIOD and constitutes an important new research direction.
- Novel Application of Task Arithmetic to Detection: This work is the first to introduce Task Arithmetic into incremental object detection and validates its detector-agnostic applicability.
- Elegant Directional Consistency Loss: Constraining the consistency of consecutive update directions via dot products to mitigate sign conflicts is simple and effective, reducing sign conflicts by 34% on average.
- Lightweight and Efficient: YOLO11n with only 2.58M parameters and 6.3 GFLOPs can serve as a real-time incremental detector, offering strong practical utility.
- Well-Designed Evaluation Metric: RAI jointly captures retention and generalization, providing a more comprehensive assessment than existing forgetting-only metrics.
- Parameter Efficient: No exemplar buffer or generative replay is required; only task vectors and the shared parameter baseline need to be stored.
Limitations & Future Work¶
- Category–Domain Binding Assumption: Each incremental task pairs new categories with a new domain; more complex real-world combinations (new categories in an old domain, the same category across multiple domains, etc.) are not addressed.
- Requires Storing Baseline Weights \(\theta_{s_0}\): Task vector computation depends on the initial pretrained weights, and storage overhead grows linearly with the number of layers.
- Poor Performance on ViTDet: RAI of only 27.89% suggests that DuET's layer-wise fusion strategy may not generalize to all architectures.
- Low Avg GI Overall: Even for DuET, Avg GI peaks at only 56.95%, indicating substantial room for improvement in cross-domain generalization.
- Only 2–4 Stages Evaluated: Performance on longer incremental sequences (10+ tasks) remains unknown.
- \(\mathcal{L}_{\text{DC}}\) Requires at Least Three Tasks: Directional consistency necessitates comparing updates across three consecutive steps, limiting its contribution in two-stage experiments.
Related Work & Insights¶
- Task Arithmetic [Ilharco et al., 2023]: Modifies pretrained models via arithmetic operations on task vectors — DuET extends this paradigm to incremental detection.
- TIES-Merging [Yadav et al., 2023]: Resolves sign conflicts via orthogonal constraints — inspires DuET's \(\mathcal{L}_{\text{DC}}\).
- MagMax [Marczak et al.]: Mitigates forgetting through important parameter selection — DuET's layer-wise p-factor serves as a more dynamic alternative.
- CL-DETR [Liu et al., 2023]: CIOD SOTA using knowledge distillation + exemplar replay — DuET surpasses it without any exemplars.
- LDB [Chen et al., 2024]: DIOD SOTA that learns domain biases — incapable of handling category changes in DuIOD.
Insight: The "vector space" perspective of Task Arithmetic offers a new lens for continual learning — treating the weight differences between new and old task models as directional vectors and balancing stability and plasticity through simple linear combinations, thereby avoiding complex regularization or replay strategies.
Rating¶
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Novelty | ⭐⭐⭐⭐ | First to define DuIOD + first to introduce Task Arithmetic into incremental detection |
| Technical Depth | ⭐⭐⭐⭐ | Layer-wise dynamic fusion + DC Loss + comprehensive evaluation framework |
| Experimental Thoroughness | ⭐⭐⭐⭐ | 7 experiments + 6 detectors + detailed ablation |
| Practical Value | ⭐⭐⭐⭐⭐ | Detector-agnostic + lightweight + exemplar-free; highly practical |
| Writing Quality | ⭐⭐⭐⭐ | Clear structure with informative figures |
| Overall | ⭐⭐⭐⭐ | Strong problem formulation, concise and effective method, solid experiments |
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD