Skip to content

SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection

Conference: CVPR 2025
arXiv: 2412.20047
Code: None
Area: Object Detection
Keywords: Long-Tailed Object Detection, Semi-Supervised Learning, Pseudo-Labeling, Model Transfer, LVIS

TL;DR

SimLTD proposes a simple and intuitive three-stage framework—pre-training on head classes, transferring to tail classes, and fine-tuning on hybrid-sampled data—optionally integrated with semi-supervised learning on unlabeled images, which comprehensively outperforms existing methods that rely on ImageNet labels on the LVIS v1 benchmark.

Background & Motivation

Background: Object detection has made tremendous progress on balanced datasets like PASCAL VOC and MS-COCO, but its performance drops significantly on large-vocabulary long-tailed datasets like LVIS. Existing long-tailed detection methods mainly follow two paradigms: multi-stage training (e.g., LST, which requires 7 stages + knowledge distillation) and leveraging external ImageNet labels (e.g., Detic, RichSem, requiring ~14M labeled images) to supplement tail-class samples.

Limitations of Prior Work: (1) Multi-stage methods like LST are overly complex, and multiple knowledge transfer stages easily lead to catastrophic forgetting; (2) Methods relying on ImageNet labels require large-scale labeled auxiliary databases, which is impractical in custom domains like industrial scenarios and aerial remote sensing; (3) For genuinely rare categories (e.g., scarce fish species), collecting more annotated data from the internet is simply impossible.

Key Challenge: The fundamental challenge of long-tailed detection lies in the scarcity of tail-class samples (down to just 1), whereas existing solutions are either overly complex or strictly dependent on large-scale supervised auxiliary data.

Goal: To design a simple and general framework that does not rely on additional image-level annotations, while optionally utilizing easily accessible unlabeled images to boost long-tailed detection performance.

Key Insight: Through empirical analysis, the authors find that a strong representation model pre-trained on COCO shows significantly improved performance when transferred to the LVIS tail classes—the stronger the pre-training, the better the transfer. This inspires a direct "head pre-training \(\rightarrow\) tail transfer" paradigm without the extra complexity of meta-learning or knowledge distillation.

Core Idea: Decomposing long-tailed detection into three simple steps—head pre-training for strong representations, tail transfer learning for adapting to rare classes, and hybrid fine-tuning to balance overall performance—with an option to incorporate pseudo-labeling semi-supervised learning for further enhancement.

Method

Overall Architecture

Given a long-tailed dataset \(\mathcal{D}_{\text{ltd}}\) (e.g., LVIS), it is divided into a head set \(\mathcal{D}_{\text{head}}\) (866 classes) and a tail set \(\mathcal{D}_{\text{tail}}\) (337 classes) using a threshold \(M=10\). The three-step workflow is as follows: Step 1 trains a detector on the head set (optionally including unlabeled data for semi-supervised learning) to obtain strong representations; Step 2 freezes the backbone and only re-initializes and fine-tunes the detection head on the tail set for transfer learning; Step 3 samples \(k\) instances from each category across all classes to construct \(\mathcal{D}_k\), fine-tuning the detection head to balance head and tail performance. It is compatible with various detectors such as Faster R-CNN, Deformable DETR, and DINO.

Key Designs

  1. Head Pre-training + Semi-Supervised Enhancement:

    • Function: Learn powerful visual representations on data-rich head classes.
    • Mechanism: In addition to standard data augmentation, Simple Copy-Paste and Repeat Factor Sampling are used to address class imbalance within the head set. Optionally, unlabeled images are introduced for pseudo-label learning via a student-teacher framework: \(\mathcal{L} = \mathcal{L}_{\text{sup}} + \alpha \mathcal{L}_{\text{pseudo}}\). Three semi-supervised methods are explored: SoftER Teacher, MixTeacher, and MixPL.
    • Design Motivation: The head classes have relatively sufficient data, making them an ideal scenario for pre-training representation learning; semi-supervised learning further exploits cheap unlabeled data.
  2. Tail Transfer Learning (Frozen Backbone + Re-initialized Detection Head):

    • Function: Efficiently transfer representations learned from head classes to tail classes.
    • Mechanism: The detection head (classifier and regressor) learned from the head classes is removed and re-initialized with random weights, and then the detection head is trained exclusively on the tail set while keeping the backbone frozen. Semi-supervised pseudo-labeling is also optionally incorporated. To facilitate pseudo-label generation for rare classes, random copy-pasting of tail-class instances onto unlabeled images is innovatively applied.
    • Design Motivation: Empirical analysis shows that stronger pre-training leads to better transfer, and freezing the backbone avoids overfitting due to scarce tail data.
  3. Hybrid Sampling Fine-Tuning (Exemplar Replay):

    • Function: Balance head and tail classes, preventing the forgetting of head-class knowledge after transfer.
    • Mechanism: A relatively balanced small dataset \(\mathcal{D}_k\) is constructed by sampling \(k\) instances (\(k \leq M\)) per category from the entire long-tailed dataset, which is used for the final fine-tuning of the detection head. If a tail class has fewer than \(k\) instances, all of them are used.
    • Design Motivation: Training exclusively on the tail classes in Step 2 causes forgetting of head classes. The hybrid sampling fine-tuning in Step 3 achieves a balance between "remembering the head and learning the tail".

Loss & Training

The same detection losses (classification + regression) are used in all three stages, with an additional pseudo-label loss during the semi-supervised stage. Data augmentations include resizing,翻转, photometric distortion, and SCP+RFS, while semi-supervised learning additionally uses translation, shearing, rotation, and Cutout.

Key Experimental Results

Main Results

Method External Data Backbone mAP AP_r↑ AP_c AP_f
Seesaw Loss None R101-FPN 27.8 18.7 27.0 32.8
RichSem ImageNet-21K R50 35.1 26.0 32.6 41.8
SimLTD (Supervised) None R50 35.0 32.0 34.0 37.5
RichSem ImageNet-21K Swin-B 46.4 38.5 45.1 51.3
SimLTD (Supervised) None Swin-B 47.2 42.7 46.7 49.9
Detic ImageNet-21K R50 32.5 26.2 31.3 36.6
SimLTD (Semi-Supervised) COCO-unlabeled R50 39.4 32.6 38.5 43.6

Ablation Study

Configuration mAP AP_r↑ Description
Head Pre-training \(\rightarrow\) Direct Full-set Fine-tuning 31.2 21.5 No transfer learning
Head Pre-training \(\rightarrow\) Tail Transfer \(\rightarrow\) No Fine-tuning 15.3 25.1 Severe forgetting of head classes
Full SimLTD 35.0 32.0 Complete three-stage pipeline
SimLTD + SoftER Teacher 30.3 23.3 Faster R-CNN
SimLTD + MixTeacher 31.8 23.4 Faster R-CNN
SimLTD + MixPL 39.4 32.6 DINO, R50

Key Findings

  • Supervised SimLTD doesn't use any external labeled data, yet its tail-class AP_r (32.0) already outperforms RichSem (26.0) which relies on ImageNet-21K labels, while achieving a comparable mAP on R50 (35.0 vs 35.1).
  • On Swin-B, SimLTD's AP_r (42.7) significantly outperforms RichSem (38.5), demonstrating the scalability of a stronger backbone paired with the three-stage strategy.
  • Using only COCO-unlabeled2017 (~120k unlabeled images), the semi-supervised SimLTD improves the mAP of DINO/R50 from 35.0 to 39.4.
  • The augmentation strategy of copy-pasting tail-class instances onto unlabeled images is particularly crucial for semi-supervised learning.

Highlights & Insights

  • Overcoming Complexity with Simplicity: With just three simple steps and without the need for meta-learning or knowledge distillation, it significantly outperforms complex methods that require 7-stage training or 14M extra labels. The simplicity of the method is highly appealing.
  • Replacing Labeled Auxiliary Data with Unlabeled Data: It is demonstrated that leveraging unlabeled images via pseudo-labeling semi-supervised learning can replace or even outperform schemes depending on large labeled databases like ImageNet-21K, greatly boosting practical feasibility.
  • Empirical Discovery of Transfer Learning: A stronger pre-trained model leads to a better tail transfer effect; this simple yet important finding provides a clear pathway for improving long-tailed learning.

Limitations & Future Work

  • The three stages require separate training, and the total computational cost remains substantial (~460 GPU hours for Swin-L).
  • Freezing the backbone during tail transfer may limit the backbone's ability to adapt to tail-class features.
  • The choice of \(k\) (the number of samples per class in hybrid sampling) may require tuning for different datasets.
  • Future work could explore unifying the three stages into an end-to-end curriculum learning scheme.
  • vs LST: LST requires 7 stages + knowledge distillation, whereas SimLTD only needs 3 steps without distillation, being simpler and more effective.
  • vs Detic/RichSem: These methods strictly depend on ImageNet-21K labeled data, whereas SimLTD achieves comparable or even superior performance using unlabeled data + pseudo-labels, offering broader applicability.
  • The generality of the framework (supporting both CNN and Transformer detectors) allows it to serve as a strong baseline for long-tailed detection.

Rating

  • Novelty: ⭐⭐⭐ The method itself is a simple combination of existing techniques, but the combination is highly effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Various backbones, multiple detectors, supervised/semi-supervised comparisons, and detailed ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Clear logic and well-defined motivation.
  • Value: ⭐⭐⭐⭐ Provides a simple and efficient baseline for long-tailed detection, offering valuable reference for practical applications.