SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection¶

Conference: CVPR 2025
arXiv: 2412.20047
Code: None
Area: Object Detection
Keywords: Long-Tailed Object Detection, Semi-Supervised Learning, Pseudo-Labeling, Model Transfer, LVIS

TL;DR¶

SimLTD proposes a simple and intuitive three-stage framework—pre-training on head classes, transferring to tail classes, and fine-tuning on hybrid-sampled data—optionally integrated with semi-supervised learning on unlabeled images, which comprehensively outperforms existing methods that rely on ImageNet labels on the LVIS v1 benchmark.

Background & Motivation¶

Background: Object detection has made tremendous progress on balanced datasets like PASCAL VOC and MS-COCO, but its performance drops significantly on large-vocabulary long-tailed datasets like LVIS. Existing long-tailed detection methods mainly follow two paradigms: multi-stage training (e.g., LST, which requires 7 stages + knowledge distillation) and leveraging external ImageNet labels (e.g., Detic, RichSem, requiring ~14M labeled images) to supplement tail-class samples.

Limitations of Prior Work: (1) Multi-stage methods like LST are overly complex, and multiple knowledge transfer stages easily lead to catastrophic forgetting; (2) Methods relying on ImageNet labels require large-scale labeled auxiliary databases, which is impractical in custom domains like industrial scenarios and aerial remote sensing; (3) For genuinely rare categories (e.g., scarce fish species), collecting more annotated data from the internet is simply impossible.

Key Challenge: The fundamental challenge of long-tailed detection lies in the scarcity of tail-class samples (down to just 1), whereas existing solutions are either overly complex or strictly dependent on large-scale supervised auxiliary data.

Goal: To design a simple and general framework that does not rely on additional image-level annotations, while optionally utilizing easily accessible unlabeled images to boost long-tailed detection performance.

Key Insight: Through empirical analysis, the authors find that a strong representation model pre-trained on COCO shows significantly improved performance when transferred to the LVIS tail classes—the stronger the pre-training, the better the transfer. This inspires a direct "head pre-training \(\rightarrow\) tail transfer" paradigm without the extra complexity of meta-learning or knowledge distillation.

Core Idea: Decomposing long-tailed detection into three simple steps—head pre-training for strong representations, tail transfer learning for adapting to rare classes, and hybrid fine-tuning to balance overall performance—with an option to incorporate pseudo-labeling semi-supervised learning for further enhancement.

Method¶

Overall Architecture¶

Given a long-tailed dataset \(\mathcal{D}_{\text{ltd}}\) (e.g., LVIS), it is divided into a head set \(\mathcal{D}_{\text{head}}\) (866 classes) and a tail set \(\mathcal{D}_{\text{tail}}\) (337 classes) using a threshold \(M=10\). The three-step workflow is as follows: Step 1 trains a detector on the head set (optionally including unlabeled data for semi-supervised learning) to obtain strong representations; Step 2 freezes the backbone and only re-initializes and fine-tunes the detection head on the tail set for transfer learning; Step 3 samples \(k\) instances from each category across all classes to construct \(\mathcal{D}_k\), fine-tuning the detection head to balance head and tail performance. It is compatible with various detectors such as Faster R-CNN, Deformable DETR, and DINO.

Key Designs¶

Head Pre-training + Semi-Supervised Enhancement:
- Function: Learn powerful visual representations on data-rich head classes.
- Mechanism: In addition to standard data augmentation, Simple Copy-Paste and Repeat Factor Sampling are used to address class imbalance within the head set. Optionally, unlabeled images are introduced for pseudo-label learning via a student-teacher framework: \(\mathcal{L} = \mathcal{L}_{\text{sup}} + \alpha \mathcal{L}_{\text{pseudo}}\). Three semi-supervised methods are explored: SoftER Teacher, MixTeacher, and MixPL.
- Design Motivation: The head classes have relatively sufficient data, making them an ideal scenario for pre-training representation learning; semi-supervised learning further exploits cheap unlabeled data.
Tail Transfer Learning (Frozen Backbone + Re-initialized Detection Head):
- Function: Efficiently transfer representations learned from head classes to tail classes.
- Mechanism: The detection head (classifier and regressor) learned from the head classes is removed and re-initialized with random weights, and then the detection head is trained exclusively on the tail set while keeping the backbone frozen. Semi-supervised pseudo-labeling is also optionally incorporated. To facilitate pseudo-label generation for rare classes, random copy-pasting of tail-class instances onto unlabeled images is innovatively applied.
- Design Motivation: Empirical analysis shows that stronger pre-training leads to better transfer, and freezing the backbone avoids overfitting due to scarce tail data.
Hybrid Sampling Fine-Tuning (Exemplar Replay):
- Function: Balance head and tail classes, preventing the forgetting of head-class knowledge after transfer.
- Mechanism: A relatively balanced small dataset \(\mathcal{D}_k\) is constructed by sampling \(k\) instances (\(k \leq M\)) per category from the entire long-tailed dataset, which is used for the final fine-tuning of the detection head. If a tail class has fewer than \(k\) instances, all of them are used.
- Design Motivation: Training exclusively on the tail classes in Step 2 causes forgetting of head classes. The hybrid sampling fine-tuning in Step 3 achieves a balance between "remembering the head and learning the tail".

Loss & Training¶

The same detection losses (classification + regression) are used in all three stages, with an additional pseudo-label loss during the semi-supervised stage. Data augmentations include resizing,翻转, photometric distortion, and SCP+RFS, while semi-supervised learning additionally uses translation, shearing, rotation, and Cutout.

Key Experimental Results¶

Main Results¶

Method	External Data	Backbone	mAP	AP_r↑	AP_c	AP_f
Seesaw Loss	None	R101-FPN	27.8	18.7	27.0	32.8
RichSem	ImageNet-21K	R50	35.1	26.0	32.6	41.8
SimLTD (Supervised)	None	R50	35.0	32.0	34.0	37.5
RichSem	ImageNet-21K	Swin-B	46.4	38.5	45.1	51.3
SimLTD (Supervised)	None	Swin-B	47.2	42.7	46.7	49.9
Detic	ImageNet-21K	R50	32.5	26.2	31.3	36.6
SimLTD (Semi-Supervised)	COCO-unlabeled	R50	39.4	32.6	38.5	43.6

Ablation Study¶

Configuration	mAP	AP_r↑	Description
Head Pre-training \(\rightarrow\) Direct Full-set Fine-tuning	31.2	21.5	No transfer learning
Head Pre-training \(\rightarrow\) Tail Transfer \(\rightarrow\) No Fine-tuning	15.3	25.1	Severe forgetting of head classes
Full SimLTD	35.0	32.0	Complete three-stage pipeline
SimLTD + SoftER Teacher	30.3	23.3	Faster R-CNN
SimLTD + MixTeacher	31.8	23.4	Faster R-CNN
SimLTD + MixPL	39.4	32.6	DINO, R50

Key Findings¶

Supervised SimLTD doesn't use any external labeled data, yet its tail-class AP_r (32.0) already outperforms RichSem (26.0) which relies on ImageNet-21K labels, while achieving a comparable mAP on R50 (35.0 vs 35.1).
On Swin-B, SimLTD's AP_r (42.7) significantly outperforms RichSem (38.5), demonstrating the scalability of a stronger backbone paired with the three-stage strategy.
Using only COCO-unlabeled2017 (~120k unlabeled images), the semi-supervised SimLTD improves the mAP of DINO/R50 from 35.0 to 39.4.
The augmentation strategy of copy-pasting tail-class instances onto unlabeled images is particularly crucial for semi-supervised learning.

Highlights & Insights¶

Overcoming Complexity with Simplicity: With just three simple steps and without the need for meta-learning or knowledge distillation, it significantly outperforms complex methods that require 7-stage training or 14M extra labels. The simplicity of the method is highly appealing.
Replacing Labeled Auxiliary Data with Unlabeled Data: It is demonstrated that leveraging unlabeled images via pseudo-labeling semi-supervised learning can replace or even outperform schemes depending on large labeled databases like ImageNet-21K, greatly boosting practical feasibility.
Empirical Discovery of Transfer Learning: A stronger pre-trained model leads to a better tail transfer effect; this simple yet important finding provides a clear pathway for improving long-tailed learning.

Limitations & Future Work¶

The three stages require separate training, and the total computational cost remains substantial (~460 GPU hours for Swin-L).
Freezing the backbone during tail transfer may limit the backbone's ability to adapt to tail-class features.
The choice of \(k\) (the number of samples per class in hybrid sampling) may require tuning for different datasets.
Future work could explore unifying the three stages into an end-to-end curriculum learning scheme.

vs LST: LST requires 7 stages + knowledge distillation, whereas SimLTD only needs 3 steps without distillation, being simpler and more effective.
vs Detic/RichSem: These methods strictly depend on ImageNet-21K labeled data, whereas SimLTD achieves comparable or even superior performance using unlabeled data + pseudo-labels, offering broader applicability.
The generality of the framework (supporting both CNN and Transformer detectors) allows it to serve as a strong baseline for long-tailed detection.

Rating¶

Novelty: ⭐⭐⭐ The method itself is a simple combination of existing techniques, but the combination is highly effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Various backbones, multiple detectors, supervised/semi-supervised comparisons, and detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear logic and well-defined motivation.
Value: ⭐⭐⭐⭐ Provides a simple and efficient baseline for long-tailed detection, offering valuable reference for practical applications.