Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection¶
Conference: ECCV 2024
arXiv: 2407.15296
Code: None provided
Area: Object Detection / Vision-Language
Keywords: compositional understanding, language-based detection, synthetic data generation, contrastive learning, weak-to-strong
TL;DR¶
Proposes the WSCL framework: leveraging LLMs to generate diverse text descriptions, diffusion models to generate corresponding images, and a weak detector to decompose phrases and generate pseudo bounding boxes, constructing dense synthetic triplets (image, description, bbox). Together with compositional contrastive learning, it significantly improves language-guided object detection performance, achieving a +5.0 AP improvement for GLIP-T on OmniLabel.
Background & Motivation¶
Background: Vision-Language (VL) models like CLIP and GLIP have achieved remarkable progress in open-vocabulary detection. However, their compositional understanding of complex linguistic descriptions remains limited—models often behave like a "bag-of-words," failing to distinguish fine-grained semantics such as attributes and relations.
Limitations of Prior Work: Prior methods (e.g., DesCo) only perform augmentation in the text domain (generating negative examples through noun replacement), which has limited effectiveness. Manually annotating dense
Key Challenge: Existing detectors perform well in detecting simple class names but often ignore modifiers and detect all objects corresponding to relevant nouns when facing complex descriptions such as "a golden retriever sitting beside a red fire hydrant."
Goal: Leverage the compositional understanding capability of generative models to enhance the compositional understanding of discriminative VL models, achieving a "weak-to-strong" capability transfer.
Key Insight: A reversed pipeline—first utilizing an LLM to generate diverse descriptions, then using a diffusion model to generate images, and finally using a weak detector to automatically generate high-quality pseudo bounding boxes through task decomposition.
Core Idea: Generative foundation models possess powerful compositional understanding capabilities, which can be distilled into discriminative detectors through synthetic dense triplets and compositional contrastive learning.
Method¶
Overall Architecture¶
The method consists of two steps: (1) Dense Synthetic Triplet Generation—using an LLM to generate diverse descriptions, a diffusion model to generate corresponding images, and a weak detector to generate pseudo bounding boxes; (2) Compositional Contrastive Learning—designing description-aware and structural-aware contrastive learning objectives to effectively learn compositional understanding capabilities from synthetic triplets.
Key Designs¶
-
Diverse Object Description Generation (LLM-based)
- Function: Generate diverse text descriptions for each visual entity category.
- Mechanism: Prompt the LLM (ChatGPT-3.5-Turbo): "Please list \(\{ND\}\) plausible visual object descriptions for \(\{class\}\) that are around \(\{NW\}\) words in length. Consider incorporating diverse visual attributes, actions, and spatial or semantic relations with other objects."
- Design Motivation: Achieve scalable dense description coverage by controlling the entity pool size, the number of descriptions per category \(ND\), and the description length \(NW\). By default, 365 categories from Objects365 are used, with 20 descriptions per category.
-
Diffusion Model Based Dense Paired Image Generation
- Function: Generate multiple corresponding images for each description.
- Mechanism: Perform conditional generation using the PixArt diffusion model, generating 8 image variations for each description using different random seeds.
- Design Motivation: Unlike previous approaches that use simple prompts (e.g., "a photo of [NAME]"), directly conditioning on complex descriptions introduces diversity. This yields 58,400 synthetic triplets in total.
-
Weak-to-Strong Pseudo Bounding Box Generation
- Function: Address the challenge where a weak detector fails to accurately locate objects in complex descriptions.
- Mechanism: Decompose the complex phrase grounding task into multiple simple detection tasks. Specifically: (1) Use an NLP parser to extract all noun phrases; (2) Use each noun phrase as an independent detection query; (3) Filter low-confidence predictions based on a threshold \(p\); (4) Re-associate the results with the original description.
- Design Motivation: Two key observations—the weak detector has higher detection precision on positive sample texts (AP-dP > AP-d), and it performs better on short descriptions than on complex ones (AP-dS > AP-dL). Sub-task decomposition fully exploits the capabilities of the weak detector on simple tasks.
-
Description-Aware Contrastive Learning
- Function: Direct the detector to focus on the specific content of a given description, rather than merely detecting all mentioned entities.
- Mechanism: Select intra-class negatives from the description pool of the same category and concatenate them into the input query \(Q\). Train the model to yield positive detections only for matching descriptions while ignoring entities corresponding to negative descriptions.
- Design Motivation: Force the model to distinguish between similar yet different descriptions (e.g., different descriptions for "avocado"), thereby learning the fine-grained semantic differences in descriptions.
-
Textural-Structural-Aware Contrastive Learning
- Function: Enable the detector to understand structural relationships between subject and non-subject entities in descriptions.
- Mechanism: (1) Identify subject and non-subject noun phrases in descriptions using a textual relation parser; (2) Structural Negative: Non-subject entities (e.g., "lying on a cutting board") should not be detected as positive matches for the subject; (3) Structural Positive: Separately add the non-subject noun phrase (e.g., "A cutting board") as a positive query to the input to ensure the detector can still recognize the object; (4) Sentence-Level Positive: Positively associate the entire description sentence with the bounding box of the subject entity.
- Design Motivation: Prevent the model from treating all occurring noun phrases identically, encouraging it to distinguish identical phrases based on structural roles (acting as the subject vs. as a modifier).
-
Domain Shift Mitigation Strategy
- Function: Prevent the detector from overfitting to the synthetic image distribution.
- Mechanism: (1) Freeze the visual backbone and only train the cross-modal alignment layers; (2) Add real detection data (Objects365) as regularization.
- Design Motivation: Synthetic images inevitably contain artifacts. Freezing the visual representation prevents learning bias from the synthetic domain.
Loss & Training¶
- Region-word alignment loss \(\mathcal{L}(S_{\text{ground}}, T) + \mathcal{L}_{loc}\) based on GLIP/FIBER.
- Superimposed with intra-class negative contrastive loss + structural negative/positive contrastive loss.
- Mixed training: synthetic triplets + Objects365 detection data.
- Visual backbone is frozen.
Key Experimental Results¶
Main Results¶
| Model | Method | OmniLabel AP | OmniLabel AP-dL | D3 Full |
|---|---|---|---|---|
| GLIP-T | baseline | 19.3 | 8.2 | 19.1 |
| GLIP-T | +Ours | 24.3 (+5.0) | 16.4 (2×) | 26.0 (+6.9) |
| FIBER-B | baseline | 25.7 | 12.4 | 22.7 |
| FIBER-B | +Ours | 30.5 (+4.8) | 21.3 | 26.5 |
| DesCo-GLIP | baseline | 23.8 | 13.7 | 24.2 |
| DesCo-GLIP | +Ours | 26.5 (+2.7) | 18.7 | 29.3 |
Ablation Study¶
| Configuration | AP | AP-d | AP-dL | Description |
|---|---|---|---|---|
| FIBER-B baseline | 25.7 | 22.3 | 12.4 | Original model |
| + Gen-only (Naive fine-tuning) | 25.5 | 23.7 | 12.4 | Severe drop in AP-c |
| + Det data regularization | 26.3 | 23.3 | 11.5 | Alleviates domain shift |
| + Freeze visual backbone | 26.8 | 23.4 | 11.8 | Maintains localization capability |
| + Intra-neg contrastive | 29.0 | 27.4 | 14.9 | Significant improvement in description awareness |
| + Struct-neg | 29.0 | 27.3 | 16.2 | Structural negative fine-tuning |
| + Struct-pos (Full) | 30.5 | 29.5 | 21.3 | Gain of 6.4 AP on long queries |
Data Scale Factor Analysis¶
| Factor | Configuration | AP | AP-dL |
|---|---|---|---|
| Entity density | COCO 80 classes | 29.7 | 18.6 |
| Entity density | O365 365 classes | 30.5 | 21.3 |
| Description density | 5 per category | 29.1 | 17.4 |
| Description density | 20 per category | 30.5 | 21.3 |
| Image density | 2 per description | 29.8 | 19.3 |
| Image density | 8 per description | 30.5 | 21.3 |
Key Findings¶
- GLIP-T performance on long queries doubled from 8.2 to 16.4 AP, indicating that compositional contrastive learning is highly effective for complex descriptions.
- Only performing text augmentation (DesCo) is insufficient; joint dense triplet generation in both image and text domains is crucial.
- Structural Positive contributes the most, improving long query AP-dL by 5.1 (from 16.2 to 21.3).
- The weak-to-strong pseudo-bounding box generation performs 5.1 AP-dL higher than direct grounding-based annotation (16.2 vs. 21.3).
Highlights & Insights¶
- Philosophy of Weak-to-Strong: Decomposing complex phrase grounding into multiple simple detection tasks is a very ingenious idea. A weak detector remains strong on simple tasks, and through task decomposition, a weak model can generate strong annotations. This line of thinking has broad applicability in the field of data annotation.
- Necessity of Structural Positives: Introducing structural negatives alone yields minor improvements, but adding structural positives brings a significant leap. This is because the model needs to learn both "when not to detect non-subject entities" and "to still recognize non-subject entities when they appear independently"—both are indispensable.
Limitations & Future Work¶
- Generative quality of diffusion models degrades on long-tail categories (the AP-c on LVIS's 1203 categories is inferior to Objects365's 365 categories), which limits entity coverage.
- Image quality drops when description length exceeds 10 words, limiting the upper bound of scene complexity.
- The scale of 58K synthetic triplets is relatively small. Although the framework is scalable, the performance upper bound of large-scale generation has not been verified.
- Only two architectures (GLIP and FIBER) were validated.
Related Work & Insights¶
- vs. DesCo [Li et al.]: DesCo only performs text-domain augmentation, whereas WSCL performs dense triplet generation across both image and text domains, and they are complementary and superimposable.
- vs. MDETR [Kamath et al.]: MDETR relies on manually annotated grounding data, while WSCL automatically generates synthetic data, offering higher scalability.
- vs. Pic2Word/LinCIR: These methods focus on zero-shot retrieval, whereas WSCL targets compositional understanding in detection tasks. Although the directions differ, they all tackle the compositional understanding bottleneck of VL models.
Rating¶
- Novelty: ⭐⭐⭐⭐ The joint design of weak-to-strong pseudo bounding box generation and structural contrastive learning is both novel and reasonable.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly detailed ablation and multi-dimensional analysis including data scale, pseudo-bounding box strategies, and description lengths.
- Writing Quality: ⭐⭐⭐⭐ Logically clear, with motivation and effectiveness of each component supported by corresponding experiments.
- Value: ⭐⭐⭐⭐ Model-agnostic framework that directly promotes the field of language-guided detection.