BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs¶
Conference: CVPR 2025
arXiv: 2407.03314
Code: None
Area: Object Detection
Keywords: Image Captioning, Vision-Language Models, Structured Representation, Concept Graph, Open-Vocabulary Detection
TL;DR¶
This paper proposes BACON, a prompting method that deconstructs verbose image captions generated by VLMs into decoupled structured elements (in JSON dictionary format) such as objects, relationships, styles, and themes. This allows downstream models to efficiently utilize caption information without requiring strong text-encoding capabilities, achieving a 1.51x recall improvement for GroundingDINO in open-vocabulary object detection.
Background & Motivation¶
Large Vision-Language Models (VLMs) can generate accurate and detailed image captions, but these captions typically contain verbose, tangled contextual information that is difficult to parse and prone to omitting key clues. This poses a major barrier for downstream models like GroundingDINO and SDXL, which lack powerful text-encoding and syntax analysis capabilities to fully exploit information in dense captions. The Key Challenge lies in: the more detailed the captions generated by VLMs, the harder they are for downstream models to effectively utilize. Traditional captions are in the form of natural language paragraphs, which are high in information density but poor in structure, requiring complex NLP capabilities for models to extract the required information. The Key Insight of this work is to transform captions from "unstructured paragraphs" to "structured concept graphs", decoupling visual information along semantic dimensions via a Bag-of-Concept approach.
Method¶
The core of BACON is a prompting method that decomposes image captions into multiple independent semantic dimensions, outputting them in a JSON dictionary format.
Overall Architecture¶
The entire pipeline consists of three steps: (1) using GPT-4V to generate a BACON-style structured caption dataset for 100K images; (2) fine-tuning a LLaVA model on this dataset to automatically generate BACON-format captions, eliminating the reliance on GPT-4V; and (3) directly applying BACON captions to various downstream tasks, where downstream models can directly access specific targeted concepts via the JSON keys. The structured dimensions of the captions include: objects (objects and their attributes), relationships (relationships between objects), style (image style), themes (themes/scenes), etc.
Key Designs¶
-
Bag-of-Concept Structured Representation:
- Function: Decouples and structures VLM-generated captions into separate semantic elements.
- Mechanism: Instead of the traditional approach of "describing the whole image in a single paragraph", this method breaks down captions into dimensional components—such as object lists (including attributes, locations, and counts), relationship lists (spatial/action relations), style descriptions, and theme summaries—which are organized into a JSON dictionary. Each dimension is independent and directly indexable.
- Design Motivation: Decoupled information is more user-friendly for downstream models that lack strong text comprehension capabilities. Downstream models can directly query the object list for detection or the style information for generation via keys, bypassing the challenge of parsing information from complex sentences.
-
GPT-4V Data Annotation and LLaVA Distillation:
- Function: Cost-effectively generates BACON-style captions.
- Mechanism: First, GPT-4V is paired with meticulously designed prompt templates to annotate 100K image-caption pairs, creating a BACON-format dataset. Then, a LLaVA model is fine-tuned on this dataset to learn how to self-generate BACON-style captions.
- Design Motivation: Although GPT-4V is highly capable, it is costly and proprietary. Distilling its knowledge into an open-source model enables large-scale application.
-
Direct Adaptation to Downstream Tasks:
- Function: Boosts the performance of various downstream tasks in a training-free manner.
- Mechanism: The JSON format of BACON captions allows downstream models to retrieve needed information using simple keys. For instance, open-vocabulary detection models can directly read the
objectslist as detection queries, and image generation models can process style, theme, and object layout dimensionally. - Design Motivation: Traditional dense captions require models to comprehend long text and extract key information autonomously. BACON shifts this process upstream to the caption generation stage.
Loss & Training¶
LLaVA fine-tuning employs standard autoregressive language modeling loss. The GPT-4V annotation stage ensures consistency in output format and information completeness through meticulously designed prompt engineering.
Key Experimental Results¶
Main Results¶
| Task | Metric | BACON + Model | Previous Method | Gain |
|---|---|---|---|---|
| Open-Vocabulary Detection | Recall | BACON+GroundingDINO | Original Caption+GroundingDINO | 1.51x |
| Caption Quality Evaluation | Overall Quality | BACON-LLaVA | SOTA VLM Captioner | Consistently Outperforms |
| Caption Quality Evaluation | Precision | BACON-LLaVA | SOTA VLM Captioner | Higher Precision |
| Caption Quality Evaluation | Recall | BACON-LLaVA | SOTA VLM Captioner | Higher Recall |
Ablation Study¶
| Configuration | Caption Quality | Description |
|---|---|---|
| Traditional VLM Captions | Baseline | Verbose paragraphs, hard for models to parse |
| BACON Structured Captions | Significant Improvement | Decoupled elements, directly usable |
| BACON + GPT-4V | Optimal | But high cost |
| BACON + LLaVA Fine-tuning | Close to GPT-4V | Low cost and deployable locally |
Key Findings¶
- BACON-style captions consistently outperform other SOTA VLM models in terms of caption quality (overall quality, precision, recall) and user studies.
- In open-vocabulary object detection, BACON captions improve the recall of GroundingDINO by 1.51x, representing the most significant application scenario.
- Structured captions make previously impossible tasks feasible—significantly improving performance without requiring training of downstream models.
- The BACON captioner quality after LLaVA fine-tuning is close to GPT-4V, proving the effectiveness of the knowledge distillation strategy.
Highlights & Insights¶
- "Structuring is the right way"—even if VLMs can already generate high-quality captions, the organizational format remains crucial for downstream tasks.
- A JSON dictionary is an extremely practical visual information representation that allows downstream models to utilize rich captions with zero NLP capability.
- 100K annotated data samples are sufficient to train a highly usable BACON captioner via distillation, showcasing high data efficiency.
- The zero-shot generalization of this method is impressive—boosting multiple tasks directly without retraining.
Limitations & Future Work¶
- Developing the training data relies heavily on high-quality annotations from GPT-4V, which incurs a non-trivial initial annotation cost.
- The level of structuring in the JSON format is fixed, which may not scale or adapt well to all downstream task requirements.
- The division of semantic dimensions (objects/relations/style/theme) in captions might lack fine-grained specificity for certain scenarios.
- Future work could explore allowing users to customize dimensions or adaptively adjusting the structure based on downstream tasks.
- The LLaVA-distilled model may not be as accurate as GPT-4V in highly complex scenarios.
Related Work & Insights¶
- Relationship with Dense Captioning: BACON does not aim to generate longer captions but rather focuses on making captions more structured.
- Relationship with Visual Grounding: Structured captions naturally support superior visual grounding.
- Insight: Between VLM outputs and downstream model inputs, the organizational schema of information may hold greater importance than the information content itself.
Supplementary Analysis¶
BACON Format Example¶
A typical BACON output is a JSON dictionary containing the following keys:
- objects: A list of objects, each including the name, attributes (color, size, state, etc.), and spatial location.
- relationships: A list of relationships between objects (spatial relationships like "above" or action relationships like "using").
- style: The visual style of the image (photography style, lighting conditions, color tone, etc.).
- themes: The theme of the scene (indoor/outdoor, type of activity, etc.).
Broad Applicability of the Method¶
- Not limited to object detection; structured captions offer distinct advantages in image generation, visual question-answering, and other tasks.
- For diffusion models like SDXL, structured input prevents key information from being truncated or missed due to lengthy captions.
- The JSON format can seamlessly interface with programmatic pipelines, such as Agent systems and databases.
Essential Differences from Traditional Image Captioning Methods¶
- Traditional methods optimize the "accuracy of captions," whereas BACON optimizes the "usability of captions."
- Traditional captions are designed for human reading, whereas BACON is designed for machine consumption.
- This paradigm shift in perspective carries profound potential implications for VLM application paradigms.
Rating¶
- Novelty: ⭐⭐⭐⭐ The concept of structuring descriptions into concept graphs is novel, though it fundamentally relies on prompt engineering and data annotation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Wide coverage of caption quality evaluations and downstream task applications, bolstered by a user study.
- Writing Quality: ⭐⭐⭐⭐ Clear problem definition and thorough elaboration of the motivation.
- Value: ⭐⭐⭐⭐ Possesses strong practical guidance for downstream applications of VLM captions, with the 1.51x detection recall improvement being a standout highlight.