DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding¶

Conference: NeurIPS 2025
arXiv: 2511.02495
Code: https://kaggle.com/datasets/detectiumfire
Area: Dataset / Multi-modal
Keywords: Fire Detection, Multi-modal Dataset, Synthetic Data, RLHF, Vision-Language Model

TL;DR¶

DetectiumFire constructs the largest multi-modal fire understanding dataset — 14.5K real images + 2.5K videos + 8K synthetic images + 12K RLHF preference pairs — with a low duplication rate (0.03 PHash vs. D-Fire 0.15), a 4-level severity classification scheme, and detailed scene descriptions. Fine-tuning YOLOv11m achieves mAP 43.74, and fine-tuning LLaMA-3.2-11B yields 83.84% accuracy on fire severity classification.

Background & Motivation¶

Background: Fire safety is a critical global concern, yet existing fire datasets are small (D-Fire contains only 5.8K images) and exhibit high duplication rates (CNN duplication rate 0.55). Multi-modal models (CLIP, VLMs) lack fire-domain training data.

Limitations of Prior Work: (a) Existing datasets have high duplication rates, causing models to overfit on repeated samples rather than learning generalizable features; (b) semantic annotations are absent (what is burning? what is the environment? how severe?) — only bounding boxes are provided; (c) synthetic data quality is poor (FLAME_SD mAP only 2.10).

Key Challenge: Fire scenes require contextual reasoning (e.g., a small candle vs. a spreading blaze), yet existing datasets do not support such understanding — detection alone is insufficient; scene semantics and severity assessment are needed.

Goal: To construct a large-scale, low-duplication, multi-modal fire understanding dataset supporting detection, description, and severity assessment.

Key Insight: Combining real image collection + SFT/RLHF fine-tuned Stable Diffusion synthesis + GPT-4o semantic annotation + a 4-level severity classification scheme.

Core Idea: Low-duplication real images + SFT/RLHF fine-tuned SD synthesis + GPT-4o semantic annotation + 4-level severity classification = a multi-modal fire understanding benchmark.

Method¶

Overall Architecture¶

Data Collection: Multi-source collection of 14.5K images + 2.5K videos → deduplication (PHash + CNN imagededup) → Annotation: Roboflow bounding boxes + GPT-4o generated 75-word descriptions (burning object + environment + severity) → Synthetic Data: SFT fine-tuning of SD v1.5/v2/XL-1.0 + RLHF (Diffusion-DPO) → Quality Control: CLAP embedding cosine distance filtering + fire safety expert validation.

Key Designs¶

4-Level Severity Classification Scheme:
- Function: Assign a severity level to each fire image.
- Mechanism: No Risk → Low Risk (small, controllable fire) → Medium Risk (moderate spreading) → High Risk (large-scale, uncontrollable). Each level is associated with specific visual feature descriptions.
- Design Motivation: A binary "fire / no fire" label is insufficient — fire suppression decisions require severity assessment.
SFT + RLHF Synthetic Data:
- Function: Generate high-quality fire images using fine-tuned Stable Diffusion.
- Mechanism: SFT — LoRA fine-tuning of SD v1.5/v2/XL-1.0 for 4,000 steps. RLHF — Diffusion-DPO pipeline with 12K human preference pairs (pairwise comparisons with \(k=2\)–\(9\) per prompt).
- Design Motivation: Real fire data is scarce and difficult to collect safely. SFT-generated images achieve significantly higher quality than baseline SD (as measured by Elo ratings).
GPT-4o Semantic Annotation Pipeline:
- Function: Generate structured descriptions for each image.
- Mechanism: 75-word limit focusing on three elements: burning objects (e.g., building / forest / vehicle), environment (indoor / outdoor / time of day), and severity. Manual refinement and error correction are applied.
- Design Motivation: High-quality captions are required for VLM fine-tuning — simple category labels are insufficient; detailed scene descriptions are necessary.

Loss & Training¶

Detection: Standard YOLOv11m training.
VLM: LLaMA-3.2-11B instruction fine-tuning.
Combining synthetic and real images during training yields better results.

Key Experimental Results¶

Main Results¶

Task	Method	Metric	Value
Detection	YOLOv11m (DetectiumFire)	mAP	43.74±0.64
Cross-domain Detection	Train DetectiumFire → Test D-Fire	mAP	40.32
Cross-domain Detection	Train D-Fire → Test DetectiumFire	mAP	24.88 (poor)
Synthetic Augmentation	Real + Synthetic	mAP	44.52 (+0.78)
VLM Severity	LLaMA-3.2-11B fine-tuned	Accuracy	83.84%
VLM Environment	LLaMA-3.2-11B fine-tuned	Accuracy	89.39%
VLM Burning Object	LLaMA-3.2-11B fine-tuned	Accuracy	87.37%
VLM Baseline	LLaMA-3.2-11B (no fine-tuning)	Severity	56.06% (+27.78%)

Ablation Study¶

Data Source	mAP
Real only	43.74
Synthetic only (FLAME_SD)	2.10
Synthetic only (SFT)	33.50
Real + SFT Synthetic	44.52

Key Findings¶

Dataset quality (low duplication) is critically important — models trained on D-Fire transfer to DetectiumFire with only 24.88 mAP (vs. 40.32 in the reverse direction), demonstrating that DetectiumFire is more challenging and diverse.
SFT synthetic data outperforms other synthetic approaches by an order of magnitude (33.50 vs. 2.10 mAP) — LoRA fine-tuning is effective.
RLHF synthetic data slightly underperforms SFT — possibly due to reduced diversity (preference pairs may be biased toward common scenes).
Semantic descriptions substantially improve VLM fine-tuning (severity +27.78%) — contextual reasoning requires detailed annotations.
Synthetic data provides effective augmentation (+0.78 mAP), though the gain is marginal.

Highlights & Insights¶

The 4-level severity classification fills the gap between fire detection and fire assessment in fire safety AI — practically useful systems require severity judgment.
The low duplication rate (0.03 vs. 0.15) demonstrates that data deduplication is essential for benchmark quality.
The cross-domain asymmetry (DetectiumFire→D-Fire: 40.32 vs. D-Fire→DetectiumFire: 24.88) confirms the superior diversity and difficulty of DetectiumFire.

Limitations & Future Work¶

Gains from synthetic data are marginal (+0.78 mAP); more advanced generation strategies (e.g., ControlNet-based conditional generation) may be needed.
Linguistic bias (primarily English and Chinese search queries) may result in missing fire scenes annotated in other languages.
Richer scene annotations are lacking, such as human presence and fire progression (temporal dynamics).
The 4-level severity scheme remains coarse — 10-level or continuous assessment may be more suitable for professional firefighting applications.
RLHF synthesis underperforms SFT — preference pairs may be biased toward common scenes, reducing diversity.

vs. D-Fire: Small scale, high duplication rate, no semantic annotations; DetectiumFire comprehensively surpasses it.
vs. FLAME_SD: A purely synthetic dataset of extremely low quality (mAP 2.10); DetectiumFire's hybrid strategy is substantially more effective.
Application Value: Scenarios such as firefighting robots, UAV inspection, and intelligent surveillance all require fire severity assessment capabilities.

Rating¶

Novelty: ⭐⭐⭐⭐ First large-scale multi-modal fire understanding dataset with severity classification.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dimensional evaluation covering detection, VLM, synthetic data, and cross-domain transfer.
Writing Quality: ⭐⭐⭐⭐ Dataset construction pipeline is described in detail.
Value: ⭐⭐⭐⭐ Provides urgently needed data infrastructure for fire safety AI.