Skip to content

DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

Conference: NeurIPS 2025
arXiv: 2511.02495
Code: https://kaggle.com/datasets/detectiumfire
Area: Dataset / Multi-modal
Keywords: Fire Detection, Multi-modal Dataset, Synthetic Data, RLHF, Vision-Language Model

TL;DR

DetectiumFire constructs the largest multi-modal fire understanding dataset — 14.5K real images + 2.5K videos + 8K synthetic images + 12K RLHF preference pairs — with a low duplication rate (0.03 PHash vs. D-Fire 0.15), a 4-level severity classification scheme, and detailed scene descriptions. Fine-tuning YOLOv11m achieves mAP 43.74, and fine-tuning LLaMA-3.2-11B yields 83.84% accuracy on fire severity classification.

Background & Motivation

Background: Fire safety is a critical global concern, yet existing fire datasets are small (D-Fire contains only 5.8K images) and exhibit high duplication rates (CNN duplication rate 0.55). Multi-modal models (CLIP, VLMs) lack fire-domain training data.

Limitations of Prior Work: (a) Existing datasets have high duplication rates, causing models to overfit on repeated samples rather than learning generalizable features; (b) semantic annotations are absent (what is burning? what is the environment? how severe?) — only bounding boxes are provided; (c) synthetic data quality is poor (FLAME_SD mAP only 2.10).

Key Challenge: Fire scenes require contextual reasoning (e.g., a small candle vs. a spreading blaze), yet existing datasets do not support such understanding — detection alone is insufficient; scene semantics and severity assessment are needed.

Goal: To construct a large-scale, low-duplication, multi-modal fire understanding dataset supporting detection, description, and severity assessment.

Key Insight: Combining real image collection + SFT/RLHF fine-tuned Stable Diffusion synthesis + GPT-4o semantic annotation + a 4-level severity classification scheme.

Core Idea: Low-duplication real images + SFT/RLHF fine-tuned SD synthesis + GPT-4o semantic annotation + 4-level severity classification = a multi-modal fire understanding benchmark.

Method

Overall Architecture

Data Collection: Multi-source collection of 14.5K images + 2.5K videos → deduplication (PHash + CNN imagededup) → Annotation: Roboflow bounding boxes + GPT-4o generated 75-word descriptions (burning object + environment + severity) → Synthetic Data: SFT fine-tuning of SD v1.5/v2/XL-1.0 + RLHF (Diffusion-DPO) → Quality Control: CLAP embedding cosine distance filtering + fire safety expert validation.

Key Designs

  1. 4-Level Severity Classification Scheme:

    • Function: Assign a severity level to each fire image.
    • Mechanism: No Risk → Low Risk (small, controllable fire) → Medium Risk (moderate spreading) → High Risk (large-scale, uncontrollable). Each level is associated with specific visual feature descriptions.
    • Design Motivation: A binary "fire / no fire" label is insufficient — fire suppression decisions require severity assessment.
  2. SFT + RLHF Synthetic Data:

    • Function: Generate high-quality fire images using fine-tuned Stable Diffusion.
    • Mechanism: SFT — LoRA fine-tuning of SD v1.5/v2/XL-1.0 for 4,000 steps. RLHF — Diffusion-DPO pipeline with 12K human preference pairs (pairwise comparisons with \(k=2\)\(9\) per prompt).
    • Design Motivation: Real fire data is scarce and difficult to collect safely. SFT-generated images achieve significantly higher quality than baseline SD (as measured by Elo ratings).
  3. GPT-4o Semantic Annotation Pipeline:

    • Function: Generate structured descriptions for each image.
    • Mechanism: 75-word limit focusing on three elements: burning objects (e.g., building / forest / vehicle), environment (indoor / outdoor / time of day), and severity. Manual refinement and error correction are applied.
    • Design Motivation: High-quality captions are required for VLM fine-tuning — simple category labels are insufficient; detailed scene descriptions are necessary.

Loss & Training

  • Detection: Standard YOLOv11m training.
  • VLM: LLaMA-3.2-11B instruction fine-tuning.
  • Combining synthetic and real images during training yields better results.

Key Experimental Results

Main Results

Task Method Metric Value
Detection YOLOv11m (DetectiumFire) mAP 43.74±0.64
Cross-domain Detection Train DetectiumFire → Test D-Fire mAP 40.32
Cross-domain Detection Train D-Fire → Test DetectiumFire mAP 24.88 (poor)
Synthetic Augmentation Real + Synthetic mAP 44.52 (+0.78)
VLM Severity LLaMA-3.2-11B fine-tuned Accuracy 83.84%
VLM Environment LLaMA-3.2-11B fine-tuned Accuracy 89.39%
VLM Burning Object LLaMA-3.2-11B fine-tuned Accuracy 87.37%
VLM Baseline LLaMA-3.2-11B (no fine-tuning) Severity 56.06% (+27.78%)

Ablation Study

Data Source mAP
Real only 43.74
Synthetic only (FLAME_SD) 2.10
Synthetic only (SFT) 33.50
Real + SFT Synthetic 44.52

Key Findings

  • Dataset quality (low duplication) is critically important — models trained on D-Fire transfer to DetectiumFire with only 24.88 mAP (vs. 40.32 in the reverse direction), demonstrating that DetectiumFire is more challenging and diverse.
  • SFT synthetic data outperforms other synthetic approaches by an order of magnitude (33.50 vs. 2.10 mAP) — LoRA fine-tuning is effective.
  • RLHF synthetic data slightly underperforms SFT — possibly due to reduced diversity (preference pairs may be biased toward common scenes).
  • Semantic descriptions substantially improve VLM fine-tuning (severity +27.78%) — contextual reasoning requires detailed annotations.
  • Synthetic data provides effective augmentation (+0.78 mAP), though the gain is marginal.

Highlights & Insights

  • The 4-level severity classification fills the gap between fire detection and fire assessment in fire safety AI — practically useful systems require severity judgment.
  • The low duplication rate (0.03 vs. 0.15) demonstrates that data deduplication is essential for benchmark quality.
  • The cross-domain asymmetry (DetectiumFire→D-Fire: 40.32 vs. D-Fire→DetectiumFire: 24.88) confirms the superior diversity and difficulty of DetectiumFire.

Limitations & Future Work

  • Gains from synthetic data are marginal (+0.78 mAP); more advanced generation strategies (e.g., ControlNet-based conditional generation) may be needed.
  • Linguistic bias (primarily English and Chinese search queries) may result in missing fire scenes annotated in other languages.
  • Richer scene annotations are lacking, such as human presence and fire progression (temporal dynamics).
  • The 4-level severity scheme remains coarse — 10-level or continuous assessment may be more suitable for professional firefighting applications.
  • RLHF synthesis underperforms SFT — preference pairs may be biased toward common scenes, reducing diversity.
  • vs. D-Fire: Small scale, high duplication rate, no semantic annotations; DetectiumFire comprehensively surpasses it.
  • vs. FLAME_SD: A purely synthetic dataset of extremely low quality (mAP 2.10); DetectiumFire's hybrid strategy is substantially more effective.
  • Application Value: Scenarios such as firefighting robots, UAV inspection, and intelligent surveillance all require fire severity assessment capabilities.

Rating

  • Novelty: ⭐⭐⭐⭐ First large-scale multi-modal fire understanding dataset with severity classification.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-dimensional evaluation covering detection, VLM, synthetic data, and cross-domain transfer.
  • Writing Quality: ⭐⭐⭐⭐ Dataset construction pipeline is described in detail.
  • Value: ⭐⭐⭐⭐ Provides urgently needed data infrastructure for fire safety AI.