OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving¶
Conference: ICCV 2025 arXiv: 2603.05936 Code: https://kotashimomura.github.io/odrase/ Area: Autonomous Driving Safety Keywords: Ontology-driven, Risk Assessment, Infrastructure Improvement, Large-scale Vision-Language Models, Diffusion Models
TL;DR¶
This paper proposes OD-RASE, a framework that constructs a road traffic expert knowledge ontology to filter infrastructure improvement proposals generated by LVLMs, enabling proactive identification of accident-prone road structures and generation of improvement recommendations.
Background & Motivation¶
- Current autonomous driving systems achieve high perception performance but remain limited in rare or complex scenarios
- Traditional road infrastructure improvement is reactive: experts typically analyze causes and propose solutions only after traffic accidents have occurred
- Autonomous driving systems require proactive identification of potentially hazardous road structures to enable pre-accident intervention
- Existing datasets focus primarily on accident prediction and high-risk object description, neglecting the underlying road structures that cause accidents
- The quality of annotation data automatically generated by LVLMs is difficult to guarantee, and validation costs are high
Method¶
Overall Architecture¶
The OD-RASE framework consists of two core components: (1) ontology-driven dataset construction based on expert knowledge; and (2) a multimodal OD-RASE model (image + text → infrastructure improvement prediction + diffusion-based generation of improved road images).
Key Designs¶
-
Ontology Construction: Based on expert knowledge of road traffic systems, accident-related road structures from 390+ cases are categorized into 11 accident-inducing road structure types and 10 improvement strategies. Domain experts first consolidate the original 30 road structure types and 26 improvement strategies by removing time-dependent factors. The ontology is represented as a mapping between these structures and improvement strategies.
-
Graph-to-Chain-of-Thought Prompting (G2CoT Prompt): GPT-4o is employed to simulate expert reasoning through multi-stage chain-of-thought (CoT) prompting, generating infrastructure improvement proposals for road images. The textual output of each stage is converted into a graph-structured prompt passed to the next stage. The generated outputs correspond to the 10 improvement strategies and 11 road structure types defined in the ontology.
-
Ontology-Driven Data Filtering: The expert knowledge ontology is represented as a directed reference graph \(G_A=(V_A, E_A)\), and the improvement proposals generated by GPT-4o are represented as a generated graph \(G_B=(V_B, E_B)\). Graph matching is applied for filtering:
- Compute node and edge intersections: \(V'_B = V_B \cap V_A\), \(E'_B = E_B \cap E_A\)
- Remove isolated nodes: \(V''_B = V'_B \setminus \text{Iso}(G'_B)\)
- Retain edges whose endpoints both belong to \(V''_B\), yielding the final filtered graph \(G''_B\)
- If all edges between any two modules are removed, the data sample is considered unreliable and discarded
-
OD-RASE Multimodal Model: Comprises a visual encoder (ResNet-50/ViT-B/CLIP/Long-CLIP), a text encoder (RoBERTa-Base/Flan-T5-xl/Long-CLIP), and a Grounding Block. The Grounding Block performs cross-attention using image embeddings as queries and text embeddings as keys/values; the output is passed through a fully connected layer to predict 10 improvement strategy classes.
-
Diffusion Model Layout Control: Instruct Pix2Pix is adopted to generate text prompts from OD-RASE's predicted improvement strategies, editing the original road images to produce visualizations of improved road environments as a decision-support tool.
Loss & Training¶
- The task is formulated as multi-label classification (a single road image may correspond to multiple improvement strategies)
- Binary cross-entropy loss: \(\mathcal{L} = -\sum_{c=1}^{C}[y_c \log p_c + (1-y_c)\log(1-p_c)]\)
- Visual and text encoder parameters are frozen during training
- Batch size 16, trained for 25 epochs
- Evaluated on Mapillary Vistas and BDD100K datasets
Key Experimental Results¶
Main Results (Table)¶
Performance of different visual-text encoder combinations on infrastructure improvement prediction:
| Visual Encoder | Text Encoder | Mapillary F1 | Mapillary Acc | BDD100K F1 | BDD100K Acc |
|---|---|---|---|---|---|
| ResNet-50 | RoBERTa-Base | 64.98 | 37.12 | 77.18 | 45.94 |
| ViT-B | RoBERTa-Base | 67.79 | 40.71 | 78.14 | 47.98 |
| CLIP | RoBERTa-Base | 69.76 | 40.71 | 78.30 | 48.69 |
| Long-CLIP | RoBERTa-Base | 70.26 | 42.14 | 78.79 | 49.48 |
| Long-CLIP | Flan-T5-xl | 24.39 | 0.00 | 28.22 | 0.00 |
Ablation Study (Table)¶
Modality Ablation (Long-CLIP + RoBERTa-Base, Mapillary):
| Input Modality | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|
| Image only | 57.42 | 72.37 | 64.03 | 34.50 |
| Text only | 60.02 | 79.76 | 68.50 | 40.63 |
| Image + Text | 64.54 | 77.09 | 70.26 | 42.14 |
Ontology Filtering Ablation:
| Data Filtering | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|
| Without filtering | 33.59 | 64.85 | 44.26 | 0.00 |
| With filtering | 64.54 | 77.09 | 70.26 | 42.14 |
Key Findings¶
- Ontology filtering is highly effective: Accuracy rises from 0% without filtering to 42.14% with filtering enabled; F1 improves from 44.26 to 70.26
- Flan-T5-xl performs poorly as a text encoder: high recall but extremely low precision, likely due to excessive false positives caused by 8-bit quantization
- Zero-shot generalization: the model trained on BDD100K and evaluated on Mapillary remains robust (F1=68.32), while general-purpose LVLMs (e.g., GPT-4o/LLaVA) substantially underperform the specialized model on this task
- Diffusion model visualization: Instruct Pix2Pix generates improved road images with FID=8.5; expert evaluation rates 54.23% as fully compliant
Highlights & Insights¶
- Pioneering integration of road infrastructure improvement with autonomous driving safety, establishing a proactive risk identification framework
- The ontology serves as a knowledge-based filter that effectively improves the reliability of LVLM-generated data; the "expert knowledge + AI generation + graph-matching validation" paradigm is broadly generalizable
- Generating improved road visualizations via diffusion models enhances interpretability and practical utility
Limitations & Future Work¶
- Only front-view images are used; video temporal information and multi-view inputs are not considered
- Time-dependent factors such as traffic volume are excluded
- The actual accident reduction rate of proposed improvements cannot be quantitatively evaluated without a traffic simulator
- Quantitative metrics for prioritizing and assessing the urgency of improvement strategies are lacking
- The category granularity is relatively coarse (11 road structure types + 10 improvement strategies); finer-grained classification may be required for real-world deployment
Related Work & Insights¶
- Aligns with established road safety practices such as lane reduction (50% accident reduction) and roundabout conversion (38% accident reduction)
- The framework can serve as an automated road risk assessment tool for urban planning and intelligent transportation systems
- The ontology-driven data filtering approach is transferable to other domains requiring expert validation of AI-generated data
Rating¶
- Novelty: ⭐⭐⭐⭐ First work to integrate road infrastructure improvement with autonomous driving safety; the ontology-driven data filtering approach is novel
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive experiments on two datasets, including zero-shot transfer, modality ablation, filtering ablation, and comparison with general-purpose LVLMs
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated, the framework is complete, and the figures are convincing
- Value: ⭐⭐⭐⭐ High practical value; offers a new perspective on autonomous driving safety