OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving¶

Conference: ICCV 2025 arXiv: 2603.05936 Code: https://kotashimomura.github.io/odrase/ Area: Autonomous Driving Safety Keywords: Ontology-driven, Risk Assessment, Infrastructure Improvement, Large-scale Vision-Language Models, Diffusion Models

TL;DR¶

This paper proposes OD-RASE, a framework that constructs a road traffic expert knowledge ontology to filter infrastructure improvement proposals generated by LVLMs, enabling proactive identification of accident-prone road structures and generation of improvement recommendations.

Background & Motivation¶

Current autonomous driving systems achieve high perception performance but remain limited in rare or complex scenarios
Traditional road infrastructure improvement is reactive: experts typically analyze causes and propose solutions only after traffic accidents have occurred
Autonomous driving systems require proactive identification of potentially hazardous road structures to enable pre-accident intervention
Existing datasets focus primarily on accident prediction and high-risk object description, neglecting the underlying road structures that cause accidents
The quality of annotation data automatically generated by LVLMs is difficult to guarantee, and validation costs are high

Method¶

Overall Architecture¶

The OD-RASE framework consists of two core components: (1) ontology-driven dataset construction based on expert knowledge; and (2) a multimodal OD-RASE model (image + text → infrastructure improvement prediction + diffusion-based generation of improved road images).

Key Designs¶

Ontology Construction: Based on expert knowledge of road traffic systems, accident-related road structures from 390+ cases are categorized into 11 accident-inducing road structure types and 10 improvement strategies. Domain experts first consolidate the original 30 road structure types and 26 improvement strategies by removing time-dependent factors. The ontology is represented as a mapping between these structures and improvement strategies.
Graph-to-Chain-of-Thought Prompting (G2CoT Prompt): GPT-4o is employed to simulate expert reasoning through multi-stage chain-of-thought (CoT) prompting, generating infrastructure improvement proposals for road images. The textual output of each stage is converted into a graph-structured prompt passed to the next stage. The generated outputs correspond to the 10 improvement strategies and 11 road structure types defined in the ontology.
Ontology-Driven Data Filtering: The expert knowledge ontology is represented as a directed reference graph \(G_A=(V_A, E_A)\), and the improvement proposals generated by GPT-4o are represented as a generated graph \(G_B=(V_B, E_B)\). Graph matching is applied for filtering:
- Compute node and edge intersections: \(V'_B = V_B \cap V_A\), \(E'_B = E_B \cap E_A\)
- Remove isolated nodes: \(V''_B = V'_B \setminus \text{Iso}(G'_B)\)
- Retain edges whose endpoints both belong to \(V''_B\), yielding the final filtered graph \(G''_B\)
- If all edges between any two modules are removed, the data sample is considered unreliable and discarded
OD-RASE Multimodal Model: Comprises a visual encoder (ResNet-50/ViT-B/CLIP/Long-CLIP), a text encoder (RoBERTa-Base/Flan-T5-xl/Long-CLIP), and a Grounding Block. The Grounding Block performs cross-attention using image embeddings as queries and text embeddings as keys/values; the output is passed through a fully connected layer to predict 10 improvement strategy classes.
Diffusion Model Layout Control: Instruct Pix2Pix is adopted to generate text prompts from OD-RASE's predicted improvement strategies, editing the original road images to produce visualizations of improved road environments as a decision-support tool.

Loss & Training¶

The task is formulated as multi-label classification (a single road image may correspond to multiple improvement strategies)
Binary cross-entropy loss: \(\mathcal{L} = -\sum_{c=1}^{C}[y_c \log p_c + (1-y_c)\log(1-p_c)]\)
Visual and text encoder parameters are frozen during training
Batch size 16, trained for 25 epochs
Evaluated on Mapillary Vistas and BDD100K datasets

Key Experimental Results¶

Main Results (Table)¶

Performance of different visual-text encoder combinations on infrastructure improvement prediction:

Visual Encoder	Text Encoder	Mapillary F1	Mapillary Acc	BDD100K F1	BDD100K Acc
ResNet-50	RoBERTa-Base	64.98	37.12	77.18	45.94
ViT-B	RoBERTa-Base	67.79	40.71	78.14	47.98
CLIP	RoBERTa-Base	69.76	40.71	78.30	48.69
Long-CLIP	RoBERTa-Base	70.26	42.14	78.79	49.48
Long-CLIP	Flan-T5-xl	24.39	0.00	28.22	0.00

Ablation Study (Table)¶

Modality Ablation (Long-CLIP + RoBERTa-Base, Mapillary):

Input Modality	Precision	Recall	F1	Accuracy
Image only	57.42	72.37	64.03	34.50
Text only	60.02	79.76	68.50	40.63
Image + Text	64.54	77.09	70.26	42.14

Ontology Filtering Ablation:

Data Filtering	Precision	Recall	F1	Accuracy
Without filtering	33.59	64.85	44.26	0.00
With filtering	64.54	77.09	70.26	42.14

Key Findings¶

Ontology filtering is highly effective: Accuracy rises from 0% without filtering to 42.14% with filtering enabled; F1 improves from 44.26 to 70.26
Flan-T5-xl performs poorly as a text encoder: high recall but extremely low precision, likely due to excessive false positives caused by 8-bit quantization
Zero-shot generalization: the model trained on BDD100K and evaluated on Mapillary remains robust (F1=68.32), while general-purpose LVLMs (e.g., GPT-4o/LLaVA) substantially underperform the specialized model on this task
Diffusion model visualization: Instruct Pix2Pix generates improved road images with FID=8.5; expert evaluation rates 54.23% as fully compliant

Highlights & Insights¶

Pioneering integration of road infrastructure improvement with autonomous driving safety, establishing a proactive risk identification framework
The ontology serves as a knowledge-based filter that effectively improves the reliability of LVLM-generated data; the "expert knowledge + AI generation + graph-matching validation" paradigm is broadly generalizable
Generating improved road visualizations via diffusion models enhances interpretability and practical utility

Limitations & Future Work¶

Only front-view images are used; video temporal information and multi-view inputs are not considered
Time-dependent factors such as traffic volume are excluded
The actual accident reduction rate of proposed improvements cannot be quantitatively evaluated without a traffic simulator
Quantitative metrics for prioritizing and assessing the urgency of improvement strategies are lacking
The category granularity is relatively coarse (11 road structure types + 10 improvement strategies); finer-grained classification may be required for real-world deployment

Aligns with established road safety practices such as lane reduction (50% accident reduction) and roundabout conversion (38% accident reduction)
The framework can serve as an automated road risk assessment tool for urban planning and intelligent transportation systems
The ontology-driven data filtering approach is transferable to other domains requiring expert validation of AI-generated data

Rating¶

Novelty: ⭐⭐⭐⭐ First work to integrate road infrastructure improvement with autonomous driving safety; the ontology-driven data filtering approach is novel
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive experiments on two datasets, including zero-shot transfer, modality ablation, filtering ablation, and comparison with general-purpose LVLMs
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated, the framework is complete, and the figures are convincing
Value: ⭐⭐⭐⭐ High practical value; offers a new perspective on autonomous driving safety