Skip to content

OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

Conference: ICCV 2025 arXiv: 2603.05936 Code: https://kotashimomura.github.io/odrase/ Area: Autonomous Driving Safety Keywords: Ontology-driven, Risk Assessment, Infrastructure Improvement, Large-scale Vision-Language Models, Diffusion Models

TL;DR

This paper proposes OD-RASE, a framework that constructs a road traffic expert knowledge ontology to filter infrastructure improvement proposals generated by LVLMs, enabling proactive identification of accident-prone road structures and generation of improvement recommendations.

Background & Motivation

  • Current autonomous driving systems achieve high perception performance but remain limited in rare or complex scenarios
  • Traditional road infrastructure improvement is reactive: experts typically analyze causes and propose solutions only after traffic accidents have occurred
  • Autonomous driving systems require proactive identification of potentially hazardous road structures to enable pre-accident intervention
  • Existing datasets focus primarily on accident prediction and high-risk object description, neglecting the underlying road structures that cause accidents
  • The quality of annotation data automatically generated by LVLMs is difficult to guarantee, and validation costs are high

Method

Overall Architecture

The OD-RASE framework consists of two core components: (1) ontology-driven dataset construction based on expert knowledge; and (2) a multimodal OD-RASE model (image + text → infrastructure improvement prediction + diffusion-based generation of improved road images).

Key Designs

  1. Ontology Construction: Based on expert knowledge of road traffic systems, accident-related road structures from 390+ cases are categorized into 11 accident-inducing road structure types and 10 improvement strategies. Domain experts first consolidate the original 30 road structure types and 26 improvement strategies by removing time-dependent factors. The ontology is represented as a mapping between these structures and improvement strategies.

  2. Graph-to-Chain-of-Thought Prompting (G2CoT Prompt): GPT-4o is employed to simulate expert reasoning through multi-stage chain-of-thought (CoT) prompting, generating infrastructure improvement proposals for road images. The textual output of each stage is converted into a graph-structured prompt passed to the next stage. The generated outputs correspond to the 10 improvement strategies and 11 road structure types defined in the ontology.

  3. Ontology-Driven Data Filtering: The expert knowledge ontology is represented as a directed reference graph \(G_A=(V_A, E_A)\), and the improvement proposals generated by GPT-4o are represented as a generated graph \(G_B=(V_B, E_B)\). Graph matching is applied for filtering:

    • Compute node and edge intersections: \(V'_B = V_B \cap V_A\), \(E'_B = E_B \cap E_A\)
    • Remove isolated nodes: \(V''_B = V'_B \setminus \text{Iso}(G'_B)\)
    • Retain edges whose endpoints both belong to \(V''_B\), yielding the final filtered graph \(G''_B\)
    • If all edges between any two modules are removed, the data sample is considered unreliable and discarded
  4. OD-RASE Multimodal Model: Comprises a visual encoder (ResNet-50/ViT-B/CLIP/Long-CLIP), a text encoder (RoBERTa-Base/Flan-T5-xl/Long-CLIP), and a Grounding Block. The Grounding Block performs cross-attention using image embeddings as queries and text embeddings as keys/values; the output is passed through a fully connected layer to predict 10 improvement strategy classes.

  5. Diffusion Model Layout Control: Instruct Pix2Pix is adopted to generate text prompts from OD-RASE's predicted improvement strategies, editing the original road images to produce visualizations of improved road environments as a decision-support tool.

Loss & Training

  • The task is formulated as multi-label classification (a single road image may correspond to multiple improvement strategies)
  • Binary cross-entropy loss: \(\mathcal{L} = -\sum_{c=1}^{C}[y_c \log p_c + (1-y_c)\log(1-p_c)]\)
  • Visual and text encoder parameters are frozen during training
  • Batch size 16, trained for 25 epochs
  • Evaluated on Mapillary Vistas and BDD100K datasets

Key Experimental Results

Main Results (Table)

Performance of different visual-text encoder combinations on infrastructure improvement prediction:

Visual Encoder Text Encoder Mapillary F1 Mapillary Acc BDD100K F1 BDD100K Acc
ResNet-50 RoBERTa-Base 64.98 37.12 77.18 45.94
ViT-B RoBERTa-Base 67.79 40.71 78.14 47.98
CLIP RoBERTa-Base 69.76 40.71 78.30 48.69
Long-CLIP RoBERTa-Base 70.26 42.14 78.79 49.48
Long-CLIP Flan-T5-xl 24.39 0.00 28.22 0.00

Ablation Study (Table)

Modality Ablation (Long-CLIP + RoBERTa-Base, Mapillary):

Input Modality Precision Recall F1 Accuracy
Image only 57.42 72.37 64.03 34.50
Text only 60.02 79.76 68.50 40.63
Image + Text 64.54 77.09 70.26 42.14

Ontology Filtering Ablation:

Data Filtering Precision Recall F1 Accuracy
Without filtering 33.59 64.85 44.26 0.00
With filtering 64.54 77.09 70.26 42.14

Key Findings

  • Ontology filtering is highly effective: Accuracy rises from 0% without filtering to 42.14% with filtering enabled; F1 improves from 44.26 to 70.26
  • Flan-T5-xl performs poorly as a text encoder: high recall but extremely low precision, likely due to excessive false positives caused by 8-bit quantization
  • Zero-shot generalization: the model trained on BDD100K and evaluated on Mapillary remains robust (F1=68.32), while general-purpose LVLMs (e.g., GPT-4o/LLaVA) substantially underperform the specialized model on this task
  • Diffusion model visualization: Instruct Pix2Pix generates improved road images with FID=8.5; expert evaluation rates 54.23% as fully compliant

Highlights & Insights

  • Pioneering integration of road infrastructure improvement with autonomous driving safety, establishing a proactive risk identification framework
  • The ontology serves as a knowledge-based filter that effectively improves the reliability of LVLM-generated data; the "expert knowledge + AI generation + graph-matching validation" paradigm is broadly generalizable
  • Generating improved road visualizations via diffusion models enhances interpretability and practical utility

Limitations & Future Work

  • Only front-view images are used; video temporal information and multi-view inputs are not considered
  • Time-dependent factors such as traffic volume are excluded
  • The actual accident reduction rate of proposed improvements cannot be quantitatively evaluated without a traffic simulator
  • Quantitative metrics for prioritizing and assessing the urgency of improvement strategies are lacking
  • The category granularity is relatively coarse (11 road structure types + 10 improvement strategies); finer-grained classification may be required for real-world deployment
  • Aligns with established road safety practices such as lane reduction (50% accident reduction) and roundabout conversion (38% accident reduction)
  • The framework can serve as an automated road risk assessment tool for urban planning and intelligent transportation systems
  • The ontology-driven data filtering approach is transferable to other domains requiring expert validation of AI-generated data

Rating

  • Novelty: ⭐⭐⭐⭐ First work to integrate road infrastructure improvement with autonomous driving safety; the ontology-driven data filtering approach is novel
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive experiments on two datasets, including zero-shot transfer, modality ablation, filtering ablation, and comparison with general-purpose LVLMs
  • Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated, the framework is complete, and the figures are convincing
  • Value: ⭐⭐⭐⭐ High practical value; offers a new perspective on autonomous driving safety