SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing¶

Basic Information¶

Conference: ICLR 2026
arXiv: 2603.01630
Code: Project Page
Area: AI Safety / Autonomous System Evaluation
Keywords: Ethical Testing, Bayesian Experimental Design, Gaussian Process, LLM Evaluator, Autonomous Systems

TL;DR¶

This paper proposes SEED-SET, a framework that formulates ethical evaluation of autonomous systems as a hierarchical Bayesian experimental design problem, jointly integrating objective metrics and subjective value judgments to efficiently generate test cases with high ethical alignment under limited evaluation budgets.

Background & Motivation¶

State of the Field¶

Autonomous systems (e.g., drones, power grid distribution) are increasingly deployed in high-stakes domains, making ethical alignment evaluation critically important. Ethical assessment faces three major challenges:

Measurement difficulty: Ethical behaviors (fairness, social acceptability) lack ground-truth labels;

Subjectivity dependence: Value alignment varies across stakeholders and evolves over time, requiring continuous revision of static benchmarks;

Evaluation cost: Real-system evaluation is budget-constrained, making large-scale human feedback collection infeasible.

Limitations of Prior Work¶

Rule-based ethical benchmarks rely on predefined criteria and lack specificity;
RL/RLHF-based methods assume abundant simulation or expert annotations, requiring large sample sizes;
Preference-based methods and large-scale human studies focus on only a single dimension.

Mechanism¶

The framework jointly models objective metrics (e.g., fire damage, grid cost) and subjective preferences (stakeholder ethical judgments), efficiently generating test scenarios via hierarchical Gaussian processes and Bayesian experimental design.

Method¶

Overall Architecture¶

SEED-SET (Scalable Evolving Experimental Design for System-level Ethical Testing) comprises three components:

Hierarchical Variational Gaussian Process (HVGP) as the surrogate model
Joint acquisition strategy for adaptive test case generation
LLM agent as a substitute for human preference evaluation

1. Problem Formulation¶

Given a black-box autonomous system $\mathcal{S}_\pi$ and scenario space $\mathcal{X}$, the ethical compliance function is decomposed into:

Objective layer: $f_{\text{obj}}: \mathcal{X} \to \mathcal{Y}$, mapping scenario parameters to measurable metrics (cost, resilience, etc.)
Subjective layer: $f_{\text{subj}}: \mathcal{Y} \to \mathbb{R}$, producing an ethical utility score from objective metrics

2. Hierarchical Variational Gaussian Process (HVGP)¶

Ethical evaluation is modeled as a two-level VGP hierarchy:

Objective GP: Learns a surrogate model $g: x \to y$ predicting objective metrics for a given scenario $$ p(f(x)|\mathcal{D}) = \mathcal{N}(\mu(x), k(x, x')) $$

Subjective GP: Learns a preference model $h: y \to z$ mapping objective metrics to subjective ethical scores

Since subjective evaluations lack ground-truth labels, pairwise preference elicitation is adopted: oracle $\mathcal{T}: (y, y') \to \{1, 2\}$ compares the ethical quality of two scenarios.

The hierarchical structure offers two key advantages: - Interpretability: Ethical preferences are anchored to observable system behaviors - Data efficiency: Exploiting the dependence of subjective on objective layers reduces the number of required evaluations

3. Joint Acquisition Strategy¶

The core innovation—an acquisition function that simultaneously balances objective exploration and subjective exploitation:

\[ V(x) = \underbrace{I(g_x; y|\mathcal{D})}_{\text{Objective information gain}} + \mathbb{E}_{q_\phi(y|x)}\left[\underbrace{I(h_y; z|\mathcal{D})}_{\text{Subjective information gain}} + \underbrace{\mathbb{E}_{q_\psi(h_y)}[h_y]}_{\text{Preference exploitation}}\right] \]

Roles of the three terms: - First term: Reduces uncertainty in the objective metric space (scenario exploration) - Second term: Improves estimation of the subjective utility function (preference learning) - Third term: Directs search toward regions of high ethical utility (preference exploitation)

4. LLM Proxy Evaluator¶

GPT-4o is used as a stakeholder proxy for pairwise preference evaluation. The prompt includes:

Task description: Domain-specific context
Objective metrics: Measurable outcomes for both scenarios
Subjective criteria: Ethical preferences encoded in natural language

Key Experimental Results¶

Case 1: Power Grid Resource Allocation (IEEE 5/30-Bus)¶

Method	5-Bus Preference Score (↑)	30-Bus Preference Score (↑)
Random	Low	Low
Single GP	Moderate	Fails
VS-AL-1	Fails	Fails
VS-AL-2	Fails	Fails
HVGP (SEED-SET)	Highest	Highest

Method	Preference Score (↑)	Coverage (↑)
Random	Low	Low
Single GP	Moderate	Moderate
HVGP (MI1+MI2 exploration only)	Moderate–High	Moderate–High
HVGP (Pref exploitation only)	High	Moderate
HVGP (Full acquisition)	Highest	Highest

Ablation Study: Acquisition Strategy Components¶

Acquisition Strategy	Optimal Test Case Rate (↑)	Spatial Coverage (↑)
Random sampling	1×	1×
MI1+MI2 only	1.4×	1.1×
Pref only	1.6×	0.9×
Full $V(x)$	2×	1.25×

Key Findings¶

SEED-SET generates twice as many optimal test cases as baselines, with a 1.25× improvement in search space coverage;
Significant advantage in high-dimensional scenarios: Single GP fails entirely on the 30-Bus case (40-dimensional design space), while HVGP remains effective;
Hierarchical modeling is essential: Decomposing $f$ into $f_{\text{obj}} + f_{\text{subj}}$ yields more accurate estimates than directly modeling $f(x) \to z$;
All three acquisition terms are necessary: Removing any single term degrades performance;
LLM proxy is reliable: TrueSkill ratings confirm that GPT-4o preference judgments align with trends from handcrafted preference functions;
Adaptable to different stakeholders: Switching subjective criteria in the prompt enables rapid adaptation to different ethical standards.

Highlights & Insights¶

First ethical testing framework for autonomous systems that jointly considers objective metrics and subjective value judgments
The hierarchical HVGP design anchors subjective preferences to observable behaviors, enhancing interpretability
The joint acquisition strategy elegantly balances exploration and exploitation, with each of the three terms serving a distinct purpose
Using an LLM as a proxy evaluator reduces dependence on human experts
The framework is domain-agnostic, applicable to power grids, firefighting, transportation, and other scenarios

Limitations & Future Work¶

Assumes stakeholders report preferences truthfully (Assumption A2); strategic misreporting is not addressed
Assumes the set of objective metrics is fully known and fixed (Assumption A3); dynamic metric expansion is not considered
The LLM proxy may inherit GPT-4o's biases; preference consistency across different LLMs requires further validation
Scalability of VGP in extremely high-dimensional settings remains constrained by the number of inducing points
Handcrafted preference scoring functions rely on domain expertise

AI ethics frameworks: NIST AI RMF 1.0 (2023), IEEE standards
Bayesian experimental design: Rainforth et al. (2024), Chaloner & Verdinelli (1995)
Preference learning: RLHF (Christiano et al., 2017), pairwise comparison GP (Chu & Ghahramani, 2005)
Active learning: Preference elicitation (Keswani et al., 2024)
LLM evaluators: Huang et al. (2025) use LLMs for preference evaluation

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First application of hierarchical Bayesian experimental design to ethical testing
Technical depth: ⭐⭐⭐⭐ — HVGP + joint acquisition + LLM evaluator as an integrated system
Experimental Thoroughness: ⭐⭐⭐⭐ — Three case studies + multi-dimensional ablation + stakeholder analysis
Value: ⭐⭐⭐⭐ — Domain-agnostic framework, though real-world deployment requires validation with actual stakeholders