SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing¶
Basic Information¶
- Conference: ICLR 2026
- arXiv: 2603.01630
- Code: Project Page
- Area: AI Safety / Autonomous System Evaluation
- Keywords: Ethical Testing, Bayesian Experimental Design, Gaussian Process, LLM Evaluator, Autonomous Systems
TL;DR¶
This paper proposes SEED-SET, a framework that formulates ethical evaluation of autonomous systems as a hierarchical Bayesian experimental design problem, jointly integrating objective metrics and subjective value judgments to efficiently generate test cases with high ethical alignment under limited evaluation budgets.
Background & Motivation¶
State of the Field¶
Autonomous systems (e.g., drones, power grid distribution) are increasingly deployed in high-stakes domains, making ethical alignment evaluation critically important. Ethical assessment faces three major challenges:
Measurement difficulty: Ethical behaviors (fairness, social acceptability) lack ground-truth labels;
Subjectivity dependence: Value alignment varies across stakeholders and evolves over time, requiring continuous revision of static benchmarks;
Evaluation cost: Real-system evaluation is budget-constrained, making large-scale human feedback collection infeasible.
Limitations of Prior Work¶
- Rule-based ethical benchmarks rely on predefined criteria and lack specificity;
- RL/RLHF-based methods assume abundant simulation or expert annotations, requiring large sample sizes;
- Preference-based methods and large-scale human studies focus on only a single dimension.
Mechanism¶
The framework jointly models objective metrics (e.g., fire damage, grid cost) and subjective preferences (stakeholder ethical judgments), efficiently generating test scenarios via hierarchical Gaussian processes and Bayesian experimental design.
Method¶
Overall Architecture¶
SEED-SET (Scalable Evolving Experimental Design for System-level Ethical Testing) comprises three components:
- Hierarchical Variational Gaussian Process (HVGP) as the surrogate model
- Joint acquisition strategy for adaptive test case generation
- LLM agent as a substitute for human preference evaluation
1. Problem Formulation¶
Given a black-box autonomous system \(\mathcal{S}_\pi\) and scenario space \(\mathcal{X}\), the ethical compliance function is decomposed into:
- Objective layer: \(f_{\text{obj}}: \mathcal{X} \to \mathcal{Y}\), mapping scenario parameters to measurable metrics (cost, resilience, etc.)
- Subjective layer: \(f_{\text{subj}}: \mathcal{Y} \to \mathbb{R}\), producing an ethical utility score from objective metrics
2. Hierarchical Variational Gaussian Process (HVGP)¶
Ethical evaluation is modeled as a two-level VGP hierarchy:
Objective GP: Learns a surrogate model \(g: x \to y\) predicting objective metrics for a given scenario $$ p(f(x)|\mathcal{D}) = \mathcal{N}(\mu(x), k(x, x')) $$
Subjective GP: Learns a preference model \(h: y \to z\) mapping objective metrics to subjective ethical scores
Since subjective evaluations lack ground-truth labels, pairwise preference elicitation is adopted: oracle \(\mathcal{T}: (y, y') \to \{1, 2\}\) compares the ethical quality of two scenarios.
The hierarchical structure offers two key advantages: - Interpretability: Ethical preferences are anchored to observable system behaviors - Data efficiency: Exploiting the dependence of subjective on objective layers reduces the number of required evaluations
3. Joint Acquisition Strategy¶
The core innovation—an acquisition function that simultaneously balances objective exploration and subjective exploitation:
Roles of the three terms: - First term: Reduces uncertainty in the objective metric space (scenario exploration) - Second term: Improves estimation of the subjective utility function (preference learning) - Third term: Directs search toward regions of high ethical utility (preference exploitation)
4. LLM Proxy Evaluator¶
GPT-4o is used as a stakeholder proxy for pairwise preference evaluation. The prompt includes:
- Task description: Domain-specific context
- Objective metrics: Measurable outcomes for both scenarios
- Subjective criteria: Ethical preferences encoded in natural language
Key Experimental Results¶
Case 1: Power Grid Resource Allocation (IEEE 5/30-Bus)¶
| Method | 5-Bus Preference Score (↑) | 30-Bus Preference Score (↑) |
|---|---|---|
| Random | Low | Low |
| Single GP | Moderate | Fails |
| VS-AL-1 | Fails | Fails |
| VS-AL-2 | Fails | Fails |
| HVGP (SEED-SET) | Highest | Highest |
Case 2: Firefighting Rescue (Drone Navigation)¶
| Method | Preference Score (↑) | Coverage (↑) |
|---|---|---|
| Random | Low | Low |
| Single GP | Moderate | Moderate |
| HVGP (MI1+MI2 exploration only) | Moderate–High | Moderate–High |
| HVGP (Pref exploitation only) | High | Moderate |
| HVGP (Full acquisition) | Highest | Highest |
Ablation Study: Acquisition Strategy Components¶
| Acquisition Strategy | Optimal Test Case Rate (↑) | Spatial Coverage (↑) |
|---|---|---|
| Random sampling | 1× | 1× |
| MI1+MI2 only | 1.4× | 1.1× |
| Pref only | 1.6× | 0.9× |
| Full \(V(x)\) | 2× | 1.25× |
Key Findings¶
- SEED-SET generates twice as many optimal test cases as baselines, with a 1.25× improvement in search space coverage;
- Significant advantage in high-dimensional scenarios: Single GP fails entirely on the 30-Bus case (40-dimensional design space), while HVGP remains effective;
- Hierarchical modeling is essential: Decomposing \(f\) into \(f_{\text{obj}} + f_{\text{subj}}\) yields more accurate estimates than directly modeling \(f(x) \to z\);
- All three acquisition terms are necessary: Removing any single term degrades performance;
- LLM proxy is reliable: TrueSkill ratings confirm that GPT-4o preference judgments align with trends from handcrafted preference functions;
- Adaptable to different stakeholders: Switching subjective criteria in the prompt enables rapid adaptation to different ethical standards.
Highlights & Insights¶
- First ethical testing framework for autonomous systems that jointly considers objective metrics and subjective value judgments
- The hierarchical HVGP design anchors subjective preferences to observable behaviors, enhancing interpretability
- The joint acquisition strategy elegantly balances exploration and exploitation, with each of the three terms serving a distinct purpose
- Using an LLM as a proxy evaluator reduces dependence on human experts
- The framework is domain-agnostic, applicable to power grids, firefighting, transportation, and other scenarios
Limitations & Future Work¶
- Assumes stakeholders report preferences truthfully (Assumption A2); strategic misreporting is not addressed
- Assumes the set of objective metrics is fully known and fixed (Assumption A3); dynamic metric expansion is not considered
- The LLM proxy may inherit GPT-4o's biases; preference consistency across different LLMs requires further validation
- Scalability of VGP in extremely high-dimensional settings remains constrained by the number of inducing points
- Handcrafted preference scoring functions rely on domain expertise
Related Work & Insights¶
- AI ethics frameworks: NIST AI RMF 1.0 (2023), IEEE standards
- Bayesian experimental design: Rainforth et al. (2024), Chaloner & Verdinelli (1995)
- Preference learning: RLHF (Christiano et al., 2017), pairwise comparison GP (Chu & Ghahramani, 2005)
- Active learning: Preference elicitation (Keswani et al., 2024)
- LLM evaluators: Huang et al. (2025) use LLMs for preference evaluation
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First application of hierarchical Bayesian experimental design to ethical testing
- Technical depth: ⭐⭐⭐⭐ — HVGP + joint acquisition + LLM evaluator as an integrated system
- Experimental Thoroughness: ⭐⭐⭐⭐ — Three case studies + multi-dimensional ablation + stakeholder analysis
- Value: ⭐⭐⭐⭐ — Domain-agnostic framework, though real-world deployment requires validation with actual stakeholders