NeurIPS Should Lead Scientific Consensus on AI Policy¶

Conference: NeurIPS 2025 arXiv: 2510.00075 Code: None (Position Paper) Area: AI Policy / AI Governance Keywords: Scientific Consensus, AI Policy, IPCC Model, NeurIPS, Evidence-Based Policy

TL;DR¶

This position paper argues that NeurIPS should proactively assume the role of facilitating scientific consensus in AI policy, drawing on the successful experience of the IPCC (Intergovernmental Panel on Climate Change) in climate science to fill the current gap in AI policy consensus mechanisms.

Background & Motivation¶

Background: Governments worldwide are actively formulating AI policies (e.g., the EU's mandatory requirements vs. the US's voluntary frameworks), yet policy design for AI as a transformative technology requires rigorous scientific evidence and consensus as foundations. Evidence production in AI research is already abundant (NeurIPS 2025 received over 25,000 submissions), yet critical information remains unclear.

Limitations of Prior Work: The authors decompose scientifically-driven AI policymaking into three sub-problems: (i) evidence generation, (ii) evidence synthesis, and (iii) scientific consensus. NeurIPS plays a central role in evidence generation and participates to some extent in evidence synthesis (e.g., publishing surveys and meta-analyses), but is entirely absent from scientific consensus formation.

Key Challenge: Despite the increasingly profound societal impact of AI and policymakers' urgent need for scientific consensus to guide legislation (e.g., the EU AI Act requires model evaluations against the "state of the art"), no formal mechanism for AI policy scientific consensus formation currently exists. This stands in stark contrast to climate science, where the IPCC has operated for 37 years and produced six assessment reports.

Goal: To argue that NeurIPS is the optimal venue for advancing scientific consensus formation in AI policy, and to propose concrete pilot initiatives.

Key Insight: Drawing from multidisciplinary consensus literature spanning sociology, philosophy, political science, and economics, combined with lessons from the IPCC, the paper designs a consensus-formation framework applicable to the AI policy domain.

Core Idea: Leveraging its unparalleled convening power and reputation, NeurIPS should actively catalyze scientific consensus in AI policy in the same way the IPCC has led consensus in climate science.

Method¶

Overall Architecture¶

Rather than a conventional technical paper, this work is a well-structured position paper whose argumentative framework consists of three parts: (1) identifying the problem—the absence of a consensus-formation mechanism; (2) arguing for NeurIPS's unique advantages; and (3) proposing concrete pilot initiatives based on IPCC experience.

Key Designs¶

Problem Diagnosis — Three-Level Analytical Framework:
- Function: Decomposes the scientific foundations of AI policy into three progressive levels—"evidence generation → evidence synthesis → scientific consensus"—and analyzes the involvement of NeurIPS and other institutions at each level.
- Mechanism: Evidence generation is covered by NeurIPS (paper publication) and other channels (company reports, investigative journalism); evidence synthesis has the International AI Safety Report led by Yoshua Bengio as a benchmark mechanism; only scientific consensus formation is entirely absent.
- Design Motivation: This layered analysis precisely locates the problem—not a lack of evidence, nor a lack of synthesis, but a lack of consensus mechanism—making the paper's call to action more focused and persuasive.
Arguing for NeurIPS's Advantages — Dual-Role Analysis:
- Function: Argues for NeurIPS's unique suitability from two dimensions: its "internal role" (toward the AI scientific community) and its "external role" (toward policymakers and the public).
- Mechanism: Internally, NeurIPS possesses unparalleled legitimacy—expert participation (participants are scientists) and inclusivity (NeurIPS 2024 had over 16,000 in-person attendees from around the world); externally, NeurIPS possesses unparalleled credibility—ranked 7th globally by Google Scholar h5-index as a top scientific venue.
- Design Motivation: Drawing on multidisciplinary consensus literature from sociology and philosophy (Zollman 2012; Stegenga 2016; Miller 2019), the paper treats "legitimacy" and "credibility" as necessary conditions for consensus formation and argues that NeurIPS satisfies both.
Analysis of Alternative Mechanisms' Weaknesses:
- Function: Systematically analyzes all plausible alternatives (the International AI Safety Report, AI Summit series, other AI conferences, advisory bodies, etc.) and identifies their respective limitations.
- Mechanism: The International AI Safety Report's core mission is evidence synthesis rather than consensus formation, and its production process is enormously time-consuming (100+ contributors over several months), making scope expansion impractical; AI Summit series (e.g., the UK AI Safety Summit, the Paris AI Action Summit) are overly entangled in high-level geopolitics—exemplified by the dramatic shift from VP Kamala Harris's focus on the "full spectrum of AI risks" in 2023 to J.D. Vance's pivot toward "AI opportunity" in 2025; other AI conferences (FAccT, ICML, CVPR, ACL) fall short of NeurIPS in either breadth or prestige.
- Design Motivation: The process of elimination reinforces the argument that NeurIPS is the uniquely optimal choice.
Learning from IPCC — Three Dimensions of Legitimacy, Credibility, and Influence:
- Function: Conducts an in-depth analysis of the IPCC assessment report process to extract transferable lessons.
- Mechanism: IPCC's legitimacy derives from broad expert nominations, author selection based on diversity across expertise, geography, gender, and discipline, consensus reached through "reasonable agreement" rather than voting, and the use of calibrated uncertainty language when consensus cannot be reached; credibility derives from deep government involvement while scientists maintain independence, multiple rounds of external review, and conflict-of-interest disclosure mechanisms; influence derives from self-positioning as "policy-relevant but not policy-prescriptive," the production of Summary for Policymakers (SPM) documents, and governments editing the SPM line by line while scientists retain veto power.
- Design Motivation: The IPCC is the most successful scientific consensus mechanism over 37 years (awarded the 2007 Nobel Peace Prize), providing the most valuable reference blueprint for the AI domain.

Pilot Initiative Design¶

The authors propose three concrete, low-cost pilot initiatives operating at different levels:

Standing Working Group: A leadership body operating year-round, responsible for continuously advancing consensus formation and presenting conclusions at the NeurIPS conference. Unlike a model active only during the December conference, consensus formation requires year-round intermittent dialogue.
Dedicated Track: A dedicated track in the NeurIPS call for papers to encourage research that promotes scientific consensus—including surveys of scientist opinions, meta-analyses of conflicting evidence, and novel methods such as if-then protocols.
Debates and Surveys: Debate sessions in the main conference program (rather than only keynotes and panel discussions), along with community surveys administered before and after the conference to assess consensus levels, drawing on the precedent of the NLP community's 2022 meta-survey.

Key Experimental Results¶

This paper presents no traditional experimental data. However, two concrete consensus case studies are provided:

Case Study 1: Evaluation Selection¶

Aspect	Current State	Value That Consensus Could Provide
Evaluation validity	Which evaluations have construct validity and reliability remains unclear	Scientific consensus can confirm which evaluations are trustworthy
Evaluation cost	Independent measurement of running costs and cost-reduction approaches is lacking	Consensus can provide a standardized framework for cost assessment
Evaluation gaps	E.g., no consensus-based tools exist for biosecurity evaluations	The consensus formation process can reveal research gaps
Policy needs	EU AI Act requires evaluations against the "state of the art"	Scientific consensus directly satisfies the policy definition requirement

Case Study 2: Threshold Design¶

Aspect	Current State	Value That Consensus Could Provide
Compute threshold debate	Proponents (Heim & Koessler 2024) vs. opponents (Hooker 2024) vs. conditional supporters (Bommasani 2023)	Consensus can move beyond binary debate to identify hybrid approaches
Candidate metric limitations	Training compute is reliably measured but a poor proxy; post-deployment usage statistics are risk-relevant but hard to measure	Consensus can establish trade-off principles (predictive validity vs. measurement cost)
Infrastructure gaps	Some quantities that could yield better thresholds cannot currently be measured	The consensus process can drive policy–research collaboration to build measurement tools
Policy precedent	Biden Executive Order used a \(10^{26}\) FLOP threshold	Consensus can provide a scientific basis for future threshold design

Key Findings¶

NeurIPS ranks 7th among global scientific venues by Google Scholar h5-index, commanding unmatched academic prestige
NeurIPS 2024 attracted over 16,000 in-person attendees from disciplines and industries worldwide
NeurIPS 2025 received over 25,000 submissions, demonstrating abundant evidence production while critical information remains missing
The IPCC has produced 6 assessment reports over 37 years, directly informing major international policy instruments such as the Kyoto Protocol and the Paris Agreement

Highlights & Insights¶

Precise Problem Diagnosis: The clear stratification of "evidence generation–evidence synthesis–scientific consensus" identifies the problem as a gap in consensus mechanisms rather than insufficient evidence. This analytical framework itself carries methodological value.
Depth of the IPCC Analogy: Rather than a superficial analogy, the paper conducts an in-depth analysis of IPCC's mechanisms of legitimacy, credibility, and influence, extracting actionable lessons—particularly the specific mechanisms of "calibrated uncertainty language" and the "Summary for Policymakers."
Proactive Engagement with Anticipated Objections: A dedicated "Alternative Views" section directly addresses two anticipated objections—"this is not NeurIPS's place" and "the AI community is too divided to reach consensus"—with well-reasoned responses.
Pragmatism of the Pilot Proposals: The three pilot initiatives are designed as low-cost interventions that allow rapid assessment of whether further investment is warranted, lowering the implementation barrier of the proposal.

Limitations & Future Work¶

Concern About Industry Influence: The paper acknowledges that NeurIPS's entanglement with the AI industry (attendee composition, leadership, funding dependencies) may compromise the independence of consensus, but treats this concern somewhat optimistically, suggesting it can be managed by analogy with peer review and ethics review. In practice, industry penetration of consensus processes may be considerably more severe than its penetration of paper review.
Blurry Boundaries of Consensus Scope: The paper deliberately selects topics that "depend on scientific expertise" (e.g., evaluation selection), but many critical AI policy questions (e.g., open-source vs. closed-source, regulatory intensity) inherently involve ideological and value disagreements. Where the boundaries of achievable consensus lie is not clearly articulated.
Global Representativeness: NeurIPS attendees are predominantly from North America and Europe; AI researchers and policy needs from the Global South may be systematically marginalized.
Uncertainty in Implementation Pathways: While pragmatic, the three pilot proposals lack concrete timelines, budgets, and governance structure designs; substantial organizational work remains between proposal and execution.
Relationship with the Newly Established UN Independent International AI Scientific Panel: The paper was written prior to the panel's establishment, and its formation may alter the comparative advantage analysis of NeurIPS relative to other institutions.

The IPCC Model as the Core Reference: The IPCC assessment report process (broad nominations → diversity-based selection → reasonable agreement → calibrated uncertainty → Summary for Policymakers) provides the most complete template for AI policy consensus.
Bengio et al.'s International AI Safety Report (2025): Currently the best AI policy evidence synthesis mechanism, authored by 96 international experts and supported by 30 countries.
The Trend of Multi-Author AI Policy Papers: Large-scale collaborative papers such as Brundage et al. 2018, Bengio et al. 2024, Kapoor et al. 2024, and Longpre et al. 2024 represent early attempts at small-coalition consensus.
NLP Community Meta-Survey (Michael et al. 2023): A 2022 survey of NLP researchers' views on contested questions, serving as a precedent for community-level consensus assessment.
Insights: This paper raises an important and underappreciated question of institutional design—the technical community should not only produce knowledge but also actively organize the social consensus around that knowledge. This holds reference value for any rapidly evolving technology domain.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic introduction of scientific consensus formation mechanisms into AI policy discourse, positioning NeurIPS as a consensus hub. Problem identification is precise and the perspective is distinctive.
Experimental Thoroughness: ⭐⭐⭐ — No traditional experiments, as expected for a position paper, but the case analyses (evaluation selection and threshold design) are persuasive, and the in-depth analysis of IPCC experience compensates for the lack of empirical evidence.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, rigorous argumentation, effective responses to anticipated objections, and wide-ranging citations spanning multiple disciplines; an exemplary position paper.
Value: ⭐⭐⭐⭐ — Raises an institutional design challenge the AI research community urgently needs to confront, with actionable pilot proposals. Ultimately, however, the value depends on whether NeurIPS is willing to adopt and implement them.