Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing¶

Conference: ACL2026
arXiv: 2606.00033
Code: No public repository; the paper proposes concepts for a platform and auditing framework
Area: Mechanistic Interpretability / AI Safety Auditing / Metascience
Keywords: Mechanistic Interpretability, Auditing Standards, Continuous Reviewing, Community Guidelines, Source Evidence Tracking

TL;DR¶

This is a position paper arguing that mechanistic interpretability research must incorporate a layer of "auditability." By establishing a continuous collaborative reviewing platform, community-refined guidelines, and source evidence tracking systems, it aims to transform fragmented replications, negative results, and methodological critiques into auditing protocols suitable for safety-critical scenarios.

Background & Motivation¶

Background: Mechanistic Interpretability (MI) has provided valuable explanations for the internal mechanisms of neural networks and has been applied to model steering, hallucination detection, and AI auditing. As high-risk sectors such as healthcare AI, autonomous driving, and financial regulation begin to focus on interpretability, MI conclusions are no longer just research insights but may serve as evidence for deployment and governance decisions.

Limitations of Prior Work: The paper uses a typical example to illustrate the problem: two MI studies provided conflicting explanations for the same behavioral mechanism, both having passed peer review. It was only when a third paper analyzed them using a unified framework that both were found to be partially correct, but they were not directly comparable due to inconsistent experimental methods. While such conflicts are acceptable in general research discussions, stakeholders in medical diagnosis, autonomous systems, or financial regulation need to know "which explanation is credible, why is it credible, and where is the chain of evidence?"

Key Challenge: MI experiments are highly sensitive to details such as metric selection, construction of corrupted samples, component granularity, and causal intervention settings. While there are tutorials, courses, forums, and blog discussions, the community lacks a standardized, continuously updated, and traceable auditing process. Consequently, many useful replications, negative results, and methodological warnings are scattered across social media, forums, and short posts, making them difficult to incorporate into formal papers or systematic use by successors.

Goal: Rather than directly providing a set of final auditing standards, the authors call for the community to establish mechanisms to generate such standards. Goals include: organizing meta-results outside papers, developing community-refined guidelines, explicitly tracking assumptions and evidence for claims, and exploring source evidence auditing assisted by agentic AI.

Key Insight: The paper views MI auditing as a "methodological infrastructure" problem, similar to software engineering standards, clinical GRADE, or MIAME in life sciences. Standards should not be dictated by a few authorities behind closed doors but should be continuously refined by an open community based on experiment repositories, replication results, controversial discussions, and expert validation.

Core Idea: Use a Collaborative Meta-Analysis Platform to host continuous reviews and experiment repositories, translating recurring good practices supported by experts and evidence into living guidelines. Simultaneously, develop source-based auditing to allow every interpretability claim to be traced back to specific assumptions, experiments, figures, code, and other claims.

Method¶

Overall Architecture¶

The paper proposes an MI auditing ecosystem rather than an algorithm. It is divided into three layers: the first is a Continuous Reviewing platform, allowing replications, negative results, critiques, supplementary experiments, and partial results to be recorded without waiting for a new paper; the second is Community-Refined Guidelines and Protocols, distilling recurring effective practices from the platform into community-accepted minimal standards; the third is Source-Based Automated Auditing, using explicit evidence chains and probabilistic logic to help humans and AI agents track the credibility of claims. These layers advance sequentially—the platform accumulates fragmented experiences, guidelines solidify standards from that evidence, and source-based tracking enables machine-human collaborative verification of claim credibility.

In this framework, peer review still exists but is no longer the sole node of quality control. "Cleanup work" outside of papers is given a clear place: researchers can upload experiment repositories, comment on claims, record replication failures, and add edge cases, while the platform organizes this meta-knowledge into a searchable, citable, and cumulative community memory.

Key Designs¶

1. Continuous Collaborative Reviewing Platform: A home for "cleanup work" beyond papers

Many of the most valuable MI experiences—replications, negative results, post-hoc extensions, partial results, small counterexamples, and methodological critiques—are currently scattered across Twitter, Discord, forums, and private communications, making them prone to loss and difficult for new researchers and LLMs to retrieve. The proposed platform consists of experiment repositories and forums: repositories store hypotheses, evidence, claims, code, and paper links for each study, while forums allow for ongoing debate around specific claims and guideline pages. It functions like a combination of OpenReview, LessWrong, and GitHub, but focuses on the continuous revision of the evidence status of existing research rather than publishing new papers.

2. Community-Refined Minimal Guidelines: Solidifying good practices through evidence rather than authority

MI experiments are extremely sensitive to details like metric choice and causal intervention settings, yet the community lacks shared minimum standards. The authors propose allowing researchers to create "Proposed Guideline" pages (e.g., "A certain type of circuit validity must pass a specific sanity check"), where both supporters and opponents must provide evidence from papers, repositories, or meta-results. Professional auditing bodies can then select standards by viewing the evidence chain and controversy history. Crucially, the authors oppose turning guidelines into rigid dogma; they should be minimal, logically justified, and empirically supported requirements that help researchers avoid missing critical checks without stifling exploration.

3. Source-Based Automated Auditing Assistance: Tracing every claim back to specific figures, experiments, and code

General citations only indicate that a paper was read, but the credibility of an MI claim often depends on which specific plot, ablation, corrupted prompt, seed, or metric it relies upon. Source-based auditing requires explicitly tracking a claim's dependencies to specific internal evidence and updating the claim's credibility synchronously if a dependency is weakened. Given the vast number of MI claims, the paper suggests using agentic AI to help track long dependency chains, run evaluation harnesses, and use frameworks like Probabilistic Soft Logic to weight the relationships between hypotheses and observations. Automated systems do not replace human judgment but convert explanations into testable claims to expose selective evidence, post-hoc hypotheses, and missing ablations.

Loss & Training¶

The paper does not train a model and has no loss function. Its "training strategy" is institutional design: accumulating meta-results through an open platform, forming living guidelines through evidence-based community discussion, and finally reducing auditing costs with source evidence tracking and agentic AI tools. The authors emphasize that this mechanism requires experimental pilots (surveys, workshops) rather than the immediate announcement of mandatory standards.

Key Experimental Results¶

Main Results¶

The paper is a position and framework proposal and does not contain traditional model experiments or dataset metrics. It uses case studies, tables, and appendix examples to argue why MI auditing requires standardization. The following table records verifiable claims reported in the paper rather than fabricated experimental values.

Evidence Type	Reported Content	Function
Conflict Case	Two MI papers gave conflicting explanations; a third found them partially correct but methods incomparable	Demonstrates peer review is insufficient for MI claim auditability
Table 1	General pitfalls: interpretability illusions, cherry-picking, missing sanity checks, no causal validation	Provides high-level risk classification across methods
Table 2	Method-specific pitfalls for probing, activation patching, sparse decomposition, steering	Shows different MI techniques require distinct audit items
Platform Design	Experiment repositories + forums + proposed guideline pages	Organizes fragmented meta-results into reviewing infrastructure
Automated Audit	Source-based reasoning, agentic AI, Probabilistic Soft Logic	Reduces the cost of tracking large-scale claim dependencies

Ablation Study¶

The paper has no ablation experiments. The three proposed components can be understood as complementary modules rather than quantified system variants:

Component	Problems Solved	Key Basis in Buffer
Continuous Reviewing	Difficult to accumulate replications, critiques, and negative results	Authors note meta-knowledge is often buried in blogs, Twitter, or Discord
Community-Refined Guidelines	Lack of shared minimal experimental standards in MI	Authors cite MIAME, GRADE, and High Integrity C++ as precedents
Source-Based Auditing	Opacity of claim hypotheses, evidence, and dependency chains	Authors suggest tracking to specific figures/code using probabilistic logic

Key Findings¶

The core contribution is the problem framing: MI needs to upgrade from "are explanations interesting" to "are explanations auditable, comparable, and adoptable for high-risk scenarios."
The authors remain cautious about standardization: guidelines should be minimal requirements and auditing aids rather than veto-style checklists.

Highlights & Insights¶

Elevating MI credibility issues to community infrastructure: Many MI controversies arise not from lack of rigor but from the absence of a unified place to record and compare experimental assumptions.
Valuing knowledge outside formal papers: Negative results and methodological critiques are often not "novel" enough for publication but are critical for auditing. Platformed continuous review provides visibility and incentives for this "cleanup work."
Source-based auditing is more granular than citation: Mapping dependencies at the level of specific plots or seeds is particularly vital for the sensitivity of MI.
Consciousness of standardization risks: By emphasizing "minimal guidelines" and "guides not doctrines," the authors avoid prematurely freezing a nascent field with absolute standards.

Limitations & Future Work¶

This is a proposal paper; a platform has not been built, nor has user participation data been collected. Feasibility requires validation through pilots.
Attracting researchers to contribute "cleanup work" poses incentive challenges. While the paper proposes "reviewer portfolios," their recognition in academic evaluation remains unknown.
Governance of community guidelines is unresolved: who has the right to merge/discard guidelines, how to prevent vote manipulation, and how professional bodies will adopt them.
Automated auditing systems introduce new risks like hallucinations or incorrect code execution; they must have human review and traceable logs.
For frontier models, the computational cost of exhaustive testing is high; even with platforms and guidelines, scalable verification remains difficult.

vs. Traditional Peer Review: Peer review is a one-time gateway; this paper emphasizes post-publication continuous review where replications can shift claim credibility.
vs. OpenReview / arXiv / Papers with Code: While these support dissemination, they do not specifically organize claim dependencies or living guidelines for evidence governance.
vs. MI Tutorials: Resources like ARENA teach how to do MI; this paper focuses on how to audit completed research and turn best practices into standards.
Insights: Applicable to other rapidly evolving fields like LLM safety evaluation or alignment steering where experimental details determine credibility but negative results are hard to publish.

Rating¶

Novelty: ⭐⭐⭐⭐☆ (Not a new algorithm, but a novel framing of auditing infrastructure and source-based tracing.)
Experimental Thoroughness: ⭐⭐☆☆☆ (Relies on cases and design proposals rather than quantitative experiments.)
Writing Quality: ⭐⭐⭐⭐☆ (Clear structure linking the platform, guidelines, and automation.)
Value: ⭐⭐⭐⭐☆ (Highly insightful for MI safety governance and research infrastructure.)