Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem¶

Conference: ICML2026
arXiv: 2506.08134
Code: To be confirmed
Area: NLP Understanding / Research Methodology
Keywords: Peer review, LLM assistance, review ecosystem, process data, position paper

TL;DR¶

This is a position paper arguing that the machine learning community must urgently build an "AI-augmented" peer-review ecosystem—treating LLMs as collaborative assistants for authors, reviewers, and Area Chairs (ACs) rather than replacements. The paper identifies that the primary near-term bottleneck is not the lack of stronger models, but the absence of structured process data that records "why scores changed" or "which specific rebuttal addressed which concern."

Background & Motivation¶

Background: Peer review is the cornerstone of scientific validation in ML, but submission volumes at top conferences are growing exponentially. The paper provides hard data: NeurIPS submissions rose from 1,678 in 2014 to 17,491 in 2024 (a 10.4x increase, ~26.4% CAGR); ICML submissions grew from 6,538 to 9,653 in a single year (+48%). The growth of the qualified reviewer pool cannot keep pace.

Limitations of Prior Work: The authors categorize current "cracks" in the system into four symptoms: superficial reviews and reviewer fatigue; high variance in scoring for the same paper (ICLR 2019–2024 data shows \(\sigma\approx1\text{–}1.5\), increasing with submission volume); shallow rebuttal dialogues with limited influence on final decisions (most reviewers remain silent, respondents average <1 reply and <150 words); and long feedback cycles with process inefficiencies. The paper cites statistics suggesting up to 23% of acceptance decisions might flip based on different reviewer assignments.

Key Challenge: The root cause is that "human expert judgment scales linearly, yet must handle exponentially growing workloads"—a "tragedy of the commons" in peer review. Adding fuel to the fire, LLM writing tools both inflate submission volume and make reviews/rebuttals increasingly "AI-flavored," complicating quality control.

Goal: The aim is not to discuss "whether to use AI," but to argue that the community must proactively and systematically weave AI into the entire review lifecycle while building the necessary data infrastructure.

Key Insight: Peer review is a highly challenging AI testbed—it simultaneously requires domain expertise, fact-checking, multi-turn argumentation, and value judgments (novelty/importance/ethics). This is more complex than summarization, QA, or code generation. Addressing review as an explicit research problem both fixes a collapsing academic process and provides a real-world laboratory for "linguistic intelligence."

Core Idea: Construct a "human-in-the-loop" AI-augmented review ecosystem where LLMs serve as collaborators for three roles. The review process itself should be "instrumented" to capture causal trajectories for AI learning without significantly increasing human burden.

Method¶

As a position paper, there is no single "model," but rather a clear argumentative framework: defining a "visionary architecture" (an LLM-centric ecosystem revolving around three human roles), decomposing it into foundational tools and role-specific assistants, and identifying the "bottleneck"—data—while providing actionable mechanisms for acquisition.

Overall Architecture¶

The envisioned ecosystem is a cycle: three human stakeholders (authors, reviewers, ACs) occupy the outer ring, with an LLM collaborative assistant at the center. The assistant provides support during paper preparation, review writing, and decision-making, while humans remain in control. This ecosystem is supported by two layers: the bottom layer consists of "foundational AI capabilities" (retrieval-augmented verification, code/reproducibility analysis, review quality "report cards," content provenance, writing assistance, AC decision support), and the top layer distributes these capabilities to the three roles. The authors emphasize that to learn "why a decision was made," the ecosystem must rely on fine-grained, structured, and ethically sourced process data; otherwise, even strong models will only mimic the "surface form" of reviews without capturing the underlying reasoning.

Key Designs¶

1. Foundational AI Capability Layer: Upgrading Disparate Tools to Cognitive Assistants

The community already uses "narrow AI" for plagiarism, formatting/ethics checks, paper-reviewer matching, and diff tools. The authors propose advancing to "cognitive assistance" with capabilities such as: ① Retrieval Augmented Verification (RAV) — connecting LLMs to authoritative knowledge bases (e.g., Semantic Scholar, arXiv) to cross-check claims and flag conflicts; ② Code analysis and reproducibility assessment — parsing methods and source code to identify data leakage, implementation errors, or text-code inconsistencies; ③ AI-driven review quality feedback via "Review Report Cards," providing structured feedback based on coverage, specificity, evidence, and constructiveness; ④ Content provenance and authenticity (using statistical features like perplexity/burstiness or SynthID watermarking), though the authors admit these detectors are still immature and prone to high false-positive rates for non-native speakers.

2. Three Role Assistants: Reviewers / Authors / ACs

Capabilities are distributed to three groups. For reviewers, the "ideal reviewer" serves as a benchmark (comprehensive knowledge, rigor, insight). AI acts as a "copilot" using RAV for factual rigor and report cards to guide novices. For authors, AI provides "simulated reviews" before submission, mimicking different personas (e.g., theory-leaning vs. application-leaning) and helping organize rebuttals by distinguishing "misunderstandings" from "disagreements." For ACs, AI provides decision support by summarizing key points, flagging direct contradictions (e.g., R1 praises novelty while R2 calls it incremental), and drafting meta-review scaffolds that link score changes to specific justifications.

3. Structured Process Data: The Real Bottleneck

This is the paper’s core claim. Current public datasets lack four elements: grounded reasoning behind judgments (why scores changed), deliberation dynamics (negotiation/clarification), fine-grained links between claims and manuscript content, and implicit domain knowledge. Consequently, models are pushed toward "outcome imitation" rather than auditable judgment. The authors propose collecting data across four dimensions: ① structured reasoning for score changes; ② semantically annotated author-reviewer-AC interaction trajectories; ③ anonymously aggregated AC deliberation tracks; ④ fine-grained links between review text, specific manuscript parts, and external knowledge. The summary: the bottleneck "is not better models, but better process data."

4. Active Elicitation Interfaces + Tiered Access: Low-Friction Data Collection

The authors suggest "active elicitation interfaces" that prompt for reasons at key decision points (e.g., "Which part of the rebuttal influenced your score change from 5 to 7?"). This transforms the review process into a structured data annotation task with minimal overhead. For privacy, they propose a "tiered access" model: ① public OpenReview data as the base; ② mandatory opt-in for private deliberation tracks at the start of the cycle; ③ data hosting in confidential computing enclaves to prevent leaks. They call on organizers, publishers, and platforms like OpenReview to build ethical frameworks and shared benchmark datasets.

Loss & Training¶

This is a position paper and does not involve training objectives. Section 6 provides "illustrative experiments" to support the position rather than proposing a new model.

Key Experimental Results¶

Two sets of illustrative experiments were conducted on ICLR corpora to demonstrate that while LLMs are useful, In-Context Learning (ICL) has clear limitations with current data, necessitating fine-tuning and structured process data.

Main Results: Recall of Review Points (Table 1)¶

LLMs were used to extract strengths/weaknesses and identify key rebuttal points using few-shot prompting, measured by LLM-as-judge recall (Avg Hits / Avg Real Points).

Task	ICLR 2024 Recall	ICLR 2025 Recall	Interpretation
Strengths	0.724	\(0.927\pm0.060\)	Identifiying strengths is easy
Weaknesses	0.387	\(0.632\pm0.000\)	Identifying weaknesses is significantly harder
Rebuttal points	—	\(0.911\pm0.040\)	High recall for rebuttal points

A key finding is that weakness recall (0.632) is significantly lower than strength recall (0.927), exposing a "critical thinking gap" in current models. This supports the argument that "humans must remain senior partners in the loop" to identify non-obvious flaws. The higher recall in 2025 compared to 2024 may reflect both model improvements and the increasing "AI-flavor" of reviews, making them easier for LLMs to parse.

Ablation Study: Score Prediction (Table 2)¶

Predicting initial scores, final scores, and score changes using few-shot ICL (\(n=0,1,2,3\)).

Prediction Target	Setup	MAE	Note
Initial Score (Paper only)	\(n=2\)	\(2.2857\pm0.0095\)	Large error
Final Score (With review+rebuttal)	\(n=1\)	\(0.6709\pm0.0052\)	Much better with context
Score Change	—	Extremely difficult	Difficult to predict with ICL

Key Findings¶

Increasing the number of shots (\(n\)) yields diminishing returns, suggesting a ceiling for pure ICL in complex regression. Significant improvements require fine-tuning on larger datasets with structured justifications.
Inherent subjectivity in reviews may set a ceiling on accuracy, reinforcing the necessity of human judgment.
"Score change prediction," the most difficult task, corresponds to the exact data type currently missing (reasons for score changes).

Highlights & Insights¶

Relocating the "Bottleneck" from Models to Data: While most AI review discussions focus on model capabilities, this paper shifts the focus to the lack of structured data recording the reasoning process—a shift in research priority.
"Active Elicitation Interfaces" as a Practical Hook: Adding a simple prompt for justification when scores change in OpenReview is a zero-cost way to convert implicit judgment into explicit supervision.
Peer Review as an AI Testbed: Arguing that peer review tasks (domain judgment, fact-checking, argumentation) are real-world benchmarks for "collective reasoning + adversarial robustness + alignment with human norms."
Honest Negative Evidence: The "ugly" 0.632 recall for weaknesses is used as evidence for the indispensability of humans rather than being hidden.

Limitations & Future Work¶

Weak Illustrative Experiments: Based only on ICLR corpora and few-shot ICL without actual fine-tuning; the "need for more data" is an argumentative claim rather than a strictly verified conclusion.
Underestimating Practical Resistance to Data Collection: Mandatory opt-ins and confidential computing enclaves face significant engineering and governance hurdles.
Inherent Flaws in AI Detection: The authors admit content provenance is unreliable (high false positives for non-native speakers), meaning "authenticity" remains a challenge.
Incentives and Gaming: Once AI scores reviews and authors use AI for rebuttals, all parties might "optimize for metrics," a risk the paper does not fully address.

vs. Early Review Automation: Prior work focused on logistics (matching, plagiarism); this paper advances toward cognitive assistance.
vs. Automated Review/Review Generation (Lu et al., D'Arcy et al.): Those works let AI act as a reviewer, but struggle with deep flaw detection; this paper maintains the "AI as assistant, human as senior partner" stance.
vs. ICLR 2025 LLM Feedback Experiments (Thakkar et al.): That experiment showed 26.6% of reviewers revised reports after LLM suggestions; this paper uses that as evidence that AI collaboration can enhance human judgment and argues for an end-to-end ecosystem.

Rating¶

Novelty: ⭐⭐⭐⭐ Refocusing the bottleneck on structured process data and proposing collection mechanisms is a significant contribution.
Experimental Thoroughness: ⭐⭐⭐ Minimal ICLR corpora + few-shot ICL is sufficient for "illustration" but does not fully support all claims.
Writing Quality: ⭐⭐⭐⭐ Solid data, progressive argumentation, and honest handling of negative evidence.
Value: ⭐⭐⭐⭐ Provides a systemic agenda for a review system currently overwhelmed by submissions.