Stop Automating Peer Review Without Rigorous Evaluation¶

Conference: ICML 2026
arXiv: 2605.03202
Code: None
Area: LLM / NLP / AI Governance / Peer Review
Keywords: AI reviewing, Hivemind effect, Paper Laundering, Algorithmic monoculture, Review diversity

TL;DR¶

This is a position paper: Through empirical measurement of real ICLR 2026 reviews and 60 simulated reviews, the authors identify two major failures in current LLM reviewing—hivemind (high convergence) and paper laundering (zero-shot rewriting can increase scores by 0.45). They argue that "LLMs should not directly generate review reports without rigorous evaluation" and call for the establishment of a "science of review automation."

Background & Motivation¶

Background: Top conferences are experiencing explosive growth in submissions (NeurIPS / ICLR / ICML increase by thousands each year), and the pool of qualified reviewers cannot keep up. As a result, major conferences are introducing LLMs: AAAI 2025 generates LLM reviews in parallel with human reviews; ICLR 2026 allows reviewers to use LLMs; ICML 2026 introduces an "author opt-in" dual policy; ICLR 2025 deployed a review feedback agent. Policies are highly fragmented and lack consensus.

Limitations of Prior Work: Existing work on AI reviewing mostly measures "correlation with human scores" or "susceptibility to prompt injection attacks." Both are insufficient: high correlation does not imply harmlessness (errors may be consistent), and prompt injection is explicitly banned by conferences, making it an atypical threat model. The real challenge lies in legitimate but low-cost strategic writing behaviors and group-level "homogenization" in AI reviewing.

Key Challenge: Human review bias is distributed (different reviewers have different bias directions, partially canceling out when aggregated), while LLM review bias is centralized (similar base model training data → highly correlated errors). This is what Kleinberg & Raghavan call algorithmic monoculture: each individual decision seems reasonable, but aggregation worsens outcomes and is vulnerable to the same attack.

Goal: To clarify the "necessary conditions" for review automation from the vague "human-aligned" to two measurable criteria—(C1) review diversity / non-convergence; (C2) non-gamability / resistance to strategic rewriting—and empirically test whether current SOTA LLM reviewing meets these conditions.

Key Insight: The authors analyze all 75,800 real ICLR 2026 reviews (with AI labels) as "in-the-wild" data, and conduct controlled experiments on 60 randomly selected papers using Bianchi et al.'s AI reviewer agent, constructing a full grid of 4 prompts × 2 rewriting models × 3 reviewer models = 24 conditions to quantify "hivemind" and "laundering."

Core Idea: Use two embedding similarity metrics, IntraSim and InterSim, to measure the "hivemind effect," and define "paper laundering" via zero-shot LaTeX rewriting, demonstrating that both necessary conditions for AI reviewing are unmet, thus supporting the position that "a science of review automation must precede deployment."

Method¶

No new models are proposed; the method is an empirical protocol of "measurement + attack," comprising two main pillars.

Overall Architecture¶

Inputs: (1) All 75,800 ICLR 2026 reviews with AI labels (from Emi 2025's EditLens classifier, 21% classified as AI-generated); (2) 60 randomly selected ICLR papers with synthetic reviews generated by an AI reviewer agent; (3) The same 60 papers after zero-shot LLM rewriting ("laundered" versions).

Outputs: Two position arguments—statistically significant hivemind effect, and laundering increases average scores by +0.45 ($p<0.0001$).

Intermediate steps include: (a) Encoding each review into a vector using text-embedding-3-small, computing intra-paper reviewer similarity (IntraSim) and inter-paper reviewer similarity (InterSim); (b) Feeding each paper into 4 zero-shot prompts (including a launderer command to actively jailbreak the review model), rewriting the entire LaTeX with GPT-5.1 / GPT-5.4, recompiling to PDF, and submitting to the review agent; (c) Wilcoxon signed-rank test for paired score differences.

Key Designs¶

Hivemind Metrics: IntraSim / InterSim:
- Function: Converts the vague concept of "review diversity" into a comparable scalar, enabling human vs AI and real vs simulated group comparisons.
- Mechanism: For all review vectors $\mathcal{R}(p)$ of paper $p$, IntraSim is the average pairwise cosine similarity: $\mathrm{IntraSim}(p)=\frac{2}{m_p(m_p-1)}\sum_{i<j}\mathrm{sim}(r_i,r_j)$. InterSim is the double average of $\mathrm{sim}(r,r')$ across different papers $p\neq q$. InterSim is only compared within the same reviewer type across different papers to avoid confounding signals from human vs AI text style differences.
- Design Motivation: Score correlation alone conflates "AI gives high scores" with "AI says the same thing"; pairwise similarity in embedding space directly measures whether "the second review brings new information," which is the core value of the multi-reviewer system.
Paper Laundering Protocol:
- Function: Constructs a fully compliant "gaming" attack on AI reviewing (not prompt injection, not hidden instructions) to test non-gamability C2.
- Mechanism: The entire LaTeX source of a paper is input to a launderer LLM, which rewrites it under a zero-shot prompt (cost ≈ $0.25/paper), recompiles to PDF, and submits to the AI reviewer agent; paired score differences are collected across a 4 prompt × 2 launderer × 3 reviewer 24-cell grid. The Wilcoxon test yields $p<0.001$ in almost all cells, with an average score increase of $+0.45$; self-preference bias causes GPT reviews of GPT-rewritten papers to increase even more.
- Design Motivation: Existing adversarial attack research focuses on prompt injection, but that is explicitly banned by conferences; laundering is "authors openly admitting to AI-assisted editing," so even if effective, it cannot be banned by policy—this is the real failure mode threatening review credibility.
Monoculture Cascade Measurement:
- Function: Extends "hivemind" from the review level to the paper level, arguing that if laundering becomes widespread, academic writing will converge toward AI reviewer preferences.
- Mechanism: For the 60 papers, compute pairwise cosine similarity of abstract+introduction embeddings (6,903 pairs, multiple samples), comparing original vs laundered versions—the latter's average similarity rises from $0.497$ to $0.529$, a relative increase of $+6.5\%$, Cohen's $d=1.02$ (large effect).
- Design Motivation: The harm of algorithmic monoculture is not only "review convergence," but also "writing shaped by reviewer preferences." A simple embedding aggregation experiment traces the causal chain from reviewer to author writing behavior, strengthening the position argument's logical completeness.

Loss & Training¶

No new models are trained. All "training" costs are concentrated in: (i) calling OpenAI / Anthropic APIs for 60 × 24 = 1,440 reviews; (ii) per-paper laundering cost ≈ $0.25; (iii) embedding and similarity computation for all ICLR reviews is a one-time offline expense.

Key Experimental Results¶

Main Results¶

Hivemind effect on real ICLR 2026 reviews:

Data Slice	InterSim Mean	$p$ Value	Cohen's $d$
Pure AI reviews	0.486	$<10^{-4}$	0.29
Reviews with human contribution	0.467	—	—
Weaknesses+questions only	—	$<10^{-4}$	0.35

Convergence of AI vs human in simulated experiments (60 ICLR papers):

Setting	IntraSim	InterSim (vs Human 0.470)
Human ICLR reviews	0.811	0.470
GPT-5.1 on original papers	0.882 (+8.7%)	0.646 (+37.4%, $d=3.55$)
GPT-5.1 on laundered papers	0.891 (+9.8%)	0.657 (+39.8%, $d=3.79$)
Claude on original papers	—	0.553 (+17.6%, $d=1.41$)

Laundering score increase (1–10 scale, 24-cell average):

Metric	Value
Mean paired score increase	$+0.45$
Wilcoxon significance	$p<0.001$ (almost all cells)
Per-paper laundering cost	≈ $0.25$
Paper similarity increase	+6.5% ($d=1.02$)

Ablation Study¶

Slice / Configuration	Key Findings	Notes
Restrict review to weakness+question	IntraSim $d$ rises from 1.47 → 1.93	Convergence is not just boilerplate, but permeates critical sections
GPT reviewer + GPT launderer	Largest score increase	Self-preference bias (Panickssery 2024) reproduced
Claude reviewer	Smaller score increase	Cross-family launderer attacks still effective, but weaker
AI vs human prediction of final acceptance	AUC 0.710 vs 0.822 (n=8,015)	Human scores better predict final decision
AI–AI score correlation vs AI–human	$r=0.49$ vs $r=0.15$	Consistent errors, not human alignment

Key Findings¶

AI reviewers heavily reuse generic phrases: GPT's most common phrase "if not, can you comment on" appears in 13.3% of papers, Claude's "how does the method handle" in 21.7%, while the most common phrase in human reviews is <1%—direct evidence that AI review "questions" are largely unrelated to the paper.
Score increases mainly stem from more hedging words ("may", "typically", "suggests") and emphasis words ("strong", "robust", "consistent"), as well as hallucinated "extra interpretations" of results, not substantive experimental additions.
AI reviews average 7.3 (GPT), 6.1 (Claude), while humans average 4.3, indicating a significant systematic inflation.

Highlights & Insights¶

Redefines "AI review quality" as two testable conditions: Correlation-based thinking is endlessly debatable; diversity + non-gamability bring the discussion back to engineering baselines, enabling immediate empirical testing.
Laundering as an elegant example of "compliant attack": Unlike prompt injection, it exposes the governance gap—conference policy can ban overstepping, but cannot ban zero-shot editing—making the argument highly transferable.
Algorithmic monoculture falsified at two levels: Reviewer level (review convergence) + author level (writing converges to AI preferences); this "dual-layer monoculture" framework is reusable for recommender systems, automated grading, and any LLM-mediated multi-evaluation scenario.
Appendix H openly provides this paper's AI reviews—a rare "testing on oneself" for a position paper, serving as an ironic demo.

Limitations & Future Work¶

The authors acknowledge: necessary conditions are not sufficient; even if future LLMs achieve non-convergence and non-gamability, issues like accountability and democratic legitimacy remain.
Empirical limitations: 60 simulated samples is small; AI labels come from the EditLens classifier (risk of misclassification, checked in Appendix G.3); whether laundering truly leaves "quality unchanged" is mainly judged by word-level analysis, lacking human blind review.
Directions for improvement: Make "non-gamability" a benchmark (laundering robust score) for future AI reviewer systems; study failure boundaries when launderers do more than style changes (e.g., adding fake experiments).

vs Bianchi et al. 2025b (AI reviewer agent): This work directly uses their agent but shifts the research question from "can it review" to "should it be automated," providing counterexamples.
vs Goldberg et al. 2024 (NeurIPS LLM checklist): Also reveals "AI tools can be gamed," but this work lowers the bar to zero-shot, zero-optimization attacks, making the threat more universal.
vs Lin et al. 2025b (targeted adversarial attacks): They require character-level perturbations; this work only needs natural language rewriting, lowering the attacker threshold by an order of magnitude.
vs Kleinberg & Raghavan 2021 (algorithmic monoculture): Applies this theoretical framework empirically to academic peer review for the first time.

Rating¶

Novelty: ⭐⭐⭐⭐ First to decompose "review automation" into two measurable conditions and propose paper laundering as a new threat
Experimental Thoroughness: ⭐⭐⭐⭐ Real data + controlled experiments + 24-cell grid balance external/internal validity, but 60-paper scale is still small
Writing Quality: ⭐⭐⭐⭐⭐ Model position paper, clear logic, point-by-point rebuttal of opposing views (§5), appendix provides own AI review results
Value: ⭐⭐⭐⭐⭐ Highly timely topic, directly impacts conference policy and community consensus

Data Slice	InterSim Mean	\(p\) Value	Cohen's \(d\)
Pure AI reviews	0.486	\(<10^{-4}\)	0.29
Reviews with human contribution	0.467	—	—
Weaknesses+questions only	—	\(<10^{-4}\)	0.35

Metric	Value
Mean paired score increase	\(+0.45\)
Wilcoxon significance	\(p<0.001\) (almost all cells)
Per-paper laundering cost	≈ \(0.25\)
Paper similarity increase	+6.5% (\(d=1.02\))

Slice / Configuration	Key Findings	Notes
Restrict review to weakness+question	IntraSim \(d\) rises from 1.47 → 1.93	Convergence is not just boilerplate, but permeates critical sections
GPT reviewer + GPT launderer	Largest score increase	Self-preference bias (Panickssery 2024) reproduced
Claude reviewer	Smaller score increase	Cross-family launderer attacks still effective, but weaker
AI vs human prediction of final acceptance	AUC 0.710 vs 0.822 (n=8,015)	Human scores better predict final decision
AI–AI score correlation vs AI–human	\(r=0.49\) vs \(r=0.15\)	Consistent errors, not human alignment