Skip to content

Stop Automating Peer Review Without Rigorous Evaluation

Conference: ICML 2026
arXiv: 2605.03202
Code: None
Area: LLM / NLP / AI Governance / Peer Review
Keywords: AI Reviewing, Hivemind Effect, Paper Laundering, Algorithmic Monoculture, Review Diversity

TL;DR

This is a position paper: through empirical measurements of real ICLR 2026 reviews and 60 simulated reviews, the authors identify two major failures in current LLM-based reviewing—"hivemind" (high convergence) and "paper laundering" (a zero-shot rewrite can increase scores by 0.45). Consequently, they argue that LLMs should not directly generate review comments without rigorous evaluation and call for the establishment of a "science of review automation."

Background & Motivation

Background: Top-tier conference submissions are exploding (NeurIPS / ICLR / ICML see increases of thousands of papers annually), and the pool of qualified reviewers cannot keep up. Consequently, conferences have begun introducing LLMs: AAAI 2025 generated LLM reviews in parallel with humans; ICLR 2026 allows reviewers to use LLMs; ICML 2026 introduced an "author opt-out" binary policy; and ICLR 2025 deployed a review feedback agent. These policies are highly fragmented and lack consensus.

Limitations of Prior Work: Existing work on AI reviewing almost exclusively measures "correlation with human ratings" or "tendency toward prompt injection attacks." These two dimensions are insufficient: high correlation does not imply harmlessness (errors may be consistently aligned), and prompt injection is explicitly banned by conferences, making it an atypical threat model. The truly difficult issues are legal but low-cost strategic writing behaviors and the "homogenization" of AI reviews at the group level.

Key Challenge: Human review bias is distributed (different reviewers have different bias directions, which partially cancel out when aggregated), whereas LLM review bias is centralized (similar training data for the same base models → highly correlated errors). This is what Kleinberg & Raghavan call algorithmic monoculture: individual decisions seem reasonable, but the aggregate becomes worse, and the entire system falls to the same type of attack.

Goal: To redefine the "necessary conditions" for review automation from vague "human correlation" to two measurable conditions: (C1) review diversity / non-convergence; (C2) non-gamability / resistance to strategic rewriting—and to empirically test whether current SOTA LLM reviewers satisfy these.

Key Insight: The authors analyze "wild data" from all 75,800 real reviews of ICLR 2026 (already containing AI labels) while conducting controlled experiments on 60 random papers using the AI reviewer agent from Bianchi et al. They construct a complete grid of 4 prompts × 2 rewriting models × 3 reviewer models = 24 conditions to quantify "hivemind" and "laundering" respectively.

Core Idea: Use two embedding similarity metrics, IntraSim and InterSim, to measure the "hivemind effect," and define "paper laundering" through zero-shot LaTeX rewriting. By proving that AI reviewing fails both necessary conditions, they support the position: "Establish a science of review automation before deployment."

Method

This paper does not propose a new model; its method is an empirical protocol of "measurement + attack," based on two pillars.

Overall Architecture

Input: (1) All 75,800 ICLR 2026 reviews + AI labels (from the EditLens classifier of Emi 2025, with 21% identified as AI-generated); (2) 60 random ICLR papers + synthetic reviews generated by an AI reviewer agent; (3) the same 60 papers in "laundered" versions via zero-shot LLM rewriting.

Output: Two positional arguments—the hivemind effect is statistically significant, and laundering leads to an average score increase of +0.45 (\(p<0.0001\)).

The pipeline includes: (a) encoding each review into a vector using text-embedding-3-small to calculate similarity between reviewers for the same paper (IntraSim) and similarity between reviewers across different papers (InterSim); (b) feeding 4 types of zero-shot prompts into each paper (including one that commands the launderer to actively jailbreak the review model) to rewrite the entire LaTeX source using GPT-5.1 / GPT-5.4, then re-compiling to PDF for the review agent; (c) Wilcoxon signed-rank tests for paired score differences.

Key Designs

  1. Hivemind Metrics IntraSim / InterSim:

    • Function: Transforms the vague concept of "review diversity" into a comparable scalar for side-by-side comparisons (Human vs. AI, Real vs. Simulated).
    • Mechanism: For the set of all review vectors \(\mathcal{R}(p)\) for paper \(p\), IntraSim is the average of pairwise cosine similarities: \(\mathrm{IntraSim}(p)=\frac{2}{m_p(m_p-1)}\sum_{i<j}\mathrm{sim}(r_i,r_j)\). InterSim calculates the dual average of \(\mathrm{sim}(r,r')\) across different papers \(p \neq q\). InterSim is only compared within the same "type" of reviewer across papers to avoid signal contamination by stylistic differences between humans and AI.
    • Design Motivation: Looking only at rating correlation conflates "AI giving high scores" with "AI saying the same things." Pairwise similarity in embedding space directly measures "whether the second review provides new information," which is the core value of a multi-reviewer system.
  2. Paper Laundering Protocol:

    • Function: Constructs a fully compliant "gaming" attack against AI reviewing (not prompt injection, no hidden instructions) to test the non-gamability condition C2.
    • Mechanism: Taking the entire LaTeX source of a paper as input, a launderer LLM rewrites it under a zero-shot prompt (costing ~$0.25/paper). The recompiled PDF is given to the AI reviewer agent. Statistics are gathered across a 24-cell grid (4 prompts × 2 launderers × 3 reviewers). The Wilcoxon test shows \(p<0.001\) in almost all cells, with an average gain of \(+0.45\); a self-preference bias causes GPT-based reviewers to give even higher gains to GPT-rewritten papers.
    • Design Motivation: Existing adversarial research focuses on prompt injection, which is explicitly banned. Laundering is "authors openly admitting to using AI for polishing," which cannot be banned by policy even if effective—this is the failure mode that truly threatens the credibility of peer review.
  3. Monoculture Cascade Measurement:

    • Function: Extends "hivemind" from the review level to the paper level, arguing that if laundering is widely adopted, academic writing will converge toward AI review preferences.
    • Mechanism: Pairwise cosine similarity was calculated for the abstract+introduction embeddings of the 60 papers, totaling 6,903 pairs (sampled repeatedly), comparing original vs. laundered versions. The average similarity rose from \(0.497\) to \(0.529\), a relative increase of \(+6.5\%\) with Cohen's \(d=1.02\) (a large effect).
    • Design Motivation: The danger of algorithmic monoculture is not just "reviewer convergence," but "writing convergence shaped by reviewers." The authors link the causal chain from reviewer behavior to author behavior using a simple embedding aggregation experiment.

Loss & Training

No new models were trained. All "training" costs were concentrated on: (i) calling OpenAI / Anthropic APIs for 60 × 24 = 1,440 reviews; (ii) laundering costs of ≈ $0.25 per paper; (iii) one-time offline costs for embedding and similarity calculations of the full ICLR review set.

Key Experimental Results

Main Results

Hivemind effect on real ICLR 2026 reviews:

Data Slice InterSim Mean \(p\)-value Cohen's \(d\)
Pure AI Reviews 0.486 \(<10^{-4}\) 0.29
Reviews with Human Contribution 0.467
Weaknesses+Questions sections only \(<10^{-4}\) 0.35

Convergence of AI vs. Human on simulation experiments (60 ICLR papers):

Setting IntraSim InterSim (vs. Human 0.470)
Human ICLR Reviews 0.811 0.470
GPT-5.1 on Original Papers 0.882 (+8.7%) 0.646 (+37.4%, \(d=3.55\))
GPT-5.1 on Laundered Versions 0.891 (+9.8%) 0.657 (+39.8%, \(d=3.79\))
Claude on Original Papers 0.553 (+17.6%, \(d=1.41\))

Laundering Score Gain (1–10 scale, average of 24 cells):

Metric Value
Avg. Paired Score Gain \(+0.45\)
Wilcoxon Significance \(p<0.001\) (almost all cells)
Single Paper Laundering Cost ≈ $0.25
Paper Similarity Increase +6.5% (\(d=1.02\))

Ablation Study

Slice / Configuration Key Findings Description
Restricting sections to weakness+question IntraSim \(d\) from 1.47 → 1.93 Convergence is not just boilerplate; it penetrates critical sections.
GPT reviewer + GPT launderer Largest score gain Replicates self-preference bias (Panickssery 2024).
Claude reviewer Smaller score gain Cross-family launderer attacks still work but are weaker.
AI vs. Human Predictive Power AUC 0.710 vs. 0.822 (n=8,015) Human scores better predict final decisions.
AI–AI vs. AI–Human Correlation \(r=0.49\) vs. \(r=0.15\) AI fails consistently rather than aligning with humans.

Key Findings

  • AI reviewers reuse generic sentence structures heavily: GPT's most frequent phrase "if not, can you comment on" appears in 13.3% of papers; Claude's "how does the method handle" appears in 21.7%. In contrast, the most frequent human phrases appear in <1% of reviews—direct evidence that AI review "questions" are often decoupling from the paper.
  • Score gains primarily result from increased use of hedging words ("may", "typically", "suggests") and emphasis words ("strong", "robust", "consistent"), as well as "extra interpretations" of hallucinated results, rather than substantive experimental additions.
  • AI reviewers give an average score of 7.3 for GPT and 6.1 for Claude, compared to a human average of 4.3, indicating significant systematic inflation.

Highlights & Insights

  • Redefining "Is AI reviewing good?" into two testable conditions: The correlation-based approach is endlessly debatable. Diversity and game-resistance pull the discussion back to an engineering baseline that researchers can immediately test.
  • Laundering as an elegant "compliant attack": Unlike prompt injection, it exposes a governance gap—conference policies can ban crossing boundaries, but they cannot effectively ban zero-shot polishing. The argument's transferability is very high.
  • Algorithmic monoculture falsified at two levels: The reviewer level (review homogenization) + the author level (papers converging toward AI preferences). This "double-layer monoculture" framework can be applied to any LLM-mediated evaluation scenario, such as recommendation systems or automated grading.
  • Public Appendix H provides AI reviews for this paper: A rare "eat your own dog food" moment for a position paper, serving as an ironic demo of the phenomenon.

Limitations & Future Work

  • The authors acknowledge that necessary conditions are not sufficient; even if future LLM generations become non-convergent and game-resistant, questions of accountability and democratic legitimacy remain.
  • Empirical limitations: The 60-paper simulated sample is small; AI labels come from the EditLens classifier (risk of misclassification, though validated in Appendix G.3); whether "quality remained unchanged" during laundering relies on word-level analysis and lacks human blind-review validation.
  • Future directions: Developing "game-resistance" into a benchmark (laundering robust score) for standard AI reviewer system evaluations; studying the failure boundaries when launderers change more than style (e.g., adding fake experiments).
  • vs. Bianchi et al. 2025b (AI reviewer agent): This paper reuses their agent but shifts the research question from "can it review" to "should it be automated," providing counterexamples.
  • vs. Goldberg et al. 2024 (NeurIPS LLM checklist): Similarly reveals that AI tools can be gamed, but this paper lowers the gaming threshold to zero-shot optimization, making the threat more pervasive.
  • vs. Lin et al. 2025b (targeted adversarial attacks): While they use character-level perturbations, this paper uses natural language rewriting, lowering the attacker's barrier by an order of magnitude.
  • vs. Kleinberg & Raghavan 2021 (algorithmic monoculture): First empirical application of this theoretical framework to the academic peer review scenario.

Rating

  • Novelty: ⭐⭐⭐⭐ First to decompose "review automation" into two testable conditions and propose paper laundering as a new threat.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Combines real-world data and controlled experiments with a 24-cell grid for internal/external validity, though the 60-paper scale remains modest.
  • Writing Quality: ⭐⭐⭐⭐⭐ A model position paper with a clear logical chain, addressing counterarguments pointwise (§5), and providing its own AI review results.
  • Value: ⭐⭐⭐⭐⭐ Extremely timely topic with direct influence on conference policies and community consensus.