ICLR 2026 Video Understanding peer review score dynamics decision entropy conference statistics dataset LLM metadata extraction

Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences¶

Conference: ICLR 2026 arXiv: 2510.13201 Code: Project Page Area: Scientometrics / Review Analysis Keywords: peer review, score dynamics, decision entropy, conference statistics, dataset, LLM metadata extraction

TL;DR¶

Paper Copilot is constructed as a persistent digital archive and analysis platform for peer reviews spanning dozens of AI/ML venues. It adopts a tri-source hybrid data collection strategy—OpenReview API, web scraping, and community contributions—to archive real-time score snapshots capturing pre- and post-rebuttal dynamics. The platform reveals a structural anomaly in ICLR 2025: a counterintuitive decline in decision entropy, signaling a shift from probabilistic tiering to near-deterministic score-driven decision-making. LLM-driven author–affiliation metadata extraction further supports talent trajectory tracking.

Background & Motivation¶

Background: Submission volumes at top AI/ML venues now exceed 10,000 per year (ICLR 2025: 11,672), placing unprecedented pressure on peer review. Some venues (ICLR/NeurIPS) adopt open review via OpenReview, while most (CVPR/AAAI/ICCV) remain closed. Review dimensions have also expanded from a single score to multi-dimensional assessments covering soundness, correctness, novelty, and contribution.

Limitations of Prior Work: (1) Review data is fragmented across social platforms such as Twitter, Reddit, Zhihu, and Xiaohongshu; (2) OpenReview overwrites earlier review versions—score histories during the rebuttal period constitute an irrecoverable information loss; (3) cross-venue and longitudinal comparisons of review standards lack a unified data source and toolset; (4) during the rebuttal window (only 1–2 weeks), authors lack statistical reference points to assess their score levels and the potential value of rebuttal.

Key Challenge: The review process is central to research transparency, yet existing infrastructure cannot support systematic tracking of review dynamics or longitudinal analysis.

Goal: To construct a unified platform for review data collection, archiving, and analysis that supports cross-venue longitudinal studies and real-time score dynamics tracking.

Key Insight: A tri-source hybrid data strategy maximizes coverage, while real-time snapshot archiving preserves otherwise irrecoverable historical data.

Core Idea: Consolidate scattered, ephemeral AI conference review information into a persistent, structured, and analyzable digital archive—establishing a "meta-scientific infrastructure" for the peer review process.

Method¶

Overall Architecture¶

Paper Copilot is a modular system comprising: a venue configuration layer → a multi-source data collection pipeline (multi-source assigners + worker pool + parallel bots) → cleaning and normalization → versioned datasets (JSON format, 30+ fields per paper) → backend storage/API (LAMP/MySQL) → frontend visualization and analysis (WordPress + custom JS). The system supports rapid onboarding of new venues with minimal configuration.

Key Designs¶

Tri-Source Hybrid Data Collection Pipeline:
- Function: Unified collection of review data from heterogeneous sources to maximize coverage.
- Mechanism: (1) OpenReview API—scheduled scripts pull scores, confidence levels, and comments for open venues (ICLR/NeurIPS), storing timestamped snapshots to track pre- and post-rebuttal changes; (2) Web scraping—targeted extraction of accepted papers, authors, and metadata for venues without APIs (CVPR/AAAI); (3) Community opt-in contributions—authors from closed-review venues voluntarily submit their reviews (accumulating 6,584 valid records), with approximately 60% consenting to public release of anonymized scores.
- Design Motivation: No single data source can cover all venues. For closed-review venues, community contribution is the only viable data channel.
Review Dynamics Temporal Snapshot Archiving:
- Function: Real-time archiving of the complete evolution of review scores throughout the discussion and rebuttal phases.
- Mechanism: Daily crawling of review snapshots for ICLR 2024/2025, recording all dimensional scores (rating, confidence, soundness, contribution, presentation) from each reviewer at each time point. OpenReview's official platform retains only the final version; earlier versions are overwritten and irrecoverable. Paper Copilot is the only publicly available archive on the internet preserving complete review time-series data. Score footprint visualizations trace the scoring trajectory of individual papers across multiple dimensions and reviewers.
- Design Motivation: The review process—how scores change and how consensus forms—is as important as the final outcome, yet it had not previously been systematically preserved.
LLM-Driven Author–Affiliation Metadata Extraction:
- Function: Large-scale automated extraction of structured author tuples \((a_i, \mathcal{A}_i, e_i)\) (name, affiliation set, email) from papers.
- Mechanism: GLM-series models are used to extract metadata from camera-ready PDFs. A mismatch indicator \(\mathbf{1}(x,y) = 1\) if \(|x| \neq |y|\) is defined to evaluate structural consistency. The evaluation metric is Success Rate \(= 1 - \frac{1}{|\mathcal{D}|} \sum_{i} (\delta_{\text{aff}}^i \lor \delta_{\text{email}}^i \lor \delta_{\text{parse}}^i)\). glm-4-plus achieves an 86.82% success rate on ~70K papers (\(\delta_{\text{aff}} = 5.01\%\), \(\delta_{\text{email}} = 4.94\%\), \(\delta_{\text{parse}} = 0.81\%\)).
- Design Motivation: The vast majority of venues do not provide structured author–affiliation mappings, which are a prerequisite for institutional and country-level analyses.

Analysis Methods¶

Decision Entropy Analysis: Quantifies the certainty of area chair (AC) decisions. For year \(t\) and score bin \(b\), the entropy is defined as \(H_{t,b} = -\sum_{s \in \{\text{Reject, Poster, ...}\}} p_{t,b,s} \log p_{t,b,s}\), with the weighted average \(\bar{H}_t = \sum_b w_{t,b} H_{t,b}\). Entropy typically grows logarithmically with submission volume as \(\bar{H}_t \approx a \log X_t + b\); however, a strong negative residual in 2025 indicates anomalously high decision sensitivity \(\kappa_{2025}\)—ACs rely more heavily on average scores for deterministic tiering.

Key Experimental Results¶

Analysis of ICLR Review Evolution (2017–2025)¶

Metric	Finding	Quantitative Evidence
Submission growth	490 → 11,672 (24×)	AC count increased from 31 to 823
Decision entropy trend	Typically grows logarithmically with submissions	\(\bar{H}_t \approx a\log X_t + b\)
2025 structural shift	Decision entropy anomalously declines	\(\text{resid}_{2025}\) deviates strongly negative from the fitted line
Rebuttal score changes	54.8% of papers exhibit overall rating changes	soundness and other dimensions change only ~10–13%
Consensus evolution	Disagreement initially increases then converges after discussion begins	Oral converges fastest; Reject maintains high divergence
Boundary asymmetry	High-score, low-variance papers benefit for acceptance	Low-score, high-variance papers paradoxically benefit for acceptance

Community Transparency Survey (1,860 Responses Across 4 Venues)¶

Venue	Responses	Agreed to Release Anonymized Reviews	Proportion
CVPR 2025	357	191	53.5%
ICML 2025	1,034	628	60.7%
ICCV 2025	254	151	59.4%
ACL 2025	215	145	67.4%
Total	1,860	1,115	59.9%

LLM Metadata Extraction Accuracy¶

Model	\(\delta_{\text{aff}}\)	\(\delta_{\text{email}}\)	\(\delta_{\text{parse}}\)	Success Rate
glm-4-plus	5.01%	4.94%	0.81%	86.82%
glm-4-air	49.98%	17.11%	0.51%	44.73%
glm-4-flash	76.39%	43.27%	0.62%	18.52%
glm-3-turbo	76.07%	32.34%	1.34%	20.90%

Key Findings¶

Structural turning point in 2025: Despite the largest-ever submission volume, decision entropy declined—ACs relied more heavily on average scores for acceptance decisions, shifting from probabilistic tiering to near-deterministic mapping.
Dual role of rebuttal: Amplifies score changes for borderline papers while driving consensus formation for strong papers.
Spotlight converging toward Oral: Average scores between tiers are increasingly separated, with Spotlight trending closer to Oral year over year.
Dimensional divergence in score changes: Overall rating is the most frequently changed dimension during rebuttal; soundness and related dimensions change far less.

Highlights & Insights¶

The only review time-series archive: OpenReview overwrites earlier versions during discussion; Paper Copilot's real-time archiving preserves the only complete record of review dynamics available on the internet—an irreplaceable historical resource.
Decision entropy analysis framework: An ordered-logit model combined with decision entropy provides a quantitative characterization of the evolution of the review system, elevating scattered community intuitions to rigorous meta-scientific analysis.
Integrity of ethical design: The paper provides thorough discussion of data source compliance, privacy protection, re-identification risks, and dual-use safeguards, reflecting best practices in research ethics.

Limitations & Future Work¶

Closed-review venues rely on voluntary community submissions, introducing self-selection bias (authors with high or low scores may submit at different rates).
Non-zero error rates in LLM-based affiliation extraction may affect the reliability of institutional ranking analyses.
Author trajectory analysis could be repurposed for high-stakes assessments such as recruitment, posing dual-use risks.
As a continuously updated live platform, exact reproduction of the system's state at any given point in time is not feasible.

vs. PeerRead (Kang et al., 2018): Covers 14.7K papers but is limited to specific venues and time snapshots; does not support longitudinal dynamic tracking.
vs. MOPRD (Lin et al., 2023): Multidisciplinary in scope but does not cover rebuttal dynamics or score time-series for AI venues.
vs. CSRankings: Focuses on institutional rankings but updates slowly, lacks data source transparency, and contains no review data whatsoever.

Rating¶

Novelty: ⭐⭐⭐⭐ Distinctive tri-source data strategy and review time-series archiving; the decision entropy analysis framework is original.
Experimental Thoroughness: ⭐⭐⭐⭐ Large-scale longitudinal analysis spanning ICLR 2017–2025 with well-grounded quantitative findings.
Writing Quality: ⭐⭐⭐⭐ System description is clear; ethical discussion is comprehensive and detailed.
Value: ⭐⭐⭐⭐⭐ Represents an infrastructure-level contribution to the AI research community; the combined value of the dataset and platform far exceeds that of a standalone paper.