🔎 AIGC Detection¶

🔬 ICLR2026 · 6 paper notes

Calibrating Verbalized Confidence with Self-Generated Distractors: This paper proposes DiNCo, a method that exposes the "suggestibility bias" of LLMs by having them independently evaluate automatically generated distractor options (plausible but incorrect alternative answers). It normalizes confidence using the total confidence assigned to distractors, and integrates two complementary dimensions—generation consistency and verification consistency—to significantly improve confidence calibration on both short-form QA and long-form generation tasks.
CLARC: C/C++ Benchmark for Robust Code Search: This paper introduces CLARC, the first compilable C/C++ code retrieval benchmark comprising 6,717 query–code pairs. An automated pipeline extracts code from GitHub and employs LLMs combined with hypothesis testing to generate and validate queries. The benchmark covers four retrieval scenarios—standard, anonymized, assembly, and WebAssembly—and reveals that existing code embedding models over-rely on lexical features (NDCG@10 drops from 0.89 to 0.67 after anonymization) and perform poorly on binary-level retrieval.
Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity: Through close reading annotations of 8,618 expressions by 26 professional writers, this paper demonstrates that n-gram novelty is insufficient for measuring textual creativity — approximately 91% of expressions with high n-gram novelty are not perceived as creative, and in open-source LLMs, high n-gram novelty negatively correlates with pragmaticality.
DMAP: A Distribution Map for Text: This paper proposes DMAP (Distribution Map), a mathematical framework that maps text to i.i.d. samples on \([0,1]\) via next-token probability rankings from a language model. A formal theorem proves that purely sampled text yields a uniform distribution, enabling \(\chi^2\)-based verification of generation parameters, exposing the root cause of the complete failure of probability-curvature detectors under pure sampling, and visualizing statistical fingerprints left by post-training (SFT/RLHF) in downstream models.
Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review: This paper constructs the largest AI-generated peer review dataset to date (788,984 reviews), systematically evaluates 18 AI text detection methods in the peer review setting, and proposes the Anchor detection method that leverages the source paper as contextual grounding, substantially outperforming all baselines at low false positive rates.
PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives: The PoliCon benchmark is constructed from 2,225 high-quality deliberation records spanning 13 years (2009–2022) of the European Parliament. By designing diverse voting mechanisms (simple majority / two-thirds majority / veto power), power structures, and political objectives (utilitarianism / Rawlsianism), the benchmark systematically evaluates the ability of LLMs to draft political consensus resolutions, revealing the shortcomings of frontier models on complex consensus tasks and their inherent partisan biases.