🔎 AIGC Detection¶

📷 CVPR2026 · 7 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (30) · 💬 ACL2026 (17) · 🧪 ICML2026 (11) · 🤖 AAAI2026 (2) · 🧠 NeurIPS2025 (9)

Enabling Supervised Learning of Generative Signatures for Generalized AI-Generated Images Detection: To address the deadlock where "generative traces in AI-generated images lack clean pairs and cannot be extracted via supervised learning," this paper uses a randomly-structured image reconstructor to artificially "create traces" on real images. The reconstruction residuals are treated as pseudo-labels to train a generative signature (GenSign) extractor, followed by a GenSign + RGB dual-stream classifier for detection, achieving SOTA cross-model generalization across four benchmarks.
Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks: This work defines the new task of "Fine-grained Image Aesthetic Assessment" and constructs the FGAesthetics benchmark containing 32,217 images across 10,028 series. It proposes the FGAesQ model, which learns discriminative aesthetic scores from relative ranks through Difference-Preserving Tokenization (DiffToken), Contrastive Text-aligned Alignment (CTAlign), and Rank-Aware Regression (RankReg). The model achieves an accuracy of 0.779 in fine-grained scenarios while maintaining a coarse-grained SRCC of 0.770.
Inconsistency-aware Multimodal Schrodinger Bridge for Deepfake Localization: IaMSB reformulates "temporal interval localization" of audio-visual deepfakes as a Schrödinger Bridge (SB) generation problem—directly reading cross-modal consistency scores from the bridge's transmission cost and asymmetrically allocating computation steps to the more suspicious modality, resulting in a 3-10% gain over existing methods on strict IoU ([email protected]).
Learning Forgery-Aware Lip Representations Without Forgery Priors: To address the vulnerability of speaker authentication systems to personalized Talking Face Generation (TFG) forgeries, this paper proposes a detector trained solely on real videos without relying on any forgery samples. By combining mixed-fake lip generation, asymmetric contrastive learning, and Gaussian regularization, the real lip motion features are compressed into a compact hypersphere. Anything outside the sphere (forgeries and impostors) is treated as an outlier, reducing the error rate by over 10% against 8 modern forgeries compared to 10 SOTA methods.
Learning Where to Look and How to Judge: Resolution-agnostic Image Quality Assessment with Quality-aware Saliency: To address four common issues in No-Reference Image Quality Assessment (NR-IQA)—forced resizing to accommodate pre-trained resolutions, poor generalization across resolutions, difficulty in joint training due to inconsistent MOS scales, and computational explosion for UHD images—this paper proposes ReLIQS. It samples fixed-size patches from the original resolution and its scaled variants, encoding them with CLIP. A lightweight "Perceptual Importance Estimator (PIE)" learns IQA-specific saliency to select a few key patches, while a "Latent Quality Axis Module (LQAM)" aggregates multi-scale embeddings into a single score. ReLIQS outperforms CNN, CLIP, and MLLM-based baselines across various real/synthetic/AIGC distortions and resolutions with lower computational cost.
Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images: LTE enables Vision-Language Models to first perform a "global scan to locate suspicious regions" and then "zoom in and crop to re-examine for the final verdict." It upgrades one-time classification into a two-stage region-grounded reasoning process. Accompanied by the TRACE dataset containing box-level annotations and forensic explanations, it achieves simultaneous improvements in accuracy, robustness, and interpretability.
PPM-CLIP: Probabilistic Prompt Modeling for Generalizable AI-Generated Image Detection: PPM-CLIP replaces the "discriminative static boundary" paradigm with "generative probabilistic inference." It utilizes normalizing flows to generate a family of adaptive prompts (multiple hypotheses) for each image and determines the results by averaging cosine similarities to marginalize noise. Combined with frequency-guided patch-wise contrastive learning, it forces the CLIP encoder to capture high-frequency forgery traces, significantly outperforming SOTA in cross-generator generalization on Ojha, GenImage, and DRCT.