🔎 AIGC Detection¶

🧠 NeurIPS2025 · 8 paper notes

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text: This paper introduces ASCIIBench, the first publicly available benchmark for ASCII art understanding and generation (5,315 images, 752 categories). Systematic evaluation reveals that the visual modality substantially outperforms the text modality, multimodal fusion yields no benefit, and CLIP exhibits a fundamental bottleneck in representing ASCII structure—only categories with high intra-class consistency can be effectively distinguished.
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content: This paper proposes a dual-agent (quantitative + qualitative) evaluation framework that systematically assesses the faithfulness of GPT-4o, Ansari AI, and Fanar on Islamic content generation tasks across three dimensions—theological accuracy, citation integrity, and stylistic appropriateness—finding that even the best-performing model exhibits significant deficiencies in citation reliability.
Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code: This paper proposes having LLMs generate Python code for domain-dependent heuristic functions (rather than directly generating plans). Candidate heuristics are obtained via \(n\) samples and the best is selected on a training set, then injected into the Python planner Pyperplan for use with GBFS. The approach surpasses all C++ Fast Downward traditional heuristics on 8 IPC 2023 benchmark domains using pure Python, matches the SOTA learned planner \(h^{\mathrm{WLF}}_{\mathrm{GPR}}\), and guarantees 100% correctness for all plans found.
CLAWS: Creativity Detection for LLM-Generated Solutions Using Attention Window of Sections: This paper proposes CLAWS, a method that analyzes the attention weight distribution of LLMs across different prompt sections during mathematical solution generation to classify outputs as "creative," "typical," or "hallucinated," without requiring human evaluation.
DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code: DuoLens is proposed — an AI-generated content detection framework based on dual-encoder fusion of CodeBERT and CodeBERTa — achieving AUROC of 0.97–0.99 on multilingual text (8 languages) and source code (7 programming languages) at significantly reduced computational cost (8–12× lower latency, 3–5× lower VRAM), substantially outperforming large models such as GPT-4o.
"Jutters": Through the metaphor of the Dutch tradition of jutters (beachcombers), this work constructs an immersive installation art piece that integrates real beach debris with AI-generated images and videos, guiding visitors to adopt a beachcomber's mindset in reflecting on how to engage with AI-generated content.
Reasoning Compiler: LLM-Guided Optimizations for Efficient Model Serving: This paper proposes Reasoning Compiler, which models compiler optimization as a sequential decision-making process, employing an LLM as a context-aware proposal engine combined with MCTS to balance exploration and exploitation. The approach achieves an average 5.0× speedup across 5 representative benchmarks and 5 hardware platforms, with 10.8× better sampling efficiency than TVM's evolutionary search.
Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency: This paper proposes Wedge, a framework that uses LLMs to synthesize performance-characterizing constraints to guide constraint-aware fuzzing, generating stress-test inputs that expose code performance bottlenecks. It further constructs the PerfForge benchmark, enabling LLM-based code optimizers (e.g., Effi-Learner) to achieve up to 24% additional reduction in CPU instructions.