ICML2025 Object Detection AI paper notes paper summaries Anomaly Detection Time-Series Forecasting Alignment/RLHF Multimodal/VLM

🎯 Object Detection¶

🧪 ICML2025 · 12 paper notes

📌 Same area in other venues: 📷 CVPR2026 (99) · 🔬 ICLR2026 (31) · 🧪 ICML2026 (6) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (27) · 📹 ICCV2025 (28)

🔥 Top topics: Anomaly Detection ×4 · Time-Series Forecasting ×2 · Alignment/RLHF ×2 · Multimodal/VLM ×2

BlueGlass: A Framework for Composite AI Safety: This work proposes BlueGlass, a composite AI safety framework that integrates three safety analysis tools—distributed evaluation, approximation probes, and sparse autoencoders—via a unified infrastructure to systematically analyze the capability boundaries, layer dynamics, and internal concept representations of Vision-Language Models (VLMs) in object detection tasks.
Causality-Aware Contrastive Learning for Robust Multivariate Time-Series Anomaly Detection: This paper proposes CAROTS—a multivariate time-series anomaly detection framework that integrates causal relationships into contrastive learning. It utilizes causality-preserving augmentation as positive samples (normal variations) and causality-violating augmentation as negative samples (simulated anomalies) to train encoders to distinguish normal from abnormal patterns based on causal structures.
CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering: By introducing the core concept of cost volume filtering from stereo matching and optical flow estimation into Unsupervised Anomaly Detection (UAD), this work constructs a matching cost volume between the input and templates. It utilizes a 3D U-Net with dual-stream attention guidance for denoising and filtering. Designed as a general plug-and-play post-processing module, it simultaneously boosts the performance of both reconstruction-based and embedding-based UAD methods, achieving state-of-the-art (SOTA) results on MVTec-AD and VisA.
Few-Shot Learner Generalizes Across AI-Generated Image Detection: This paper is the first to redefine AI-generated image detection as a few-shot classification task. It proposes FSD (Few-Shot Detector) based on prototypical networks to learn a metric space. Using only 10 samples from unseen generative models, it achieves an average accuracy of 84.1% on the GenImage dataset, outperforming the previous SOTA (LARE2) by +11.6%.
FG-CLIP: Fine-Grained Visual and Textual Alignment: FG-CLIP systematically addresses the three major bottlenecks of fine-grained understanding in CLIP: capturing global semantic details with 1.6B long-description-image pairs, achieving precise regional alignment with 12M images and 40M region annotations, and training models to distinguish subtle semantic differences with 10M hard negatives, achieving comprehensive leading performance in fine-grained understanding, open-vocabulary detection, and image-text retrieval.
KAN-AD: Time Series Anomaly Detection with Kolmogorov-Arnold Networks: KAN-AD reformulates time series anomaly detection as approximating sequences using smooth univariate functions. By replacing B-splines in KAN with truncated Fourier expansion to avoid local perturbation sensitivity, it improves detection accuracy by an average of 15% across four benchmarks with fewer than 1000 parameters.
Open-Det: An Efficient Learning Framework for Open-Ended Detection: Open-Det proposes an efficient open-ended detection (OED) framework. By reconstructing the object detector (decoupling one-to-many/one-to-one matching), introducing a VL-prompts distillation module to bridge the vision-language semantic gap, utilizing a LoRa Head + Text Denoising to accelerate LLM training, and applying a Masked Alignment Loss to eliminate contradictory supervision, Open-Det achieves superior detection performance (APr +1.0%) using only 1.5% of the training data and 20.8% of the training epochs compared to GenerateU.
Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models: Outlier Gradient Analysis (OGA) is proposed to reformulate the identification of detrimental training samples in influence functions as an outlier detection problem in the gradient space. This sidesteps the high computational overhead of Hessian matrix inversion while outperforming traditional influence function methods on tasks such as noisy label correction, NLP data filtering, and LLM influence data identification.
Self-Organizing Visual Prototypes for Non-Parametric Representation Learning: This paper proposes the Self-Organizing Prototypes (SOP) strategy, which replaces the single prototype in traditional self-supervised learning (SSL) with multiple semantically similar support embeddings to represent local regions of the feature space. It also introduces a non-parametric masked image modeling (MIM) task, achieving state-of-the-art performance on downstream tasks such as retrieval, detection, and segmentation.
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction: UI-Vision is proposed – the first comprehensive offline evaluation benchmark for desktop environments, covering 83 software applications. It provides dense annotations of bounding boxes, UI labels, and action trajectories. It defines a three-level evaluation task from fine-grained to coarse-grained (Element Grounding \(\rightarrow\) Layout Grounding \(\rightarrow\) Action Prediction) to systematically evaluate and reveal key shortcomings of SOTA models in professional software understanding, spatial reasoning, and complex actions.
Understanding the Emergence of Multimodal Representation Alignment: This work systematically investigates the emergence mechanism of multimodal representation alignment. It reveals that the occurrence of implicit alignment and its relationship with performance depend on the ratio of redundant to unique information in the data and modal heterogeneity, challenging the common assumption of "larger models \(\rightarrow\) better alignment \(\rightarrow\) better performance."
When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network: A multimodal asynchronous hybrid network is proposed, which combines the high temporal resolution of event cameras (processed via asynchronous GNN) with the rich spatial features of RGB cameras (processed via CNN). This achieves an inference speed of 579 FPS and an average response time of 1.17s in traffic anomaly detection, introducing event streams to the field of autonomous driving anomaly detection for the first time.