🎯 Object Detection¶
🤖 AAAI2026 · 29 paper notes
📌 Same area in other venues: 📷 CVPR2026 (97) · 🔬 ICLR2026 (31) · 🧪 ICML2026 (6) · 🧠 NeurIPS2025 (27) · 📹 ICCV2025 (28)
🔥 Top topics: Anomaly Detection ×9 · Object Detection ×8 · Remote Sensing ×3 · Few-/Zero-Shot Learning ×3 · Object Tracking ×2
- AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios
-
This paper introduces AerialMind, the first large-scale Referring Multi-Object Tracking (RMOT) benchmark dataset for UAV scenarios, and proposes HawkEyeTrack (HETrack), a method that achieves language-guided multi-object tracking in aerial UAV scenes via a co-evolutionary fusion encoder and a scale-adaptive contextual refinement module.
- An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice
-
This paper proposes a real-time overall mechanism for rice quality evaluation, integrating three modules: an improved YOLO-v5 (variety detection), an improved ConvNeXt-Tiny (intactness grading), and K-means (chalkiness region quantification). The system achieves 99.14% mAP and 97.89% detection accuracy on a self-constructed dataset of 20,000 images spanning six rice varieties.
- AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer
-
This work formulates zero-shot anomaly generation as a text-guided localized style transfer problem. A lightweight U-Net trained with CLIP-based losses stylizes masked regions of normal images into semantically aligned anomalous images. With only 263M total parameters (0.61M trainable), AnoStyler surpasses diffusion-based baselines on MVTec-AD and VisA while significantly improving downstream anomaly detection performance.
- AquaSentinel: Next-Generation AI System Integrating Sensor Networks for Urban Underground Water Pipeline Anomaly Detection via Collaborative MoE-LLM Agent Architecture
-
This paper proposes AquaSentinel, a physics-informed AI system that achieves network-wide pipeline leak detection using only 20–30% node coverage through sparse sensor deployment, physics-augmented virtual sensors, a MoE spatiotemporal GNN ensemble, a dual-threshold RTCA detection algorithm, causal flow localization, and LLM-based report generation. The system achieves 100% detection rate across 110 leak scenarios.
- Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection
-
This paper proposes a framework leveraging VFMs (DINOv2 + Grounding DINO) to enhance Source-Free Object Detection (SFOD) via three modules: Patch-weighted Global Feature Alignment (PGFA), Prototype-based Instance Feature Alignment (PIFA), and Dual-source Enhanced Pseudo-label Fusion (DEPF). The method achieves state-of-the-art results on 6 cross-domain detection benchmarks, e.g., 47.1% mAP on Cityscapes→Foggy Cityscapes (+3.5% over DRU) and 67.4% AP on Sim10k→Cityscapes (+8.7% over DRU).
- CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection
-
This work identifies point cloud curvature as a powerful cue for anomaly detection and proposes CASL, a curvature-augmented self-supervised learning framework. By guiding coordinate reconstruction with multi-scale curvature prompts, CASL learns generalizable 3D representations without any anomaly-detection-specific mechanisms, achieving a 5.6% O-AUROC improvement over the previous state of the art on Real3D-AD.
- Commonality in Few: Few-Shot Multimodal Anomaly Detection via Hypergraph-Enhanced Memory
-
This paper proposes CIF, which leverages hypergraphs to extract intra-class structural commonalities from a small number of training samples, guiding memory bank construction and retrieval for few-shot multimodal industrial anomaly detection, achieving state-of-the-art performance.
- Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning
-
This paper proposes GroundingAgent, a visual grounding framework that requires no task-specific fine-tuning. By composing pretrained open-vocabulary detectors (YOLO World), an MLLM (Llama-3.2-11B-Vision), and an LLM (DeepSeek-V3) into a structured iterative reasoning pipeline, the method achieves a zero-shot average accuracy of 65.1% on RefCOCO/+/g, substantially outperforming prior zero-shot approaches.
- Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time
-
This paper proposes TUNE, a plug-and-play test-time adaptation framework that addresses the "normality shift" problem in graph anomaly detection—caused by the emergence of new normal node categories—by transforming node features via a graph aligner. It leverages the degree of aggregation contamination as an unsupervised adaptation signal and significantly enhances the generalization of various pretrained GAD models across 10 real-world datasets.
- CountSteer: Steering Attention for Object Counting in Diffusion Models
-
This paper proposes CountSteer, a training-free inference-time method that injects adaptive steering vectors into the cross-attention hidden states of diffusion models, improving object counting accuracy by approximately 4% without degrading image quality.
- FDP: A Frequency-Decomposition Preprocessing Pipeline for Unsupervised Anomaly Detection in Brain MRI
-
This work presents the first systematic frequency-domain analysis of brain MRI anomalies, demonstrating that lesions are predominantly concentrated in low-frequency components. Based on this finding, the authors propose the Frequency Decomposition Preprocessing (FDP) framework, which reconstructs low-frequency signals via a learnable prior context bank to suppress lesions while preserving anatomical structures. As a plug-and-play module, FDP consistently improves detection performance across multiple UAD baselines (achieving a 17.63% DICE gain on LDM).
- Harnessing Vision-Language Models for Time Series Anomaly Detection
-
A two-stage zero-shot time series anomaly detection framework is proposed: ViT4TS employs a lightweight ViT to perform multi-scale cross-patch matching on line-chart renderings of time series for candidate anomaly interval localization, while VLM4TS leverages GPT-4o with global temporal context to validate and refine detection results. The framework achieves F1-max surpassing the best baseline by 24.6% across 11 benchmarks, with token consumption only 1/36 of existing LLM-based methods.
- LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers
-
This paper proposes LampQ, a metric-based layer-wise mixed precision quantization method that measures the quantization sensitivity of each ViT layer via a type-aware Fisher information metric, combines integer linear programming to optimize bit-width allocation, and iteratively refines the allocation. LampQ achieves state-of-the-art performance across image classification, object detection, and zero-shot quantization tasks.
- LoReTTA: A Low Resource Framework To Poison Continuous Time Dynamic Graphs
-
This paper proposes LoReTTA, a two-stage adversarial poisoning attack framework that requires no surrogate model. It first sparsifies high-influence edges via 16 temporal importance metrics, then replaces them with adversarial edges using a degree-preserving negative sampling algorithm. Across 4 datasets × 4 TGNN models, LoReTTA achieves an average performance degradation of 29.47%, while evading 4 anomaly detection systems and resisting 4 defense methods.
- MovSemCL: Movement-Semantics Contrastive Learning for Trajectory Similarity (Extension)
-
This paper proposes MovSemCL, a framework that transforms GPS trajectories into movement-semantic features (displacement vectors + heading angles + Node2Vec spatial graph embeddings), achieves hierarchical encoding via patch-level two-stage attention (reducing complexity from \(O(L^2)\) to near-linear), and designs Curvature-Guided Augmentation (CGA) to preserve behaviorally critical segments such as turns and intersections. The framework achieves a mean rank approaching the ideal value of 1 on trajectory retrieval tasks while reducing inference latency by 43.4%.
- CountVid: Open-World Object Counting in Videos
-
This paper proposes CountVid, a model, and the VideoCount benchmark, presenting the first systematic study of open-world video object counting—given a text or image description specifying target objects, the system enumerates all unique instances in a video. By combining an image counting model with a promptable video segmentation and tracking model, CountVid addresses challenges such as occlusion and re-appearance, achieving substantial improvements over strong baselines across diverse scenarios including TAO, MOT20, penguin colonies, and X-ray metal crystallization.
- PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixing of Experts
-
PromptMoE shifts prompt learning from a monolithic paradigm to a compositional one. Through a visually-guided Mixture of Experts (MoE) mechanism, it dynamically assembles instance-adaptive normal/abnormal state prompts from a learnable semantic primitive bank, achieving state-of-the-art zero-shot anomaly detection (ZSAD) performance across 15 industrial and medical datasets.
- RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection
-
This paper proposes a Recursive Convolutional Autoencoder (RcAE) that progressively suppresses anomalies while preserving normal details through multi-step iterative reconstruction with shared parameters. Combined with a Cross-Recursive Detection module (CRD) that exploits multi-step reconstruction dynamics for robust anomaly localization, the method achieves performance comparable to state-of-the-art approaches using only 10% of the parameters required by diffusion models.
- Reimagining Anomalies: What if Anomalies Were Normal?
-
This paper proposes the first counterfactual explanation framework for unsupervised image anomaly detection. By training a generator to modify anomalous samples into multiple disentangled counterfactuals perceived as normal by the detector, the framework answers at the semantic level: "What would an anomaly look like if it were normal?" This provides a depth of interpretability far exceeding traditional heatmap-based approaches.
- REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion
-
This paper extends the 2D bounding box diffusion paradigm of DiffusionDet to 3D radar space, proposing the REXO framework. It enables explicit cross-view radar feature association guided by noisy 3D bounding box projections, and introduces a ground-level constraint to reduce the diffusion parameter space. REXO surpasses the state of the art by +4.22 AP and +11.02 AP on the HIBER and MMVR indoor radar datasets, respectively.
- SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements
-
This paper proposes SimROD, an extremely lightweight (only 0.003M parameters) RAW image object detection method that surpasses complex state-of-the-art approaches on multiple RAW detection benchmarks through global Gamma enhancement (4 learnable parameters) and green channel-guided local enhancement.
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection
-
SM3Det introduces the M2Det task for remote sensing (multi-modal datasets + multi-task object detection), employing a grid-level sparse MoE backbone and a Dynamic Sub-module Optimization (DSO) mechanism to handle SAR/optical/infrared modalities with both horizontal and oriented bounding box detection in a single unified model, substantially outperforming three independently trained modality-specific models combined.
- T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection
-
This paper proposes T-Rex-Omni, the first framework to systematically incorporate negative visual prompts into open-set object detection. Through a training-free NNC module and an NNH loss, it substantially narrows the performance gap between visual-prompt and text-prompt detection methods, with particularly strong results in long-tail scenarios (LVIS-minival APr of 51.2).
- Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection
-
This paper proposes an object-aware temporal modeling framework that achieves cross-frame temporal consistency through selective propagation of high-confidence detection features. Combined with a pretrained vision-language encoder (OWL-ViT) and a few-shot detection head, the method achieves an average improvement of 3.7%–5.3% AP across four video few-shot detection benchmarks.
- Towards Multiple Missing Values-Resistant Unsupervised Graph Anomaly Detection
-
This paper proposes M2V-UGAD, the first framework to address unsupervised graph anomaly detection under simultaneous node attribute and graph topology missingness. Through three core mechanisms—dual-pathway independent imputation, hyperspherical latent space fusion, and pseudo-anomaly generation—the framework overcomes cross-view interference and imputation bias, consistently outperforming existing methods across 7 benchmark datasets.
- TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding
-
This paper proposes TubeRMC, a framework that generates text-conditioned candidate tubes and performs tube-conditioned reconstruction along temporal, spatial, and spatio-temporal dimensions, augmented by spatial-temporal mutual constraints to improve weakly-supervised spatio-temporal video grounding.
- VK-Det: Visual Knowledge Guided Prototype Learning for Open-Vocabulary Aerial Object Detection
-
VK-Det is proposed as a framework that leverages only the visual knowledge of VLMs (without any additional supervision signals) to achieve state-of-the-art performance in open-vocabulary aerial object detection through Adaptive Selection Knowledge Distillation (ASKD), Prototype-Aware Pseudo-Label generation (PAPL), and Synthetic Matching Inference (SMI), even surpassing methods that rely on extra supervision.
- When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking
-
This paper proposes MFT25, a large-scale underwater multiple fish tracking dataset (15 sequences, 408K annotations), and SU-T, a tracking framework combining UKF with FishIoU, achieving state-of-the-art performance of 34.1 HOTA and 44.6 IDF1. Statistical analyses further reveal fundamental differences between fish tracking and terrestrial object tracking.
- YOLO-IOD: Towards Real Time Incremental Object Detection
-
This work is the first to systematically integrate incremental object detection (IOD) into the YOLO real-time framework. It identifies three types of knowledge conflict, proposes a three-module solution (CPR + IKS + CAKD), and introduces the more realistic LoCo COCO benchmark for evaluation.