🎯 Object Detection¶

📷 CVPR2025 · 38 paper notes

📌 Same area in other venues: 📷 CVPR2026 (99) · 🔬 ICLR2026 (31) · 🧪 ICML2026 (6) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (27) · 📹 ICCV2025 (28)

🔥 Top topics: Anomaly Detection ×11 · Object Detection ×10 · Few-/Zero-Shot Learning ×5 · Diffusion Models ×3

AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP: Proposed AA-CLIP, which enhances anomaly discriminability while preserving the generalization ability of CLIP through a two-stage training strategy (first adapting the text encoder to establish anomaly-aware anchors, then aligning patch-level visual features). It achieves SOTA zero-shot anomaly detection performance across multiple industrial and medical datasets with minimal training samples.
ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection: Proposed ABRA (Aligned Basis Relocation for Adaptation), which "teleports" class-specific detection knowledge from a source domain to an unlabeled target domain by performing SVD decomposition and orthogonal rotation alignment in the weight space, achieving zero-shot cross-domain object detection.
AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios: Proposes AnomalyNCD, the first self-supervised multi-class anomaly classification method for industrial scenarios: MEBin extracts major anomaly regions \(\rightarrow\) mask-guided ViT focuses on weak-semantic anomalies \(\rightarrow\) region fusion strategy achieves flexible region/image-level classification, improving F1 by 10.8% and NMI by 8.8% on MVTec AD.
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs: This paper proposes BACON, a prompting method that deconstructs verbose image captions generated by VLMs into decoupled structured elements (in JSON dictionary format) such as objects, relationships, styles, and themes. This allows downstream models to efficiently utilize caption information without requiring strong text-encoding capabilities, achieving a 1.51x recall improvement for GroundingDINO in open-vocabulary object detection.
Boosting Domain Incremental Learning: Selecting the Optimal Parameters Is All You Need: Discovers that selecting the optimal subset of parameters is more effective than fine-tuning all parameters in domain incremental learning, and proposes a parameter selection strategy to resolve catastrophic forgetting in domain incremental object detection.
DEIM: DETR with Improved Matching for Fast Convergence: This paper accelerates DETR training convergence through two simple improvements: Dense O2O (increasing targets per image via data augmentation to achieve dense one-to-one matching) and MAL (replacing VFL to better optimize low-quality matches). It cuts the training epochs in half while boosting performance (COCO AP 56.5 with D-FINE-X).
Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection: The DPDL method is proposed to learn multi-Gaussian distribution prototypes and map normal samples to the prototype space via diffusion using the Schrödinger Bridge (while concurrently pushing away anomalous samples). Combined with dispersion feature learning on hyperspherical space to enhance generalization, this method achieves state-of-the-art (SOTA) performance on 9 public anomaly detection datasets (e.g., outperforming AHL by 5.0% on AITEX and 8.7% on ELPV).
Efficient Event-Based Object Detection: A Hybrid Neural Network with Spatial and Temporal Attention: This work proposes the first hybrid SNN-ANN object detection model targeting large-scale benchmarks. An Attention-Squeeze Bridging (ASAB) block is designed to convert sparse spike representations from the SNN into dense features for the ANN via spatio-temporal attention. With only 6.6M parameters, it significantly outperforms SNN methods and approaches the accuracy of ANN/RNN methods on the Gen1/Gen4 datasets, while the SNN component can be deployed on the Intel Loihi 2 neuromorphic chip for low-power inference.
Efficient Test-Time Adaptive Object Detection via Sensitivity-Guided Pruning: Proposes an efficient continual test-time adaptive object detection (CTTA-OD) method, identifying that certain feature channels in the source model are sensitive to domain shifts and impede cross-domain performance. Selective pruning is achieved by guiding weighted sparse regularization with channel sensitivity measured at both image and instance levels, complemented by a random channel reactivation mechanism to prevent erroneous pruning. This approach surpasses SOTA adaptation accuracy while reducing computational cost by 12%.
Generalized Diffusion Detector: Mining Robust Features from Diffusion Models for Domain-Generalized Detection: This paper introduces diffusion models to domain-generalized object detection for the first time. By extracting multi-timestep intermediate features from the diffusion process to build a domain-invariant detector, and designing a dual-level (feature-level and object-level) alignment knowledge transfer framework, the generalization capability is distilled into a lightweight common detector. It achieves an average improvement of 14.0% mAP across six DG benchmarks, even outperforming most domain adaptation methods.
Integration of deep generative Anomaly Detection algorithm in high-speed industrial line: A semi-supervised anomaly detection framework based on GAN and Dual-stage Residual Autoencoder (DRAE) deployed for real-time online quality inspection on high-speed pharmaceutical BFS lines. Trained exclusively on normal samples, it achieves a single-patch inference time of 0.17 ms, optimizing reconstruction quality via Perlin noise augmentation and Noise Loss.
Interpreting Object-level Foundation Models via Visual Precision Search: Addressing the interpretability of object-level foundation models like Grounding DINO and Florence-2, this paper proposes Visual Precision Search (VPS). By combining superpixel sparsification with greedy search guided by submodular functions, VPS accurately localizes critical decision subregions. It outperforms the state-of-the-art D-RISE method in fidelity metrics (Insertion) by 23.7%, 20.1%, and 31.6% on MS COCO, RefCOCO, and LVIS, respectively.
Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection: DINO Teacher proposes replacing the traditional EMA teacher in the Mean Teacher framework with a frozen self-supervised DINOv2 foundation model. This acts as both a more accurate pseudo-label generator and a proxy target for feature alignment, achieving SOTA performance on multiple domain adaptive object detection benchmarks (+7.6% on BDD100k).
MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism: MI-DETR proposes a parallel Multi-time Inquiries (MI) mechanism to replace the traditional cascaded decoder architecture of DETR. This allows object queries to learn multi-modal information from image features in parallel through multiple parameter-independent inquiry heads. Combined with U-like Feature Interaction (UFI), it achieves 52.7 AP on COCO with a ResNet-50 backbone, outperforming all existing DETR variants.
Mr. DETR++: Instructive Multi-Route Training for Detection Transformers with MoE: This work systematically investigates the roles of various components in the DETR decoder within a joint one-to-one/one-to-many multi-task framework, and reveals that transitioning any single component to be independent can effectively coordinate the two objectives. Based on this, instructive multi-route training is proposed (Instructive Self-Attention + Independent FFN + Route-Aware MoE), which discards auxiliary routes during inference, incurring zero extra cost.
MulSen-AD: Multi-Sensor Object Anomaly Detection: The first multi-sensor anomaly detection dataset MulSen-AD is proposed, integrating RGB camera, infrared lock-in thermography, and laser scanning modalities, alongside a baseline method MulSen-TripleAD, achieving 96.1% AUROC in object-level anomaly detection through decision-level fusion.
Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties: This work proposes MulSen-AD, the first industrial object anomaly detection dataset that integrates three sensors: RGB cameras, laser scanners, and infrared thermography (covering 15 product categories and 14 types of anomalies). Additionally, a decision-level fusion baseline method, MulSen-TripleAD, is designed, achieving a 96.1% AUROC and demonstrating that multi-sensor fusion significantly outperforms single-sensor approaches.
Multiple Object Tracking as ID Prediction: This paper proposes MOTIP, which reformulates the object association problem in multiple object tracking (MOT) as an in-context ID prediction task. Given historical trajectories with ID embeddings, a standard Transformer decoder directly predicts the ID labels of the current detections without relying on heuristic matching algorithms, achieving a HOTA of 69.6 on DanceTrack and significantly outperforming the previous SOTA CO-MOT (65.3).
Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset: This paper proposes MvHeat-DET, which models visual features as a 2D heat diffusion process and dynamically routes among three frequency domain transforms (DFT/DCT/Haar) using MoE. Combined with IoU-aware query selection, it performs object detection on event streams. Additionally, the paper releases a high-definition event camera detection dataset, EvDET200K (10,054 videos / 200K bboxes / 10 classes).
Odd-One-Out: Anomaly Detection by Comparing with Neighbors: OddOneOutAD formalizes the task of "finding anomalies among a group of similar products" in industrial quality inspection as scene-level anomaly detection. It constructs object representations in 3D voxel space using sparse 5-view images, obtains part-aware features through DINOv2 knowledge distillation and differentiable rendering, and compares similarities among instances using cross-instance sparse voxel attention to identify whether each instance is anomalous. Additionally, it contributes two new benchmarks: ToysAD-8K and PartsAD-15K.
One-for-More: Continual Diffusion Model for Anomaly Detection: The CDAD framework is proposed to achieve stable continual learning for diffusion models via gradient projection. Supported by iterative SVD (iSVD), the memory consumption is reduced from 157GB to 17GB. Additionally, an anomaly-masked network is designed to enhance the conditioning mechanism, achieving first place in 17 out of 18 settings across MVTec and VisA.
PO3AD: Predicting Point Offsets toward Better 3D Point Cloud Anomaly Detection: PO3AD proposes to learn normal point cloud representations by predicting the offset vectors of pseudo-anomaly points (rather than reconstructing the entire point cloud), thereby focusing the model's attention on anomalous regions. Combined with a normal-guided pseudo-anomaly generation method (Norm-AS), it improves the detection AUC-ROC by 9.0% and 1.4% on Anomaly-ShapeNet and Real3D-AD, respectively, compared to existing methods.
ProbPose: A Probabilistic Approach to 2D Human Pose Estimation: ProbPose proposes replacing traditional heatmaps with calibrated probability maps for 2D human keypoint localization. It introduces presence probability to explicitly model whether keypoints are within the activation window. Through crop data augmentation and expected risk minimization of the OKS loss, it significantly improves the localization capability of out-of-image keypoints and the quality of the model's probability calibration.
ROICtrl: Boosting Instance Control for Visual Generation: Inspired by ROI-Align in object detection, ROICtrl proposes a complementary operation, ROI-Unpool, to achieve efficient and precise ROI feature recovery. It constructs a diffusion model adapter compatible with community fine-tuned models and existing spatial/embedded plugins, achieving SOTA performance in multi-instance regional control generation while drastically reducing computational costs.
RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark: This paper revisits angle decoders in rotated object detection from a unified perspective of dimensional mapping, reveals the prediction bias caused by ignoring the unit circle constraint in existing methods, proposes the Unit Cycle Resolver (UCR), and leverages UCR to construct RSAR, currently the largest multi-class rotated SAR object detection dataset.
Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval: SearchDet proposes a completely training-free long-tail object detection framework. By retrieving positive and negative sample images from the web, generating attention-weighted queries, and performing joint localization with SAM region proposals and similarity heatmaps, SearchDet improves mAP by 48.7% on ODinW and 59.1% on LVIS compared to GroundingDINO, showcasing the immense potential of leveraging the Web as an external dynamic memory for inference-time augmentation.
Show, Don't Tell: Detecting Novel Objects by Watching Human Videos: This paper proposes the "Show, Don't Tell" paradigm, which automatically creates training datasets by watching human manipulation demonstration videos to train specialized object detectors for identifying novel objects. It completely bypasses the reliance on language descriptions or prompt engineering in traditional methods, significantly improving the performance of manipulating object detection and recognition on real-world robotic systems.
SimLTD: Simple Supervised and Semi-Supervised Long-Tailed Object Detection: SimLTD proposes a simple and intuitive three-stage framework—pre-training on head classes, transferring to tail classes, and fine-tuning on hybrid-sampled data—optionally integrated with semi-supervised learning on unlabeled images, which comprehensively outperforms existing methods that rely on ImageNet labels on the LVIS v1 benchmark.
Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images: This paper proposes ESM-YOLO+, a lightweight visible-infrared fusion network. By incorporating the MEAF module (pixel-level fusion with learnable spatial masks and spatial attention) and structural representation enhancement (SR, a super-resolution auxiliary supervision with zero inference overhead), it achieves 84.71% mAP on VEDAI with only 5.1M parameters (a 93.6% reduction).
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting: This paper proposes T2ICount, which leverages one-step denoising features from pre-trained text-to-image diffusion models for zero-shot object counting. It addresses the lack of text sensitivity in one-step denoising through a Hierarchical Semantic Correction Module (HSCM) and a Representational Regional Coherence loss (\(\mathcal{L}_{RRC}\)).
TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection: TailedCore addresses a realistic scenario in unsupervised anomaly detection where normal samples contain noisy defects and concurrently follow an unknown long-tail class distribution. It proposes TailSampler to predict class cardinality based on the symmetry assumption of embedding similarity, allowing for the independent sampling of tail-class specimens. This constructs a memory bank model that captures tail-class information while remaining robust to noise, outperforming SOTA in various settings.
Test-Time Backdoor Detection for Object Detection Models: TRACE (TRAnsformation Consistency Evaluation) proposes the first test-time backdoor sample detection method for object detection models. Based on two key observations—that poisoned samples yield more consistent detection results across different backgrounds, and clean samples are more consistent under different focus information—it detects poisoned samples by calculating the variance of object confidence after applying transformations to the foreground and background, achieving black-box general-purpose detection and improving AUROC by 30% compared to the SOTA.
TornadoNet: Real-Time Building Damage Detection with Ordinal Supervision: TornadoNet builds the first systematic benchmark for post-tornado street-view building damage assessment. By comparing the performance of the YOLO series (CNNs) and RT-DETR (Transformers) on a 5-level damage detection task and proposing an ordinal-aware supervision strategy, it improves the [email protected] of RT-DETR by 4.8 percentage points. This demonstrates the effectiveness of incorporating the ordinal nature of damage severity into the loss function design.
Towards RAW Object Detection in Diverse Conditions: This paper proposes the AODRaw dataset (7,785 high-resolution real-world RAW images, 62 categories, 9 illumination/weather conditions) along with a RAW-domain pre-training and cross-domain distillation scheme, achieving superior RAW object detection performance under diverse adverse conditions without requiring an ISP module.
Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models: The first MLLM dedicated to zero-shot anomaly detection and reasoning (Anomaly-OV). It generates anomaly saliency maps through a Look-Twice Feature Matching mechanism coupled with a visual token selector to focus on suspicious regions, achieving SOTA zero-shot anomaly detection with an average AUROC of 88.6% across 9 benchmarks.
UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection: This paper proposes UniVAD, a training-free unified few-shot visual anomaly detection method. Through the Contextual Component Clustering (C3) module, it achieves precise component segmentation. Combined with component-aware patch matching and graph-enhanced component modeling, it achieves state-of-the-art anomaly detection across industrial, logical, and medical domains using only a few normal samples.
Unseen Visual Anomaly Generation: This paper proposes the AnomalyAny framework, which leverages the generative capability of pre-trained Stable Diffusion. By utilizing attention-guided optimization and prompt-guided refinement, it generates diverse and realistic unseen anomaly samples under the condition of requiring only a single normal sample and no additional training.
VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos: VCBench repositions counting as a minimal probe to diagnose the "spatial-temporal state maintenance" capability of video models. It proposes 8 subcategories covering object counting (current state/identity tracking) and event counting (instantaneous events/periodic activities). By observing model prediction trajectories through streaming multi-point queries along the timeline, mainstream models are evaluated on 406 videos and 4,576 query points, revealing that current models still exhibit significant deficiencies in spatial-temporal state maintenance.