🎯 Object Detection¶
🔬 ICLR2026 · 31 paper notes
📌 Same area in other venues: 📷 CVPR2026 (97) · 🧪 ICML2026 (6) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (27) · 📹 ICCV2025 (28)
🔥 Top topics: Object Detection ×11 · Anomaly Detection ×7 · Time-Series Forecasting ×4 · Few-/Zero-Shot Learning ×3 · Diffusion Models ×2
- APT: Towards Universal Scene Graph Generation via Plug-in Adaptive Prompt Tuning
-
APT replaces the long-standing "frozen word vector semantic prior" in Scene Graph Generation (SGG) with a set of lightweight learnable prompts. It dynamically modulates static semantic features into representations dependent on visual context. As a plug-and-play module, it can be integrated into any one-stage, two-stage, or open-vocabulary SGG framework, achieving comprehensive performance gains with <0.5M parameters and shorter training times.
- Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting
-
WS-COC is the first framework to utilize Multimodal Large Language Models (MLLM) for weakly-supervised class-agnostic object counting. Using only image-level total counts for supervision, it activates the counting capabilities of MLLMs through three simple strategies: "Binary Dialogue Tuning + Comparative Ranking Optimization + Global-Local Fusion." It approaches or even surpasses some fully-supervised methods using point-level supervision on four datasets, including FSC-147.
- CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection
-
This work introduces Object-Centric Learning (Slot Attention) to Source-Free Domain Adaptive Object Detection (SF-DAOD) for the first time. By extracting domain-invariant object-level structural priors through a Hierarchical Slot Awareness module and driving domain-invariant representations with class-guided contrastive learning, the method significantly outperforms existing approaches across multiple cross-domain benchmarks.
- CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
-
It is demonstrated through linear probing experiments that CLIP's Bag-of-Words (BoW) behavior originates from cross-modal alignment failure rather than a lack of binding information in the encoders. LABCLIP is proposed, which significantly restores attribute-object binding capabilities by training only a lightweight linear transformation.
- Complexity- and Statistics-Guided Anomaly Detection in Time Series Foundation Models
-
When Time Series Foundation Models (TFMs, such as MOMENT) are applied to reconstruction-based anomaly detection, they fail due to "overgeneralization" (reconstructing anomalies too well) and "over-stationarization" (Instance Normalization removing mean and variance). This paper introduces a complexity metric \(\alpha\) derived from the difference between reconstruction and imputation errors to adaptively ensemble TFMs with lightweight statistical models (CAE), and re-injects mean and variance into the decoding stage (MOMENT-Stat). It improves VUS-PR from the previous SOTA of 0.4233 to 0.4679 across 23 univariate and 17 multivariate benchmarks.
- Contextual and Seasonal LSTMs for Time Series Anomaly Detection
-
Aiming at "minor point anomalies" and "slowly rising anomalies" that are difficult for existing methods to detect in univariate time series, this paper proposes the CS-LSTMs dual-branch architecture. S-LSTM models periodic evolution in the frequency domain, while C-LSTM captures local trends in the time domain. Combined with a wavelet noise decomposition strategy, it outperforms SOTA on four benchmarks with a 40% increase in inference speed.
- DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
-
DeCo-DETR decouples "online text encoder invocation" and the "competition between localization and alignment" in open-vocabulary detection. It employs an LVLM to offline distill a reusable hierarchical semantic prototype pool as a substitute for the text encoder during inference and utilizes dual-stream gradient isolation to separate localization and semantic alignment training. This approach achieves a gain of 3.1--5.8 points on OV-COCO novel classes while compressing single-image inference latency to 135ms.
- DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
-
DETR-ViP attributes the performance gap between "visual prompts and text prompts" to a lack of global discriminability in visual prompts. By expanding negative samples through global prompt integration, reshaping the visual prompt space topology via text-based relationship distillation, and stabilizing inference with selective fusion, it achieves a new SOTA in visual prompt detection across COCO / LVIS / ODinW / Roboflow100 (surpassing T-Rex2-T by +4.4 AP on COCO).
- DiffuDETR: Rethinking Detection Transformers with Denoising Diffusion Process
-
DiffuDETR reformulates object detection as an "object query generation task conditioned on an image and a set of noisy reference points." By using denoising diffusion training, the DETR decoder learns to gradually denoise query reference points from Gaussian noise into precise object locations. It consistently outperforms baselines such as Deformable DETR and DINO on COCO, LVIS, and V3Det, while adding negligible computational overhead during inference as it only requires a few extra decoder passes.
- Dual Distillation for Few-Shot Anomaly Detection
-
Proposed D24FAD, a dual-distillation framework combining Teacher-Student Distillation (TSD) on query images and Student Self-Distillation (SSD) on support images, supplemented by a Learn-to-Weight (L2W) mechanism for adaptive support evaluation. It achieves 100% AUROC using 2-shot on the APTOS fundus dataset.
- Enhancing Vision Transformers for Object Detection via Context-Aware Token Selection and Packing
-
The paper proposes Select and Pack Attention (SPA): it uses a lightweight gating layer supervised by dynamic multi-scale object labels to select informative tokens for each image, then packs varying numbers of tokens into fixed-length containers to restore batch parallelism. This achieves a +0.5~2.7 AP precision improvement and 10.9%~24.9% reduction in computational cost on object detection.
- Fantastic Tractor-Dogs and How Not to Find Them With Open-Vocabulary Detectors
-
This paper reveals that early-fusion open-vocabulary detectors produce a large number of high-confidence false positives on background images that "do not contain the target object" (e.g., confidently framing a "tractor" in a photo of a Golden Retriever). The root cause is identified as the inability of cross-modal attention in the vision-language fusion layer to select "nothing." A training-free solution is proposed: appending several semantically neutral "attention sink" tokens to the prompt to absorb displaced attention, thereby nearly eliminating background false positives.
- ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection
-
ForestPersons is the first large-scale benchmark dataset specifically designed for detecting missing persons under the forest canopy (96,482 images + 204,078 annotations). By simulating micro aerial vehicle (MAV) low-altitude flight perspectives at 1.5–2.0 meters, it covers realistic search and rescue (SAR) conditions across multiple seasons, weather conditions, poses, and occlusion levels, providing a solid foundation for the training and evaluation of under-canopy person detection models.
- FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion
-
This paper proposes a training-free few-shot object detection framework that combines three foundation models—UPN, SAM2, and DINOv2—to generate proposals and matching features. A graph diffusion algorithm is introduced to refine confidence scores and suppress fragmented proposals, significantly outperforming SOTA on Pascal-5i and COCO-20i.
- InfoDet: A Dataset for Infographic Element Detection
-
A large-scale infographic element detection dataset (101,264 infographics, 14.2M annotations) is constructed, covering two major categories: charts and human-recognizable objects. A Grounded CoT method is proposed to leverage detection results to enhance VLM chart understanding capabilities.
- Interference-Isolated Elastic Weight Consolidation and Knowledge Calibration for Incremental Object Detection
-
Addressing the task knowledge conflict caused by "unlabeled past/future targets being treated as background" in Incremental Object Detection (IOD), this paper re-derives the Bayesian posterior of EWC to explicitly subtract interference knowledge (IKI-EWC) from parameter importance. It then retrains the classification head using learnable projection layers to compensate for prototype semantic drift (PKC), consistently outperforming SOTA on VOC/COCO.
- Long-Context Generalization with Sparse Attention
-
ASEntmax (Adaptive-Scalable Entmax) is proposed, replacing softmax attention with \(\alpha\)-entmax using learnable temperature. The work theoretically and experimentally proves that sparse attention achieves \(1000\times\) length extrapolation, resolving the attention dispersion problem of softmax under long contexts.
- OD3: Optimization-Free Dataset Distillation for Object Detection
-
OD3 extends dataset distillation from image classification to object detection by proposing a completely optimization-free synthesis pipeline. Starting from a blank canvas, it iteratively pastes real objects (candidate selection) and uses a pre-trained observer model to filter out low-confidence objects (candidate screening). Combined with channel-level soft labels to train student detectors, OD3 achieves a mAP50 14.8% higher than the previously sole detection distillation method, DCOD, at a 1% compression rate on COCO.
- OVID: Open-Vocabulary Intrusion Detection
-
This paper proposes the "Open-Vocabulary Intrusion Detection (OVID)" task for the first time, constructs the Cityintrusion-OpenV dataset with 8 intrusion categories, and designs an end-to-end multi-modal framework, OVIDNet. By leveraging text-image feature alignment to identify intrusion categories unseen during training and incorporating two plug-and-play strategies (multi-distribution noise mixing and dynamic memory gating) to enhance generalization, OVIDNet outperforms strong baselines like OpenSeeD in zero-shot and task transfer settings.
- OwlEye: Zero-Shot Learner for Cross-Domain Graph Data Anomaly Detection
-
Ours proposes the OwlEye framework, which utilizes cross-domain feature alignment based on pairwise distance statistics to map heterogeneous graph embeddings into a shared space. It extracts attribute-level and structure-level normal patterns from multiple graphs into an extensible dictionary and detects anomalous nodes in unseen graphs under strictly zero-shot conditions through a truncated attention reconstruction mechanism. It achieves an average AUPRC of 36.17% across 8 datasets, surpassing the strongest baseline, ARC, by approximately 5.4 percentage points.
- PAANO: Patch-Based Representation Learning for Time-Series Anomaly Detection
-
Ours proposes PaAno, a lightweight time-series anomaly detection method based on patch-level representation learning. It utilizes a 1D-CNN encoder with triplet loss and pretext loss to learn a patch embedding space. Anomaly scores are calculated via the distance to normal patches stored in a memory bank. It achieves comprehensive SOTA results on the TSB-AD benchmark with only 0.3M parameters and inference times in seconds.
- PGRF-Net: A Prototype-Guided Relational Fusion Network for Diagnostic Multivariate Time-Series Anomaly Detection
-
To be supplemented after in-depth paper reading.
- Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization
-
Aiming at the weakly supervised task of "training rotated object detectors with only a single point annotation," this paper proposes Point2RBox-v3. It utilizes Progressive Label Assignment (PLA) to feed scale information from pseudo-labels into multi-level FPN label assignment and adopts Prior-Guided Dynamic Mask Loss (PGDM-Loss) to use SAM for sparse scenes and Watershed for dense scenes. It achieves a new SOTA on six remote sensing benchmarks including DOTA-v1.0 (66.09% for the two-stage version).
- Retain and Adapt: Auto-Balanced Model Editing for Open-Vocabulary Object Detection under Domain Shifts
-
This work introduces "model editing" to Open-Vocabulary Object Detection (OVOD) for the first time. By fine-tuning only the FFN output projection layers and storing compact KV covariance statistics, the method utilizes a data-adaptive diagonal matrix \(\Gamma\) to replace the manually tuned hyperparameter \(\lambda\). This approach automatically balances "retaining pre-trained capabilities" and "adapting to new domains"—achieving an Adaptation Gain Ratio (AGR) of approximately 95–99% across 19 cross-domain few-shot tasks while retaining 94–98% of original COCO performance. Furthermore, tasks can be added or removed in any order without retraining.
- RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
-
RF-DETR utilizes DINOv2 internet-scale pre-training combined with end-to-end weight-sharing NAS to train a "supernet." This approach allows a single training session to export an entire accuracy-latency Pareto curve via grid search without retraining. It is the first real-time detector to exceed 60 AP on COCO and outperforms GroundingDINO on the real-world dataset RF100-VL with a 20x speedup.
- Self-Guided Low Light Object Detection Framework
-
This paper proposes SGLDet: during training, a detachable enhancement-denoising-Fourier fusion auxiliary branch is attached to a standard detector. It generates pixel-level supervision from the low-light images themselves to strengthen backbone representations. Since the auxiliary branch is removed during testing, it significantly improves performance on DARK FACE, ExDark, and nuImages night detection without increasing inference overhead.
- SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection
-
The first unified framework for oriented object detection, SPWOOD, is proposed to handle "sparse annotation + weak annotation (HBox/Point)". It utilizes SOS-Student to parallelize three learning signals—unlabeled, missing angle, and missing scale—within a single student model, then incorporates Multi-level Pseudo-labels Filtering (MPF) for self-training from unlabeled data. It achieves performance close to full supervision on DOTA-v1.0/v1.5 and DIOR using mixed annotations (RBox:HBox:Point=1:1:1).
- Towards Anomaly-Aware Pre-Training and Fine-Tuning for Graph Anomaly Detection
-
The APF framework is proposed to address the dual challenges of label scarcity and homophily variation in graph anomaly detection through Rayleigh quotient-guided anomaly-aware pre-training and fine-grained adaptive fine-tuning.
- Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection
-
Object detection is reformulated as a "Conditional Marked Poisson Point Process" (CMPPP), where object centers are points and dimensions/classes are marks. Trained end-to-end via maximum likelihood, the model provides well-calibrated probability estimates for "whether a specific region is truly free of obstacles (passable)" while maintaining detection accuracy comparable to standard detectors.
- Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Method
-
This paper proposes TreeBench (the first traceable visual reasoning benchmark, consisting of 405 highly challenging VQA tasks where OpenAI-o3 achieves only 54.87%) and TreeVGR (a training paradigm that jointly supervises grounding and reasoning via reinforcement learning with Dual IoU rewards). The TreeVGR-7B model achieves gains of +16.8 on V*Bench, +12.6 on MME-RealWorld, and +13.4 on TreeBench, demonstrating that traceability is crucial for advancing visual reasoning.
- Unbiased Object Detection Beyond Frequency with Visually Prompted Image Synthesis
-
To address category, size, and location biases in object detection training data, this paper proposes a "Diagnosis-Synthesis" debiasing framework. It identifies truly under-represented data groups using a Representation Score (RS) that goes beyond frequency. It recalibrates layouts based on RS and synthesizes high-fidelity samples using a Visual Blueprint (color rectangle pixel conditions) combined with Dual Generative Alignment. This approach improves rare classes by 3.6 mAP and large objects by 4.4 mAP, achieving a layout accuracy 15.9 mAP higher than the previous L2I SOTA.