🎯 Object Detection¶

📹 ICCV2025 · 30 paper notes

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection: This paper proposes 3D-MOOD, the first end-to-end monocular open-set 3D object detector, which lifts open-set 2D detections into 3D space via geometry-aware 3D query generation and a canonical image space design, achieving state-of-the-art performance on both the Omni3D closed-set benchmark and the Argoverse 2 / ScanNet open-set benchmarks.
Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning: This paper proposes tgGBC (trim keys gradually Guided By Classification scores), a zero-shot runtime pruning method that computes key importance by element-wise multiplication of classification scores and attention maps, progressively pruning unimportant keys across layers. It achieves nearly 2× acceleration of the Transformer decoder on multiple 3D detectors with less than 1% performance degradation.
Adversarial Attention Perturbations for Large Object Detection Transformers: This paper proposes AFOG (Attention-Focused Offensive Gradient), an architecture-agnostic adversarial attack method that leverages a learnable attention mechanism to concentrate perturbations on vulnerable image regions. With only 10 iterations and visually imperceptible perturbations, AFOG reduces the mAP of 12 detection Transformers by up to 37.8×, while also outperforming existing methods on CNN-based detectors.
Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning: This paper proposes the AMR framework, which leverages a Splice-and-Boost data augmentation strategy and a cold-start–distillation two-stage training pipeline to substantially improve boundary awareness and semantic discriminability in video moment retrieval—without relying on any external data or pretrained models—surpassing the previous SOTA by +5% on QVHighlights.
Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability: This paper proposes PCR (Prediction Consistency and Reliability), an automated evaluation method that estimates object detection model performance without human annotations. PCR analyzes the spatial consistency and confidence reliability of bounding boxes before and after NMS to estimate mAP, and constructs a corruption-based meta-dataset for more realistic and scalable evaluation.
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction: This paper proposes SGCDet, a framework that achieves efficient and accurate multi-view indoor 3D object detection without relying on ground-truth scene geometry, via a Geometry and Context-Aware aggregation module (adaptive feature lifting) and a sparse voxel construction strategy (coarse-to-fine adaptive voxel selection).
SGCDet: Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction: SGCDet achieves efficient and accurate multi-view indoor 3D object detection through adaptive sparse 3D voxel construction and geometry-context-aware aggregation, surpassing existing methods without requiring ground-truth geometric supervision.
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction: SGCDet achieves state-of-the-art performance in multi-view indoor 3D object detection without ground-truth geometric supervision, through a geometry- and context-aware aggregation module (3D deformable attention + multi-view attention fusion) and an occupancy-probability-based sparse voxel construction strategy, while substantially reducing computational overhead.
Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion: This paper leverages the image guidance strength of diffusion models to generate a continuous synthetic-to-real spectrum of data, and proposes a Diffusion Curriculum Learning (DisCL) strategy that adaptively selects synthetic data at optimal guidance levels across different training stages, effectively addressing long-tail classification and low-quality data learning challenges.
DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion: DISTIL proposes a data-free trojan trigger inversion method that searches for trigger patterns in the latent space of a pretrained guided diffusion model—rather than in pixel space—and injects uniform noise regularization at each step to effectively distinguish genuine backdoor triggers from adversarial perturbations, achieving up to 7.1% accuracy improvement on BackdoorBench.
Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection: This work is the first to introduce Mixture of Experts into real-time open-vocabulary object detectors. Through MoE-Tuning, it extends Grounding DINO 1.5 Edge from a dense model into a dynamic inference framework, proposing fine-grained expert decomposition and a pretrained weight allocation strategy. Using only 1.56M open-source data, the resulting model surpasses the original version trained on 20M private data.
EA-KD: Entropy-based Adaptive Knowledge Distillation: This paper proposes EA-KD, a plug-and-play knowledge distillation method based on information entropy. It dynamically reweights distillation losses by combining the entropy values of teacher and student outputs, prioritizing learning from high-entropy (high-information) samples. EA-KD consistently improves multiple KD frameworks across image classification, object detection, and LLM distillation tasks with negligible computational overhead.
EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision: This paper proposes I2EvDet, a framework that adapts mainstream image detectors to event-based video detection by inserting lightweight RNN temporal modules into the frozen latent space of RT-DETR, achieving state-of-the-art results of +2.3 and +1.4 mAP on the Gen1 and 1Mpx benchmarks, respectively, with minimal architectural modifications.
Intervening in Black Box: Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding: This paper proposes the CBM-HNMU framework, which approximates the reasoning process of a black-box model via a Concept Bottleneck Model (CBM), automatically identifies and corrects harmful concepts, and distills the corrected knowledge back into the black-box model, enabling systematic model intervention and accuracy improvement beyond the sample level.
Large-scale Pre-training for Grounded Video Caption Generation: This paper proposes the GROVE model along with a large-scale automatic annotation pipeline, constructing the HowToGround1M pre-training dataset (1M videos) and the manually annotated iGround dataset (3,513 videos). GROVE jointly performs video caption generation and multi-object spatio-temporal bounding box localization, achieving state-of-the-art results on iGround, VidSTG, ActivityNet-Entities, and other benchmarks.
LMM-Det: Make Large Multimodal Models Excel in Object Detection: This paper proposes LMM-Det, which through systematic analysis identifies low recall as the core bottleneck of large multimodal models (LMMs) in object detection. By applying data distribution adjustment (pseudo-label augmentation) and inference optimization (per-category detection), LMM-Det improves COCO AP from 0.2 to 47.5 without any additional specialized detection modules.
Measuring the Impact of Rotation Equivariance on Aerial Object Detection: This paper proposes MessDet, a rotation-equivariant aerial object detector that achieves strict rotation equivariance through a novel downsampling procedure, and introduces rotation-equivariant channel attention (RE-CA) and a multi-branch detection head, attaining state-of-the-art performance on DOTA and other benchmarks with significantly fewer parameters.
OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images: OpenRSD is a general-purpose open-prompt object detection framework for remote sensing that supports both text and image multimodal prompts. It integrates an alignment head and a fusion head to balance speed and accuracy, employs a three-stage training pipeline, and is trained on the ORSD+ dataset comprising 470K images. OpenRSD achieves state-of-the-art average performance across seven public benchmarks while maintaining real-time inference at 20.8 FPS.
Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights: This paper systematically revisits 11 adversarial patch defense methods, establishes the first patch defense benchmark covering 13 attacks, 11 detectors, and 4 metrics, constructs a large-scale APDE dataset of 94,000 images, and reveals three key insights: the difficulty of defending against natural adversarial patches stems from data distribution rather than high-frequency components; patch detection accuracy is inconsistent with defense performance; and adaptive attacks can circumvent most existing defenses.
SFUOD: Source-Free Unknown Object Detection: This paper introduces a novel Source-Free Unknown Object Detection (SFUOD) setting and proposes the CollaPAUL framework, which simultaneously detects known and unknown objects without access to source data by combining collaborative tuning to fuse source- and target-domain knowledge with a principal-axis-based pseudo-label assignment strategy for unknown objects.
Sim-DETR: Unlock DETR for Temporal Sentence Grounding: This paper systematically analyzes the root causes of anomalous behavior in DETR-based temporal sentence grounding (TSG) — inter-query conflict and intra-query global-local contradiction — and proposes two simple decoder modifications (Query Grouping & Ranking + Global-Local Bridging) to form Sim-DETR, unlocking the full potential of DETR for TSG.
The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning: This paper is the first to identify spurious correlations between text queries and background frames as the fundamental bottleneck in moment retrieval performance. It proposes TD-DETR, a framework that mitigates this issue via two strategies: dynamic context video synthesis and text-dynamics interaction enhancement, achieving state-of-the-art results on QVHighlights and Charades-STA.
Uncertainty-Aware Gradient Stabilization for Small Object Detection: This paper identifies gradient instability caused by steep loss curvature in traditional localization methods when applied to small objects, and proposes UGS (Uncertainty-aware Gradient Stabilization), a framework comprising three components — classification-based localization, uncertainty minimization, and uncertainty-guided refinement — to stabilize gradients and significantly improve small object detection performance.
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement: This paper proposes the UPRE framework, which jointly optimizes Multi-view Domain Prompts (MDP) and Unified Representation Enhancement (URE) to simultaneously alleviate detection bias and domain bias in zero-shot domain adaptive object detection, achieving state-of-the-art performance across nine datasets spanning three scenario types: adverse weather, cross-city, and virtual-to-real.
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning: VisRL is the first framework to apply reinforcement learning to intention-driven visual perception. Through iterative DPO training, it enables large multimodal models (LMMs) to autonomously select focus regions (by predicting bounding boxes) according to query intent, achieving superior visual reasoning over SFT without requiring costly intermediate bounding box annotations.
Visual-RFT: Visual Reinforcement Fine-Tuning: Visual-RFT extends the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm from DeepSeek R1—originally applied to mathematics and code—to visual perception tasks. It introduces task-specific verifiable reward functions, including an IoU reward for object detection and a CLS reward for classification, achieving substantial improvements over SFT on fine-grained classification, few-shot detection, and grounded reasoning with only a fraction of the training data.
Visual Modality Prompt for Adapting Vision-Language Object Detectors: This paper proposes ModPrompt, an encoder-decoder-based visual prompting strategy that adapts vision-language object detectors (e.g., YOLO-World, Grounding DINO) to new modalities such as infrared and depth, while preserving zero-shot detection capability.
VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under Real Occlusions: This paper presents VOccl3D, a large-scale synthetic video dataset (250K frames, 400 video sequences) rendered via 3DGS, targeting 3D human pose and shape (HPS) estimation under realistic occlusion scenarios. Models fine-tuned on VOccl3D demonstrate significant performance improvements in occluded settings.
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation: This paper proposes YOLO-Count, a fully differentiable open-vocabulary object counting model built upon the YOLO architecture. Through an innovative cardinality map regression target and a hybrid strong-weak supervised training strategy, YOLO-Count achieves state-of-the-art performance on both general object counting and quantity-controlled text-to-image generation.
YOLOE: Real-Time Seeing Anything: This paper proposes YOLOE, which unifies text prompt, visual prompt, and prompt-free open-scenario detection and segmentation within the YOLO architecture. Through three key designs — RepRTA (Re-parameterizable Region-Text Alignment), SAVPE (Semantic-Activated Visual Prompt Encoder), and LRPC (Lazy Region-Prompt Contrast) — YOLOE achieves high efficiency and strong performance, surpassing YOLO-World v2 on LVIS with 3× lower training cost.