🎯 Object Detection¶
📷 CVPR2026 · 45 paper notes
- A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps
-
This paper proposes a Hybrid Ensemble Decoder (HED) and a progressive fine-tuning strategy for cross-domain few-shot object detection (CD-FSOD). By parallelizing a subset of decoder layers and randomly initializing denoising queries to introduce prediction diversity, the method achieves state-of-the-art performance on three benchmarks — CD-FSOD, ODinW-13, and RF100-VL — without introducing any additional parameters.
- ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection
-
ABRA decouples domain knowledge from category knowledge by constructing class-agnostic domain experts via Objectification, extracting lightweight per-category residuals via SVFT, and aligning weight spaces through Orthogonal Procrustes rotation—enabling detection capability transfer to a target domain even when no data for certain categories exists therein.
- ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection
-
This paper formulates cross-domain category transfer as SVD rotation alignment in weight space: domain-agnostic experts are trained via Objectification, lightweight class residuals are extracted with SVFT, and a closed-form orthogonal Procrustes solution is used to "teleport" source-domain class knowledge to a target domain with no data for that class.
- AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos
-
By exploiting the temporal invariance of background structure in fixed-view videos, the paper constructs an offline Anchor Bank and an online Anchor Map as persistent language–scene memory. Combined with an anchor-guided re-entry prior and a ReID-Gating identity verification mechanism, the system achieves robust re-capture of targets after occlusion or departure, yielding a 10.3% improvement in RCR and a 24.2% reduction in RCL.
- Beyond Caption-Based Queries for Video Moment Retrieval
-
This paper identifies a substantial gap between caption-based queries and real-world search queries in VMR, introduces three search-query benchmarks, and mitigates active decoder-query collapse in DETR via two architectural modifications—self-attention removal and query dropout—achieving gains of up to 21.83% mAPm on multi-moment search queries.
- Beyond Prompt Degradation: Prototype-Guided Dual-Pool Prompting for Incremental Object Detection
-
This paper proposes the PDP framework, which addresses prompt degradation in incremental object detection caused by prompt coupling and prompt drift via decoupled dual-pool prompting (shared pool + private pool) and Prototypical Pseudo-Label Generation (PPG), achieving state-of-the-art performance on COCO and VOC.
- Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
-
This paper proposes Object-Anchored Composed Image Retrieval (OACIR), a new task formulation, along with a large-scale benchmark OACIRR (160K+ quadruplets) and the AdaFocal framework. AdaFocal employs a context-aware attention modulator to adaptively enhance focus on anchored instance regions, substantially outperforming existing methods in instance-level retrieval fidelity.
- CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
-
This paper proposes the CD-Buffer framework, which drives complementary collaboration between a subtractive buffer (channel suppression) and an additive buffer (lightweight adapter compensation) via a unified domain discrepancy measure, enabling robust test-time object detection adaptation across adverse weather conditions of varying severity.
- CompAgent: An Agentic Framework for Visual Compliance Verification
-
This paper proposes CompAgent, the first agentic framework for visual compliance verification. A Planning Agent dynamically selects visual tools (object detection, face analysis, NSFW detection, etc.) based on compliance policies, while a Compliance Verification Agent integrates image content, tool outputs, and policy context for multimodal reasoning. Without any training, CompAgent surpasses the previous SOTA by 10% on UnsafeBench, achieving 76% F1.
- DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection
-
This paper proposes DA-Mamba, a CNN-SSM hybrid architecture that achieves image-level and instance-level global-local domain-invariant feature alignment with linear complexity via two modules—Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM)—attaining state-of-the-art performance on four domain adaptive detection benchmarks.
- Detecting Unknown Objects via Energy-Based Separation for Open World Object Detection
-
This paper proposes the DEUS framework, which introduces ETF-Subspace Unknown Separation (EUS) to effectively separate known, unknown, and background proposals via energy scores within geometrically orthogonal known/unknown subspaces, and designs an Energy-based Known Distinction (EKD) loss to reduce cross-task interference between old and new classes during incremental learning, achieving substantial improvements in unknown object recall on OWOD benchmarks.
- Detecting Unknown Objects via Energy-based Separation for Open World Object Detection
-
This paper proposes DEUS, a framework that constructs orthogonal known/unknown subspaces via Simplex ETF and employs energy scores to guide feature separation (EUS), while mitigating cross-task interference between old and new categories through an Energy-based Known Distinction loss (EKD), achieving substantially improved unknown recall on OWOD benchmarks.
- Does YOLO Really Need to See Every Training Image in Every Epoch?
-
This paper proposes the Anti-Forgetting Sampling Strategy (AFSS), which dynamically determines which training images participate in each epoch based on per-image learning sufficiency measured by \(\min(\text{Precision}, \text{Recall})\). AFSS achieves over 1.43× training speedup for YOLO-series detectors while maintaining or even improving detection accuracy.
- Evaluating Few-Shot Pill Recognition Under Visual Domain Shift
-
This paper systematically evaluates the generalization of pill recognition under cross-domain few-shot conditions from a deployment perspective. It reveals a decoupling phenomenon in which semantic classification saturates at 1-shot while localization and recall degrade sharply under occlusion and overlap, and demonstrates that the visual realism of training data is far more critical than data volume or shot count.
- Evaluating Few-Shot Pill Recognition Under Visual Domain Shift
-
This paper systematically evaluates few-shot pill recognition under cross-dataset domain shift from a deployment-oriented perspective. It reveals a decoupling phenomenon in which semantic classification saturates at 1-shot while localization and recall degrade sharply under occlusion and overlap, and demonstrates that the visual realism of training data is the dominant factor governing few-shot generalization.
- EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
-
This paper proposes the Evolving World Object Detection (EWOD) paradigm and the EW-DETR framework, which jointly address class-incremental learning, domain shift adaptation, and unknown object detection under a strict no-replay constraint through three synergistic modules: incremental LoRA adapters, a query-norm objectness adapter, and entropy-aware unknown mixing. The proposed approach achieves a 57.24% improvement on the FOGS metric.
- EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
-
This paper proposes the Evolving World Object Detection (EWOD) paradigm and the EW-DETR framework, which jointly address class-incremental learning, domain-shift adaptation, and unknown object detection without storing any historical data, via three modules: incremental LoRA adapters, a query-norm objectness adapter, and entropy-aware unknown mixing. The proposed method achieves a 57.24% improvement in FOGS over prior methods.
- Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
-
This paper proposes FI3Det, the first few-shot incremental 3D object detection framework. During the base training stage, a VLM-guided unknown object learning module enables early awareness of potential novel categories. During the incremental stage, a gated multimodal prototype imprinting module fuses 2D semantic and 3D geometric features for novel class detection. FI3Det achieves an average improvement of 17.37% in novel class mAP on ScanNet V2 and SUN RGB-D.
- Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
-
This paper proposes FALCON-SFOD, a framework that leverages class-agnostic binary masks generated by a foundation model (OV-SAM) to regularize the detector's feature space via Spatial Prior-Aware Regularization (SPAR), and introduces an Imbalance-aware Robust Pseudo Label loss (IRPL) to achieve object-focused representations in source-free object detection, attaining state-of-the-art results across multiple benchmarks.
- Fourier Angle Alignment for Oriented Object Detection in Remote Sensing
-
By exploiting Fourier rotational equivariance to estimate the principal orientation of objects in the frequency domain and align features accordingly, this paper proposes two plug-and-play modules—FAAFusion and FAA Head—to address cross-scale directional incoherence in FPN and the classification–regression task conflict in detection heads, respectively, achieving new state-of-the-art results on DOTA-v1.0/v1.5 and HRSC2016.
- HeROD: Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection
-
HeROD proposes a lightweight, model-agnostic framework that injects heuristic-inspired spatial and semantic reasoning priors into three stages of a DETR-style detection pipeline (candidate ranking, prediction fusion, and Hungarian matching), significantly improving data efficiency and convergence for referring object detection (ROD) under annotation-scarce conditions.
- Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection
-
This paper proposes LMP, a dual-branch framework built upon GroundingDINO that introduces a visual prototype branch (comprising positive class prototypes and hard negative prototypes) jointly trained and integrated with the text branch at inference, achieving state-of-the-art performance on cross-domain few-shot object detection.
- Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
-
This paper proposes InCoM-Net, which extracts intra-instance, inter-instance, and global context features separately for each instance from VLM features, and achieves state-of-the-art HOI detection on HICO-DET and V-COCO (HICO-DET Full mAP 43.96, V-COCO AP_role^S1 73.6) via progressive context aggregation and fusion with detector features.
- Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection
-
This paper proposes RAPTA (training-time region-aware prompt augmentation) to mitigate memorization in diffusion models, and ADMCD (attention-driven multimodal copy detection) to detect whether generated images copy training data. The two modules are complementary, forming an end-to-end framework for memorization mitigation and detection.
- Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection
-
This paper proposes two complementary modules — Region-Aware Prompt Augmentation at Training time (RAPTA) and Attention-Driven Multimodal Copy Detection (ADMCD) — to address training data memorization in diffusion models. RAPTA generates semantically grounded prompt variants via object detector proposals to mitigate memorization during training, while ADMCD fuses patch-level, CLIP, and texture features through a zero-training detection pipeline to classify copy behavior at inference time. On LAION-10k, the copy rate is reduced from 7.4 to 2.6.
- MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
-
This work is the first to formally define and address the problem of sparsely annotated monocular 3D object detection. It proposes two modules—Road-Aware Patch Augmentation (RAPA) and Prototype-Based Filtering (PBF)—achieving substantial improvements over existing 2D SAOD methods under the KITTI 30% annotation setting (AP3D Easy: 21.28 vs. 17.14).
- MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
-
This paper proposes MRD, a training-free multi-resolution retrieval-detection fusion framework that mitigates object fragmentation via cross-resolution semantic fusion and suppresses background interference through an open-vocabulary detector, substantially improving MLLM understanding of high-resolution images.
- NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection
-
NoOVD proposes a framework that, during frozen-VLM-based OVD training, employs a parameter-free K-FPN to preserve CLIP knowledge for discovering potential novel-category objects, applies self-distillation to embed novel-category knowledge into the detector, and introduces R-RPN at inference to improve novel-category recall, achieving SOTA on OV-LVIS, OV-COCO, and Objects365.
- PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection
-
PaQ-DETR proposes pattern-based dynamic query generation (content-aware weighted combination of shared basis patterns) combined with quality-aware one-to-many assignment (adaptive positive sample selection based on localization–classification consistency), jointly addressing query representation imbalance and supervision sparsity in DETR. It achieves consistent gains of 1.5%–4.2% mAP across multiple backbones.
- Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection
-
HSA-DINO proposes a multi-scale prompt bank that learns hierarchical semantic prompts from the image feature pyramid to enrich text representations, and employs a semantics-aware router to dynamically determine at inference time whether domain-specific augmentation should be applied. This design achieves a superior balance between domain adaptation and open-vocabulary generalization, attaining the best harmonic mean (H) scores across three vertical-domain datasets.
- PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
-
PET-DINO builds a unified object detector supporting both text and visual prompts on top of Grounding DINO. It introduces an alignment-friendly visual prompt generation module (AFVPG) and two prompt-enriched training strategies (IBP and DMD), achieving competitive zero-shot detection performance with significantly less training data.
- PHAC: Promptable Human Amodal Completion
-
This paper introduces Promptable Human Amodal Completion (PHAC), a novel task that accepts point-based user prompts (pose/bounding box) via dedicated ControlNet modules to inject conditional signals, and designs an inpainting-based refinement module to preserve the appearance of visible regions, achieving high-quality and controllable completion of occluded human images.
- Prompt-Free Universal Region Proposal Network
-
PF-RPN replaces text/image prompts with learnable visual embeddings and introduces three modules—Sparse Image-Aware Adapter (SIA), Cascaded Self-Prompting (CSP), and Centrality-Guided Query Selection (CG-QS)—to achieve state-of-the-art zero-shot region proposals across 19 cross-domain datasets using only 5% of COCO training data.
- Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection
-
This paper is the first to identify an "astigmatism" phenomenon in cross-domain few-shot object detection (CD-FSOD), wherein model attention remains persistently diffuse in the target domain. Inspired by the human foveal visual system, the authors design three complementary modules — Positive Pattern Refinement (PPR), Negative Context Modulation (NCM), and Text Semantic Alignment (TSA) — to reshape attention, achieving state-of-the-art performance with significant margins across six cross-domain benchmarks.
- Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
-
This paper proposes Saliency-R1, which uses a logit-decomposition-based efficient saliency map technique and chain-of-thought bottleneck attention rollout to compute alignment between saliency maps and human-annotated bounding boxes as a GRPO reward, training VLMs to focus on task-relevant image regions during reasoning and thereby improving the interpretability and faithfulness of the reasoning process.
- SDF-Net: Structure-Aware Disentangled Feature Learning for Optical–SAR Ship Re-Identification
-
SDF-Net is proposed to exploit the rigid-body geometric structure of ships as a cross-modal invariant anchor. It enforces structural consistency via gradient energy extracted from intermediate layers, and disentangles modality-shared/specific features at the terminal layer with additive residual fusion, achieving SOTA on HOSS-ReID (All mAP 60.9%, surpassing TransOSS by 3.5%).
- Show, Don't Tell: Detecting Novel Objects by Watching Human Videos
-
This paper proposes the "Show, Don't Tell" paradigm — automatically creating training datasets and training bespoke object detectors by watching human demonstration videos, entirely bypassing language descriptions and prompt engineering. The approach significantly outperforms state-of-the-art open-set/closed-set detectors on novel object recognition in real-world robotic scenarios.
- Show, Don't Tell: Detecting Novel Objects by Watching Human Videos
-
This paper proposes the "Show, Don't Tell" paradigm: a SODC pipeline (HOIST-Former for hand-object detection → SAMURAI for tracking → DBSCAN for spatiotemporal clustering) automatically creates annotated datasets from human demonstration videos to train a lightweight F-RCNN customized detector (MOD). Without any language prompts, MOD achieves instance-level detection of novel objects, surpassing VLM baselines such as GroundingDINO, RexOmni, and YoloWorld in mAP and precision on the Meccano and in-house datasets, and is integrated end-to-end into a real robotic sorting system.
- Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images
-
This paper proposes ESM-YOLO+, a lightweight visible-infrared fusion network for small target detection. It achieves pixel-level cross-modal adaptive fusion via a Mask-Enhanced Attention Fusion (MEAF) module, and introduces a training-time structural representation enhancement to improve spatial discriminability. The method achieves 84.71% mAP on VEDAI while reducing parameter count by 93.6%.
- SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
-
This paper proposes Spatial-Projection Alignment (SPAN), which improves the localization accuracy of arbitrary monocular 3D detectors through two geometrically synergistic constraints — 3D corner spatial alignment and 3D-to-2D projection alignment — coupled with a hierarchical task learning strategy, serving as a plug-and-play module.
- SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras
-
This paper proposes SpiralDiff, a diffusion framework for RGB-to-RAW conversion that employs a signal-dependent noise weighting strategy to accommodate varying reconstruction difficulty across pixel intensity regions, and introduces a CamLoRA module for lightweight cross-camera adaptation within a single unified model.
- The COTe Score: A Decomposable Framework for Evaluating Document Layout Analysis Models
-
This paper proposes COTe (Coverage, Overlap, Trespass, Excess), a decomposable evaluation framework for Document Layout Analysis (DLA), along with the concept of Structural Semantic Units (SSUs). Compared to conventional IoU/mAP/F1 metrics, COTe more accurately reflects page parsing quality and reveals model-specific failure modes.
- Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data
-
This paper introduces CANVAS — the first large-scale, subcellular-resolution Light-Sheet Fluorescence Microscopy (LSFM) whole-brain benchmark dataset, encompassing 6 cell markers, approximately 93,000 annotated cells, and a public leaderboard. It reveals critical generalization failures of existing detection models across markers and brain regions, and explores the potential of 3D Masked Autoencoders (MAE) for self-supervised representation learning.
- Towards Intrinsic-Aware Monocular 3D Object Detection
-
MonoIA proposes converting numerical camera intrinsics into language-guided semantic representations (via LLM-generated intrinsic descriptions encoded by CLIP), and injects them into the detection network through a hierarchical adaptation module. This enables zero-shot generalization to unseen focal lengths and unified cross-dataset training, achieving new state-of-the-art results on KITTI, Waymo, and nuScenes.
- UAVGen: Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection
-
This paper proposes UAVGen, a layout-to-image data augmentation framework for UAV-based object detection. It addresses low-quality small object generation, inefficient model capacity allocation, and label inconsistency through a visual prototype conditioned diffusion model and a focal region enhancement pipeline.