Skip to content

🎯 Object Detection

📷 CVPR2026 · 45 paper notes

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

This paper proposes a Hybrid Ensemble Decoder (HED) and a progressive fine-tuning strategy for cross-domain few-shot object detection (CD-FSOD). By parallelizing a subset of decoder layers and randomly initializing denoising queries to introduce prediction diversity, the method achieves state-of-the-art performance on three benchmarks — CD-FSOD, ODinW-13, and RF100-VL — without introducing any additional parameters.

ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection

ABRA decouples domain knowledge from category knowledge by constructing class-agnostic domain experts via Objectification, extracting lightweight per-category residuals via SVFT, and aligning weight spaces through Orthogonal Procrustes rotation—enabling detection capability transfer to a target domain even when no data for certain categories exists therein.

ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection

This paper formulates cross-domain category transfer as SVD rotation alignment in weight space: domain-agnostic experts are trained via Objectification, lightweight class residuals are extracted with SVFT, and a closed-form orthogonal Procrustes solution is used to "teleport" source-domain class knowledge to a target domain with no data for that class.

AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

By exploiting the temporal invariance of background structure in fixed-view videos, the paper constructs an offline Anchor Bank and an online Anchor Map as persistent language–scene memory. Combined with an anchor-guided re-entry prior and a ReID-Gating identity verification mechanism, the system achieves robust re-capture of targets after occlusion or departure, yielding a 10.3% improvement in RCR and a 24.2% reduction in RCL.

Beyond Caption-Based Queries for Video Moment Retrieval

This paper identifies a substantial gap between caption-based queries and real-world search queries in VMR, introduces three search-query benchmarks, and mitigates active decoder-query collapse in DETR via two architectural modifications—self-attention removal and query dropout—achieving gains of up to 21.83% mAPm on multi-moment search queries.

Beyond Prompt Degradation: Prototype-Guided Dual-Pool Prompting for Incremental Object Detection

This paper proposes the PDP framework, which addresses prompt degradation in incremental object detection caused by prompt coupling and prompt drift via decoupled dual-pool prompting (shared pool + private pool) and Prototypical Pseudo-Label Generation (PPG), achieving state-of-the-art performance on COCO and VOC.

Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

This paper proposes Object-Anchored Composed Image Retrieval (OACIR), a new task formulation, along with a large-scale benchmark OACIRR (160K+ quadruplets) and the AdaFocal framework. AdaFocal employs a context-aware attention modulator to adaptively enhance focus on anchored instance regions, substantially outperforming existing methods in instance-level retrieval fidelity.

CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection

This paper proposes the CD-Buffer framework, which drives complementary collaboration between a subtractive buffer (channel suppression) and an additive buffer (lightweight adapter compensation) via a unified domain discrepancy measure, enabling robust test-time object detection adaptation across adverse weather conditions of varying severity.

CompAgent: An Agentic Framework for Visual Compliance Verification

This paper proposes CompAgent, the first agentic framework for visual compliance verification. A Planning Agent dynamically selects visual tools (object detection, face analysis, NSFW detection, etc.) based on compliance policies, while a Compliance Verification Agent integrates image content, tool outputs, and policy context for multimodal reasoning. Without any training, CompAgent surpasses the previous SOTA by 10% on UnsafeBench, achieving 76% F1.

DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

This paper proposes DA-Mamba, a CNN-SSM hybrid architecture that achieves image-level and instance-level global-local domain-invariant feature alignment with linear complexity via two modules—Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM)—attaining state-of-the-art performance on four domain adaptive detection benchmarks.

Detecting Unknown Objects via Energy-Based Separation for Open World Object Detection

This paper proposes the DEUS framework, which introduces ETF-Subspace Unknown Separation (EUS) to effectively separate known, unknown, and background proposals via energy scores within geometrically orthogonal known/unknown subspaces, and designs an Energy-based Known Distinction (EKD) loss to reduce cross-task interference between old and new classes during incremental learning, achieving substantial improvements in unknown object recall on OWOD benchmarks.

Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

This paper proposes DEUS, a framework that constructs orthogonal known/unknown subspaces via Simplex ETF and employs energy scores to guide feature separation (EUS), while mitigating cross-task interference between old and new categories through an Energy-based Known Distinction loss (EKD), achieving substantially improved unknown recall on OWOD benchmarks.

Does YOLO Really Need to See Every Training Image in Every Epoch?

This paper proposes the Anti-Forgetting Sampling Strategy (AFSS), which dynamically determines which training images participate in each epoch based on per-image learning sufficiency measured by \(\min(\text{Precision}, \text{Recall})\). AFSS achieves over 1.43× training speedup for YOLO-series detectors while maintaining or even improving detection accuracy.

Evaluating Few-Shot Pill Recognition Under Visual Domain Shift

This paper systematically evaluates the generalization of pill recognition under cross-domain few-shot conditions from a deployment perspective. It reveals a decoupling phenomenon in which semantic classification saturates at 1-shot while localization and recall degrade sharply under occlusion and overlap, and demonstrates that the visual realism of training data is far more critical than data volume or shot count.

Evaluating Few-Shot Pill Recognition Under Visual Domain Shift

This paper systematically evaluates few-shot pill recognition under cross-dataset domain shift from a deployment-oriented perspective. It reveals a decoupling phenomenon in which semantic classification saturates at 1-shot while localization and recall degrade sharply under occlusion and overlap, and demonstrates that the visual realism of training data is the dominant factor governing few-shot generalization.

EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer

This paper proposes the Evolving World Object Detection (EWOD) paradigm and the EW-DETR framework, which jointly address class-incremental learning, domain shift adaptation, and unknown object detection under a strict no-replay constraint through three synergistic modules: incremental LoRA adapters, a query-norm objectness adapter, and entropy-aware unknown mixing. The proposed approach achieves a 57.24% improvement on the FOGS metric.

EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer

This paper proposes the Evolving World Object Detection (EWOD) paradigm and the EW-DETR framework, which jointly address class-incremental learning, domain-shift adaptation, and unknown object detection without storing any historical data, via three modules: incremental LoRA adapters, a query-norm objectness adapter, and entropy-aware unknown mixing. The proposed method achieves a 57.24% improvement in FOGS over prior methods.

Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

This paper proposes FI3Det, the first few-shot incremental 3D object detection framework. During the base training stage, a VLM-guided unknown object learning module enables early awareness of potential novel categories. During the incremental stage, a gated multimodal prototype imprinting module fuses 2D semantic and 3D geometric features for novel class detection. FI3Det achieves an average improvement of 17.37% in novel class mAP on ScanNet V2 and SUN RGB-D.

Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection

This paper proposes FALCON-SFOD, a framework that leverages class-agnostic binary masks generated by a foundation model (OV-SAM) to regularize the detector's feature space via Spatial Prior-Aware Regularization (SPAR), and introduces an Imbalance-aware Robust Pseudo Label loss (IRPL) to achieve object-focused representations in source-free object detection, attaining state-of-the-art results across multiple benchmarks.

Fourier Angle Alignment for Oriented Object Detection in Remote Sensing

By exploiting Fourier rotational equivariance to estimate the principal orientation of objects in the frequency domain and align features accordingly, this paper proposes two plug-and-play modules—FAAFusion and FAA Head—to address cross-scale directional incoherence in FPN and the classification–regression task conflict in detection heads, respectively, achieving new state-of-the-art results on DOTA-v1.0/v1.5 and HRSC2016.

HeROD: Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

HeROD proposes a lightweight, model-agnostic framework that injects heuristic-inspired spatial and semantic reasoning priors into three stages of a DETR-style detection pipeline (candidate ranking, prediction fusion, and Hungarian matching), significantly improving data efficiency and convergence for referring object detection (ROD) under annotation-scarce conditions.

Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

This paper proposes LMP, a dual-branch framework built upon GroundingDINO that introduces a visual prototype branch (comprising positive class prototypes and hard negative prototypes) jointly trained and integrated with the text branch at inference, achieving state-of-the-art performance on cross-domain few-shot object detection.

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

This paper proposes InCoM-Net, which extracts intra-instance, inter-instance, and global context features separately for each instance from VLM features, and achieves state-of-the-art HOI detection on HICO-DET and V-COCO (HICO-DET Full mAP 43.96, V-COCO AP_role^S1 73.6) via progressive context aggregation and fusion with detector features.

Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

This paper proposes RAPTA (training-time region-aware prompt augmentation) to mitigate memorization in diffusion models, and ADMCD (attention-driven multimodal copy detection) to detect whether generated images copy training data. The two modules are complementary, forming an end-to-end framework for memorization mitigation and detection.

Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

This paper proposes two complementary modules — Region-Aware Prompt Augmentation at Training time (RAPTA) and Attention-Driven Multimodal Copy Detection (ADMCD) — to address training data memorization in diffusion models. RAPTA generates semantically grounded prompt variants via object detector proposals to mitigate memorization during training, while ADMCD fuses patch-level, CLIP, and texture features through a zero-training detection pipeline to classify copy behavior at inference time. On LAION-10k, the copy rate is reduced from 7.4 to 2.6.

MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label

This work is the first to formally define and address the problem of sparsely annotated monocular 3D object detection. It proposes two modules—Road-Aware Patch Augmentation (RAPA) and Prototype-Based Filtering (PBF)—achieving substantial improvements over existing 2D SAOD methods under the KITTI 30% annotation setting (AP3D Easy: 21.28 vs. 17.14).

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

This paper proposes MRD, a training-free multi-resolution retrieval-detection fusion framework that mitigates object fragmentation via cross-resolution semantic fusion and suppresses background interference through an open-vocabulary detector, substantially improving MLLM understanding of high-resolution images.

NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection

NoOVD proposes a framework that, during frozen-VLM-based OVD training, employs a parameter-free K-FPN to preserve CLIP knowledge for discovering potential novel-category objects, applies self-distillation to embed novel-category knowledge into the detector, and introduces R-RPN at inference to improve novel-category recall, achieving SOTA on OV-LVIS, OV-COCO, and Objects365.

PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection

PaQ-DETR proposes pattern-based dynamic query generation (content-aware weighted combination of shared basis patterns) combined with quality-aware one-to-many assignment (adaptive positive sample selection based on localization–classification consistency), jointly addressing query representation imbalance and supervision sparsity in DETR. It achieves consistent gains of 1.5%–4.2% mAP across multiple backbones.

Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

HSA-DINO proposes a multi-scale prompt bank that learns hierarchical semantic prompts from the image feature pyramid to enrich text representations, and employs a semantics-aware router to dynamically determine at inference time whether domain-specific augmentation should be applied. This design achieves a superior balance between domain adaptation and open-vocabulary generalization, attaining the best harmonic mean (H) scores across three vertical-domain datasets.

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

PET-DINO builds a unified object detector supporting both text and visual prompts on top of Grounding DINO. It introduces an alignment-friendly visual prompt generation module (AFVPG) and two prompt-enriched training strategies (IBP and DMD), achieving competitive zero-shot detection performance with significantly less training data.

PHAC: Promptable Human Amodal Completion

This paper introduces Promptable Human Amodal Completion (PHAC), a novel task that accepts point-based user prompts (pose/bounding box) via dedicated ControlNet modules to inject conditional signals, and designs an inpainting-based refinement module to preserve the appearance of visible regions, achieving high-quality and controllable completion of occluded human images.

Prompt-Free Universal Region Proposal Network

PF-RPN replaces text/image prompts with learnable visual embeddings and introduces three modules—Sparse Image-Aware Adapter (SIA), Cascaded Self-Prompting (CSP), and Centrality-Guided Query Selection (CG-QS)—to achieve state-of-the-art zero-shot region proposals across 19 cross-domain datasets using only 5% of COCO training data.

Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

This paper is the first to identify an "astigmatism" phenomenon in cross-domain few-shot object detection (CD-FSOD), wherein model attention remains persistently diffuse in the target domain. Inspired by the human foveal visual system, the authors design three complementary modules — Positive Pattern Refinement (PPR), Negative Context Modulation (NCM), and Text Semantic Alignment (TSA) — to reshape attention, achieving state-of-the-art performance with significant margins across six cross-domain benchmarks.

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

This paper proposes Saliency-R1, which uses a logit-decomposition-based efficient saliency map technique and chain-of-thought bottleneck attention rollout to compute alignment between saliency maps and human-annotated bounding boxes as a GRPO reward, training VLMs to focus on task-relevant image regions during reasoning and thereby improving the interpretability and faithfulness of the reasoning process.

SDF-Net: Structure-Aware Disentangled Feature Learning for Optical–SAR Ship Re-Identification

SDF-Net is proposed to exploit the rigid-body geometric structure of ships as a cross-modal invariant anchor. It enforces structural consistency via gradient energy extracted from intermediate layers, and disentangles modality-shared/specific features at the terminal layer with additive residual fusion, achieving SOTA on HOSS-ReID (All mAP 60.9%, surpassing TransOSS by 3.5%).

Show, Don't Tell: Detecting Novel Objects by Watching Human Videos

This paper proposes the "Show, Don't Tell" paradigm — automatically creating training datasets and training bespoke object detectors by watching human demonstration videos, entirely bypassing language descriptions and prompt engineering. The approach significantly outperforms state-of-the-art open-set/closed-set detectors on novel object recognition in real-world robotic scenarios.

Show, Don't Tell: Detecting Novel Objects by Watching Human Videos

This paper proposes the "Show, Don't Tell" paradigm: a SODC pipeline (HOIST-Former for hand-object detection → SAMURAI for tracking → DBSCAN for spatiotemporal clustering) automatically creates annotated datasets from human demonstration videos to train a lightweight F-RCNN customized detector (MOD). Without any language prompts, MOD achieves instance-level detection of novel objects, surpassing VLM baselines such as GroundingDINO, RexOmni, and YoloWorld in mAP and precision on the Meccano and in-house datasets, and is integrated end-to-end into a real robotic sorting system.

Small Target Detection Based on Mask-Enhanced Attention Fusion of Visible and Infrared Remote Sensing Images

This paper proposes ESM-YOLO+, a lightweight visible-infrared fusion network for small target detection. It achieves pixel-level cross-modal adaptive fusion via a Mask-Enhanced Attention Fusion (MEAF) module, and introduces a training-time structural representation enhancement to improve spatial discriminability. The method achieves 84.71% mAP on VEDAI while reducing parameter count by 93.6%.

SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection

This paper proposes Spatial-Projection Alignment (SPAN), which improves the localization accuracy of arbitrary monocular 3D detectors through two geometrically synergistic constraints — 3D corner spatial alignment and 3D-to-2D projection alignment — coupled with a hierarchical task learning strategy, serving as a plug-and-play module.

SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras

This paper proposes SpiralDiff, a diffusion framework for RGB-to-RAW conversion that employs a signal-dependent noise weighting strategy to accommodate varying reconstruction difficulty across pixel intensity regions, and introduces a CamLoRA module for lightweight cross-camera adaptation within a single unified model.

The COTe Score: A Decomposable Framework for Evaluating Document Layout Analysis Models

This paper proposes COTe (Coverage, Overlap, Trespass, Excess), a decomposable evaluation framework for Document Layout Analysis (DLA), along with the concept of Structural Semantic Units (SSUs). Compared to conventional IoU/mAP/F1 metrics, COTe more accurately reflects page parsing quality and reveals model-specific failure modes.

Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

This paper introduces CANVAS — the first large-scale, subcellular-resolution Light-Sheet Fluorescence Microscopy (LSFM) whole-brain benchmark dataset, encompassing 6 cell markers, approximately 93,000 annotated cells, and a public leaderboard. It reveals critical generalization failures of existing detection models across markers and brain regions, and explores the potential of 3D Masked Autoencoders (MAE) for self-supervised representation learning.

Towards Intrinsic-Aware Monocular 3D Object Detection

MonoIA proposes converting numerical camera intrinsics into language-guided semantic representations (via LLM-generated intrinsic descriptions encoded by CLIP), and injects them into the detection network through a hierarchical adaptation module. This enables zero-shot generalization to unseen focal lengths and unified cross-dataset training, achieving new state-of-the-art results on KITTI, Waymo, and nuScenes.

UAVGen: Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

This paper proposes UAVGen, a layout-to-image data augmentation framework for UAV-based object detection. It addresses low-quality small object generation, inefficient model capacity allocation, and label inconsistency through a visual prototype conditioned diffusion model and a focal region enhancement pipeline.