Skip to content

🎯 Object Detection

📷 CVPR2026 · 97 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (31) · 🧪 ICML2026 (6) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (27) · 📹 ICCV2025 (28)

🔥 Top topics: Object Detection ×38 · Anomaly Detection ×19 · Few-/Zero-Shot Learning ×13 · Multimodal/VLM ×10 · Reasoning ×5

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

This paper proposes the Hybrid Ensemble Decoder (HED) and a progressive fine-tuning strategy for cross-domain few-shot object detection (CD-FSOD). By parallelizing part of the decoding layers and introducing prediction diversity through randomly initialized denoising queries, the method achieves SOTA performance on CD-FSOD, ODinW-13, and RF100-VL benchmarks without introducing any additional parameters.

A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection

The SeDiR framework is proposed to achieve semantically disentangled unified 3D anomaly detection through three modules: Coarse-to-Fine Global Tokenization (CFGT), Category-Conditional Contrastive Learning (C3L), and Geometric-Guided Decoder (GGD). It addresses the Inter-Category Entanglement (ICE) problem and outperforms SOTA by 2.8% and 9.1% AUROC on Real3D-AD and Anomaly-ShapeNet, respectively.

AKCMamba-YOLO: Selective State Space Models For Real-Time Object Detection

This paper integrates Selective State Space Models (Mamba/SSM) and adaptive kernel convolutions into YOLOv8. By replacing the C2f blocks in the backbone and neck with 3CAKCMamba and 4CAKCMamba modules, it compensates for the "short-range" limitation of standard convolutions while maintaining linear complexity and real-time speed. On COCO2017, the model achieves 46.3% mAP with 14.9G FLOPs (a 1.4% mAP improvement with 47.9% fewer FLOPs compared to YOLOv8-S).

Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

ANoCo redefines anomaly detection from "how similar is this patch to normal ones" to "how much cost is required to pull this patch back to the normal manifold." By minimizing an anchored bipartite graph Laplacian energy to pull query patches toward the normal manifold, the displacement magnitude itself serves as the anomaly score. This approach requires no training, no message passing, and provides a closed-form solution, achieving new SOTA results on MVTec-AD / VisA in 1/2/4-shot settings.

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

AnomalyVFM proposes a general framework that transforms any Vision Foundation Model (VFM) into a robust zero-shot anomaly detector through a three-stage synthetic data generation scheme and a parameter-efficient LoRA adaptation mechanism. Using RADIO as the backbone, it achieves 94.1% image-level AUROC on 9 industrial datasets, outperforming the SOTA by 3.3 percentage points.

AR²-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

Ours leverages the time-invariance of background structures in fixed-view videos to construct an offline Anchor Bank and an online Anchor Map as persistent language-scene memory. Combined with anchor-guided re-entry priors and a ReID-Gating identity verification mechanism, it achieves robust target re-capture after occlusion or departure, improving RCR by 10.3% and reducing RCL by 24.2%.

Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

AVI-Edit performs "audio-visual synchronized instance-level video editing" on a pre-trained video diffusion backbone. It utilizes a Granularity-Aware Mask Refiner to progressively refine rough user-provided masks (even bounding boxes) into precise instance contours, paired with a Self-Feedback Audio Agent (a separate-generate-remix-rework pipeline) to produce accompanying audio temporally aligned with the edited visuals. It significantly outperforms existing methods in visual quality, condition following, and audio-visual synchronization.

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

BTP applies pre-trained Point-Language Models (PLM, e.g., ULIP) to zero-shot 3D anomaly detection for the first time. It proposes a Multi-Granularity Feature Embedding Module (MGFEM) to fuse patch-level semantics, geometric descriptors, and global CLS tokens. Combined with a joint representation learning strategy, it achieves 84.5% point-level AUROC on Real3D-AD, significantly surpassing the VLM-based rendering approach of PointAD (73.5%).

Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

Ours embeds the hierarchical label tree from remote sensing fine-grained detection into the representation space of DETR. A "Balanced Hierarchical Contrastive Loss" (BHCL) is proposed to achieve gradient balancing via learnable class prototypes, combined with a strategy that decouples classification and localization queries. This allows contrastive learning to act solely on the classification branch without interfering with localization, reaching new SOTA on three hierarchically labeled remote sensing datasets.

BDNet: Bio-Inspired Dual-Backbone Small Object Detection Network

BDNet mimics the LGN/V1–V2–V4 color pathway and the V1–V4 edge pathway of the human visual system to construct a dual-backbone detection network featuring "color enhancement + edge strengthening + hierarchical fusion." Designed to remedy the insufficient feature extraction caused by low color contrast and blurred edges of small objects in remote sensing, it achieves SOTA results on VisDrone2019, NWPU VHR-10, and AI-TODv2 datasets with only 2.59M parameters.

Beyond Caption-Based Queries for Video Moment Retrieval

This work reveals a significant gap between caption-based queries and real user search queries in VMR. It proposes three search query benchmarks and alleviates the decoder query collapse in DETR by removing self-attention and introducing Query Dropout, achieving up to a 21.83% mAPm improvement on multi-moment search queries.

Beyond Duality: A Hybrid Framework of Leveraging Shared and Private Features for RGB-Event Object Detection

SPFD explicitly decouples RGB and event features into three streams—"Shared," "RGB-Private," and "Event-Private"—in the frequency domain using "spectral coherence." These features are then injected into a DETR-based encoder (via adaptive gated fusion of private features) and decoder (through layer-wise asymmetric injection), improving mAP on DSEC-Det from the SOTA of 30.4 to 34.6.

Beyond Prompt Degradation: Prototype-Guided Dual-Pool Prompting for Incremental Object Detection

The PDP framework is proposed to address prompt degradation caused by prompt coupling and prompt drift in incremental object detection. By utilizing dual-pool prompt decoupling (shared pool + private pool) and Prototype-Guided Pseudo-labeling (PPG), the method achieves SOTA performance on MS-COCO and PASCAL VOC.

Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval

This paper proposes the Object-Anchored Composed Image Retrieval (OACIR) task, the OACIRR large-scale benchmark (160K+ quadruplets), and the AdaFocal framework. AdaFocal adaptively enhances focus on anchored instance regions through a Context-Aware Attention Modulator, significantly outperforming existing methods in instance-level retrieval fidelity.

Bidirectional Multimodal Prompt Learning with Scale-Aware Training for Few-Shot Multi-Class Anomaly Detection

AnoPLe is proposed as a lightweight multimodal bidirectional prompt learning framework. Without requiring manual anomaly descriptions or external auxiliary modules, it achieves few-shot multi-class anomaly detection through bidirectional text-visual prompt interaction and scale-aware prefixes, maintaining competitive performance on MVTec-AD/VisA/Real-IAD while ensuring efficient inference (~28 FPS).

Black-Box Domain Adaptation for Object Detection with Retention-Driven Knowledge Compression

Under the strictest privacy constraints where only a cloud-based black-box API is accessible (no source data or source model weights), this paper proposes RDKC for cross-domain object detection. Inspired by the "active forgetting + selective consolidation" mechanisms in lifelong learning, RDKC utilizes Memory Retention (MR) to partition candidate boxes by reliability and redistribute prediction scores for noise resistance, and Scene Compression (SC) to guide fine-grained localization through near-far contrastive weighting. RDKC consistently outperforms previous BBDA SOTA methods across four cross-domain benchmarks (e.g., +4.2 mAP gain over DINE on Cityscapes→Foggy).

Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting

The QICA framework is proposed to address the lack of quantity awareness and spatial insensitivity in zero-shot object counting. By utilizing a Synergistic Prompting Strategy (SPS) to jointly adapt vision-language encoders with quantity-conditioned prompts, combined with a Cost Aggregation Decoder (CAD) operating directly on similarity maps to maintain zero-shot transferability, it achieves zero-shot SOTA (12.41 MAE) on FSC-147 and demonstrates strong cross-domain generalization.

Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization

Addressing the issue where detectors easily learn spurious correlations from confounders like "lighting/co-occurrence/style" under data scarcity in single-source domains, this paper proposes the plug-and-play Causal Basis Block (CBB). By implementing causal front-door adjustment via learnable low-rank bases to "estimate two expectations," CBB allows for end-to-end calibration on frozen VFMs (DINOv2/3, SAM, Stable Diffusion). It consistently sets new SOTAs across five DGOD benchmarks (up to +5.4 mAP).

BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection

The authors propose BUSSARD, the first learning-based method for scene-specific anomalous relationship detection. It utilizes pre-trained language model embeddings for scene graph triplets, dimensionality reduction via an autoencoder, and likelihood estimation through normalizing flows. It achieves an AUROC improvement of approximately 10% on the SARD dataset and exhibits robustness to synonym variations.

CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection

The CD-Buffer framework is proposed, which achieves robust test-time object detection adaptation across varying adverse weather severities by driving the complementary collaboration of a subtractive buffer (channel suppression) and an additive buffer (lightweight adapter compensation) through a unified domain discrepancy metric.

CHAL: Causal-guided Hierarchical Anomaly-aware Learning for Moving Infrared Small Target Detection

This work inverts "moving infrared small target detection" from "direct learning of weak target features" to "learning normal background patterns and treating targets as anomalies within the background." By utilizing spatio-temporal neural fields for background evolution modeling, hierarchical anomaly awareness (appearance anomaly → motion consistency verification), and causal backdoor adjustment to sever background confusion paths, the method achieves new SOTA performance on three infrared datasets.

Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection

Addressing the issue where "unconditional cross-modal mapping" in RGB-3D multimodal anomaly detection misidentifies diverse normal variations (e.g., different colors for the same geometry) as anomalies, CPMAD dynamically extracts "consensus prototypes" (cross-modal consistent, anomaly-free subspaces) and "supplementary prototypes" (capturing modality-specific cues ignored by consensus). These complementary prototypes guide cross-modal reconstruction, achieving 97.8% I-AUROC on MVTec-3D while the lightweight version offers 5× faster inference and 2.6× lower memory consumption.

Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning

This paper discovers that open-vocabulary detectors output features that drift significantly when the same object appears in different backgrounds (background overfitting). It proposes the CCL framework, which utilizes diffusion models to generate paired "same-object-different-background" samples (CBDG) and enforces background invariance through an intra-class contrastive consistency loss (CCLoss). This approach achieves gains of +16.3 AP on OmniLabel and +14.9 AP on D3 with zero additional inference overhead and model-agnostic properties.

CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

To address the "cross-view gap" where Vision-Language Models (VLMs) perform strongly in ground views but poorly in aerial views, CrossVL introduces a Complexity-aware Pathway Aggregation (CPA) module that routes visual features based on scene density (active only during training with zero inference overhead) and a Paired Curriculum Learning (PCL) strategy that transitions from paired to random sampling. CrossVL improves Florence-2's mAP on the MAVREC aerial dataset from 58.66% to 61.03%, reduces the ground-aerial gap from 8.63pp to 6.65pp, and decreases variance across random seeds by 3.3×.

DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

The authors propose DA-Mamba, a CNN-SSM hybrid architecture that achieves image-level and instance-level global-local domain-invariant feature alignment with linear complexity via two modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). It achieves SOTA performance on four domain adaptive detection benchmarks.

Detect Anything via Next Point Prediction

Object detection is reformulated as "generating quantized coordinate token sequences with an MLLM." By combining three components—learnable coordinate tokens, a self-built data engine generating 22 million annotations, and SFT followed by GRPO reinforcement training to rectify behavior—the authors develop Rex-Omni, a 3B model. It surpasses regression-based detectors like DINO and Grounding DINO in zero-shot performance on benchmarks such as COCO, while simultaneously handling eight task categories including referring, pointing, GUI localization, and OCR.

Detecting Unknown Objects via Energy-Based Separation for Open World Object Detection

Ours proposes the DEUS framework, which effectively separates known, unknown, and background proposals in geometrically orthogonal known/unknown subspaces via ETF-Subspace Unknown Separation (EUS). It further introduces Energy-based Known Distinction (EKD) loss to reduce cross-interference between old and new classes during incremental learning, significantly improving unknown object recall on OWOD benchmarks.

Distribution-Aligned Multimodal Fusion for Robust Object Detection

To address the poor generalization of RGB-Infrared multimodal detection in "unseen degradation scenarios," this paper freezes the pretrained detector and trains only a lightweight fusion module. It explicitly pulls fused features back to the "normal feature distribution \(P_\text{normal}\)" (where the pretrained detector performs best) using complementary information from infrared data, rather than adapting to the degradation distributions seen during training. This achieves SOTA on three benchmarks with a \(4 \times\) training speedup.

Does YOLO Really Need to See Every Training Image in Every Epoch?

Ours proposes the Anti-Forgetting Sampling Strategy (AFSS), which dynamically decides which training images to participate in training and which can be skipped based on the learning sufficiency (\(\min(\text{Precision, Recall})\)) of each image. This achieves a training acceleration of over 1.43× for the YOLO series detectors while maintaining or even improving detection accuracy.

DyFCLT: Dynamic Frequency-Decoupled Cross-Modal Learning Transformer for Multimodal Tiny Object Detection

For Visible-Infrared (RGBT) tiny object detection, DyFCLT first decouples cross-modal features into low/mid/high-frequency sub-bands using learnable dynamic frequency bands, performs Band-Wise Frequency Cross-modal Attention (DFCA) within each sub-band, and utilizes a foreground mask-guided Selective Smoothing and Enhancement (SSE) module to suppress background noise and enhance foreground details. It achieves SOTA AP on two RGBT tiny object benchmarks (48.2 AP on RGBT-Tiny, +9.5 over the previous best multimodal method).

ElasticFormer: Detecting Objects in HRW Shots via Elastic Computing Vision Transformer

ElasticFormer equips a sparse ViT backbone with a lightweight module called ElasticSelector, allowing it to dynamically determine the number of windows retained for local attention based on the "foreground ratio" of the image during the forward pass. This reduces backbone FLOPs by 80% on PANDA gigapixel detection while simultaneously improving AP50.

EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer

This paper proposes the Evolving World Object Detection (EWOD) paradigm and the EW-DETR framework. By synergizing three modules—Incremental LoRA Adapters, Query Norm Objectness Adapter, and Entropy-aware Unknown Mixing—the framework simultaneously addresses class-incremental learning, domain adaptation, and unknown object detection under no-replay constraints, achieving a 57.24% improvement in the FOGS metric.

Expert-Teacher-Student Collaborative Learning for Domain Adaptive Object Detection

To address the complementarity dilemma in domain adaptive object detection (DAOD)—where Visual Foundation Model (VFM) knowledge is too broad and teacher model knowledge is too narrow—this paper proposes the Expert-Teacher-Student (ETS) framework. By treating VFM as a "free lunch" expert model to generate offline pseudo-labels and prototypes, and employing a dual-layer mechanism of ETCT (Label-level Collaborative Teaching) and ETJC (Representation-level Joint Consolidation), the expert and teacher collaboratively supervise the student. ETS outperforms SOTA on three DAOD benchmarks (e.g., reaching 49.8% mAP on Cityscapes→BDD100k, 2.0% higher than DT).

Explaining Object Detectors via Collective Contribution of Pixels

This paper proposes VX-CODE, which explains object detectors using Shapley values (individual contribution) and interactions (collective contribution) from game theory. By utilizing a self-context variant and greedy patch selection, the exponential computation is reduced to a practical level, generating faithful heatmaps that cover both "primary features + collaborative background cues." Insertion/deletion AUC is improved by up to approximately 19% compared to the state-of-the-art (SOTA).

FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Alignment

FALCON is proposed as a learning-based mini-batch construction strategy. It utilizes a negative mining scheduler to adaptively balance the trade-off between hard negatives and false negatives, significantly improving cross-modal alignment quality in vision-language pre-training.

FastRef: Fast Prototype Refinement for Few-shot Industrial Anomaly Detection

FastRef formulates "refining normal prototypes with query features" as a nested optimization problem involving feature migration + anomaly suppression. During inference, it uses a transform matrix with a closed-form update to migrate query information into prototypes, while employing Sinkhorn Optimal Transport to suppress incorporated anomalies. As a plug-and-play module for PatchCore, WinCLIP, and AnomalyDINO, it consistently improves detection and localization AUROC under 1/2/4-shot settings while meeting real-time requirements.

FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement

FB-CLIP addresses the "foreground-background feature entanglement" problem in CLIP-based fine-grained zero-shot anomaly detection by treating text and vision paths simultaneously: the text side fuses EOT, global pooling, and attention tokens for richer semantic prompts, while the vision side softly separates foreground from background via identity, semantic, and spatial perspectives, applying background subtraction to suppress residual interference. Combined with semantic consistency regularization, it achieves SOTA AUPRO across 16 industrial and medical datasets.

Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection

The FALCON-SFOD framework is proposed to enhance object-focused representations in source-free object detection. It regularizes the detector's feature space using class-agnostic binary masks generated by a foundation model (OV-SAM) via Spatial Prior-Aware Regularization (SPAR) and incorporates an Imbalance-Aware Noise-Robust Pseudo-Label loss (IRPL). The method achieves SOTA results across multiple benchmarks.

Fourier Angle Alignment for Oriented Object Detection in Remote Sensing

Leveraging Fourier rotation equivariance to estimate principal orientations in the frequency domain for feature alignment, this paper proposes two plug-and-play modules, FAAFusion and FAA Head. These modules address cross-scale directional incoherence in FPN and task conflict between classification and regression in detection heads, respectively, achieving new SOTA results on DOTA-v1.0/v1.5 and HRSC2016.

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

FDTA identifies that "excessively high inter-class similarity" in object embeddings produced by DETR is the root cause of poor association accuracy in end-to-end MOT. Consequently, three lightweight adapters—Spatial (Depth), Temporal (Trajectory), and Identity (Contrastive Learning)—are attached to a shared DETR to explicitly refine embeddings from the perspectives of spatial continuity, temporal dependence, and instance discriminativeness. This achieves SOTA performance across HOTA, IDF1, and AssA on DanceTrack, SportsMOT, and BFT.

FSLoRA: Harmonizing Detection and Re-Identification via Freq-Spatial Low-Rank Adapter for One-Stage Person Search

FSLoRA utilizes LoRA as a "layer-wise feature decoupler" integrated into the entire backbone. By employing Spatial MoE Routing (SLM) and Frequency-domain decomposition (FLM), the model separates shared detection features and ReID identity features at the bottom layers. This plug-and-play approach achieves new SOTA performance across multiple one-stage person search frameworks with <2% additional parameters.

GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking

GMT reformulates the traditional two-stage pipeline of "Single-Camera Tracking (SCT) + Inter-Camera Association (ICT)" into a unified "Global Tracklet-to-Target" association. It first uses the CFCE module to align appearance and spatial features across different views into a consistent space, then employs a DETR-style GTA module to directly match new detections with global tracklets that encode multi-view historical information. The method achieves state-of-the-art results in metrics such as IDF1 and CVIDF1 across six datasets, including the large-scale self-collected VisionTrack.

GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection

GPFlow models the continuous distribution of "normal" samples using a set of learnable Gaussian prototypes (mean + diagonal covariance + mixture weights), then iteratively contracts input features towards the posterior mean of the Gaussian mixture via an analytically solvable "Posterior Mean Path (PMP) router." This naturally realizes a "covariance-aware information bottleneck," significantly outperforming Prev. SOTA such as FIND in few-shot industrial multi-modal (RGB+3D) anomaly detection with only 5/10/50 normal samples.

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

Ours proposes GS-CLIP, a two-stage framework that injects global shape and local defect information of 3D point clouds into text prompts via a Geometry Defect Distillation Module (GDDM). It synergistically fuses rendered images and depth maps using a LoRA-based dual-stream architecture, achieving SOTA performance in zero-shot 3D anomaly detection across four large-scale datasets.

Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

To address the sharp performance drop of Referring Object Detection (ROD) models when "annotations are scarce," this paper first defines a low-data/few-shot De-ROD evaluation protocol and subsequently proposes HeROD. Interpretable spatial orientation priors and visual semantic priors, derived directly from referring phrases, are injected into three stages of the DETR detection pipeline (candidate ranking, final prediction, and Hungarian matching) as heuristic costs similar to A*. In extremely low-data (0.1%–5%) and few-shot settings on RefCOCO/+/g, HeROD consistently achieves gains of 3–16 points compared to Grounding DINO and UNINEXT.

Incremental Object Detection via Future-Aware Decoupled Cross-Head Distillation

To address the issue where "detection head bias contaminates backbone features leading to distillation failure" in incremental object detection, this paper proposes FaCHD. It utilizes two frozen teachers—a historical teacher and an intermediate teacher—to perform cross-head decoding of student ROI features for feature distillation. This decouples the classification head from the backbone. Combined with RPSC multi-granularity prototype semantic drift compensation for retraining the classification head, it achieves a new SOTA for non-exemplar-based methods on VOC and COCO incremental benchmarks.

InsCal: Calibrated Multi-Source Fully Test-Time Prompt Tuning for Object Detection

This paper extends test-time prompt tuning (TPT) from classification to text-driven object detection and identifies that entropy minimization leads to overconfidence and miscalibration. Consequently, the authors propose InsCal, which aggregates multi-domain knowledge through multi-source prompt tuning, narrows domain gaps via text-guided style augmentation, and suppresses overconfidence using instance-level calibration entropy. On cross-domain detection benchmarks, InsCal reduces the Detection Expected Calibration Error (D-ECE) from approximately 20% to 10% while improving mAP.

InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models

Ours proposes InvAD, which shifts diffusion model anomaly detection from the "RGB space denoising reconstruction" paradigm to a "latent space noise-adding inversion" paradigm. By directly inferring the final latent variables through DDIM inversion and measuring deviations under the prior distribution to detect anomalies, it reaches SOTA performance with only 3 inversion steps while increasing inference speed by approximately 2x.

Learning to Track Instance from Single Nature Language Description

SVLTrack proposes a completely box-annotation-free self-supervised vision-language tracking framework. It utilizes a Large Vision-Language Model (LVLM) to generate a pseudo-box for the first frame of a video, performs forward/backward tracking self-supervision under weak-to-strong consistency, and designs a Dynamic Token Aggregation (DTA) module to tightly align language tokens with a few key visual tokens. Ultimately, it tracks arbitrary targets based solely on a single natural language description, surpassing existing self-supervised methods across four VL tracking benchmarks.

Mind the Gap: Transferring Labels to Align Object Detection Datasets

This paper proposes the Label-Aligned Transfer (LAT) framework, which projects annotations from multiple detection datasets with diverse labeling protocols into a fixed target dataset's label space in a multi-to-one manner. By utilizing a Privileged Proposal Generator (PPG) (replacing RPN with ground truth and cross-dataset pseudo-labels) and Semantic Feature Fusion (SFF) (denoising via class-aware attention), the method simultaneously resolves inconsistencies in category semantics and bounding box styles, achieving up to a +8.4 AP improvement across multiple benchmarks.

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

This paper proposes InCoM-Net, which extracts three levels of context—intra-instance, inter-instance, and global—separately for each instance from VLM features. Through progressive context aggregation and fusion with detector features, it achieves SOTA results in HOI detection on HICO-DET (Full mAP 43.96) and V-COCO (\(AP_{role}^{S1}\) 73.6).

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with MLLMs

MMR-AD constructs the largest reasoning-based multimodal industrial anomaly detection dataset to date (127K images, 188 product categories, 395 anomaly types) and proposes the Anomaly-R1 baseline model based on GRPO reinforcement learning, which significantly outperforms general MLLMs.

MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection

This paper proposes MoECLIP, which introduces the Mixture-of-Experts architecture into Zero-Shot Anomaly Detection (ZSAD). By employing Frozen Orthogonal Feature Separation (FOFS) and Equiangular Tight Frame (ETF) loss, it achieves patch-level dynamic expert routing and specialization, reaching SOTA performance across 14 industrial and medical benchmarks.

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Proposes MRD, a training-free multi-resolution retrieval-detection fusion framework that alleviates object fragmentation through multi-resolution semantic fusion and suppresses background interference with an open-vocabulary detector, significantly enhancing MLLM capabilities for high-resolution image understanding.

NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection

The NoOVD framework is proposed to discover potential novel category objects during frozen VLM-based OVD training by preserving CLIP knowledge with a parameter-free K-FPN. It embeds novel category knowledge into the detector via self-distillation and enhances the recall of novel categories during inference using R-RPN, achieving SOTA results on OV-LVIS, OV-COCO, and Objects365.

Object-Generalized Re-Identification: A Step Towards Universal Instance Perception

Proposes the Object-Generalized ReID (OG-ReID) paradigm—using a unified model to recognize the "same instance" of heterogeneous objects such as people, vehicles, animals, ships, and buildings. The MGOR framework is designed to reinterpret meta-learning as "semantic distribution regularization," outperforming existing ReID methods on 100+ unseen categories without target domain adaptation.

Omni-AD: A Large-scale and Versatile Benchmark for Industrial Anomaly Detection

Omni-AD is an Industrial Anomaly Detection (IAD) benchmark collected from real production lines, covering 150 categories across 16 industries with approximately 35,000 pixel-level annotated images; it supports both traditional unsupervised IAD evaluation and, for the first time, introduces three progressive subtasks—"discrimination-classification-localization"—for Multimodal Large Language Models (MLLMs). Experiments demonstrate that both existing methods and MLLMs are far from saturated on this dataset.

Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision

DetGain is the first truly effective online data curation method for object detection. Instead of relying on unstable training losses, it estimates the "marginal contribution" of each image to the "dataset-level mAP." By using the teacher–student contribution difference as the learnability signal to select the most informative samples in each iteration, it is architecture-agnostic and plug-and-play. It brings stable improvements of up to +2.7 mAP for various detectors on COCO and up to +6.9 mAP under low-quality data.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

Ours proposes PALM, a unified VLA framework that uses structured fine-grained affordance predictions (global, local, spatial, and dynamic) as implicit reasoning anchors, combined with continuous sub-task progress estimation for seamless task switching. It achieves an average completion length of 4.48 on CALVIN ABCD (surpassing Prev. SOTA by 12.5%), a 91.8% success rate on LIBERO-LONG, and over 2x the baseline performance in real-world long-horizon generalization tests.

PaQ-DETR: Learning Pattern and Quality-Aware Dynamic Queries for Object Detection

PaQ-DETR proposes dynamic query generation based on shared patterns (content-aware weighted combinations of shared base patterns) combined with quality-aware one-to-many assignment (adaptive positive sample selection based on localization-classification consistency). This uniformly addresses the imbalance in query representation and supervision in DETR, achieving stable gains of 1.5%-4.2% mAP across various backbones.

Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

HSA-DINO proposes a multi-scale prompt bank to learn hierarchical semantic prompts from image feature pyramids to enhance text representations. In parallel, a semantic-aware router dynamically determines whether to apply domain-specific augmentation during inference, achieving a superior balance between domain adaptation and open-vocabulary generalization (obtaining the best H-mean scores across three vertical domain datasets).

Parameterized Prompt for Incremental Object Detection

To address the failure of the "prompts pool" in Incremental Object Detection (IOD) caused by the inherent co-occurrence phenomenon in detection scenarios, this paper replaces the discrete prompts pool with a parameterizable MLP bottleneck. Combined with task-vector-based prompt fusion and sparse loss, this approach allows old task knowledge to be holistically preserved and updated, achieving SOTA results on PASCAL VOC2007 and MS COCO.

Partial Weakly-Supervised Oriented Object Detection

This paper introduces the new setting of "Partial Weakly-Supervised Oriented Object Detection (PWOOD)"—utilizing only a small amount of weak annotations (Horizontal Boxes or single points) combined with a large amount of unlabeled data. By employing a teacher-student framework (OS-Student) capable of learning orientation and scale from weak labels and a Class-agnostic Pseudo-label Filtering (CPF) mechanism based on Gaussian Mixture Models, the approach approaches or even surpasses semi-supervised methods using rotated boxes on DOTA/DIOR at a significantly lower annotation cost.

PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

PET-DINO constructs a universal object detector based on Grounding DINO that simultaneously supports text and visual prompts. It introduces an Alignment-Friendly Visual Prompt Generation (AFVPG) module and two prompt-enriched training strategies (IBP and DMD), achieving competitive performance on zero-shot detection tasks with significantly less training data.

PHAC: Promptable Human Amodal Completion

A new task titled Promptable Human Amodal Completion (PHAC) is proposed. By utilizing point-based user prompts (pose/bounding box) coupled with ControlNet for conditional signal injection, and a refinement module based on inpainting to preserve the appearance of visible regions, high-quality and controllable completion of occluded human images is achieved.

Portable Active Learning for Object Detection

PAL proposes an active learning framework that only reads detector inference outputs without modifying internal models or training pipelines. It estimates the probability of each detection being a True Positive (TP) or False Positive (FP) using a lightweight logistic regression classifier based on two features—"pre-NMS box count + confidence"—and uses entropy as the Instance Uncertainty Score (LIUS). This is combined with three image-level signals (GUIDE) for diversity and class-balanced selection. PAL achieves higher detection accuracy with fewer annotations than baselines like PPAL on COCO, VOC, and BDD100K.

Prompt-Free Universal Region Proposal Network

PF-RPN replaces text/image prompts with learnable visual embeddings. Through three modules—Sparse Image-Aware Adapter, Cascade Self-Prompting, and Centerness-Guided Query Selection—it achieves SOTA zero-shot region proposal across 19 cross-domain datasets using only 5% of COCO data for training.

RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection

RARE employs "ranking + retrieval" mechanisms to unify and solve two persistent issues in monocular 3D detection: it transforms confidence estimation from absolute regression to learning relative rankings, and constructs a set of queries for each object to predict multiple plausible 3D hypotheses, retrieving the optimal solution based on learned confidence. It outperforms several monocular SOTA methods on KITTI and nuScenes.

RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation

Proposes Robot-Conditioned Normalizing Flow (RC-NF), which models the joint distribution of robot states and object trajectories through conditional normalizing flows. It achieves <100ms real-time anomaly detection and serves as a plug-and-play monitoring module for VLA models (e.g., π₀), supporting task-level replanning and state-level trajectory homing.

Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision

The authors propose two modules, ReAL and CGRO, which generate pixel-level anomaly maps by extracting anomaly-related tokens from the MLLM autoregressive reasoning process and aggregating their visual attention. Combined with consistency-guided reinforcement learning to align reasoning with visual evidence, the system achieves end-to-end anomaly detection, localization, and explainable reasoning using only image-level supervision.

Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

This work first identifies the "astigmatism" phenomenon in Cross-Domain Few-Shot Object Detection (CD-FSOD), where model attention remains persistently dispersed in the target domain. Inspired by the human foveal vision system, three complementary modules—Positive Pattern Refinement (PPR), Negative Context Modulation (NCM), and Textual Semantic Alignment (TSA)—are designed to reshape attention, achieving SOTA performance on six cross-domain benchmarks by a significant margin.

Rotation Invariant and Symmetry Aware Pixel Difference Network for Remote Sensing Object Detection

This work integrates "continuous rotation invariance" and "structural symmetry" geometric priors directly into the convolutional kernel by proposing the RIS-PDC operator (Pixel Difference + Polar Harmonic Symmetry Kernel + SO(2) 8-direction kernel rotation averaging). As a plug-and-play replacement for convolutions in mainstream remote sensing detectors, it achieves 78.53% mAP on DOTA-v1.0 single-scale without increasing parameters.

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

This paper proposes Saliency-R1, which leverages an efficient saliency map technique based on logit decomposition and Chain-of-Thought (CoT) bottleneck attention backtracking. By using the alignment between saliency maps and human-annotated bounding boxes as a GRPO reward, the model is trained to focus on task-relevant image regions during inference, enhancing the interpretability and faithfulness of VLM reasoning.

See What We Cannot See: A Geo-guided Reasoning Benchmark for Object Counting under Adverse Earth Observation Conditions

This work proposes GROC—the first large-scale benchmark for "geo-guided reasoning counting under adverse Earth observation conditions" (14K images, 1.2M point annotations, with each image aligned to land use / map / DSM geo-modalities alongside clear-degraded pairs). Constructed via a controllable degradation + interactive annotation data engine, it includes a GROC Agent (GPT-5 backbone calling expert counting tools) as a baseline. The study systematically reveals that existing counting models suffer significant performance drops when visual cues are occluded by clouds/fog or low light, whereas geo-modalities provide stable structural and contextual priors that significantly enhance robustness.

Seeing Through the Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective

Addressing the issue where enhancing high-frequency features simultaneously increases false alarm rates in infrared small target detection, this paper proposes a noise-suppression feature pyramid network (NS-FPN) from a frequency-domain perspective. By replacing the 1×1 convolutions and upsampling in the FPN with a Low-frequency-guided Feature Purification (LFP) module and a Spiral-aware Feature Sampling (SFS) module, it significantly reduces false alarms and improves localization accuracy with almost no added computational cost.

SFR-Net: Steering-Fusion-Refining Network in Multi-label Zero-Shot Sewer Defect Detection

SFR-Net adapts CLIP to sewer defect scenarios using a three-stage "Steering (RS) \(\rightarrow\) Fusion (MEF) \(\rightarrow\) Refining (GR)" pipeline. It employs lightweight adapters to steer representations toward the pipe domain, fuses global and local evidence for initial scoring, and uses a GCN to learn a transferable "score refinement logic" from seen classes to unseen ones. It achieves SOTA on Sewer-ML and the self-collected WZ-Pipe datasets (e.g., 12.58% mAP on Sewer-ML ML-ZSL, approximately double the second-best method).

Spike-driven Discrete Aggregation for Event-based Object Detection

For event-based object detection, this paper proposes a "Discrete Aggregation" approach—utilizing the threshold-firing mechanism of spiking neurons to adaptively select and aggregate only informative events (SDA module + Gated Recurrent Spiking Neuron + Multi-Timescale Fusion). It achieves 43.4% mAP50:95 on Gen1 with fewer parameters, outperforming the previous fully spiking SOTA by 4.5%.

SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras

The authors propose SpiralDiff, a diffusion framework for RGB-to-RAW conversion that employs a signal-dependent noise weighting strategy to adapt to reconstruction difficulties across different pixel intensity regions and introduces the CamLoRA module to achieve lightweight cross-camera adaptation within a single model.

SRA-Det: Learning Omni-Grained Open-Vocabulary Detection Beyond Category Names

Addressing the issue where open-vocabulary detectors only match "category names" and remain insensitive to fine-grained attributes like color, material, and pattern, SRA-Det uses learnable retrieval queries to extract multiple semantic "facets" from text tokens. It employs soft-min matching as a "logic AND" to ensure all facets are satisfied. Combined with an attribute-augmented pipeline that uses LLMs for generation and CLIP for dual-verification, SRA-Det achieves 54.9 mAP on FG-OVD and maintains 40.4 AP on LVIS under zero-shot settings.

SL-HOI: Streamlined Open-Vocabulary Human-Object Interaction Detection

SL-HOI utilizes a single frozen DINOv3 (dino.txt variant) for open-vocabulary HOI detection—using the backbone for precise localization and a text-aligned vision head for open-vocabulary interaction classification. By "inserting interaction queries and image tokens together into the frozen vision head," the representation gap is bridged. With only a small number of trainable parameters, it achieves SOTA performance on SWiG-HOI and HICO-DET.

SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

SubspaceAD demonstrates that fitting a single PCA on features from a strong vision foundation model (DINOv2-G) is sufficient to outperform all few-shot anomaly detection methods requiring training, memory banks, or prompt tuning. In a 1-shot setting, it achieves 98.0% Image-level AUROC and 97.6% Pixel-level AUROC on MVTec-AD.

Target-Aware Invertible Encoder with Reconstruction Guidance for Infrared Small Target Detection

InvDet utilizes an invertible encoder to transform the "information loss of infrared small targets caused by downsampling" into an observable and optimizable quantity. It employs a forward path for detection and a backward path for input reconstruction. By using TARM to focus reconstruction on the target and GCTM as a pixel-level weight map (replacing IoU) to supervise the reconstruction, it achieves competitive accuracy on five infrared benchmarks and demonstrates strong cross-dataset generalization.

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

Thermal-Det utilizes "RGB-to-Thermal" translation to synthesize million-scale thermal data with text annotations for pre-training. It then employs a frozen RGB open-vocabulary detector as a teacher, transferring open-vocabulary capabilities to a thermal student through triple-path distillation (box/semantics/confidence). By combining a Thermal Text Alignment Head (TTAH) and thermal LLM caption supervision to calibrate the CLIP text space, it achieves zero-shot open-vocabulary thermal detection without any thermal annotations, outperforming RGB open-vocabulary detectors by 2–4% AP across 7 thermal benchmarks.

Toward Generalizable Whole Brain Representations with High-Resolution Light-Sheet Data

Ours proposes CANVAS—the first large-scale subcellular resolution Light-Sheet Fluorescence Microscopy (LSFM) whole-brain benchmark dataset. It covers 6 cell markers, includes ~93,000 cell annotations and a public leaderboard, reveals the severe inadequacies of existing detection models in cross-marker and cross-region generalization, and explores the potential of 3D Masked Autoencoders (MAE) for self-supervised representation learning.

Towards an Incremental Unified Multimodal Anomaly Detection: Augmenting Multimodal Denoising From an Information Bottleneck Perspective

This paper proposes IB-IUMAD, a unified framework for industrial multimodal anomaly detection (RGB+Depth) that enables a single model to learn new objects incrementally. By employing a Mamba decoder to decouple spurious feature coupling between objects and an Information Bottleneck Fusion Module to filter redundant information from fused features, the framework significantly mitigates catastrophic forgetting in incremental learning. It consistently outperforms SOTA methods on MVTec 3D-AD and Eyecandies.

Towards Persistence: Learning Topological Constraints for Event-based Small Object Detection

To address the issue of fractured small object trajectories in event camera point clouds, this paper proposes SpTopoNet. It employs a "Topological Learning Module + Spatial Consistency Module" to implicitly encode trajectory connectivity within the network, and an EvTopoLoss based on persistent homology to explicitly constrain the trajectory topology. This approach improves the IoU from 55.18% to 66.62% on the EV-UAV benchmark.

Tri-Modal Fusion Transformers for UAV-based Object Detection

To address the failure of single sensors under low light, motion blur, and rapid scene changes in UAV applications, this paper employs a dual-stream hierarchical MiT Transformer to perform gated and bi-directional token exchange fusion across multiple resolution levels of the backbone for RGB, Thermal, and Event modalities. The authors release the first synchronized and aligned tri-modal UAV dataset (10,489 frames / 24,223 vehicle boxes). Through 61 sets of ablations, they systematically answer "at which layer and with which operator should tri-modal fusion occur," achieving an mAP of 84.24%.

UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection

Aiming at the challenges of UAV detection in low-altitude complex backgrounds—characterized by "low contrast, weak boundaries, and high confusion with cluttered textures"—this paper constructs the UAV-CB dataset (3,442 image pairs, 5 background categories) with deliberately sampled camouflaged/complex scenes. It further proposes LFBNet, which performs alignment in the local frequency domain: first unifying the amplitude and phase of both modalities in the frequency domain, and then using frequency cues to guide spatial deformable registration. Ultimately, it achieves an AP(0.5:0.95) of 54.4% on UAV-CB, outperforming the previous best multimodal baseline C2Former by 5.3 points.

UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression

UniMMAD is proposed as the first unified framework capable of handling multi-modal and multi-class anomaly detection using a single set of parameters. Its core is an MoE-driven feature decompression mechanism that adaptively decomposes general multi-modal encoded features into domain-specific single-modal reconstructions. It achieves SOTA performance across 9 datasets involving 3 domains, 12 modalities, and 66 categories.

UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting

This paper proposes UniSpector, an open-set industrial defect detection framework. By integrating Spatial-Spectral Prompt Encoding (SSPE) and Angular-Margin Contrastive Prompt Encoding (CPE), it addresses the prompt embedding collapse issue. On the newly constructed Inspect Anything benchmark containing 360 defect categories, it outperforms the best baselines by 19.7% and 15.8% in AP50 for detection and segmentation, respectively.

Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

UAVGen uses diffusion models to synthesize annotated training data for UAV object detection. It replaces blurry small object layout conditions with high-quality reference instances via "visual prototypes," generates images only within target-dense "focal regions," and refines labels using a detector back-check. On VisDrone, it improved mAP from 24.5 to 25.9 using only 738 synthetic images.

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

The necessity of the text branch in Zero-Shot Anomaly Detection (ZSAD) is revisited, leading to the proposal of VisualAD—a pure vision framework. By inserting two learnable tokens (anomaly/normal) into a frozen ViT, combined with Spatial-Aware Cross-Attention and a Self-Alignment Function, the model achieves SOTA performance across 13 industrial and medical benchmarks without requiring a text encoder.

ViTPrompt: Training-Free Prompt Refinement with Visual Tokens for Open-Vocabulary Detection

Addressing the issue where boxes remain unrefined under domain shift in open-vocabulary detection, ViTPrompt concatenates RoI visual tokens of high-confidence targets from the first-pass detection into the text prompts. By re-running Grounding DINO, it refreshes bounding boxes and classification scores simultaneously via a training-free two-stage inference, achieving SOTA on multiple ODD benchmarks.

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval

This work treats open-vocabulary detection entirely as a "region × text" retrieval matching problem. It utilizes a non-fusion dual-tower structure, WeDetect, to achieve real-time SOTA detection. By freezing WeDetect, a general proposal generator WeDetect-Uni is derived (supporting the new task of local object retrieval). Finally, WeDetect-Ref reframes REC by transforming an LLM into a classifier for parallel scoring in a single forward pass, achieving both high precision and high throughput across 15 benchmarks.

When Transformers Meet Mamba: A Hybrid Transformer-Mamba Network for Video Object Detection

TMambaDet establishes a clear division of labor between Transformer and Mamba in video object detection: intra-frame spatial modeling is performed by an adaptive deformable Transformer, inter-frame temporal modeling is handled by a bidirectional Mamba with linear complexity, and the decoder interleaves both to align queries with spatio-temporal features. It achieves 87.9% mAP on ImageNet VID with ResNet-101 at 20.6 ms per frame.

YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

YOLO-Master integrates sparse MoE (ES-MoE blocks) into the YOLO backbone, enabling the network to dynamically activate different experts according to image complexity. It achieves 42.4% AP with a 1.62ms latency on MS COCO, surpassing YOLOv13-N by 0.8% mAP while being 18% faster.

YOLO-ULM: Ultra-Lightweight Models for Real-Time Object Detection

By replacing two types of high-cost operators in the YOLOv12 framework (feature aggregation replaced by D3C2f based on cascaded large-kernel depthwise convolutions; downsampling replaced by re-parameterizable dual-path RepDown) and incorporating a hardness-aware FoCIoU loss, an ultra-lightweight real-time detector is developed that achieves higher accuracy than YOLOv11/12/13 and RT-DETR with fewer parameters and lower computational cost when trained from scratch.