Skip to content

🎯 Object Detection

🤖 AAAI2026 · 17 paper notes

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

This paper introduces AerialMind, the first large-scale Referring Multi-Object Tracking (RMOT) benchmark dataset for UAV scenarios, and proposes HawkEyeTrack (HETrack), a method that achieves language-guided multi-object tracking in aerial UAV scenes via a co-evolutionary fusion encoder and a scale-adaptive contextual refinement module.

An Overall Real-Time Mechanism for Classification and Quality Evaluation of Rice

This paper proposes a real-time overall mechanism for rice quality evaluation, integrating three modules: an improved YOLO-v5 (variety detection), an improved ConvNeXt-Tiny (intactness grading), and K-means (chalkiness region quantification). The system achieves 99.14% mAP and 97.89% detection accuracy on a self-constructed dataset of 20,000 images spanning six rice varieties.

Beyond Boundaries: Leveraging Vision Foundation Models for Source-Free Object Detection

This paper proposes a framework leveraging VFMs (DINOv2 + Grounding DINO) to enhance Source-Free Object Detection (SFOD) via three modules: Patch-weighted Global Feature Alignment (PGFA), Prototype-based Instance Feature Alignment (PIFA), and Dual-source Enhanced Pseudo-label Fusion (DEPF). The method achieves state-of-the-art results on 6 cross-domain detection benchmarks, e.g., 47.1% mAP on Cityscapes→Foggy Cityscapes (+3.5% over DRU) and 67.4% AP on Sim10k→Cityscapes (+8.7% over DRU).

Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

This paper proposes GroundingAgent, a visual grounding framework that requires no task-specific fine-tuning. By composing pretrained open-vocabulary detectors (YOLO World), an MLLM (Llama-3.2-11B-Vision), and an LLM (DeepSeek-V3) into a structured iterative reasoning pipeline, the method achieves a zero-shot average accuracy of 65.1% on RefCOCO/+/g, substantially outperforming prior zero-shot approaches.

LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

This paper proposes LampQ, a metric-based layer-wise mixed precision quantization method that measures the quantization sensitivity of each ViT layer via a type-aware Fisher information metric, combines integer linear programming to optimize bit-width allocation, and iteratively refines the allocation. LampQ achieves state-of-the-art performance across image classification, object detection, and zero-shot quantization tasks.

MonoCLUE: Object-Aware Clustering Enhances Monocular 3D Object Detection

This paper proposes MonoCLUE, which leverages local clustering to extract object-level visual patterns (e.g., hood, roof) and generalized scene memory to aggregate consistent appearance features across images, enhancing detection of occluded and truncated objects in monocular 3D detection. MonoCLUE achieves state-of-the-art performance on the KITTI benchmark without relying on additional depth or LiDAR information.

Real-Time 3D Object Detection with Inference-Aligned Learning

This paper proposes SR3D, a framework that bridges the training-inference gap in indoor dense 3D object detection via two training-phase components: Spatial-Priority Optimal Transport Assignment (SPOTA) and Ranking-Aware adaptive Self-distillation (RAS). SR3D achieves state-of-the-art performance among dense detectors on ScanNet V2 and SUN RGB-D at a real-time speed of 42ms.

REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion

This paper extends the 2D bounding box diffusion paradigm of DiffusionDet to 3D radar space, proposing the REXO framework. It enables explicit cross-view radar feature association guided by noisy 3D bounding box projections, and introduces a ground-level constraint to reduce the diffusion parameter space. REXO surpasses the state of the art by +4.22 AP and +11.02 AP on the HIBER and MMVR indoor radar datasets, respectively.

SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

This paper proposes SAGA, a training-free method that learns prompt-aligned Gaussian distributions to improve semantic alignment in text-to-image generation models. Supporting both text and spatial conditioning, SAGA achieves substantial alignment gains on SD 1.4 and SD 3 (TIAM-3 improves from 8.4% to 50.7%).

SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements

This paper proposes SimROD, an extremely lightweight (only 0.003M parameters) RAW image object detection method that surpasses complex state-of-the-art approaches on multiple RAW detection benchmarks through global Gamma enhancement (4 learnable parameters) and green channel-guided local enhancement.

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

SM3Det introduces the M2Det task for remote sensing (multi-modal datasets + multi-task object detection), employing a grid-level sparse MoE backbone and a Dynamic Sub-module Optimization (DSO) mechanism to handle SAR/optical/infrared modalities with both horizontal and oriented bounding box detection in a single unified model, substantially outperforming three independently trained modality-specific models combined.

T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection

This paper proposes T-Rex-Omni, the first framework to systematically incorporate negative visual prompts into open-set object detection. Through a training-free NNC module and an NNH loss, it substantially narrows the performance gap between visual-prompt and text-prompt detection methods, with particularly strong results in long-tail scenarios (LVIS-minival APr of 51.2).

Temporal Object-Aware Vision Transformer for Few-Shot Video Object Detection

This paper proposes an object-aware temporal modeling framework that achieves cross-frame temporal consistency through selective propagation of high-confidence detection features. Combined with a pretrained vision-language encoder (OWL-ViT) and a few-shot detection head, the method achieves an average improvement of 3.7%–5.3% AP across four video few-shot detection benchmarks.

TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding

This paper proposes TubeRMC, a framework that generates text-conditioned candidate tubes and performs tube-conditioned reconstruction along temporal, spatial, and spatio-temporal dimensions, augmented by spatial-temporal mutual constraints to improve weakly-supervised spatio-temporal video grounding.

VK-Det: Visual Knowledge Guided Prototype Learning for Open-Vocabulary Aerial Object Detection

VK-Det is proposed as a framework that leverages only the visual knowledge of VLMs (without any additional supervision signals) to achieve state-of-the-art performance in open-vocabulary aerial object detection through Adaptive Selection Knowledge Distillation (ASKD), Prototype-Aware Pseudo-Label generation (PAPL), and Synthetic Matching Inference (SMI), even surpassing methods that rely on extra supervision.

When Trackers Date Fish: A Benchmark and Framework for Underwater Multiple Fish Tracking

This paper proposes MFT25, a large-scale underwater multiple fish tracking dataset (15 sequences, 408K annotations), and SU-T, a tracking framework combining UKF with FishIoU, achieving state-of-the-art performance of 34.1 HOTA and 44.6 IDF1. Statistical analyses further reveal fundamental differences between fish tracking and terrestrial object tracking.

YOLO-IOD: Towards Real Time Incremental Object Detection

This work is the first to systematically integrate incremental object detection (IOD) into the YOLO real-time framework. It identifies three types of knowledge conflict, proposes a three-module solution (CPR + IKS + CAKD), and introduces the more realistic LoCo COCO benchmark for evaluation.