ECCV2024 Object Detection AI paper notes paper summaries Few-/Zero-Shot Learning Self-Supervised Learning Layout & Composition Object Tracking

🎯 Object Detection¶

🎞️ ECCV2024 · 31 paper notes

📌 Same area in other venues: 📷 CVPR2026 (99) · 🔬 ICLR2026 (31) · 🧪 ICML2026 (6) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (27) · 📹 ICCV2025 (28)

🔥 Top topics: Object Detection ×7 · Few-/Zero-Shot Learning ×2 · Self-Supervised Learning ×2 · Layout & Composition ×2 · Object Tracking ×2

Adaptive Bounding Box Uncertainties via Two-Step Conformal Prediction: This paper proposes a two-step conformal prediction framework for uncertainty quantification in multi-object detection: the first step generates conformal prediction sets of class labels to handle classification errors, and the second step produces adaptive bounding box uncertainty intervals based on ensembles and quantile regression, providing practically useful tight prediction intervals while guaranteeing coverage.
Adaptive Multi-task Learning for Few-Shot Object Detection: This paper proposes an adaptive multi-task learning method (MTL-FSOD) that dynamically adjusts the gradient scales of classification and localization tasks using a precision-driven gradient balancer to alleviate their conflict. It also introduces CLIP-based knowledge distillation and a classification refinement scheme to enhance individual task performance, achieving consistent improvements across multiple few-shot object detection benchmarks.
AugDETR: Improving Multi-scale Learning for Detection Transformer: This paper proposes AugDETR (Augmented DETR), which expands the receptive field of the deformable encoder and introduces global context features to enhance feature representations through a Hybrid Attention Encoder. It then adaptively utilizes information from multiple encoder layers using Encoder-Mixing Cross-Attention to accelerate convergence, yielding improvements of 1.2, 1.1, and 1.0 AP over DINO, AlignDETR, and DDQ on COCO, respectively.
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos: Proposes the Boundary-Aligned Moment Detection Transformer (BAM-DETR), which models moments using an anchor-boundary triplet \((p, d_s, d_e)\) instead of the traditional center-length duplet \((c, l)\). Combined with a dual-pathway decoder and a quality-based ranking mechanism, it effectively addresses the issue of imprecise localization caused by center ambiguity.
Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection: This paper proposes the Bridge Past and Future (BPF) method, which bridges past stages via pseudo-labels, excludes potential future objects using an attention mechanism, and incorporates dual-teacher distillation (Distillation with Future) to resolve the optimization goal inconsistency caused by cross-stage information asymmetry in incremental object detection.
Can OOD Object Detectors Learn from Foundation Models?: SyncOOD proposes an automated data curation method that leverages LLMs to imagine semantically novel OOD concepts and performs region-level editing on ID images via Stable Diffusion Inpainting to synthesize scene-level OOD samples. After refining bounding boxes with SAM and filtering via feature similarity, a lightweight MLP classifier is trained, substantially outperforming SOTA on multiple OOD detection benchmarks with a minimal amount of synthetic data.
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer: DAMSDet proposes a dynamic adaptive infrared-visible object detection method based on the DETR architecture. By utilizing Modality Competitive Query Selection (dynamically selecting the dominant modality feature as the initial query for each object) and Multispectral Deformable Cross-Attention (adaptively sampling and aggregating bi-modal features across multiple semantic levels), it simultaneously addresses the dual challenges of complementary information fusion and modality misalignment, significantly outperforming the state-of-the-art (SOTA) on four public datasets.
DSPDet3D: 3D Small Object Detection with Dynamic Spatial Pruning: Proposed a Dynamic Spatial Pruning (DSP) strategy to progressively remove voxel features in areas where large objects have already been detected within the decoders of multi-scale 3D detectors. This allows the detector to process scenes at extremely high spatial resolutions, significantly improving small object detection accuracy (ScanNet small object [email protected] boosted from 27.5% to 44.8%) while reducing GPU memory to 1/5 of the baseline method with the same resolution.
GRA: Detecting Oriented Objects Through Group-Wise Rotating and Attention: A lightweight Group-wise Rotating and Attention (GRA) module is proposed. By grouping and rotating convolution kernels and applying group-wise spatial attention, it outperforms the previous SOTA method ARC with nearly 50% fewer parameters, achieving new state-of-the-art performance on DOTA-v2.0.
LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction: LaMI-DETR is proposed to address two core challenges in open-vocabulary object detection—insufficient concept representation and base-category overfitting—by leveraging GPT to generate visual concept descriptions and T5 to mine inter-category visual similarity relationships. It outperforms previous state-of-the-art methods by 7.8 rare AP on OV-LVIS, achieving 43.4 AP_rare.
MutDet: Mutually Optimizing Pre-training for Remote Sensing Object Detection: MutDet is proposed, a mutually optimizing pre-training framework for remote sensing oriented object detection. It systematically mitigates the feature discrepancy issue between object embeddings and detector features in detection pre-training via bidirectional cross-attention fusion of object embeddings and encoder features, a contrastive alignment loss, and an auxiliary Siamese head.
Nonverbal Interaction Detection: This work presents the first systematic study of human nonverbal interaction (gestures, expressions, gaze, postures, touch), introducing a large-scale dataset NVI, a new task NVI-DET, and a dual multi-scale hypergraph-based detection model NVI-DEHR, which achieves state-of-the-art performance on both nonverbal interaction detection and HOI detection tasks.
On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines: This paper systematically reveals significant flaws in existing evaluation frameworks, evaluation metrics, and the use of Temperature Scaling in object detector calibration research. It proposes a principled joint evaluation framework along with post-hoc calibration methods tailored specifically for object detection (Platt Scaling and Isotonic Regression), demonstrating that correctly designed and evaluated post-hoc calibrators far outperform recent train-time calibration methods.
OpenKD: Opening Prompt Diversity for Zero- and Few-shot Keypoint Detection: This paper proposes the OpenKD model, which opens up prompt diversity across three dimensions: modality (vision + text), semantics (seen vs. unseen), and language (diverse text). By employing a multimodal prototype set, auxiliary keypoint-text interpolation, and LLM-based text parsing, OpenKD achieves generalized zero- and few-shot keypoint detection, obtaining SOTA performance on Animal Pose, AwA, CUB, and NABirds.
Plain-Det: A Plain Multi-Dataset Object Detector: Plain-Det proposes a simple and flexible multi-dataset object detection framework. By incorporating semantic space calibration, class-aware query compositor, and hardness-indicated dynamic sampling strategies, it achieves 51.9 mAP on COCO (matching the SOTA at that time) and can be flexibly scaled to new datasets while maintaining robust performance.
Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer: A new learning paradigm utilizing pseudo multi-view videos to train a feed-forward, one-shot 4D head synthesizer is proposed. It first learns a 3D head synthesizer from synthetic data to convert monocular videos into multi-view videos, and then trains the 4D synthesizer through cross-view self-reenactment using these pseudo multi-view videos. This avoids over-reliance on 3DMMs and significantly outperforms prior methods in reconstruction fidelity, geometric consistency, and motion control accuracy.
Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation: This paper proposes a Point-Axis representation method that decouples the position (point set) and orientation (axis encoding) of oriented objects. Facilitated by Max-Projection Loss and Cross-Axis Loss, this method achieves optimization without requiring extra annotations. Based on this, the Oriented DETR model is designed to resolve the loss discontinuity issue inherent in traditional oriented bounding box representations.
Rectify the Regression Bias in Long-Tailed Object Detection: This work first reveals and systematically addresses the overlooked regression bias problem in long-tailed object detection. Due to insufficient samples, the parameters of class-specific regression heads for rare categories suffer from poor generalization. By incorporating an additional class-agnostic regression branch for trade-off, this method achieves state-of-the-art performance on datasets such as LVIS.
ReGround: Improving Textual and Spatial Grounding at No Cost: By changing the sequential connection of Gated Self-Attention (GSA) and Cross-Attention (CA) in GLIGEN to a parallel connection (network rewiring), the trade-off between textual and spatial grounding is significantly alleviated without introducing new parameters, fine-tuning, or computational overhead.
Responsible Visual Editing: Defines a new task of "Responsible Visual Editing" and proposes CoEditor, a cognitive editor that converts harmful images into responsible versions through a two-stage perceptual-behavioral cognitive process while minimizing modifications.
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection: This paper proposes the LSFA (Local-to-global Self-supervised Feature Adaptation) framework. It performs task-oriented adaptation of pretrained features through two self-supervised strategies: Intramodal Feature Compactness (IFC) optimization and Cross-modal Local-to-global Consistency (CLC) alignment. LSFA achieves 97.1% I-AUROC on MVTec-3D AD, outperforming the state-of-the-art (SOTA) by +3.4%.
Shifted Autoencoders for Point Annotation Restoration in Object Counting: Proposes Shifted AutoEncoders (SAE), an MAE-inspired point annotation restoration method: by applying random shifts to point annotations and training a UNet to restore them, the model learns "general location knowledge" while ignoring individual annotation noise. The trained SAE is used to restore original annotations to make them more consistent, which consistently improves the performance of any counting model (density-map or localization-based), setting new records across 9 datasets.
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding: To address the improper negative sample construction in existing compositional temporal grounding methods and the failure of DETR models to generate reasonable saliency responses for negative queries, this paper proposes leveraging an LLM (GPT-3.5 Turbo) to generate semantically feasible hierarchical hard negative samples, and designs a coarse-to-fine saliency ranking strategy to establish multi-granularity semantic relations between video clips and hierarchical negative queries, significantly improving compositional generalization performance.
Stepwise Multi-grained Boundary Detector for Point-Supervised Temporal Action Localization: To address the semantic ambiguity of action boundaries caused by sparse annotations in point-supervised temporal action localization, this paper proposes a Stepwise Multi-grained Boundary Detector (SMBD). By employing a Background Anchor Generator (BAG) and a Dual Boundary Detector (DBD), SMBD provides fine-grained boundary supervision signals for training, achieving state-of-the-art performance on datasets such as THUMOS'14.
TAPTR: Tracking Any Point with Transformers as Detection: TAPTR reformulates the Tracking Any Point (TAP) task as a DETR-like detection problem. It represents each tracking point as a point query containing both position and content, which is layer-wise optimized through a multi-layer Transformer decoder. Combined with a cost volume and sliding-window feature update strategy, it achieves SOTA performance on the TAP-Vid benchmark with faster inference speed.
Tensorial Template Matching for Fast Cross-Correlation with Rotations and Its Application for Tomography: Proposes the Tensorial Template Matching (TTM) algorithm, which integrates the template information under all rotations into a symmetric tensor field to reduce the calculation to a fixed number of cross-correlations. This makes the computational complexity independent of rotational precision, achieving fast and accurate object detection and rotation estimation in 3D tomographic images.
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching: Constructed the first natural language-guided drone geolocalization benchmark GeoText-1652 (276K bbox-text pairs, 316K descriptions), and proposed a blending spatial matching method that achieves region-level spatial relation matching via grounding loss + spatial relation loss, achieving a text retrieval Recall@10 of 31.2%.
Visible and Clear: Finding Tiny Objects in Difference Map: SR-TOD introduces the image self-reconstruction mechanism into object detection for the first time, discovering a strong correlation between reconstruction difference maps and tiny objects. It designs a Difference Map Guided Feature Enhancement (DGFE) module, achieving significant improvements on the self-built anti-UAV dataset DroneSwarms as well as VisDrone2019 and AI-TOD.
WALKER: Self-supervised Multiple Object Tracking by Walking on Temporal Appearance Graphs: This paper proposes Walker, the first self-supervised multiple object tracker. By constructing a quasi-dense temporal object appearance graph, designing a multi-positive contrastive loss to optimize random walks on the graph for instance similarity learning, and introducing mutually-exclusive connectivity constraints and a motion-constrained bidirectional walk inference strategy, Walker achieves competitive self-supervised tracking performance on MOT17, DanceTrack, and BDD100K, outperforming prior self-supervised methods even with 400 times fewer annotations.
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection: Proposes the WSCL framework: leveraging LLMs to generate diverse text descriptions, diffusion models to generate corresponding images, and a weak detector to decompose phrases and generate pseudo bounding boxes, constructing dense synthetic triplets (image, description, bbox). Together with compositional contrastive learning, it significantly improves language-guided object detection performance, achieving a +5.0 AP improvement for GLIP-T on OmniLabel.
YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information: YOLOv9 proposes Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN) to address the information bottleneck problem in deep networks. It comprehensively outperforms existing real-time object detectors on MS COCO with fewer parameters and computation, surpassing methods pre-trained on large-scale datasets while training from scratch.