ECCV2024 Segmentation AI paper notes paper summaries Object Detection Few-/Zero-Shot Learning Multimodal/VLM Diffusion Models Adversarial Robustness

✂️ Segmentation¶

🎞️ ECCV2024 · 54 paper notes

📌 Same area in other venues: 📷 CVPR2026 (122) · 🔬 ICLR2026 (32) · 🧪 ICML2026 (14) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (45) · 📹 ICCV2025 (73)

🔥 Top topics: Segmentation ×33 · Object Detection ×4 · Few-/Zero-Shot Learning ×3 · Multimodal/VLM ×3 · Diffusion Models ×2

A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties: ProLab uses LLMs to generate common-sense descriptions of categories, compressing them into 256 interpretable descriptive properties via sentence embeddings and K-Means clustering. This constructs an attribute-level multi-hot label space to supervise the segmentation model, replacing traditional one-hot category labels. It consistently outperforms category-level supervision across five classic benchmarks and shows emergent out-of-domain generalization capabilities.
A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting: Based on Stable Diffusion, a minimalist latent diffusion segmentation framework named LDMSeg is proposed. It compresses segmentation masks into a latent space using a shallow autoencoder and then trains an image-conditioned diffusion model to generate panoptic segmentation results. This bypasses object detection modules, Hungarian matching, and complex post-processing found in traditional methods, while naturally supporting mask inpainting and multi-task extensions.
ActionVOS: Actions as Prompts for Video Object Segmentation: ActionVOS is proposed—a new setting for Referring Video Object Segmentation that uses human action narratives as additional linguistic prompts. It generates pseudo-labels via a parameter-free action-aware labeling module and designs an action-guided focal loss to suppress false positives, reducing the false segmentation of inactive objects by 35.6% mIoU on VISOR, while improving the segmentation of state-changing objects by 3.0% mIoU on VOST/VSCOS.
Active Coarse-to-Fine Segmentation of Moveable Parts from Real Images: Proposes the first active learning framework for instance segmentation of moveable parts in real-world indoor RGB images. Utilizing a pose-aware masked attention network, the framework achieves coarse-to-fine segmentation. It requires manual annotation of only 11.45% of the images to obtain fully verified high-quality segmentation results, saving 60% of manual effort compared to the best non-active learning methods.
Attention Decomposition for Cross-Domain Semantic Segmentation: This paper proposes ADFormer, a novel Transformer architecture for cross-domain semantic segmentation. By decomposing the cross-attention in the decoder into domain-agnostic and domain-specific components, combined with gradient reversal adversarial learning, it effectively bridges the distribution gap between source and target domains. It outperforms existing proposal-free methods on GTA→Cityscapes and SYNTHIA→Cityscapes benchmarks with significantly lower complexity.
CoLA: Conditional Dropout and Language-Driven Robust Dual-Modal Salient Object Detection: This paper proposes the CoLA framework, which introduces two core modules, Language-driven Quality Assessment (LQA) and Conditional Dropout (CD), to simultaneously address two key robustness issues in dual-modal salient object detection for the first time: noisy inputs and missing modalities.
ColorMAE: Exploring Data-Independent Masking Strategies in Masked AutoEncoders: This paper proposes ColorMAE, which generates data-independent masking patterns with spatial and semantic priors by applying different frequency domain filters to random noise. Without adding any parameters or computational overhead, ColorMAE significantly improves the downstream performance of MAE, particularly achieving a 2.72 mIoU improvement over random masking on semantic segmentation tasks.
ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback: This paper proposes ControlNet++, which explicitly optimizes the quality of conditional controllable generation through pixel-level cycle consistency loss. A pre-trained discriminative model is used to extract conditions from generated images and align them with the input conditions. To avoid the massive GPU memory overhead of multi-step sampling, an efficient single-step denoising reward strategy is designed. This significantly improves controllability (e.g., +11.1% segmentation mIoU) under various conditional controls such as segmentation masks, edges, and depth.
CoReS: Orchestrating the Dance of Reasoning and Segmentation: This paper proposes CoReS (Chains of Reasoning and Segmenting), a multimodal Chain-of-Thought framework with a dual-chain structure. Through the hierarchical collaboration of the reasoning chain and the segmenting chain, combined with an in-context guidance strategy, it achieves progressive and precise segmentation of target objects in complex reasoning text, outperforming LISA by 6.5% on the ReasonSeg dataset.
CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation: This paper proposes the Class-Conditional Prompting Machine (CPM), which enhances the stability of bipartite matching and the effectiveness of cross-modal attention in Mask2Former for audio-visual segmentation by combining class-agnostic queries with GMM-sampled class-conditional queries. Simultaneously, three auxiliary tasks are designed—Audio-Conditional Prompting (ACP), Visual-Conditional Prompting (VCP), and Prompt Contrastive Learning (PCL)—achieving state-of-the-art performance on AVSBench and VPO benchmarks.
Cs2K: Class-Specific and Class-Shared Knowledge Guidance for Incremental Semantic Segmentation: Proposed the Cs2K framework, which synergistically mitigates catastrophic forgetting and underfitting of new categories in incremental semantic segmentation from two aspects: class-specific knowledge (prototype-guided pseudo-labeling + prototype-guided class adaptation) and class-shared knowledge (weight-guided selective consolidation).
Dataset Enhancement with Instance-Level Augmentations: An instance-level data augmentation method based on pretrained diffusion models is proposed. By repainting object instances individually in the image while maintaining the original labels, the method significantly enhances the performance of salient object detection, semantic segmentation, and object detection, while also supporting data anonymization.
Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time: This work identifies that subsampling layers in deep networks discard a significant volume of useful activations during the default forward pass, and proposes a search-and-aggregate framework to leverage these discarded activation maps at test-time to improve classification and segmentation performance, complementing traditional TTA methods orthogonally.
DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs: Revisiting the dense concatenation shortcut of DenseNet, this paper proposes RDNet (Revitalized DenseNet) through a systematic modernization (wider and shallower architecture, modernized blocks, enlarged intermediate dimensions, more transition layers, etc.). RDNet outperforms Swin Transformer, ConvNeXt, and DeiT-III on ImageNet-1K, proving the powerful potential of concatenation as an underestimated paradigm.
Diffusion Models for Open-Vocabulary Segmentation: This paper proposes OVDiff, which leverages a pre-trained text-to-image diffusion model to generate support image sets for arbitrary textual categories. From these, multi-level prototypes (class-level, instance-level, and part-level) are extracted and combined with background prototypes to achieve training-free open-vocabulary semantic segmentation, outperforming prior methods by over 10% on PASCAL VOC.
DreamLIP: Language-Image Pre-training with Long Captions: By generating long captions for 30M images using MLLMs, this work proposes multi-positive contrastive learning with dynamic subcaption sampling and a subcaption-specific grouping loss to achieve fine-grained vision-language alignment, matching or even surpassing the performance of CLIP 400M on retrieval and semantic segmentation tasks using only 30M data.
EAFormer: Scene Text Segmentation with Edge-Aware Transformers: Proposes Edge-Aware Transformer (EAFormer), which filters non-text region edges using a Text Edge Extractor, and fuses text edge information in the encoder using symmetric cross-attention. This significantly improves the segmentation accuracy of text edge regions. Furthermore, the COCO_TS and MLT_S datasets are re-annotated for fairer evaluation.
Early Preparation Pays Off: New Classifier Pre-tuning for Class Incremental Semantic Segmentation: This paper proposes NeST (New claSsifier pre-Tuning), which initializes new classifier weights by learning a linear transformation from all old classifiers to the new ones prior to formal training. It also designs a transformation matrix initialization strategy based on cross-task class similarity. NeST significantly improves the performance of multiple CISS methods on Pascal VOC and ADE20K.
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models: R-Adapter introduces a lightweight adapter module into the CLIP model along with three self-ensemble strategies (Adapter Dropping, Weight Accumulation, and Weight-Scaling Re-parameterization). While fine-tuning only 13% of the parameters, it simultaneously achieves high ID accuracy and strong OOD robustness, and extends robust fine-tuning to cross-modal retrieval and open-vocabulary segmentation tasks for the first time.
Eliminating Feature Ambiguity for Few-Shot Segmentation: This work proposes AENet, a plug-and-play network that eliminates feature ambiguity by mining discriminative query foreground regions to enhance foreground-foreground matching in cross-attention, consistently improving the performance of existing few-shot segmentation methods (e.g., +3.0% 1-shot mIoU for SCCAN on PASCAL-5i).
Frequency-Spatial Entanglement Learning for Camouflaged Object Detection: The Frequency-Spatial Entanglement Learning (FSEL) framework is proposed. By conducting entanglement learning between the frequency and spatial domains, it utilizes global frequency features to compensate for the locality and sensitivity limitations of spatial features, outperforming 21 state-of-the-art (SOTA) methods across three COD benchmarks.
FREST: Feature Restoration for Semantic Segmentation under Multiple Adverse Conditions: This paper proposes FREST, a source-free domain adaptation semantic segmentation framework for multiple adverse conditions (fog, rain, snow, night). By alternating between learning a condition embedding space (to extract condition-specific information) and feature restoration (to recover adverse condition features back to normal), it progressively eliminates the impact of adverse conditions on features, achieving new SOTAs on both ACDC and RobotCar benchmarks.
General and Task-Oriented Video Segmentation: GvSeg proposes a universal video segmentation framework. By decoupling the segmentation target into three factors—appearance, shape, and position—and dynamically adjusting the involvement of these three factors in query initialization, matching, and sampling based on task requirements (VIS/VSS/VPS/EVS), it achieves state-of-the-art (SOTA) performance across four video segmentation tasks within a unified architecture.
GiT: Towards Generalist Vision Transformer through Universal Language Interface: The GiT framework is proposed, which unifies five major vision tasks—image captioning, object detection, instance segmentation, semantic segmentation, and visual grounding—into autoregressive sequence generation through a universal language interface. Using only a pure ViT (without any task-specific modules), it achieves multi-task joint training where tasks mutually enhance each other.
LASS3D: Language-Assisted Semi-Supervised 3D Semantic Segmentation with Progressive Unreliable Data Exploitation: This paper proposes LASS3D, which introduces Large Vision-Language Models (LVMs) to generate multi-level text descriptions for enhancing 3D features within a MeanTeacher semi-supervised 3D semantic segmentation framework, and effectively exploits low-confidence pseudo-labeled points through a progressive negative learning strategy, achieving significant improvements on both indoor and outdoor datasets.
Learning Camouflaged Object Detection from Noisy Pseudo Label: The first weakly semi-supervised camouflaged object detection method (WSSCOD) is proposed, achieving comparable performance to fully supervised SOTA using only 20% pixel-level annotations + 80% box annotations. The core contribution is an adaptive noise correction loss \(\mathcal{L}_{NC}\), which can be optimized separately in the early learning and memorization phases.
Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation: This work is the first to propose using entirely web images (rather than well-curated dataset images) for weakly-supervised incremental semantic segmentation. By filtering web images with a Fourier domain discriminator and using a caption-driven rehearsal strategy to preserve old class knowledge, it achieves 73.4% mIoU under the PASCAL VOC 15-5 setting.
LiFT: A Surprisingly Simple Lightweight Feature Transform for Dense ViT Descriptors: This paper proposes LiFT, an extremely simple lightweight post-processing network (only 1.2M parameters) trained with a self-supervised multi-scale reconstruction objective. By blending coarse-grained semantic features from a frozen ViT with fine-grained image features extracted by a CNN, LiFT doubles the resolution of ViT features at a marginal cost of a 5.7% parameter increase and 22% more FLOPs, delivering significant performance improvements across dense tasks such as keypoint matching, detection, segmentation, and object discovery.
Long-Tail Temporal Action Segmentation with Group-wise Temporal Logit Adjustment: This work systematically addresses the long-tail problem in temporal action segmentation for the first time, proposing the Group-wise Temporal Logit Adjustment (G-TLA) framework. By leveraging activity labels for group-wise classification combined with temporal action priors for logit adjustment, it substantially improves the performance of tail classes without sacrificing head classes.
Occlusion-Aware Seamless Segmentation: Proposes a new task, Occlusion-Aware Seamless Segmentation (OASS), and the UnmaskFormer framework to simultaneously address three major challenges: unlocking the narrow field-of-view of panoramic images, complete amodal segmentation of occluded objects, and pinhole-to-panoramic cross-domain adaptation, achieving SOTA performance on the self-created BlendPASS dataset.
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing: A plug-and-play framework named OLAF is proposed. By incorporating foreground/edge masks as additional input channels, introducing a Low-level Dense Feature extraction module (LDF), and applying a targeted weight adaptation strategy, it brings significant multi-object multi-part segmentation gains to various segmentation networks (CNN/U-Net/Transformer) without changing the base architecture, surpassing the SOTA by 4.0 mIoU on the highly challenging Pascal-Parts-201 dataset.
OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models: This work defines the open-set panoptic scene graph generation (OpenPSG) task for the first time, leveraging BLIP-2 as a multimodal relation decoder in conjunction with a Relation Query Transformer (RelQ-Former) to achieve open-set relation prediction. The proposed model achieves 79.3% in PredCls R@100 on the PSG dataset, outperforming previous SOTA models by 26.6% in closed-set scenarios.
Part2Object: Hierarchical Unsupervised 3D Instance Segmentation: A hierarchical clustering framework, Part2Object, is proposed to leverage self-supervised features and 3D objectness priors to layer-by-layer merge part-level over-segmentations into object-level instances. This generates high-quality pseudo-labels for self-training Hi-Mask3D, achieving 3D instance segmentation without any human annotations.
PartSTAD: 2D-to-3D Part Segmentation Task Adaptation: PartSTAD proposes a task adaptation method for 2D-to-3D part segmentation. By introducing a learnable weight prediction network for GLIP's 2D bounding boxes (optimized targeting 3D mRIoU) and integrating SAM to acquire precise foreground masks, it achieves a 7.0%p improvement in semantic segmentation mIoU and a 5.2%p improvement in instance segmentation mAP50 on PartNet-Mobility (relative to PartSLIP).
Point-Supervised Panoptic Segmentation via Estimating Pseudo Labels from Learnable Distance: This paper proposes a point-supervised panoptic segmentation method based on learnable distance, representing each instance with an anchor query and predicting pixel-to-instance distance via cross-attention. The distance learning is supervised by point labels in an end-to-end manner. Combined with iterative query aggregation and enhancement processes, it continuously optimizes pseudo-label quality and achieves state-of-the-art results in point-supervised panoptic segmentation.
ProMerge: Prompt and Merge for Unsupervised Instance Segmentation: ProMerge is proposed to utilize self-supervised visual features (DINO) for initial patch grouping, followed by strategic merging and background-aware mask pruning for unsupervised instance segmentation. Its inference speed significantly outperforms normalized-cut methods, and training a detector with the generated pseudo-labels outpaces existing unsupervised SOTA models.
ReMamber: Referring Image Segmentation with Mamba Twister: This paper introduces the Mamba architecture to the Referring Image Segmentation (RIS) task for the first time. It proposes the Mamba Twister module, which achieves efficient vision-language feature fusion through a "twisting" mechanism of channel and spatial scanning. It achieves competitive results surpassing Transformer-based methods on RefCOCO/RefCOCO+/G-Ref benchmarks while maintaining linear computational complexity.
Representing Topological Self-Similarity Using Fractal Feature Maps for Accurate Segmentation of Tubular Structures: Using fractal theory, fractal dimension (FD) is extended from the image level to the pixel level to generate Fractal Feature Maps (FFM) as additional inputs and loss weights for deep learning models. A Multi-Decoder Network (MD-Net) containing a boundary decoder and a skeleton decoder is designed, significantly improving segmentation performance across five tubular structure datasets.
Rotary Position Embedding for Vision Transformer: This work systematically investigates the application of RoPE (Rotary Position Embedding) from 1D language models to 2D vision tasks. It proposes RoPE-Mixed (mixed learnable frequencies) to replace the conventional Axial frequency allocation. This approach achieves significant resolution extrapolation performance gains on ViT and Swin Transformer, yielding consistent improvements in ImageNet classification, COCO detection, and ADE20k segmentation.
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference: This work discovers that the failure of CLIP in dense prediction stems from spatial misalignment caused by self-attention. It proposes the Correlative Self-Attention (CSA) mechanism, which modifies only the computation of the last self-attention layer (training-free). This improvement elevates CLIP's zero-shot semantic segmentation performance from 14.1% average mIoU to 38.2%, surpassing all existing methods.
SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis: Presents the SegGen data generation framework, reversing the conventional "generate image then label" pipeline to "generate segmentation mask from text first, then generate image from mask," breaking the "chicken-and-egg" bottleneck of segmentation data synthesis, and improving the mIoU of Mask2Former R50 on ADE20K from 47.2 to 49.9 (+2.7).
Segmentation-Guided Layer-Wise Image Vectorization with Gradient Fills: Proposes a segmentation-guided vectorization framework that guides the initialization and optimization of Bézier paths through a gradient-aware segmentation subroutine. It supports radial gradient fills in a layer-wise vectorization method with layered topology for the first time, achieving higher visual quality with fewer paths.
SeiT++: Masked Token Modeling Improves Storage-Efficient Training: By introducing Masked Token Modeling (MTM) self-supervised pre-training into the tokenized training framework of SeiT and designing two token-specific data augmentation strategies, TokenAdapt and ColorAdapt, the ImageNet-1k classification accuracy is improved from 74.0% to 77.8% under only 1% storage space (1.4GB), effectively solving the challenge of data augmentation in the token domain.
Self-supervised Co-salient Object Detection via Feature Correspondences at Multiple Scales: This paper proposes SCoSPARC, a two-stage self-supervised co-salient object detection model. By detecting co-salient objects in image groups through patch-level and region-level ViT feature correspondences, it achieves a 13.7% higher F-measure on the CoCA dataset compared to the unsupervised SOTA, even outperforming several supervised methods.
SiLC: Improving Vision Language Pretraining with Self-Distillation: Introduces the SiLC framework, which incorporates local-to-global self-distillation into CLIP-style image-text contrastive learning, significantly boosting performance on dense prediction tasks (detection, segmentation) while also improving classification and retrieval.
SOS: Segment Object System for Open-World Instance Segmentation With Object Priors: The SOS method is proposed to generate object-focused SAM prompt points using DINO self-attention maps as object priors, thereby producing high-quality pseudo annotations to train standard instance segmentation systems. It significantly outperforms prior SOTA methods under COCO/LVIS/ADE20k cross-category/cross-dataset settings, achieving precision improvements of up to 81.6%.
SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images: SPIN constructs SubPartImageNet, the first hierarchical semantic segmentation dataset with subpart-level granularity for natural images, containing 203 subpart categories and 106k annotations. It proposes two hierarchical consistency evaluation metrics (SpCS / SeCS) and performs a comprehensive benchmark on over 20 modern models, revealing severe limitations of current models at the subpart level.
Un-EVIMO: Unsupervised Event-based Independent Motion Segmentation: The first label-free independent motion object (IMO) segmentation framework for event cameras. It leverages optical flow and geometric constraints to generate pseudo-labels for training a segmentation network, achieving performance comparable to supervised methods on the EVIMO dataset.
UniFS: Universal Few-Shot Instance Perception with Point Representations: This paper proposes UniFS, the first universal few-shot instance perception model. By unifying object detection, instance segmentation, pose estimation, and object counting into a dynamic point representation learning paradigm and introducing a structure-aware point learning (SAPL) loss to capture high-order structural relations among points, UniFS achieves performance close to expert models under minimal task hypotheses.
Unsupervised Moving Object Segmentation with Atmospheric Turbulence: This paper proposes an unsupervised method to segment moving objects in atmospheric turbulence videos using a "detect-then-grow" strategy. It first separates real motion from turbulent motion using epipolar geometric consistency checks based on Sampson distance, then performs region growing from high-confidence seed pixels to generate segmentation masks, and finally refines the masks with a spatio-temporal consistency loss. It substantially outperforms existing methods on DOST, the first real turbulence video dataset (achieving a 60.1% IoU gain).
VISA: Reasoning Video Object Segmentation via Large Language Models: Proposes the new ReasonVOS task and the VISA model, utilizing the world knowledge reasoning capabilities of multi-modal LLMs to achieve video object segmentation and tracking based on implicit text queries.
VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement: To address the association errors caused by over-reliance on spatial-temporal position information in existing online video instance segmentation (VIS) methods, VISAGE is proposed to enhance instance association accuracy. By explicitly extracting appearance embeddings from backbone features, combined with contrastive learning and a simplified tracker, the method achieves state-of-the-art results on YTVIS and OVIS benchmarks.
VP-SAM: Taming Segment Anything Model for Video Polyp Segmentation via Disentanglement and Spatio-Temporal Side Network: This paper proposes VP-SAM, which leverages the amplitude information of the Fourier spectrum via a Semantic Disentanglement Adapter (SDA) to help SAM distinguish low-contrast polyps from the background, while designing a Spatio-Temporal Side Network (STSN) to inject inter-frame temporal information into SAM. It achieves SOTA performance on datasets such as SUN-SEG, CVC-612, and CVC-300.
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception: Proposes the HQNet framework, which learns a unified Human Query representation to simultaneously perform multiple human-centric perception tasks—including pedestrian detection, instance segmentation, 2D pose estimation, 3D Mesh recovery, and attribute recognition—within a single-stage, single-model system. It also establishes the first comprehensive multi-task human perception benchmark, COCO-UniHuman.