✂️ Segmentation¶

🧠 NeurIPS2025 · 48 paper notes

Alligat0R: Pre-Training through Covisibility Segmentation for Relative Camera Pose Regression: This paper replaces CroCo's cross-view completion with covisibility segmentation as a stereo vision pre-training task, predicting per-pixel labels of "co-visible / occluded / out-of-view" for each pixel. The approach significantly outperforms CroCo in low-overlap scenarios and achieves a first-place overall success rate of 60.3% on the RUBIK benchmark.
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model: This paper proposes ARGenSeg — the first unified MLLM framework that leverages the autoregressive image generation paradigm for image segmentation. The model directly outputs visual tokens decoded by a VQ-VAE into segmentation masks, requiring no additional segmentation head. A next-scale prediction parallel generation strategy enables 4× inference speedup, and the method surpasses state-of-the-art on RefCOCO/+/g with significantly less training data.
Attention (as Discrete-Time Markov) Chains: This work reinterprets the softmax-normalized attention matrix as the transition probability matrix of a Discrete-Time Markov Chain (DTMC), and proposes Multi-Bounce Attention and TokenRank (stationary distribution, analogous to PageRank) to capture indirect attention paths and global token importance. The approach achieves 94.29% mAP on ImageNet segmentation and enhances image generation quality in Self-Attention Guidance.
ConnectomeBench: Can LLMs Proofread the Connectome?: This paper introduces ConnectomeBench, the first standardized benchmark for evaluating multimodal LLMs on three key connectomics proofreading tasks: segment identification, split error correction, and merge error detection. o4-mini achieves 85% on the split correction multiple-choice task, yet merge error detection remains significantly below human expert performance.
COS3D: Collaborative Open-Vocabulary 3D Segmentation: This paper proposes COS3D — a collaborative prompt-segmentation framework that constructs a collaborative field comprising an instance field and a language field. During training, the language field is built via instance-to-language feature mapping; during inference, language-to-instance adaptive prompt refinement generates precise segmentation results. COS3D substantially outperforms existing methods on two mainstream benchmarks.
Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation: A two-stage active learning pipeline (coverage → uncertainty) is proposed, leveraging multi-scale features from pretrained diffusion models to achieve efficient semantic segmentation under extremely low annotation budgets.
Exploring Structural Degradation in Dense Representations for Self-supervised Learning: This paper identifies and systematically investigates the Self-supervised Dense Degradation (SDD) phenomenon — where longer training improves classification yet hurts dense task performance — and proposes the DSE metric along with DSE-guided model selection and regularization strategies, achieving an average mIoU improvement of 3.0%.
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning: By introducing convolutional decoding normalization (replacing hard semi-autoregressive chunking) and rule-based rejective fine-tuning (R2FT), the proposed method achieves generation quality at 128 inference steps comparable to 512+ steps, reaching state-of-the-art performance among diffusion language models (DLMs).
FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis: FAST introduces explicit mechanisms to preserve anomaly regions throughout the diffusion trajectory: AIAS compresses the multi-step reverse process of discrete diffusion into a small number of coarse-to-fine analytical updates, while FARM reconstructs and reinjects anomaly foregrounds at each step, yielding a method that is both fast and better suited for generating training data for downstream anomaly segmentation models.
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning: FineRS is a two-stage MLLM reinforcement learning framework comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR), coupled via a locate-informed retrospective reward. Evaluated on the newly constructed FineRS-4k UAV high-resolution dataset, it achieves reasoning and segmentation of ultra-small objects with a gIoU of 55.1% (surpassing Seg-Zero† by 8.5%) while simultaneously supporting VQA (MVQA 83.3%).
GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset: This paper introduces GTPBD, the first fine-grained global terraced parcel and boundary dataset, comprising 47,537 high-resolution images (0.5–0.7 m) with over 200,000 manually annotated parcels. It provides three-level labels supporting four tasks—semantic segmentation, edge detection, agricultural parcel extraction, and unsupervised domain adaptation—and presents comprehensive benchmarks across 20 methods.
HAODiff: Human-Aware One-Step Diffusion via Dual-Prompt Guidance: This paper proposes HAODiff, a human-aware one-step diffusion model that generates adaptive positive–negative prompt pairs via a three-branch Dual-Prompt Guidance (DPG) module. Combined with an explicit Human Motion Blur (HMB) degradation pipeline and Classifier-Free Guidance (CFG), HAODiff substantially outperforms existing state-of-the-art methods on human image restoration tasks.
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios: This paper introduces the Referring Human Action Segmentation (RHAS) task—localizing a specific individual in multi-person videos via textual descriptions and performing frame-level action segmentation. The authors construct the RHAS133 dataset comprising 133 movies, 137 action categories, and 33 hours of video, and propose HopaDIFF, a holistic-partial aware Fourier-conditioned diffusion framework that substantially outperforms existing baselines across multiple evaluation settings.
HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation: HumanCrafter is proposed as the first feed-forward framework that unifies single-image 3D human reconstruction with body-part semantic segmentation. A human geometry prior-guided Transformer aggregates multi-view features, while DINOv2 self-supervised semantic priors construct a 3D feature field. The method simultaneously surpasses existing SOTA in both 3D reconstruction and segmentation on 2K2K and THuman2.1.
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition: This paper introduces a new task — Instruction-oriented Counting, Detection, and Segmentation (InstructCDS) — along with the EarthInstruct remote sensing benchmark covering three settings (open-vocabulary, open-ended, and open-subcategory). It proposes InstructSAM, a training-free framework that uses an LVLM to parse instructions and predict counts, SAM2 to generate mask proposals, and CLIP to compute similarities. A Binary Integer Programming (BIP) formulation then performs optimal mask-label assignment under counting constraints, achieving near-constant inference time while outperforming task-specific baselines.
Interpreting ResNet-based CLIP via Neuron-Attention Decomposition: This paper proposes a neuron-attention decomposition method to interpret CLIP-ResNet by decomposing model outputs into pairwise contribution paths of neurons and attention pooling heads. The resulting neuron-head pairs are shown to admit rank-1 approximations, exhibit sparsity, and capture sub-concepts. The method is applied to training-free semantic segmentation (mIoU 26.2% on PASCAL Context, surpassing MaskCLIP by 15%) and dataset distribution shift monitoring.
LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation: LangHOPS is the first open-vocabulary object-part instance segmentation framework based on a multimodal large language model (MLLM). It establishes object-part hierarchical relationships in language space and leverages the knowledge and reasoning capabilities of MLLMs to bridge multi-granularity concepts. It achieves 56.9% AP on PartImageNet, surpassing the previous SOTA by 5.5%, and outperforms by 4.8% in cross-dataset settings.
Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks: This paper presents Mars-Bench — the first comprehensive benchmark for Mars science tasks, encompassing 20 datasets across three task types (classification, segmentation, and object detection). It systematically evaluates ImageNet-pretrained models, Earth observation foundation models, and vision-language models on Martian data, revealing significant gaps in current general-purpose models and calling for the development of Mars-specific foundation models.
Mechanistic Interpretability of RNNs Emulating Hidden Markov Models: A vanilla RNN is trained to reproduce the emission statistics of an HMM; reverse engineering then reveals the mechanism by which the RNN implements discrete stochastic state transitions: noise-driven orbital dynamics combined with rapid transitions triggered by "kick neurons." The underlying principle is self-induced stochastic resonance (SISR), and this dynamical motif can be composed and reused to emulate more complex discrete latent structures.
Mechanistic Interpretability of RNNs Emulating Hidden Markov Models: By training RNNs to emulate the emission statistics of HMMs, then reverse-engineering the learned solutions, this work reveals how RNNs exploit noise-driven orbital dynamics, structured connectivity (noise-integrating populations + kick neurons), and self-induced stochastic resonance to implement discrete stochastic state transitions.
MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans: This paper introduces MultiHuman-Testbench, the first systematic benchmark for evaluating multi-human image generation. It comprises 1,800 test samples paired with 5,550 face images, a suite of multi-dimensional evaluation metrics including Hungarian-matching-based identity similarity, and proposes Regional Isolation and Implicit Region Assignment techniques to enhance existing methods without additional training.
Novel Class Discovery for Point Cloud Segmentation via Joint Learning of Causal Representation and Reasoning: This paper is the first to introduce causal learning into 3D point cloud novel class discovery (3D-NCD). By leveraging a Structural Causal Model (SCM) to analyze confounders in base classes and causal relationships between base and novel classes, it proposes Causal Representation Prototype learning (CRP, which eliminates confounders via an adversarial network) and graph-based causal reasoning (GCN-based pseudo-label generation), achieving state-of-the-art results on SemanticKITTI and SemanticPOSS.
OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation: OmniSegmentor constructs a large-scale ImageNeXt dataset encompassing 5 visual modalities (1.2M samples), proposes an efficient pretraining strategy that randomly selects one supplementary modality to align with RGB per iteration, and establishes the first flexible multi-modal pretrain-finetune pipeline, achieving state-of-the-art results on 6 multi-modal semantic segmentation benchmarks.
Panoptic Captioning: An Equivalence Bridge for Image and Text: This paper proposes the novel task of Panoptic Captioning, which pursues a minimum text equivalence of images—defining a comprehensive structured description along five dimensions (entity semantic tags, locations via bounding boxes, attributes, relations, and global state)—and introduces the PancapEngine data engine and PancapChain decoupled multi-stage method. A 13B model trained under this framework surpasses InternVL-2.5-78B and GPT-4o.
PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding: This paper presents PartNeXt, a fine-grained hierarchical part annotation dataset comprising 23,519 high-quality textured 3D models across 50 categories. Two benchmarks are established—category-agnostic part segmentation and 3D part question answering—revealing significant deficiencies of current methods in fine-grained part understanding.
PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding: This paper introduces the Partonomy part-level segmentation benchmark (862 part labels / 534 object labels) and the Plum model, which replaces the [SEG] token with BIO span tagging and incorporates a mask feedback loop. The study reveals that state-of-the-art segmentation LMMs achieve only 5.9% gIoU on part understanding; Plum achieves significant improvements by avoiding distribution shift and leveraging historical predictions.
Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation: This paper proposes the Edge-awareness Semantic Concordance (ESC) framework, which leverages semantic edges as an intermediate bridge between heterogeneous Event and RGB modalities. Through discrete latent space modeling via an edge dictionary, ESC achieves cross-modal feature alignment and uncertainty optimization, surpassing the state of the art by 2.55% mIoU under extreme conditions.
HCLFuse: Revisiting Generative Infrared and Visible Image Fusion Based on Human Cognitive Laws: HCLFuse performs modality alignment via the information bottleneck principle and optimal transport theory, combining a Variational Bottleneck Encoder (VBE) with a physics-guided conditional diffusion model. Three physical constraints—heat conduction, structure preservation, and physical consistency—are injected into the diffusion process. On the MSRS dataset, the gradient metric AG improves by 69.87% and spatial frequency SF improves by 39.41%.
Robust Ego-Exo Correspondence with Long-Term Memory: This paper proposes LM-EEC, a SAM 2-based cross-view video object segmentation framework for ego-exo correspondence. It introduces a Memory-View MoE (MV-MoE) module to adaptively fuse memory features with cross-view features, coupled with a dual memory bank compression strategy for retaining long-term information. LM-EEC substantially outperforms existing methods on the EgoExo4D benchmark (Ego2Exo IoU: 54.98 vs. 38.26).
Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention: This paper proposes CERES, a framework that addresses language bias and visual confusion in egocentric referring video object segmentation (Ego-RVOS) via dual-modal causal intervention — language backdoor adjustment to eliminate dataset statistical bias, and depth-guided visual frontdoor adjustment to construct causal mediators — achieving SOTA on VISOR/VOST/VSCOS.
RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing: This paper proposes RoMA — the first self-supervised autoregressive pre-training framework based on the Mamba architecture for remote sensing. By introducing an adaptive rotation encoding strategy and a multi-scale token prediction mechanism, RoMA addresses the challenges of orientation diversity and extreme scale variation inherent in remote sensing imagery, while empirically validating that Mamba follows data and parameter scaling laws in the remote sensing domain.
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation: SaFiRe is a framework that simulates the human two-stage "saccade-fixation" cognitive process, leveraging Mamba's scan-then-update mechanism to achieve linear-complexity multi-round refinement for referring image segmentation under ambiguous expressions.
SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning: SAM-R1 proposes an end-to-end reasoning segmentation framework that, for the first time, incorporates SAM as a reward provider within the reinforcement learning training loop. Combined with a tiered IoU accuracy reward, asymmetric clipping, and token-level loss normalization in an improved GRPO algorithm, the method achieves a gIoU of 60.2% on the ReasonSeg zero-shot benchmark—surpassing Seg-Zero and other approaches—using only 3K training samples.
SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation: SANSA reveals that SAM2, despite being pre-trained in a class-agnostic manner, implicitly encodes rich semantic structure in its features. By inserting lightweight AdaptFormer adapters into the last two layers of a frozen SAM2 Image Encoder, the method redirects the Memory Attention mechanism from visual-similarity matching to semantic-similarity matching. This unified architecture achieves state-of-the-art performance on few-shot segmentation while being more than 3× faster and 4–5× smaller in parameter count than competing approaches.
Seg-VAR: Image Segmentation with Visual Autoregressive Modeling: Seg-VAR reformulates image segmentation as a conditional autoregressive mask generation problem. By introducing seglat (a latent representation of segmentation masks) and spatial-aware color mapping, it encodes segmentation masks into discrete tokens processable by a VAR model. Seg-VAR comprehensively outperforms discriminative methods such as Mask2Former and generative methods such as GSS across semantic, instance, and panoptic segmentation tasks on COCO, Cityscapes, and ADE20K.
Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers: Through systematic analysis of the joint attention mechanism in Multimodal Diffusion Transformers (MM-DiT), this paper identifies specific layers ("semantic localization expert layers") that inherently possess high-quality semantic segmentation capability, and proposes a lightweight fine-tuning method, MAGNET, that simultaneously improves both segmentation and generation performance.
Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas: This paper proposes a "synthetic data-driven self-supervised pretraining" paradigm: one million synthetic fractal images are first generated via the Flame algorithm to pretrain a ViT-L/16 encoder using the DINOv2 framework; the frozen encoder is then transferred directly to an extremely limited set of magnetohydrodynamic (MHD) star-formation simulation data, achieving stellar mass prediction via kNN regression (\(R^2 = 0.81\)) and zero-shot unsupervised semantic segmentation via PCA projection—slightly outperforming a fully supervised ResNet-18 baseline trained on the same data.
SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning: SRSR proposes a training-free plug-and-play framework that addresses semantic hallucination caused by text guidance in diffusion-based super-resolution methods. It introduces two inference-time modules—Spatially Re-focused Cross-Attention (SRCA) and Spatially Targeted CFG (STCFG)—and comprehensively outperforms 7 SOTA baselines in both fidelity and perceptual quality.
STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model: This paper proposes STEAD, the first provably secure and robust linguistic steganography method based on diffusion language models (DLMs). It exploits the parallel denoising property of DLMs to identify "robust positions" for message embedding, and combines repetitive error-correcting codes with a neighborhood search strategy to resist token-level substitution, insertion, and deletion attacks.
STEP: A Unified Spiking Transformer Evaluation Platform for Fair and Reproducible Benchmarking: STEP is the first unified evaluation platform for Spiking Transformers (STs), supporting multi-task benchmarking (classification/segmentation/detection), multiple backends (SpikingJelly/BrainCog/BrainPy). Through systematic ablation, it reveals that current STs rely heavily on convolutional frontends, that attention contributes minimally, and that temporal modeling capacity is insufficient. The platform further proposes a unified energy consumption analysis framework accounting for bit-width sparsity and memory access costs.
TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations: This paper proposes TabRAG, a parsing-based RAG framework that decomposes documents into fine-grained components via layout segmentation, extracts tables into hierarchical structured representations using vision-language models, and integrates a self-generated in-context learning module to adapt to diverse table formats, achieving comprehensive improvements over existing parsing techniques on tabular document question answering.
Torch-Uncertainty: A Deep Learning Framework for Uncertainty Quantification: Torch-Uncertainty is the first unified, scalable, domain-agnostic, and evaluation-centric PyTorch/Lightning framework for uncertainty quantification (UQ), integrating 6 major UQ method families, 26 evaluation metrics, and 27 plug-and-play datasets across classification, segmentation, and regression tasks, along with comprehensive benchmark results.
Towards Robust Pseudo-Label Learning in Semantic Segmentation: An Encoding Perspective: This paper proposes ECOCSeg, which replaces one-hot encoding with Error-Correcting Output Codes (ECOC) to represent semantic categories. It decomposes an N-class classification problem into K binary sub-tasks, and couples bit-level pseudo-label denoising with customized optimization losses to substantially improve the robustness of pseudo-label learning in UDA and SSL semantic segmentation.
Towards Unsupervised Domain Bridging via Image Degradation in Semantic Segmentation: This paper proposes DiDA, which formalizes image degradation operations as the forward process of diffusion models to construct a continuous intermediate domain between the source and target domains. Combined with a semantic shift compensation mechanism, DiDA serves as a plug-and-play module that consistently improves existing UDA semantic segmentation methods.
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning: UniPixel proposes the first end-to-end large multimodal model that unifies object referring and segmentation, leveraging a novel Object Memory Bank design to transform sparse visual prompts into dense object mask features and inject them into the reasoning process. The model achieves state-of-the-art performance on 10 benchmarks and introduces PixelQA, a new task requiring simultaneous referring, segmentation, and question answering.
Unveiling the Spatial-Temporal Effective Receptive Fields of Spiking Neural Networks: This paper proposes a Spatial-Temporal Effective Receptive Field (ST-ERF) analysis framework to diagnose the bottleneck of Transformer-based SNNs in visual long-sequence modeling—namely, the lack of a global receptive field—and accordingly designs two channel mixers, MLPixer and SRB, to enhance the global modeling capability of SNNs.
Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2: This paper proposes UAP-SAM2—the first cross-prompt universal adversarial attack against SAM2—which employs a dual semantic shift framework (intra-frame semantic confusion + inter-frame semantic inconsistency) to generate a single universal perturbation that causes segmentation targets to "vanish" across different videos, frames, and prompts.
Vision Transformers with Self-Distilled Registers: This paper proposes PH-Reg (Post Hoc Registers), an efficient self-distillation approach that retrofits register tokens into existing pretrained ViTs without labeled data or full retraining. By combining test-time augmentation-based teacher feature denoising with student self-distillation, PH-Reg effectively eliminates artifact tokens in ViT dense features, improving performance on segmentation and depth estimation.