ICML2026 Segmentation AI paper notes paper summaries Object Tracking LLM Alignment/RLHF Speech & Audio Adversarial Robustness

✂️ Segmentation¶

🧪 ICML2026 · 14 paper notes

📌 Same area in other venues: 📷 CVPR2026 (122) · 🔬 ICLR2026 (32) · 🤖 AAAI2026 (29) · 🧠 NeurIPS2025 (45) · 📹 ICCV2025 (73) · 🧪 ICML2025 (18)

🔥 Top topics: Segmentation ×7

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models: This paper constructs PolyMLP, PolyConv, and PolyAttn using Hadamard products to replace pointwise activations/softmax in MLP, convolution, and attention. Without conventional activation functions, these modules allow MetaFormer-style backbones to reach or exceed the performance of activation-based models on ImageNet, robustness benchmarks, and ADE20K segmentation.
Beyond Detection: A Structure-Aware Framework for Scene Text Tracking: The authors propose SymTrack, a detection-free dual-branch scene text tracking framework. It addresses feature bottlenecks caused by perspective distortion through Predictive Token Rectification (PTR), eliminates high visual ambiguity among text instances using Cross-Expert Calibration (CEC), and stabilizes fine-grained localization with an Adaptive Inference Engine (AIE). It significantly surpasses SOTA on three benchmarks (up to +12.32% AUC).
FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation: This paper points out that current query-based LLM-conditioned segmentation follows a "propose-then-select" paradigm—candidate masks are often accurate enough, but errors occur due to incorrect selection. To address this, FlowSeg is proposed, where LLM conditional embeddings participate in query refinement at every decoder layer and are continuously updated by new visual evidence. Combined with a lightweight boundary refinement module, it achieves consistent performance gains on RefCOCO/+/g and ReasonSeg.
Functional Attention: From Pairwise Affinities to Functional Correspondences: This paper reinterprets softmax attention in Transformers as a "least-squares linear operator between two learned functional bases." Borrowing the idea of functional maps from shape matching, it compresses the \(n \times n\) pairwise affinity matrix into a \(k \times k\) compact spectral operator, achieving SOTA performance in PDE solving, 3D point cloud segmentation, and OOD generalization simultaneously.
Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models: GPUA treats VLMs like CLIP (rich semantics, insufficient local precision) and VFMs like DINOv3 (fine-grained detail, lacking semantics) as two "visual languages." It uses Optimal Transport to mine soft correspondences and solves the Orthogonal Procrustes problem to learn a geometry-preserving linear mapping that translates VFM features into the VLM space. This process is entirely unsupervised, requires no updates to pre-trained parameters, and achieves an average 11.8% improvement in zero-shot classification.
LightAVSeg: Lightweight Audio-Visual Segmentation: LightAVSeg decouples "semantic filtering (what)" and "spatial localization (where)" by replacing \(\mathcal{O}(N^2)\) cross-modal attention with global channel modulation. This allows the AVS model to achieve 50.4 mIoU (MS3) with only 20.5M parameters and reach an on-device latency of 163.4 ms on Snapdragon 8 Elite, which is approximately \(8\times\) faster than AVSegFormer-R50.
MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation: MVR-cache upgrades the similarity metric for LLM semantic caching from "single-vector cosine" to "multi-vector MaxSim after learned segmentation." By training a lightweight segmentation model via REINFORCE, it boosts cache hit rates by up to 37% while maintaining the same error rate upper bound \(\delta\).
Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion: CurriSeg keeps the segmentation network architecture unchanged and modifies only the training schedule: it first pushes the model to a stable state using a robust curriculum based on "temporal loss statistics + pixel entropy weighting," and then performs anti-curriculum "spectral blindness" fine-tuning (removing high frequencies to force the model to capture structural semantics). This approach consistently improves FEDER / FSEL / RUN by 2–4% on camouflaged/polyp segmentation benchmarks such as CHAMELEON / CAMO / COD10K / NC4K with zero additional parameters and shorter training time.
Segment Anything with Robust Uncertainty-Accuracy Correlation: Addressing the issue that the SAM series only outputs a single mask-level confidence score and suffers from "Mask-level Confidence Confusion" under domain shift, this paper equips SAM2 with a Weibull dual-granularity Bayesian mask decoder for pixel-level epistemic estimation. It incorporates a synergistic style + deformation adversarial perturbation and calibration loss inspired by human vision, ensuring uncertainty remains aligned with errors across 23 zero-shot target domains, achieving an average J&F of 79.87 with significantly more reliable uncertainty maps.
SPROUT: Supervise Less, See More — Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting: SPROUT is the first fully training-free, zero-annotation framework for pathological nuclei segmentation. It utilizes H&E staining priors to self-construct high-confidence foreground/background regions on each slide → extracts prototypes → performs feature-prototype soft alignment via Partial Optimal Transport (POT) → outputs positive/negative point prompts for SAM. On benchmarks like MoNuSeg, its AJI is 8.2% higher than training-based methods.
Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background: To address the challenges of "cluttered backgrounds + translucent/deformable waste + heavy existing backbones" in automated waste recycling, this paper proposes EWSegNet: a lightweight segmentation network that utilizes spatial domain modules for local structures and frequency domain modules for global context through cascaded complementarity. It further incorporates an Auxiliary Feature Enhancement Module (AFEM) using Difference of Gaussians (DoG) + Pooling Attention to strengthen boundaries and blobs, achieving SOTA-level accuracy with fewer parameters and lower latency.
UGround: Towards Unified Visual Grounding with Unrolled Transformers: UGround flips the LMM-based visual grounding paradigm from "using the \(\langle\text{SEG}\rangle\) token of the last layer as a prompt" to "using the similarity maps of dynamically selected intermediate layers as prompts." Through a reinforcement learning strategy (SSC), the \(\langle\text{SEG}\rangle\) token slides through all transformer layers, treating the similarity map simultaneously as a soft logit mask for SAM and a backward supervision signal. This approach unifies five visual grounding tasks—RES, RS, FP-RES, gRES, and Multi-RS—within a single framework for the first time, achieving +9.0% cIoU on ReasonSeg test and +12.1% N-acc on gRefCOCO val.
Unsupervised Hierarchical Skill Discovery: HiSD starts from unlabeled observation trajectories—performing skill segmentation via optimal transport and then discovering multi-level skill hierarchies using Sequitur grammar induction, without requiring action labels or reward signals.
What Makes Synthetic Data Effective in Image Segmentation: This paper systematically analyzes two key factors that make synthetic images effective for semantic segmentation: dense composition and fine instance fidelity. It proposes SENSE, which leverages Optimal Transport (OT) to stabilize pseudo-label assignment for synthetic images, achieving consistent improvements for DPT and Mask2Former on Cityscapes, COCO, and ADE20K.