Skip to content

✂️ Segmentation

🧪 ICML2026 · 4 paper notes

📌 Same area in other venues: 💬 ACL2026 (1) · 📷 CVPR2026 (83) · 🔬 ICLR2026 (11) · 🤖 AAAI2026 (31) · 🧠 NeurIPS2025 (48) · 📹 ICCV2025 (73)

🔥 Top topics: Segmentation ×2

LightAVSeg: Lightweight Audio-Visual Segmentation

LightAVSeg decouples "semantic selection (what)" and "spatial localization (where)", replacing \(\mathcal{O}(N^2)\) cross-modal attention with global channel modulation. This enables the AVS model to achieve 50.4 mIoU (MS3) with only 20.5M parameters and 163.4 ms on Snapdragon 8 Elite, about \(8\times\) faster than AVSegFormer-R50.

Segment Anything with Robust Uncertainty-Accuracy Correlation

To address the issue that the SAM series only outputs a single mask-level confidence and suffers from "Mask-level Confidence Confusion" under domain shift, this work equips SAM2 with a Weibull dual-granularity Bayesian mask decoder for pixel-level epistemic estimation. Inspired by human vision, a style + deformation collaborative adversarial perturbation and calibration loss are introduced, ensuring that uncertainty remains aligned with error across 23 zero-shot target domains. The average J&F reaches 79.87, and the uncertainty maps become significantly more reliable.

SEMIR: Semantic Minor-Induced Representation Learning on Graphs for Visual Segmentation

SEMIR treats the voxel grid as a parent graph \(G\), compresses it into a "boundary-aligned" graph minor \(H\) (reducing node count from \(\sim10^7\) to \(\sim10^3\)) via parameterized edge contraction, node deletion, and edge deletion. Using only 5–20 few-shot samples, it black-box optimizes \(\Theta\) to maximize boundary Dice, then applies a GNN for supernode classification on the minor, and finally performs exact lifting via a bijection between the minor and the voxel grid. On the BraTS / KiTS / LiTS tumor segmentation tasks, SEMIR consistently outperforms nnU-Net on minority class Dice, requiring only a 16GB T4 GPU.

UGround: Towards Unified Visual Grounding with Unrolled Transformers

UGround reverses the LMM-based visual grounding paradigm from "using the final layer \(\langle\text{SEG}\rangle\) token as prompt" to "using dynamically selected intermediate layer similarity maps as prompt." Through the RL-based SSC strategy, the \(\langle\text{SEG}\rangle\) token slides across all transformer layers, treating the similarity map as both a soft logit mask for SAM and a backward supervision signal. For the first time, it unifies five visual grounding tasks—RES, RS, FP-RES, gRES, Multi-RS—within a single framework, achieving cIoU +9.0% on ReasonSeg test and N-acc +12.1% on gRefCOCO val.