Skip to content

AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

Conference: ICLR 2026 arXiv: 2602.22740 Code: GitHub Area: Image Segmentation Keywords: referring image segmentation, vision-language alignment, masked learning, cross-modal similarity

TL;DR

This paper proposes an Alignment-aware Masked Learning (AML) strategy that quantifies vision-language patch-level alignment and filters low-alignment pixels, enabling RIS models to focus on reliable regions during training. Without any architectural modifications, AML achieves state-of-the-art performance across all 8 splits of RefCOCO benchmarks.

Background & Motivation

Root Cause

Key Challenge: Background: 1. Referring Image Segmentation (RIS) requires accurately segmenting target objects in images based on natural language expressions, relying on fine-grained cross-modal alignment. 2. RIS training typically provides only a single annotated target per sample, resulting in sparse supervision signals. 3. Understanding expressions such as "the giraffe closest to the person" requires reasoning about spatial relationships among other objects in the visual context. 4. Existing methods (LAVT/CARIS/DETRIS) enhance alignment through complex fusion modules, but applying loss over all pixels introduces unreliable gradients. 5. Under dense supervision, models tend to overfit to regions irrelevant to the referring expression. 6. Common data augmentation strategies (flipping, color jittering) can disrupt the semantic consistency of referring expressions.

Method

Overall Architecture: Two-stage training with shared parameters — Stage 1 computes the alignment map and generates masks via a forward pass; Stage 2 trains normally on the masked images.

PatchMax Matching Evaluation (PMME): - \(\ell_2\)-normalizes visual features \(V\) and text features \(T\) respectively - Projects both into a shared \(D_a\)-dimensional space using random Gaussian matrices \(W_i, W_t\) (distance-preserving via Johnson-Lindenstrauss) - Computes \(S_{norm} = \text{SoftMax}(V'T'^{\top})\), and assigns each patch the maximum similarity score with its best-matching token

Alignment-Aware Filtering Mask (AFM): - Bilinearly upsamples patch-level similarity scores to pixel level - Pixels below threshold \(\tau\) are marked as weakly aligned; a random fraction \(1-\rho\) is retained to prevent over-filtering - Masks are aggregated at the block level (any weakly aligned pixel causes the entire block to be masked), and the corresponding input image regions are zeroed out

Key Hyperparameters: \(\tau=0.4\), \(\rho=0.25\), block size \(32\times32\), \(D_a=2048\)

Loss: Standard cross-entropy segmentation loss \(\mathcal{L}_{seg}\), with no additional loss terms

Key Experimental Results

Main Results

Method RefCOCO val RefCOCO+ val RefCOCOg val Avg mIoU
CARIS* 76.77 69.33 68.87 71.8
MagNet 77.43 70.10 68.53 72.1
AMLRIS 77.89 71.33 69.24 72.9
  • oIoU: RefCOCO val 75.45 (+0.80 vs. CARIS), RefCOCO+ val 67.37 (+1.83)
  • Achieves SOTA on all 8 splits
  • Cross-dataset robustness: trained only on RefCOCO+, outperforms baseline under 7 perturbation scenarios
  • Overhead: only 4.9% additional memory and 17.2% additional training time; zero inference overhead

Key Experimental Results

Main Results (mIoU)

Method RefCOCO val testA testB RefCOCO+ val testA testB RefCOCOg val test Avg
LAVT 74.46 76.89 70.94 65.81 70.97 59.23 63.34 63.62 68.0
CGFormer 76.93 78.70 73.32 68.56 73.76 61.72 67.57 67.83 71.1
CARIS* 76.77 79.03 74.56 69.33 74.51 62.69 68.87 68.51 71.8
MagNet 77.43 79.43 74.11 70.10 74.50 63.59 68.53 69.15 72.1
AMLRIS 77.89 79.53 74.99 71.33 75.61 64.61 69.24 69.73 72.9

Ablation Study

Configuration RefCOCO val mIoU Notes
CARIS baseline 76.77 No masking
+ Random Mask 76.92 Marginal improvement
+ PMME + AFM (full AML) 77.89 Alignment-aware masking is effective
AML on DETRIS 75.64→76.12 Consistent gain across architectures
AML on ReLA +0.5–1.0 Also effective

Cross-Dataset Robustness

Perturbation CARIS baseline AMLRIS
Standard eval 69.33 71.33
Occlusion 65.1 68.4
Noise 64.8 67.9
Blur 66.2 69.1
Color shift 67.5 70.2

Key Findings

  • SOTA on all 8 splits, with average mIoU of 72.9 (+0.8 vs. MagNet)
  • oIoU metrics are also comprehensively best, reaching 67.37 on RefCOCO+ val (+1.83 vs. CARIS)
  • Random masking is nearly ineffective (+0.15), confirming that alignment-aware mask selection is the critical factor
  • Advantages are more pronounced under perturbation scenarios such as occlusion and noise (+3.1–3.3), indicating that the model learns more robust alignment features
  • Overhead is minimal: only 4.9% additional memory and 17.2% additional training time; zero inference overhead (masking stage is skipped at inference)
  • Can be seamlessly integrated into various RIS frameworks including DETRIS, CARIS, and ReLA

Highlights & Insights

  • Plug-and-play training strategy: No architectural modifications and no inference cost — a purely training-stage improvement with zero deployment overhead.
  • Theoretical guarantee: The Johnson-Lindenstrauss lemma is formally applied to prove that random projection preserves cross-modal inner products, providing a mathematical foundation for the alignment metric.
  • Counter-intuitive effectiveness: The model never sees complete images during training (some regions are always masked), yet performs better on complete images at inference — demonstrating that filtering weakly aligned regions genuinely eliminates misleading gradients.
  • PatchMax matching strategy: Assigning each patch the similarity score of its best-matching token better captures local alignment quality compared to average matching.

Limitations & Future Work

  • The threshold \(\tau=0.4\) and dropout ratio \(\rho=0.25\) require manual tuning, and different datasets may necessitate different configurations.
  • The random-projection-based alignment metric relies on initial feature similarity and may miss deep semantic alignment as the feature space evolves during later training stages.
  • The two-stage forward pass introduces a 17.2% increase in training time, which may become a bottleneck at large data scales.
  • Evaluation is limited to the RefCOCO series; generalization to open-vocabulary, large-scale, or more complex scenarios remains unverified.
  • Block-level masking (32×32) may inadvertently cover target regions in small-object scenarios.
  • vs. CARIS/LAVT/DETRIS: These serve as the baseline and comparison methods, improving alignment via fusion architectures but all applying full-pixel losses — AML innovates from the perspective of optimization signals.
  • vs. MaskRIS/NeMo/MagNet: Data augmentation approaches for RIS improvement that still apply loss over all pixels; AML directly suppresses low-quality gradients.
  • vs. CRIS: A CLIP-based pixel-level adaptation method that performs alignment in the pretrained feature space; AML is applicable to arbitrary backbones.

Rating

  • Novelty: ⭐⭐⭐⭐ The alignment-aware masking idea is concise and novel; PatchMax + JL projection is theoretically grounded.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Full-split SOTA + robustness evaluation + cross-architecture validation + complete ablations.
  • Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear; algorithmic pseudocode is complete.
  • Value: ⭐⭐⭐⭐ A general training strategy that can be plug-and-play integrated into existing RIS methods.