Noise-Aware Few-Shot Learning through Bi-directional Multi-View Prompt Alignment¶
Conference: CVPR 2026 arXiv: 2603.11617 Code: None (no code link provided in the paper) Area: Multimodal VLM Keywords: noisy labels, prompt learning, optimal transport, CLIP, few-shot learning
TL;DR¶
This paper proposes the NA-MVP framework, which employs a bi-directional (clean + noise-aware) multi-view prompt design coupled with Unbalanced Optimal Transport (UOT) for fine-grained patch-to-prompt alignment, and applies classical OT for selective label correction on identified noisy samples, consistently surpassing state-of-the-art methods in noisy few-shot learning scenarios.
Background & Motivation¶
- Vision-language models such as CLIP can be efficiently adapted to downstream tasks via prompt learning; however, when training labels are noisy, even a small number of incorrect labels can disproportionately affect gradient updates.
- Existing noisy prompt learning methods suffer from three major limitations:
- Insufficient prompt expressiveness: Most methods rely on only 1–2 prompts (e.g., positive/negative pairs), and single-view alignment fails to capture fine-grained semantic cues.
- Rigid explicit negative labels: Assigning a hard negative label to each image produces a fixed counter-class signal that is often inaccurate or uninformative under noise.
- Coarse denoising: Reliance on fixed confidence thresholds or non-selective pseudo-labels leads to error propagation.
- Core insight: Robust noisy few-shot learning requires a shift from global matching toward region-aware fine-grained alignment that adaptively distinguishes clean from noisy semantics.
Core Problem¶
How, within VLM prompt learning under severe label noise and few-shot settings, can one (1) adaptively distinguish clean from noisy semantic signals; (2) achieve fine-grained image-region-to-prompt alignment to suppress noisy regions; and (3) selectively correct erroneous labels without over-correction.
Method¶
Overall Architecture¶
NA-MVP comprises two core modules that operate jointly: 1. Noise-Aware Alignment (blue pathway): For each class, multiple clean and noise-aware prompts are constructed and aligned with local image patches via UOT for fine-grained matching, yielding clean/noisy probabilities. 2. Selective Label Correction (green pathway): Based on bi-directional alignment signals, mislabeled samples are adaptively identified and their labels are corrected using classical OT.
The two modules iteratively update the training set and optimize the prompts, producing a denoised dataset for robust prediction.
Key Designs¶
-
Bi-directional Multi-View Prompt Construction:
- For each class \(k\), two sets of learnable prompts are constructed: clean-oriented \(\{Prompt_{m,k}^c\}_{m=1}^N\) and noise-aware \(\{Prompt_{m,k}^n\}_{m=1}^N\).
- Each prompt consists of \(M\) learnable context tokens concatenated with a class-specific token.
- Clean prompts capture class-relevant semantics, while noise-aware prompts serve as adaptive filters to suppress misleading signals.
- Non-target classes function as implicit negative samples, avoiding the rigidity of explicit negative labels.
-
Fine-Grained Noise-Aware Alignment via UOT:
- Local image features \(F_i \in \mathbb{R}^{L \times d}\) and prompt features \(G_k \in \mathbb{R}^{N \times d}\) are treated as discrete distributions.
- A cost matrix is computed via cosine similarity: \(C_k = 1 - F_i G_k^\top\).
- UOT relaxes the strict mass conservation constraint (\(T\mathbf{1}_N \leq \mu\) rather than equality), permitting partial matching.
- Solved efficiently via a Dykstra-based implementation (Sinkhorn + entropic regularization).
- Core advantage: Not all features are forced to align; noisy or irrelevant patches can be safely "discarded."
-
Selective Label Correction:
- Noise identification: UOT distances between each sample and the clean/noise-aware prompts yield similarities \(s_{i,k}^c\) and \(s_{i,k}^n\). An adaptive threshold \(\phi_{i,k} = \frac{\exp(s_{i,k}^n/\tau)}{\exp(s_{i,k}^c/\tau) + \exp(s_{i,k}^n/\tau)}\) determines whether a sample is noisy.
- Label correction: For samples identified as noisy, classical OT (with strict mass conservation) computes an optimal transport plan \(T^*\) between global image features and class prompt features; \(\arg\max_j T^*_{ij}\) is assigned as the pseudo-label.
- Only samples satisfying \(p_{ik}^c < \phi_{i,k}\) are corrected; reliable samples are left unchanged to avoid over-correction.
Loss & Training¶
- Early stage (first \(T_{sup}\) epochs): \(\mathcal{L}_{sup} = \mathcal{L}_{gce} + \lambda_i \cdot \mathcal{L}_{itbp}\)
- GCE (Generalized Cross-Entropy): a noise-robust loss function.
- ITBP Loss: an auxiliary bi-directional contrastive loss encouraging image features to align with clean prompts and diverge from noise-aware prompts.
- Later stage: Label correction is activated and training continues with GCE on the denoised dataset.
- Inference: \(p(y=k|x_i) = (1 - p_{ik}^n) \cdot p_{ik}^c\), jointly leveraging clean and noise-aware probabilities.
- SGD optimizer (lr=0.002, momentum=0.9, weight_decay=5e-4), 50 epochs, 16 shared context tokens, ResNet-50 image encoder.
Key Experimental Results¶
Synthetic Noise (Average over 5 Datasets, 16-shot)¶
| Dataset | Metric | NA-MVP | NLPrompt | Gain |
|---|---|---|---|---|
| OxfordPets (75% Sym) | Acc | 86.23 | 70.77 | +15.46 |
| OxfordPets (50% Sym) | Acc | 88.13 | 83.17 | +4.96 |
| DTD (75% Sym) | Acc | 48.63 | 39.80 | +8.83 |
| Caltech101 (75% Sym) | Acc | 89.37 | 86.70 | +2.67 |
Real-World Noise (Food101N)¶
| Method | 4-shot | 8-shot | 16-shot | 32-shot |
|---|---|---|---|---|
| NLPrompt | 70.57 | 73.93 | 76.46 | 76.87 |
| NA-MVP | 76.10 | 76.27 | 76.90 | 77.03 |
The advantage is especially pronounced under extreme few-shot conditions (4-shot: +5.53), where the impact of noisy labels is most severe.
Ablation Study¶
- Bi-directional prompts: single prompt (48.08%) → + explicit negative labels (48.82%) → + implicit bi-directional (49.63%) → + multi-view (51.83%), showing consistent incremental gains.
- UOT vs. OT vs. KL: UOT (54.18%) > OT (53.37%) > KL (52.60%); the relaxed constraints of UOT yield a clear advantage under noise.
- Selective correction: Global OT correction incorrectly modifies clean labels at low noise (25% noise: 59.60%), whereas selective correction is more stable (63.13%).
- Number of prompts N: \(N=4\) achieves the best balance (\(N=1\) is insufficient; \(N=8\) is redundant).
- vs. DEFT: NA-MVP consistently outperforms DEFT at all noise levels (75% noise: 86.23 vs. 75.87).
Highlights & Insights¶
- A novel conceptual reframing: The noise robustness problem is reformulated as "region-aware clean-noisy semantic decomposition" rather than simple global matching.
- Judicious use of UOT: Relaxing the mass conservation constraint naturally suits noisy settings—noisy patches need not be forcibly aligned and can be safely discarded.
- Complementary design of two OT variants: UOT handles local fine-grained alignment (noise identification), while classical OT handles global label correction (strict mass conservation ensures assignment validity).
- Improvements are most pronounced at high noise rates (75%), demonstrating that the framework's noise robustness genuinely surpasses prior work.
- The inference formula \((1-p^n) \cdot p^c\) is concise and elegant.
- Introducing optimal transport theory into prompt-based noisy label learning represents a novel cross-domain transfer.
- Bi-directional multi-view alignment jointly addresses both prompt noise and label noise.
- Strong accuracy is maintained even under 60% symmetric noise, demonstrating practical utility.
Limitations & Future Work¶
- Validated only on classification tasks; not extended to more complex visual tasks such as detection or segmentation.
- Relies on CLIP's ResNet-50 backbone; performance under stronger backbones (e.g., ViT-L/14) has not been verified.
- Sinkhorn iterations in UOT introduce additional computational overhead; the paper does not discuss computational cost in detail.
- The framework assumes standard symmetric/asymmetric noise patterns; real-world noise distributions may be more complex (e.g., instance-dependent noise).
- The optimal number of multi-view prompts (\(N=4\)) may depend on dataset-specific characteristics.
- UOT computational cost grows with the number of prompts and sample size.
Related Work & Insights¶
- vs. CoOp/CoCoOp: Standard prompt learning methods that do not handle noisy labels; NA-MVP achieves substantially higher performance in the presence of noise.
- vs. NLPrompt: Employs OT-Filter for noise identification and global OT for relabeling—a coarse-grained global approach. NA-MVP achieves finer-grained noise handling via patch-level UOT and adaptive thresholding.
- vs. DEFT: Uses a fixed threshold (0.5) for clean sample identification; NA-MVP's adaptive threshold \(\phi_{i,k}\) is more flexible.
- vs. PLOT: PLOT applies OT for multi-prompt alignment on clean data; NA-MVP extends this to a noise-aware bi-directional design.
- vs. CLIPN: CLIPN uses positive/negative prompt pairs for OOD detection; NA-MVP transfers this idea to noisy label learning with added multi-view prompts.
- vs. ProGrad: Uses gradient filtering to suppress noisy prompts, but is less effective than the globally optimal transport alignment in UOT.
- The "relaxed matching" intuition of UOT under noise is transferable to other noisy settings (e.g., noisy annotations in object detection, imprecise labels in medical image segmentation).
- The finding that implicit negatives outperform explicit negative labels also has implications for contrastive learning in NLP.
- The bi-directional prompt design can be extended to "multi-dimensional prompts" (e.g., a three-way clean/noisy/ambiguous formulation).
Relevance to My Research¶
- Possibly related:
20260316_adaptive_model_routing.md - Possibly related:
20260316_cross_species_framework.md - Possibly related:
20260316_concept_bottleneck_world_model.md
Rating¶
- Novelty: ⭐⭐⭐⭐ Introducing UOT into prompt-based noisy label learning is novel; the bi-directional multi-view design is distinctive.
- Experimental Thoroughness: ⭐⭐⭐⭐ Five synthetic noise datasets plus one real-world noise dataset; ablation studies comprehensively cover components, alignment methods, prompt count, and thresholding strategies.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; Figure 1 effectively summarizes the three key limitations; the method description is systematic.
- Value: ⭐⭐⭐⭐ Provides an effective solution for the practically important setting of noisy few-shot learning; substantial gains under high noise rates carry real-world significance.