Skip to content

NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval

Conference: CVPR 2025
arXiv: 2503.10526
Code: https://github.com/zzezze/NeighborRetr
Area: Information Retrieval
Keywords: Cross-Modal Retrieval, Hubness Problem, Centrality Weighting, Neighborhood Adjustment, Uniform Regularization

TL;DR

This paper proposes NeighborRetr, which addresses the hubness problem (where a few samples dominate nearest neighbors) in cross-modal retrieval through a triple mechanism: centrality-weighted loss (reducing training weights of hub samples), neighborhood adjustment loss (distinguishing between good/bad hubs), and uniform regularization (ensuring each sample is retrieved fairly). It achieves 49.5% (+0.9% over SOTA) R@1 on MSR-VTT text-to-video retrieval.

Background & Motivation

Background

Background: Cross-modal retrieval (e.g., text-to-video, text-to-image) maps data from different modalities into a shared embedding space. However, high-dimensional embedding spaces suffer from the hubness problem—where a small number of samples become nearest neighbors (hubs) for many queries, while the majority of samples are rarely retrieved (anti-hubs).

Limitations of Prior Work: (1) Hub samples contain a mix of "good hubs" (genuinely semantically relevant) and "bad hubs" (located in specific spatial positions but irrelevant), meaning one cannot simply suppress all hubs. (2) The existence of anti-hubs causes a large number of relevant samples to never be retrieved. (3) Existing contrastive learning approaches ignore the disparity between hubs and anti-hubs.

Key Challenge: Simply suppressing hubness erroneously penalizes good hubs (genuinely relevant high-frequency samples), whereas failing to suppress it allows bad hubs to dominate the retrieval results.

Key Insight: Utilize a memory bank to online estimate the centrality (retrieval frequency) of each sample, and subsequently handle good hubs, bad hubs, and anti-hubs distinctively through different mechanisms.

Core Idea: Centrality weighting + Good/bad hub distinction + Anti-hub uniform regularization = Solving cross-modal hubness.

Proposed Approach

Goal: ### Key Designs

  1. Centrality Weighted Loss: \(w(x_i) = \exp(C(x_i)/\kappa)\), reducing the weight of high-centrality (hub) samples in the contrastive loss to decrease their dominance on learning.

  2. Neighborhood Adjustment Loss: Differentiates good/bad hubs using "de-centralized similarity"—good hubs have high de-centralized similarity (remaining relevant after subtracting centrality), while bad hubs have low.

  3. Uniform Regularization: \(\mathcal{L}_{Opt}\) forces retrieval.

Method

Key Designs

  1. Centrality Weighted Loss: \(w(x_i) = \exp(C(x_i)/\kappa)\), where high-centrality (hub) samples are assigned lower weights in the contrastive loss, reducing their dominant effect on learning.

  2. Neighborhood Adjustment Loss: Uses "de-centralized similarity" to distinguish good/bad hubs—good hubs exhibit high de-centralized similarity (remaining relevant after subtracting centrality), while bad hubs exhibit low.

  3. Uniform Regularization: \(\mathcal{L}_{Opt}\) forces the retrieval probability distribution to tend toward uniformity, ensuring anti-hubs also have opportunities to be retrieved.

Loss & Training

\(\mathcal{L} = \mathcal{L}_{Wti} + \mathcal{L}_{Nbi} + \mathcal{L}_{Opt}\) + fine-grained WTI module. A memory bank is used for online centrality estimation.

Key Experimental Results

Benchmark R@1 Rsum Description
MSR-VTT T→V 49.5% 207.7 +0.9 vs HBI
MSR-VTT V→T 48.7% 207.5 +1.9
MSVD T→V SOTA Consistently optimal across multiple benchmarks

Ablation Study

  • Bad hubs are reduced, good hubs are enhanced, and anti-hubs are minimized—these three work in synergy.
  • Decoupling intra-modal and cross-modal weighting improves stability.
  • Uniform regularization yields the most significant improvement for low-frequency samples.

Key Findings

  • The hubness problem is systemic in cross-modal retrieval; without addressing it, there exists a performance ceiling of 3-5%.
  • Distinguishing between good and bad hubs is a key innovation—achieving 1-2% better performance than simply suppressing all hubs.

Highlights & Insights

  • First systematic solution to cross-modal hubness—from theory (centrality measurement) to practice (triple loss).
  • Distinction between good and bad hubs—not all high-frequency samples are bad; semantically relevant hubs are valuable.

Limitations & Future Work

  • The hyperparameter \(\kappa\) needs to be tuned on a dataset-by-dataset basis.
  • Memory bank size affects efficiency.
  • Assumes single positive sample queries.

Rating

  • Novelty: ⭐⭐⭐⭐ The systematic solution to cross-modal hubness is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 7 benchmarks.
  • Writing Quality: ⭐⭐⭐⭐ Clear.
  • Value: ⭐⭐⭐⭐ Provides a solution to an overlooked problem in cross-modal retrieval.