NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval¶

Conference: CVPR 2025
arXiv: 2503.10526
Code: https://github.com/zzezze/NeighborRetr
Area: Information Retrieval
Keywords: Cross-Modal Retrieval, Hubness Problem, Centrality Weighting, Neighborhood Adjustment, Uniform Regularization

TL;DR¶

This paper proposes NeighborRetr, which addresses the hubness problem (where a few samples dominate nearest neighbors) in cross-modal retrieval through a triple mechanism: centrality-weighted loss (reducing training weights of hub samples), neighborhood adjustment loss (distinguishing between good/bad hubs), and uniform regularization (ensuring each sample is retrieved fairly). It achieves 49.5% (+0.9% over SOTA) R@1 on MSR-VTT text-to-video retrieval.

Background & Motivation¶

Background¶

Background: Cross-modal retrieval (e.g., text-to-video, text-to-image) maps data from different modalities into a shared embedding space. However, high-dimensional embedding spaces suffer from the hubness problem—where a small number of samples become nearest neighbors (hubs) for many queries, while the majority of samples are rarely retrieved (anti-hubs).

Limitations of Prior Work: (1) Hub samples contain a mix of "good hubs" (genuinely semantically relevant) and "bad hubs" (located in specific spatial positions but irrelevant), meaning one cannot simply suppress all hubs. (2) The existence of anti-hubs causes a large number of relevant samples to never be retrieved. (3) Existing contrastive learning approaches ignore the disparity between hubs and anti-hubs.

Key Challenge: Simply suppressing hubness erroneously penalizes good hubs (genuinely relevant high-frequency samples), whereas failing to suppress it allows bad hubs to dominate the retrieval results.

Key Insight: Utilize a memory bank to online estimate the centrality (retrieval frequency) of each sample, and subsequently handle good hubs, bad hubs, and anti-hubs distinctively through different mechanisms.

Core Idea: Centrality weighting + Good/bad hub distinction + Anti-hub uniform regularization = Solving cross-modal hubness.

Proposed Approach¶

Goal: ### Key Designs

Centrality Weighted Loss: \(w(x_i) = \exp(C(x_i)/\kappa)\), reducing the weight of high-centrality (hub) samples in the contrastive loss to decrease their dominance on learning.
Neighborhood Adjustment Loss: Differentiates good/bad hubs using "de-centralized similarity"—good hubs have high de-centralized similarity (remaining relevant after subtracting centrality), while bad hubs have low.
Uniform Regularization: \(\mathcal{L}_{Opt}\) forces retrieval.

Method¶

Key Designs¶

Centrality Weighted Loss: \(w(x_i) = \exp(C(x_i)/\kappa)\), where high-centrality (hub) samples are assigned lower weights in the contrastive loss, reducing their dominant effect on learning.
Neighborhood Adjustment Loss: Uses "de-centralized similarity" to distinguish good/bad hubs—good hubs exhibit high de-centralized similarity (remaining relevant after subtracting centrality), while bad hubs exhibit low.
Uniform Regularization: \(\mathcal{L}_{Opt}\) forces the retrieval probability distribution to tend toward uniformity, ensuring anti-hubs also have opportunities to be retrieved.

Loss & Training¶

\(\mathcal{L} = \mathcal{L}_{Wti} + \mathcal{L}_{Nbi} + \mathcal{L}_{Opt}\) + fine-grained WTI module. A memory bank is used for online centrality estimation.

Key Experimental Results¶

Benchmark	R@1	Rsum	Description
MSR-VTT T→V	49.5%	207.7	+0.9 vs HBI
MSR-VTT V→T	48.7%	207.5	+1.9
MSVD T→V	SOTA	—	Consistently optimal across multiple benchmarks

Ablation Study¶

Bad hubs are reduced, good hubs are enhanced, and anti-hubs are minimized—these three work in synergy.
Decoupling intra-modal and cross-modal weighting improves stability.
Uniform regularization yields the most significant improvement for low-frequency samples.

Key Findings¶

The hubness problem is systemic in cross-modal retrieval; without addressing it, there exists a performance ceiling of 3-5%.
Distinguishing between good and bad hubs is a key innovation—achieving 1-2% better performance than simply suppressing all hubs.

Highlights & Insights¶

First systematic solution to cross-modal hubness—from theory (centrality measurement) to practice (triple loss).
Distinction between good and bad hubs—not all high-frequency samples are bad; semantically relevant hubs are valuable.

Limitations & Future Work¶

The hyperparameter \(\kappa\) needs to be tuned on a dataset-by-dataset basis.
Memory bank size affects efficiency.
Assumes single positive sample queries.

Rating¶

Novelty: ⭐⭐⭐⭐ The systematic solution to cross-modal hubness is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ 7 benchmarks.
Writing Quality: ⭐⭐⭐⭐ Clear.
Value: ⭐⭐⭐⭐ Provides a solution to an overlooked problem in cross-modal retrieval.