Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval¶

Conference: ICCV 2025 arXiv: 2507.23284 Code: github.com/mlvlab/BLiM Area: Multimodal Learning / Video Retrieval Keywords: Text-video retrieval, multimodal large language models, bidirectional likelihood estimation, candidate prior bias, score calibration

TL;DR¶

This paper identifies the candidate prior bias problem in MLLM-based retrieval systems — where candidate likelihood estimation tends to favor candidates with high prior probability rather than those that are semantically most relevant — and proposes BLiM (Bidirectional Likelihood Estimation) and CPN (Candidate Prior Normalization) to address this issue, achieving an average R@1 gain of 6.4 across four text-video retrieval benchmarks.

Background & Motivation¶

Text-video retrieval aims to find the most relevant text (or video) candidate given a video (or text) query. The evolution of existing approaches includes: - Dual-encoder architectures (CLIP, BERT): encode queries and candidates independently into single embeddings and retrieve via similarity; computationally efficient but limited in token-level alignment. - MLLM-based retrieval: processes concatenated query-candidate pairs to enable deep token-level interaction; more effective for long and complex query-candidate pairs.

The authors identify a candidate prior bias in MLLM-based retrieval: via Bayesian decomposition, $$P(\mathbf{t}|\mathbf{v}) = \frac{P(\mathbf{v}|\mathbf{t}) P(\mathbf{t})}{P(\mathbf{v})}$$ the candidate likelihood $P(\mathbf{t}|\mathbf{v})$ is jointly influenced by the query likelihood $P(\mathbf{v}|\mathbf{t})$ and the candidate prior $P(\mathbf{t})$. The autoregressive nature of MLLMs tends to assign higher probability to long and repetitive texts (high prior), causing retrieval results to favor high-frequency patterns over truly semantically matching candidates. Experiments confirm that certain high-prior texts are retrieved by 37% of video queries (374 out of 1,003 videos).

Method¶

Overall Architecture¶

BLiM is built upon a pretrained video MLLM (VideoChat-Flash 7B), consisting of a UMT video encoder, a linear projection layer, and a Qwen2 LLM. At inference time, a two-stage retrieval pipeline is adopted: InternVideo2 1B first retrieves top-K candidates, which BLiM then re-ranks.

Key Designs¶

Bidirectional Likelihood Estimation Training:
- Video-to-text generation $P(\mathbf{t}|\mathbf{v})$: Standard MLLM pretraining paradigm; autoregressively generates text conditioned on video features. $\mathcal{L}_{t|v} = -\sum_{i=1}^{L_t} \log P(t_i | t_{<i}, \mathbf{v})$
- Text-to-video feature generation $P(\mathbf{v}|\mathbf{t})$: Given text, autoregressively predicts the next video clip feature using contrastive softmax. $\mathcal{L}_{v|t} = -\sum_{i=1}^{L_v} \log \frac{\exp(\tilde{v}_{i-1}^\top v_i)}{\sum_{n=1}^{N} \exp(\tilde{v}_{i-1}^\top v_i^{(n)})}$
- Overall training objective: $\mathcal{L}_{BLiM} = \mathcal{L}_{t|v} + \mathcal{L}_{v|t}$
- The two directions swap input modality order and use different prompts.
Candidate Prior Normalization (CPN):
- A training-free score calibration module that estimates candidate prior probability by applying an attention mask to the input modality.
- When computing candidate likelihood, all query tokens are masked so that the model generates the candidate without conditioning on the query, yielding a prior estimate.
- The candidate likelihood is then normalized by the estimated prior to eliminate prior bias.
- Generality of CPN: applicable beyond retrieval to enhance visual grounding in tasks such as VQA and captioning.
Inference Pipeline:
- Video-to-text retrieval: $n^* = \arg\max_n P(\mathbf{t}^{(n)}|\mathbf{v}) + P(\mathbf{v}|\mathbf{t}^{(n)})$
- Text-to-video retrieval: $n^* = \arg\max_n P(\mathbf{t}|\mathbf{v}^{(n)}) + P(\mathbf{v}^{(n)}|\mathbf{t})$
- Both directions are jointly considered: the candidate likelihood identifies the most probable candidate to be generated, while the query likelihood identifies the candidate most likely to have generated the query.

Loss & Training¶

Only the linear projection layer and LoRA are fine-tuned, enabling parameter-efficient adaptation.
Two-stage retrieval: InternVideo2 1B for initial retrieval (top-K) → BLiM for re-ranking.
Inference complexity is reduced from $O(N^2)$ to $O(KN)$ (e.g., 307× faster on ActivityNet).

Key Experimental Results¶

Main Results (Text-to-Video R@1)¶

Method	DiDeMo	ActivityNet	LSMDC	MSRVTT	Average
InternVideo2 1B	57.0	60.4	32.0	51.9	50.3
InternVideo2 6B	57.9	63.2	33.8	55.9	52.7
UMT (fine-tuned)	70.4	66.8	43.0	58.8	59.8
InternVideo2 1B* (fine-tuned)	75.3	68.8	44.9	59.4	62.1
BLiM (Ours)	86.4	81.0	55.7	64.7	71.9

Ablation Study (Video-to-Text R@1, DiDeMo)¶

Configuration	T2V R@1	V2T R@1	Notes
Candidate likelihood only $P(\mathbf{t}	\mathbf{v})$	—	Lower
Query likelihood only $P(\mathbf{v}	\mathbf{t})$	—	Higher
Bidirectional likelihood (BLiM−)	69.8	62.9	Substantial improvement
BLiM + CPN	86.4	82.8	Further alleviates prior bias

Zero-shot BLiM− (without CPN) on DiDeMo already achieves 69.8 T2V R@1, surpassing all fine-tuned baselines.

Key Findings¶

Candidate prior bias is pervasive in MLLM-based retrieval, affecting both video-to-text and text-to-video directions.
Query likelihood alone produces reasonably accurate retrieval results (high diagonal similarity), but high-prior candidates under the candidate likelihood distort the rankings.
CPN improves not only retrieval performance but also other multimodal tasks such as VQA, demonstrating its value as a general debiasing tool.
The two-stage retrieval pipeline substantially reduces computational cost, making MLLM-based re-ranking practically feasible.

Highlights & Insights¶

The formalization of candidate prior bias is notably rigorous; Proposition 1 formally proves that ranking reversal occurs when the prior gap exceeds the likelihood gap.
The CPN design is elegant and minimalist — prior estimation is achieved training-free via attention masking followed by normalization.
The training objective for text-to-video feature generation is novel: contrastive learning replaces actual video decoding, operating entirely within the LLM output space.

Limitations & Future Work¶

The approach relies on a two-stage pipeline (InternVideo2 for initial retrieval), leaving room for improvement in end-to-end efficiency.
The contrastive loss for text-to-video generation requires all in-batch videos as negatives, potentially sensitive to batch size.
Validation is currently limited to text-video retrieval; the effectiveness on text-image retrieval remains to be explored.

Candidate prior bias is conceptually analogous to language prior (language bias) in VQA; CPN shares a similar spirit with VCD (Visual Contrastive Decoding).
Bidirectional likelihood can be viewed as an approximation of pointwise mutual information.
The attention-mask trick for prior estimation generalizes to other scenarios requiring the removal of conditioning bias.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The discovery of candidate prior bias and the bidirectional likelihood solution are both highly novel; CPN is concise and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ — Four retrieval benchmarks plus extended analysis on multimodal tasks, with substantial performance gains.
Writing Quality: ⭐⭐⭐⭐⭐ — In-depth problem analysis, rigorous theoretical proofs, and intuitive visualizations.
Value: ⭐⭐⭐⭐⭐ — An R@1 gain of 6.4 is substantial; CPN as a general-purpose module has broad applicability.