MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence¶
Conference: NeurIPS 2025 arXiv: 2510.21406 Code: GitHub Area: Video Understanding Keywords: Video Retrieval, Multi-Modal Query, Untrimmed Video, Multi-Level Visual Correspondence, Benchmark Dataset
TL;DR¶
This paper introduces MUVR, a benchmark for multi-modal untrimmed video retrieval targeting real-world long-video platforms. It proposes a video-centric multi-modal query format (video + text + tag + mask) and a six-level visual correspondence matching criterion, comprising 53K videos and 1,050 queries, and systematically evaluates the limitations of retrieval models and MLLMs.
Background & Motivation¶
Background: Video retrieval is a core technology in recommendation systems and content search. Existing tasks include text-to-video retrieval (TVR), composed video retrieval (CVR), and fine-grained video retrieval (FVR), each with distinct limitations.
Limitations of Prior Work: (1) Pure text queries fail to capture fine-grained visual details; (2) pure video queries introduce irrelevant visual information; (3) existing benchmarks support only a single matching criterion (e.g., near-duplicate or event-level), failing to cover diverse video categories; (4) most benchmarks are built on trimmed videos with one-to-one retrieval, which does not reflect real-world platform scenarios.
Key Challenge: Real-world video platforms require one-to-many retrieval over untrimmed videos with multi-modal queries and broad category coverage, yet no existing benchmark satisfies all these requirements simultaneously.
Goal: To construct a comprehensive video retrieval benchmark aligned with the demands of real-world video platforms.
Key Insight: The proposed approach uses video as the primary query modality supplemented by textual descriptions, combined with tag and mask prompts, for one-to-many retrieval over an untrimmed video corpus.
Core Idea: Multi-level visual correspondence (copy, event, scene, instance, action, others) is proposed as a general retrieval matching criterion, and five partitions are designed to cover diverse video categories.
Method¶
Overall Architecture¶
The MUVR benchmark comprises three versions: - MUVR-Base: 53K videos, 1,050 multi-modal queries, and 84K annotated matches. - MUVR-Filter: Built upon MUVR-Base with 74K multi-label annotations to support tag-filtered retrieval (9,979 queries). - MUVR-QA: 200 discriminative questions for evaluating MLLM reranking capability.
Key Designs¶
-
Multi-Modal Query Design: Each query consists of four components:
- Video Query: Carries primary visual information, suited for expressing details that are difficult to describe in text.
- Text Description: Averages 20 words, specifying key visual content and retrieval intent.
- Tag Prompt: User-specified desired/undesired video attributes (e.g., "animated style," "first-person view") to enable fine-grained filtering.
- Mask Prompt: Key regions annotated using SAM2 to guide retrieval models toward specific parts of the query video.
-
Multi-Level Visual Correspondence: Six correspondence levels are defined according to the granularity of visual content the user seeks:
- Copy: The target is a copy or edited version of the query.
- Event: Shares the same event (spatiotemporal overlap).
- Scene: Shares the same scene, background, or location.
- Instance: Shares the same instance or object.
- Action: Shares the same human action.
- Others: Subjectively perceived relevance.
-
Five-Partition Design: News (news events), Region (travel scenes), Instance (products and pets), Dance (dance actions), and Others (memes/films), each emphasizing different visual correspondence levels and video categories.
-
Reranking Score: An evaluation metric designed for MLLM reranking. Each query contains one true positive and one false positive; an MLLM can produce four outcomes:
- 10 (correctly retains true positive and removes false positive): +1
- 11 (retains both; no action): 0
- 00 (removes both): −1
- 01 (incorrect inversion): −2
Loss & Training¶
MUVR is an evaluation benchmark rather than a training method. Retrieval scores are computed as:
where \(p = \pm 0.3\) is determined by the sign of the tag prompt (positive or negative).
Key Experimental Results¶
Main Results¶
Multi-Modal Query Retrieval Performance on MUVR-Base (mAP %):
| Method | Avg. mAP | News | Others | Instance | Region | Dance |
|---|---|---|---|---|---|---|
| CLIP (RN50x4) | 42.9 | 49.4 | 53.6 | 46.5 | 43.8 | 21.2 |
| EVA-CLIP | 58.0 | 63.1 | 66.1 | 68.2 | 63.8 | 28.7 |
| InternVideo2 | 52.1 | 57.3 | 66.3 | 55.3 | 52.5 | 28.9 |
| S2VS | 47.2 | 51.3 | 63.7 | 49.5 | 49.1 | 22.5 |
| CoVR | 43.3 | 50.5 | 54.3 | 46.9 | 44.0 | 20.8 |
MLLM Reranking Evaluation on MUVR-QA:
| Method | Params | Frames | Accuracy (All) | Reranking Score (All) |
|---|---|---|---|---|
| InternVL2 (multi-image) | 8B | 6 | 58.5 | −0.23 |
| GPT-4o (multi-image) | N/A | 6 | 65.0 | 0.19 |
| Gemini-2.0-Flash | N/A | 12 | 63.5 | 0.07 |
| InternVL2.5 (text) | 8B | 12 | 58.5 | −0.37 |
Ablation Study¶
Effect of Query Format on Retrieval Performance (EVA-CLIP, mAP %):
| Query Format | Avg. mAP | Best Partition |
|---|---|---|
| Text only | 43.0 | Instance (59.7) |
| Video only | 50.7 | Others (59.2) |
| Multi-modal | 58.0 | Instance (68.2) |
| Multi-modal + Tag | 34.0 (Filter) | News (38.3) |
Key Findings¶
- Video queries outperform text queries: Video-only retrieval consistently surpasses text-only retrieval (50.7 vs. 43.0 mAP), as videos more precisely convey details that are difficult to express in words.
- Multi-modal queries yield substantial gains: Combining video and text achieves a 7.3% mAP improvement (58.0 vs. 50.7), demonstrating the complementarity of the two modalities.
- Different partitions require different capabilities: VLMs perform best on the Instance/Region/News partitions (static spatial understanding), while video models are stronger on Dance/Others (dynamic temporal understanding).
- All methods perform worst on the Dance partition: The highest mAP reaches only 28.9%, indicating that action-level visual correspondence remains a significant challenge.
- Tag prompts pose substantial challenges for MLLMs: Some models achieve 70%+ accuracy on tag-free questions but degrade noticeably when tags are introduced.
- MLLM reranking is unreliable: Most models achieve negative reranking scores, indicating that current MLLMs are not yet dependable for video reranking.
Highlights & Insights¶
- Well-designed and practically motivated benchmark: The five-partition design covers core categories on video platforms, avoiding the bias of single-category benchmarks.
- Multi-level visual correspondence is a valuable formalization: Abstracting user retrieval intent into six granularity levels provides more precision than simple semantic matching.
- Tag prompts are a novel contribution: Enabling users to filter results with simple tags closely mirrors real-world product experiences.
- Reranking score design is well-justified: Distinguishing "no action" from "incorrect inversion" with different penalties reflects the practical requirements of reranking scenarios.
- The benchmark is large in scale (53K videos) and high in annotation quality (professional annotators with two-round verification).
Limitations & Future Work¶
- Data is sourced exclusively from Bilibili, potentially introducing a Chinese-video bias; generalizability to English-language platforms remains to be validated.
- Videos are clipped to a maximum of 2 minutes, providing insufficient coverage of retrieval scenarios involving longer content.
- The tag prompt weighting scheme (\(p = \pm 0.3\) fixed) is relatively simple; adaptive weighting strategies could be explored.
- MUVR-QA contains only 200 questions, limiting its scale.
- End-to-end multi-modal query understanding models are not explored; only existing models are evaluated.
Related Work & Insights¶
- Compared to FIVR (event-level retrieval), MUVR extends the framework to six levels of visual correspondence across multiple video categories.
- Compared to CoVR (composed retrieval), MUVR supports one-to-many retrieval over untrimmed videos.
- The extremely low performance on the Dance partition motivates future work on action-level video understanding.
- The unreliability of MLLM reranking motivates further research into two-stage retrieval + MLLM reranking pipelines.
Rating¶
- Novelty: ⭐⭐⭐⭐ The multi-level visual correspondence and multi-modal query paradigm are novel, though the primary contribution is a dataset rather than a methodological innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Systematic evaluation across 3 retrieval models, 6 VLMs, and 10 MLLMs, with comprehensive analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Benchmark design is described in detail, findings are clearly summarized, and readability is excellent.
- Value: ⭐⭐⭐⭐⭐ Fills a critical gap in benchmarks for multi-modal untrimmed video retrieval, making a significant contribution to the community.