CoVR-R: Reason-Aware Composed Video Retrieval¶
Conference: CVPR 2026
arXiv: 2603.20190
Code: github.com/mbzuai-oryx/CoVR-R
Area: Multimodal / Video-Language Models
Keywords: composed video retrieval, reason-aware retrieval, after-effect reasoning, zero-shot retrieval, large multimodal models
TL;DR¶
CoVR-R proposes a reasoning-first zero-shot composed video retrieval framework that leverages a large multimodal model (Qwen3-VL) to explicitly reason about the "after-effects" (state transitions, temporal phases, shot changes, etc.) implied by edit instructions. The paper further introduces the CoVR-R benchmark, comprising structured reasoning traces and hard negatives, to evaluate reasoning capability. The method substantially outperforms existing approaches in retrieval accuracy.
Background & Motivation¶
Composed Video Retrieval (CoVR) aims to retrieve a target video that reflects requested changes, given a reference video and a modification text. Existing methods suffer from critical limitations:
Limitations of keyword matching: Most approaches rely on triplet-driven training that primarily rewards keyword overlap while ignoring the after-effects implied by the modification text. For example, "switch to a close-up shot" implies tighter framing and shorter duration; "deep-frying" implies smoke and faster hand movements.
The gap between what is said and what must occur: A gap exists between what the edit text explicitly states and what the target video must demonstrate. Bridging this gap requires reasoning — predicting the causal chain that connects the edit to plausible visual evidence.
Existing benchmarks do not evaluate reasoning: Prior CoVR datasets emphasize literal edit or description alignment, without assessing causal plausibility or temporal consistency.
Core Motivation: To explicitly incorporate reasoning into the retrieval loop by predicting the consequences of edits, shifting from "matching keywords" to "reasoning about consequences."
Method¶
Overall Architecture¶
CoVR-R adopts a two-stage Reason-then-Retrieve architecture:
- Stage 1 — Reasoning: Qwen3-VL-8B generates a structured after-effect reasoning trace \(R\) conditioned on the reference video \(V_r\) and edit text \(E\).
- Stage 2 — Retrieval: The tuple \((V_r, E, R)\) is encoded into an effect-aware query embedding, which is matched against pre-computed gallery embeddings via cosine similarity.
The entire framework keeps the LMM frozen and requires no CoVR-specific supervision, enabling zero-shot retrieval.
Key Designs¶
-
Gallery Video Encoding: For each video \(V\), Qwen3-VL generates a detailed description \(D(V)\); the final-layer token embeddings are aggregated into a single vector via importance-weighted pooling. Weights are assigned in three tiers based on semantic informativeness: \(\alpha_{\text{high}}=1.0\) (actions, objects, states), \(\alpha_{\text{mid}}=0.3\) (attributes, scenes), \(\alpha_{\text{low}}=0.1\) (function words). All embeddings are L2-normalized and cached offline.
-
Reason-Aware Query Encoding (three steps):
- After-effect reasoning: Qwen3-VL is prompted to generate a structured reasoning trace \(R = \{\text{states}, \text{actions}, \text{scene}, \text{camera}, \text{tempo}\}\) conditioned on \((V_r, E)\), with at most four atomic assertions per slot.
- Target description generation: A complete description \(D_{\text{target}}\) of the hypothetical post-edit video is generated conditioned on \((V_r, E, R)\).
- Embedding extraction and pooling: Token embeddings are extracted and aggregated using the same importance-weighted pooling scheme.
-
CoVR-R Benchmark Construction:
- 2,800 high-quality triplets are constructed from Dense-WebVid-CoVR and Something-Something V2.
- Each triplet includes a schema-constrained reasoning trace and hard negative candidates.
- Selection criteria require satisfying at least two of: temporal dependency, state transition, camera technique, implicit causality, low lexical sufficiency.
- Reasoning traces are generated following a fixed slot order (actions → camera → states → scene → tempo) and verified through human review.
Loss & Training¶
- No training: The entire method is zero-shot and requires no task-specific fine-tuning.
- Retrieval ranking is based on cosine similarity: \(s(V) = \mathbf{q}(V_r, E)^\top \mathbf{v}(V)\)
- Reasoning evaluation employs LLM-as-a-judge (GPT-4o), scoring across 10 dimensions (1–10), with the arithmetic mean serving as the overall reasoning score.
Key Experimental Results¶
Main Results¶
Zero-shot comparison on the CoVR-R benchmark
| Method | Backbone | R@1 | R@5 | R@10 | R@50 | Reasoning Score |
|---|---|---|---|---|---|---|
| CoVR-BLIP | BLIP | 30.30 | 51.07 | 57.05 | 73.82 | 4.85 |
| BSE-CoVR (CA) | BLIP | 37.90 | 57.67 | 64.48 | 79.47 | 6.42 |
| MVFT-JI† | BLIP | 34.40 | 54.15 | 62.30 | 77.40 | 6.28 |
| Ours | Qwen-VL | 44.32 | 61.91 | 67.33 | 79.90 | 7.46 |
| Ours+R | Qwen-VL | 49.88 | 66.99 | 72.97 | 85.14 | 8.31 |
R@1 improves by +11.98 percentage points over the strongest baseline (31.6% relative gain).
Dense-WebVid-CoVR test set
| Method | R@1 | R@5 | R@10 | R@50 |
|---|---|---|---|---|
| BSE-CoVR (CA) | 48.08 | 73.36 | 81.06 | 93.78 |
| Ours | 58.19 | 80.50 | 86.92 | 97.14 |
| Ours+R | 61.21 | 83.40 | 89.39 | 97.61 |
R@1 improves by +13.13 percentage points, surpassing all baselines.
Ablation Study¶
Token aggregation strategies
| Strategy | R@1 | R@5 | R@50 |
|---|---|---|---|
| Last token | 1.51 | 3.57 | 10.14 |
| Mean pooling | 44.87 | 63.67 | 82.44 |
| Max pooling | 35.95 | 52.02 | 93.98 |
| Weighted (ours) | 49.88 | 66.99 | 85.14 |
Importance-weighted pooling outperforms mean pooling by +5.01 R@1.
Effect of model scale
| Model | R@1 | Reasoning Score |
|---|---|---|
| Qwen3-VL-4B | 43.98 | 7.95 |
| Qwen3-VL-8B | 49.88 | 8.31 |
| Qwen3-VL-72B | 55.48 | 9.05 |
Performance scales consistently with model size; 8B offers the best efficiency-performance trade-off.
Key Findings¶
- The reasoning-augmented variant (+R) improves R@1 by +5.56 percentage points over the non-reasoning version, validating the value of explicit after-effect prediction.
- Prior methods perform notably worse on CoVR-R than on standard benchmarks (avg R@1: 32.05% vs. 40.66%), demonstrating that reasoning-dependent edits pose a distinct challenge.
- Iterative reasoning refinement (5 rounds) yields only marginal gains (R@1: 49.88% → 50.56%) at a 5× increase in inference cost; single-pass reasoning is adopted as the final design choice.
- The Qwen3 series consistently outperforms the Qwen2.5 series at comparable parameter counts.
Highlights & Insights¶
- Reasoning-first paradigm: Reasoning is elevated from a byproduct of retrieval to a first-class component — explicitly predicting the after-effects of edits before retrieval, offering greater interpretability than end-to-end feature fusion.
- No task-specific training required: The framework achieves zero-shot CoVR by leveraging the reasoning capabilities of general-purpose LMMs, reducing reliance on annotated data.
- Importance-weighted pooling: A simple yet effective parameter-free strategy that outperforms all complex concatenation schemes by down-weighting function words and up-weighting semantically rich tokens.
- Structured reasoning traces: A five-dimensional schema constraint (states / actions / scene / camera / tempo) makes reasoning verifiable and comparable, facilitating future research.
Limitations & Future Work¶
- The framework depends on Qwen3-VL's video understanding capability, which may degrade on low-quality or extremely long videos.
- Gallery encoding requires generating descriptions and extracting embeddings for each video, incurring non-trivial preprocessing costs.
- The quality of reasoning traces is bounded by the LMM's reasoning capacity; subtle causal chains may be overlooked.
- The benchmark scale (2,800 triplets) is relatively small with limited domain coverage.
- Whether the zero-shot advantage over end-to-end fine-tuned methods on standard benchmarks can be sustained at larger scale remains to be verified.
Related Work & Insights¶
- The extension from CIR (Composed Image Retrieval) to CoVR introduces temporal and causal dimensions that are central to video understanding.
- The proposed approach is complementary to training-based methods such as MVFT-JI and CoVR-BLIP — reasoning-based and training-based paradigms can potentially be combined.
- The importance-weighted pooling idea generalizes to other tasks requiring semantic embeddings extracted from LMM-generated text.
- The zero-shot reasoning-retrieval paradigm may extend to composed retrieval in other modalities (3D, audio, etc.).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The reasoning-first zero-shot CoVR framework is novel, and the benchmark design is valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluations span two benchmarks with multi-dimensional ablations and model-scale analysis.
- Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, and the formal definition of reasoning traces is well-structured.
- Value: ⭐⭐⭐⭐ — Advances CoVR from keyword matching toward reasoning-driven retrieval.