CoVR-R: Reason-Aware Composed Video Retrieval¶

Conference: CVPR 2026
arXiv: 2603.20190
Code: github.com/mbzuai-oryx/CoVR-R
Area: Multimodal / Video-Language Models
Keywords: composed video retrieval, reason-aware retrieval, after-effect reasoning, zero-shot retrieval, large multimodal models

TL;DR¶

CoVR-R proposes a reasoning-first zero-shot composed video retrieval framework that leverages a large multimodal model (Qwen3-VL) to explicitly reason about the "after-effects" (state transitions, temporal phases, shot changes, etc.) implied by edit instructions. The paper further introduces the CoVR-R benchmark, comprising structured reasoning traces and hard negatives, to evaluate reasoning capability. The method substantially outperforms existing approaches in retrieval accuracy.

Background & Motivation¶

Composed Video Retrieval (CoVR) aims to retrieve a target video that reflects requested changes, given a reference video and a modification text. Existing methods suffer from critical limitations:

Limitations of keyword matching: Most approaches rely on triplet-driven training that primarily rewards keyword overlap while ignoring the after-effects implied by the modification text. For example, "switch to a close-up shot" implies tighter framing and shorter duration; "deep-frying" implies smoke and faster hand movements.

The gap between what is said and what must occur: A gap exists between what the edit text explicitly states and what the target video must demonstrate. Bridging this gap requires reasoning — predicting the causal chain that connects the edit to plausible visual evidence.

Existing benchmarks do not evaluate reasoning: Prior CoVR datasets emphasize literal edit or description alignment, without assessing causal plausibility or temporal consistency.

Core Motivation: To explicitly incorporate reasoning into the retrieval loop by predicting the consequences of edits, shifting from "matching keywords" to "reasoning about consequences."

Method¶

Overall Architecture¶

CoVR-R adopts a two-stage Reason-then-Retrieve architecture:

Stage 1 — Reasoning: Qwen3-VL-8B generates a structured after-effect reasoning trace \(R\) conditioned on the reference video \(V_r\) and edit text \(E\).
Stage 2 — Retrieval: The tuple \((V_r, E, R)\) is encoded into an effect-aware query embedding, which is matched against pre-computed gallery embeddings via cosine similarity.

The entire framework keeps the LMM frozen and requires no CoVR-specific supervision, enabling zero-shot retrieval.

Key Designs¶

Gallery Video Encoding: For each video \(V\), Qwen3-VL generates a detailed description \(D(V)\); the final-layer token embeddings are aggregated into a single vector via importance-weighted pooling. Weights are assigned in three tiers based on semantic informativeness: \(\alpha_{\text{high}}=1.0\) (actions, objects, states), \(\alpha_{\text{mid}}=0.3\) (attributes, scenes), \(\alpha_{\text{low}}=0.1\) (function words). All embeddings are L2-normalized and cached offline.
Reason-Aware Query Encoding (three steps):
- After-effect reasoning: Qwen3-VL is prompted to generate a structured reasoning trace \(R = \{\text{states}, \text{actions}, \text{scene}, \text{camera}, \text{tempo}\}\) conditioned on \((V_r, E)\), with at most four atomic assertions per slot.
- Target description generation: A complete description \(D_{\text{target}}\) of the hypothetical post-edit video is generated conditioned on \((V_r, E, R)\).
- Embedding extraction and pooling: Token embeddings are extracted and aggregated using the same importance-weighted pooling scheme.
CoVR-R Benchmark Construction:
- 2,800 high-quality triplets are constructed from Dense-WebVid-CoVR and Something-Something V2.
- Each triplet includes a schema-constrained reasoning trace and hard negative candidates.
- Selection criteria require satisfying at least two of: temporal dependency, state transition, camera technique, implicit causality, low lexical sufficiency.
- Reasoning traces are generated following a fixed slot order (actions → camera → states → scene → tempo) and verified through human review.

Loss & Training¶

No training: The entire method is zero-shot and requires no task-specific fine-tuning.
Retrieval ranking is based on cosine similarity: \(s(V) = \mathbf{q}(V_r, E)^\top \mathbf{v}(V)\)
Reasoning evaluation employs LLM-as-a-judge (GPT-4o), scoring across 10 dimensions (1–10), with the arithmetic mean serving as the overall reasoning score.

Key Experimental Results¶

Main Results¶

Zero-shot comparison on the CoVR-R benchmark

Method	Backbone	R@1	R@5	R@10	R@50	Reasoning Score
CoVR-BLIP	BLIP	30.30	51.07	57.05	73.82	4.85
BSE-CoVR (CA)	BLIP	37.90	57.67	64.48	79.47	6.42
MVFT-JI†	BLIP	34.40	54.15	62.30	77.40	6.28
Ours	Qwen-VL	44.32	61.91	67.33	79.90	7.46
Ours+R	Qwen-VL	49.88	66.99	72.97	85.14	8.31

R@1 improves by +11.98 percentage points over the strongest baseline (31.6% relative gain).

Dense-WebVid-CoVR test set

Method	R@1	R@5	R@10	R@50
BSE-CoVR (CA)	48.08	73.36	81.06	93.78
Ours	58.19	80.50	86.92	97.14
Ours+R	61.21	83.40	89.39	97.61

R@1 improves by +13.13 percentage points, surpassing all baselines.

Ablation Study¶

Token aggregation strategies

Strategy	R@1	R@5	R@50
Last token	1.51	3.57	10.14
Mean pooling	44.87	63.67	82.44
Max pooling	35.95	52.02	93.98
Weighted (ours)	49.88	66.99	85.14

Importance-weighted pooling outperforms mean pooling by +5.01 R@1.

Effect of model scale

Model	R@1	Reasoning Score
Qwen3-VL-4B	43.98	7.95
Qwen3-VL-8B	49.88	8.31
Qwen3-VL-72B	55.48	9.05

Performance scales consistently with model size; 8B offers the best efficiency-performance trade-off.

Key Findings¶

The reasoning-augmented variant (+R) improves R@1 by +5.56 percentage points over the non-reasoning version, validating the value of explicit after-effect prediction.
Prior methods perform notably worse on CoVR-R than on standard benchmarks (avg R@1: 32.05% vs. 40.66%), demonstrating that reasoning-dependent edits pose a distinct challenge.
Iterative reasoning refinement (5 rounds) yields only marginal gains (R@1: 49.88% → 50.56%) at a 5× increase in inference cost; single-pass reasoning is adopted as the final design choice.
The Qwen3 series consistently outperforms the Qwen2.5 series at comparable parameter counts.

Highlights & Insights¶

Reasoning-first paradigm: Reasoning is elevated from a byproduct of retrieval to a first-class component — explicitly predicting the after-effects of edits before retrieval, offering greater interpretability than end-to-end feature fusion.
No task-specific training required: The framework achieves zero-shot CoVR by leveraging the reasoning capabilities of general-purpose LMMs, reducing reliance on annotated data.
Importance-weighted pooling: A simple yet effective parameter-free strategy that outperforms all complex concatenation schemes by down-weighting function words and up-weighting semantically rich tokens.
Structured reasoning traces: A five-dimensional schema constraint (states / actions / scene / camera / tempo) makes reasoning verifiable and comparable, facilitating future research.

Limitations & Future Work¶

The framework depends on Qwen3-VL's video understanding capability, which may degrade on low-quality or extremely long videos.
Gallery encoding requires generating descriptions and extracting embeddings for each video, incurring non-trivial preprocessing costs.
The quality of reasoning traces is bounded by the LMM's reasoning capacity; subtle causal chains may be overlooked.
The benchmark scale (2,800 triplets) is relatively small with limited domain coverage.
Whether the zero-shot advantage over end-to-end fine-tuned methods on standard benchmarks can be sustained at larger scale remains to be verified.

The extension from CIR (Composed Image Retrieval) to CoVR introduces temporal and causal dimensions that are central to video understanding.
The proposed approach is complementary to training-based methods such as MVFT-JI and CoVR-BLIP — reasoning-based and training-based paradigms can potentially be combined.
The importance-weighted pooling idea generalizes to other tasks requiring semantic embeddings extracted from LMM-generated text.
The zero-shot reasoning-retrieval paradigm may extend to composed retrieval in other modalities (3D, audio, etc.).

Rating¶

Novelty: ⭐⭐⭐⭐ — The reasoning-first zero-shot CoVR framework is novel, and the benchmark design is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluations span two benchmarks with multi-dimensional ablations and model-scale analysis.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, and the formal definition of reasoning traces is well-structured.
Value: ⭐⭐⭐⭐ — Advances CoVR from keyword matching toward reasoning-driven retrieval.