Skip to content

CoVR-R: Reason-Aware Composed Video Retrieval

Conference: CVPR 2026
arXiv: 2603.20190
Code: github.com/mbzuai-oryx/CoVR-R
Area: Multimodal / Video-Language Models
Keywords: composed video retrieval, reason-aware retrieval, after-effect reasoning, zero-shot retrieval, large multimodal models

TL;DR

CoVR-R proposes a reasoning-first zero-shot composed video retrieval framework that leverages a large multimodal model (Qwen3-VL) to explicitly reason about the "after-effects" (state transitions, temporal phases, shot changes, etc.) implied by edit instructions. The paper further introduces the CoVR-R benchmark, comprising structured reasoning traces and hard negatives, to evaluate reasoning capability. The method substantially outperforms existing approaches in retrieval accuracy.

Background & Motivation

Composed Video Retrieval (CoVR) aims to retrieve a target video that reflects requested changes, given a reference video and a modification text. Existing methods suffer from critical limitations:

Limitations of keyword matching: Most approaches rely on triplet-driven training that primarily rewards keyword overlap while ignoring the after-effects implied by the modification text. For example, "switch to a close-up shot" implies tighter framing and shorter duration; "deep-frying" implies smoke and faster hand movements.

The gap between what is said and what must occur: A gap exists between what the edit text explicitly states and what the target video must demonstrate. Bridging this gap requires reasoning — predicting the causal chain that connects the edit to plausible visual evidence.

Existing benchmarks do not evaluate reasoning: Prior CoVR datasets emphasize literal edit or description alignment, without assessing causal plausibility or temporal consistency.

Core Motivation: To explicitly incorporate reasoning into the retrieval loop by predicting the consequences of edits, shifting from "matching keywords" to "reasoning about consequences."

Method

Overall Architecture

CoVR-R adopts a two-stage Reason-then-Retrieve architecture:

  • Stage 1 — Reasoning: Qwen3-VL-8B generates a structured after-effect reasoning trace \(R\) conditioned on the reference video \(V_r\) and edit text \(E\).
  • Stage 2 — Retrieval: The tuple \((V_r, E, R)\) is encoded into an effect-aware query embedding, which is matched against pre-computed gallery embeddings via cosine similarity.

The entire framework keeps the LMM frozen and requires no CoVR-specific supervision, enabling zero-shot retrieval.

Key Designs

  1. Gallery Video Encoding: For each video \(V\), Qwen3-VL generates a detailed description \(D(V)\); the final-layer token embeddings are aggregated into a single vector via importance-weighted pooling. Weights are assigned in three tiers based on semantic informativeness: \(\alpha_{\text{high}}=1.0\) (actions, objects, states), \(\alpha_{\text{mid}}=0.3\) (attributes, scenes), \(\alpha_{\text{low}}=0.1\) (function words). All embeddings are L2-normalized and cached offline.

  2. Reason-Aware Query Encoding (three steps):

    • After-effect reasoning: Qwen3-VL is prompted to generate a structured reasoning trace \(R = \{\text{states}, \text{actions}, \text{scene}, \text{camera}, \text{tempo}\}\) conditioned on \((V_r, E)\), with at most four atomic assertions per slot.
    • Target description generation: A complete description \(D_{\text{target}}\) of the hypothetical post-edit video is generated conditioned on \((V_r, E, R)\).
    • Embedding extraction and pooling: Token embeddings are extracted and aggregated using the same importance-weighted pooling scheme.
  3. CoVR-R Benchmark Construction:

    • 2,800 high-quality triplets are constructed from Dense-WebVid-CoVR and Something-Something V2.
    • Each triplet includes a schema-constrained reasoning trace and hard negative candidates.
    • Selection criteria require satisfying at least two of: temporal dependency, state transition, camera technique, implicit causality, low lexical sufficiency.
    • Reasoning traces are generated following a fixed slot order (actions → camera → states → scene → tempo) and verified through human review.

Loss & Training

  • No training: The entire method is zero-shot and requires no task-specific fine-tuning.
  • Retrieval ranking is based on cosine similarity: \(s(V) = \mathbf{q}(V_r, E)^\top \mathbf{v}(V)\)
  • Reasoning evaluation employs LLM-as-a-judge (GPT-4o), scoring across 10 dimensions (1–10), with the arithmetic mean serving as the overall reasoning score.

Key Experimental Results

Main Results

Zero-shot comparison on the CoVR-R benchmark

Method Backbone R@1 R@5 R@10 R@50 Reasoning Score
CoVR-BLIP BLIP 30.30 51.07 57.05 73.82 4.85
BSE-CoVR (CA) BLIP 37.90 57.67 64.48 79.47 6.42
MVFT-JI† BLIP 34.40 54.15 62.30 77.40 6.28
Ours Qwen-VL 44.32 61.91 67.33 79.90 7.46
Ours+R Qwen-VL 49.88 66.99 72.97 85.14 8.31

R@1 improves by +11.98 percentage points over the strongest baseline (31.6% relative gain).

Dense-WebVid-CoVR test set

Method R@1 R@5 R@10 R@50
BSE-CoVR (CA) 48.08 73.36 81.06 93.78
Ours 58.19 80.50 86.92 97.14
Ours+R 61.21 83.40 89.39 97.61

R@1 improves by +13.13 percentage points, surpassing all baselines.

Ablation Study

Token aggregation strategies

Strategy R@1 R@5 R@50
Last token 1.51 3.57 10.14
Mean pooling 44.87 63.67 82.44
Max pooling 35.95 52.02 93.98
Weighted (ours) 49.88 66.99 85.14

Importance-weighted pooling outperforms mean pooling by +5.01 R@1.

Effect of model scale

Model R@1 Reasoning Score
Qwen3-VL-4B 43.98 7.95
Qwen3-VL-8B 49.88 8.31
Qwen3-VL-72B 55.48 9.05

Performance scales consistently with model size; 8B offers the best efficiency-performance trade-off.

Key Findings

  • The reasoning-augmented variant (+R) improves R@1 by +5.56 percentage points over the non-reasoning version, validating the value of explicit after-effect prediction.
  • Prior methods perform notably worse on CoVR-R than on standard benchmarks (avg R@1: 32.05% vs. 40.66%), demonstrating that reasoning-dependent edits pose a distinct challenge.
  • Iterative reasoning refinement (5 rounds) yields only marginal gains (R@1: 49.88% → 50.56%) at a 5× increase in inference cost; single-pass reasoning is adopted as the final design choice.
  • The Qwen3 series consistently outperforms the Qwen2.5 series at comparable parameter counts.

Highlights & Insights

  • Reasoning-first paradigm: Reasoning is elevated from a byproduct of retrieval to a first-class component — explicitly predicting the after-effects of edits before retrieval, offering greater interpretability than end-to-end feature fusion.
  • No task-specific training required: The framework achieves zero-shot CoVR by leveraging the reasoning capabilities of general-purpose LMMs, reducing reliance on annotated data.
  • Importance-weighted pooling: A simple yet effective parameter-free strategy that outperforms all complex concatenation schemes by down-weighting function words and up-weighting semantically rich tokens.
  • Structured reasoning traces: A five-dimensional schema constraint (states / actions / scene / camera / tempo) makes reasoning verifiable and comparable, facilitating future research.

Limitations & Future Work

  • The framework depends on Qwen3-VL's video understanding capability, which may degrade on low-quality or extremely long videos.
  • Gallery encoding requires generating descriptions and extracting embeddings for each video, incurring non-trivial preprocessing costs.
  • The quality of reasoning traces is bounded by the LMM's reasoning capacity; subtle causal chains may be overlooked.
  • The benchmark scale (2,800 triplets) is relatively small with limited domain coverage.
  • Whether the zero-shot advantage over end-to-end fine-tuned methods on standard benchmarks can be sustained at larger scale remains to be verified.
  • The extension from CIR (Composed Image Retrieval) to CoVR introduces temporal and causal dimensions that are central to video understanding.
  • The proposed approach is complementary to training-based methods such as MVFT-JI and CoVR-BLIP — reasoning-based and training-based paradigms can potentially be combined.
  • The importance-weighted pooling idea generalizes to other tasks requiring semantic embeddings extracted from LMM-generated text.
  • The zero-shot reasoning-retrieval paradigm may extend to composed retrieval in other modalities (3D, audio, etc.).

Rating

  • Novelty: ⭐⭐⭐⭐ — The reasoning-first zero-shot CoVR framework is novel, and the benchmark design is valuable.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluations span two benchmarks with multi-dimensional ablations and model-scale analysis.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, and the formal definition of reasoning traces is well-structured.
  • Value: ⭐⭐⭐⭐ — Advances CoVR from keyword matching toward reasoning-driven retrieval.